-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RandomSampler #24
Add RandomSampler #24
Conversation
ee98703
to
b392967
Compare
b392967
to
a005317
Compare
This relies on an unreleased dimod feature - hence the CI failures. |
@@ -24,13 +24,32 @@ or locally on your CPU. | |||
*dwave-samplers* implements the following classical algorithms for solving | |||
:term:`binary quadratic model`\ s (BQM): | |||
|
|||
* Random: a sampler that draws uniform random samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks i.e.,
* `Random` : a sampler that draws uniform random samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are for the link. In this case I am not linking to anything. Though open to a good place to link to, https://en.wikipedia.org/wiki/Randomization seemed a bit over general 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's backticks plus underscore to link to the new Random section in line 35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I misunderstood. Will do.
Random | ||
====== | ||
|
||
Random samplers provide a useful baseline performance comparison. The variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be the place to mention which PRNG is used? e.g., Mersenne-Twister?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dwave/samplers/random/cyrandom.pyx
Outdated
cdef double sampling_start_time = realtime_clock() | ||
|
||
cdef double sampling_stop_time | ||
if time_limit < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be <=
?
Quoting docstring below
If given and non-negative, samples are drawn until
time_limit
.
Should this be strictly positive?
update: Nvm, I can see why a strict inequality is an intuitive choice. I don't have an objective argument at the moment.
dwave/samplers/random/cyrandom.pyx
Outdated
if is_spin: | ||
# go back through and convert to spin | ||
for i in range(state.sample.size()): | ||
state.sample[i] = 2 * state.sample[i] - 1 | ||
|
||
state.energy = cybqm.data().energy(state.sample.begin()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting spins and computing energies seem more fitting as a postprocessing step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only in the case when you're keeping every sample. Otherwise you need the energies in order to keep the population to num_reads
. Because for "real uses" of the solver you need to keep the population limited, IMO it's part of sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that makes sense. Follow-up question: why do we use num_reads
as a cap, as opposed to accepting one and only one parameter (one of num_reads
, time_limit
)? The latter seems like a more natural use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Four different combos with their uses:
- both unspecified: I just want to use the defaults
time_limit
set,num_reads
unset: I want the best sample I can find randomly in the given timetime_limit
unset,num_reads
set: I wantnum_reads
samples drawn uniformly. If I was to specify the time_limit, they would not be uniform.- both set, I want the best sample(s) found after
time_limit
and I want to specify the number of returned samples and/or the internal memory used.
Obviously these are not inclusive of use cases but should give some idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would forbid user setting both num_reads
and time_limit
, since time_limit
takes precedence and might return less than num_reads
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
947320b changed the behavior to always return num_reads
. time_limit
now just allows the algorithm to keep sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at your comment above, and missed that update, thanks.
dwave/samplers/random/cyrandom.pyx
Outdated
prepreocessing_time=sampling_start_time-preprocessing_start_time, | ||
sampling_time=postprocessing_start_time-sampling_start_time, | ||
postprocessing_time=realtime_clock()-preprocessing_start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return as a dict, i.e., timing=dict(preprocessing_time=..., sampling_time=..., postprocessing_time=...)
for consistency with how we output QPU timings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to having it be nested. Though since we're already changing the names, IMO consistency is not the main reason to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do nest it, and if we're already changing the names, IMO it would look nicer to have
info = dict(timing=dict(
preprocessing=...,
sampling=...,
postprocessing=...,
))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO consistency of timing terminology with QPU and other solvers is a priority. Otherwise the code to plot data from different solvers on the same graph becomes very unwieldy. This may require some exceptions and compromises because different solvers structures have different structures, but aiming for maximum comparability would be good thing. This would also be an argument for generic timing categories based on once per input'' and
once per output'' , which are natural comparators, rather than solver-specific functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for nested timings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also consider typing the duration variables. Perhaps best to use datetime.timedelta
. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like that. I did it this way for consistency with QPU timing, but we're already being inconsistent in other places so I think probably worth raising at #22 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QPU timing info could be considered encoded for transport -- and we can decode it as timedelta
client-side.
sampleset = RandomSampler().sample(bqm, num_reads=10, time_limit=.02) | ||
runtime = time.perf_counter() - t | ||
|
||
self.assertTrue(.01 < runtime < .04) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also have a test case for info['timing']['sampling_time']
being reasonably close to 0.02.
dwave/samplers/random/sampler.py
Outdated
Only the best ``num_reads`` (or fewer) samples are kept. | ||
|
||
Returns: | ||
A sample set. | ||
Some additional information is provided in the | ||
:attr:`~dimod.SampleSet.info` dictionary: | ||
|
||
* **num_drawn**: The total number of samples generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding as follows correct?
num_drawn
!= num_reads
is true when time_limit
allows for more draws than num_reads
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically when time_limit
is provided and when it allows for more reads
with self.assertWarns(dimod.exceptions.SamplerUnknownArgWarning): | ||
RandomSampler().sample(bqm, a=5, b=2) | ||
|
||
def test_time_limit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add test cases for different settings:
- default (no
num_reads
, notime_limit
) - only
num_reads
- only
time_limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default case is covered rather extensively by @dimod.testing.load_sampler_bqm_tests(RandomSampler)
, which adds ~30 tests.
But sure, can separate the num_reads
and time_limit
tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a few more test cases, otherwise it looks good.
4c3160e
to
a49551e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, my only concern is about sampling termination criteria interface.
dwave/samplers/random/cyrandom.pyx
Outdated
sort(samples.begin(), samples.end(), comp_state) | ||
samples.pop_back() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use a heap instead? (for total O(N logN)
complexity)
Alternatively, you can use insertion sort (for O(N)
per step here, and total ~O(N^2)
) instead of "IntroSort" (for O(N logN)
per step here, and total of ~O(N^2 logN)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried that, it actually hurt performance. Imagine it's the compiler being smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or constant factor too big to notice benefits on smaller scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, though I tried fairly large numbers. You know what, there have been a bunch of changes since then. Let me retest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after the change introduced in 176c25c, I was able to get a performance boost at very large num_reads
. So I will indeed make the change. Thanks, and good suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity I used
bqm = dimod.generators.gnp_random_bqm(1000, .5, 'BINARY')
num_reads=10000
dwave/samplers/random/cyrandom.pyx
Outdated
prepreocessing_time=sampling_start_time-preprocessing_start_time, | ||
sampling_time=postprocessing_start_time-sampling_start_time, | ||
postprocessing_time=realtime_clock()-preprocessing_start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also consider typing the duration variables. Perhaps best to use datetime.timedelta
. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.
dwave/samplers/random/cyrandom.pyx
Outdated
if is_spin: | ||
# go back through and convert to spin | ||
for i in range(state.sample.size()): | ||
state.sample[i] = 2 * state.sample[i] - 1 | ||
|
||
state.energy = cybqm.data().energy(state.sample.begin()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would forbid user setting both num_reads
and time_limit
, since time_limit
takes precedence and might return less than num_reads
.
dwave/samplers/random/sampler.py
Outdated
time_limit: | ||
The maximum sampling time in seconds. | ||
If given and non-negative, samples are drawn until ``time_limit``. | ||
Only the best ``num_reads`` (or fewer) samples are kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we figure out how best to extend/generalize Initialized
sampler interface to handle other termination criteria, I would forbid setting both num_reads
and time_limit
here. It's confusing because time_limit
takes precedence and the sampler might return less then num_reads
.
A case of setting time_limit
, but not num_reads
is also confusing -- we return only one sample after num_drawn
tries.
Perhaps we should introduce max_answers
(or max_samples
) that can be used in combination with time_limit
. See dwavesystems/dwave-greedy#24 for discussion.
So:
num_reads
set: exactlynum_reads
returned,time_limit
not allowed.max_answers
ignored.time_limit
set:max_answers
(which can default to 1 orinf
/None) is used
With this scheme, I would set num_reads
by default to None
, with interpretation on how much that actually is depending on time_limit
/max_answers
. Similarly to how we resolve it in the Initialized
arg parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had already changed it so num_reads
has precedence. And I disagree that they should be mutually exclusive. The solver will now always returns num_reads
. time_limit
just determines whether it keeps going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize the current implementation's behaviour...
num_reads
[x], time_limit
[x]
- Case more num_reads than time_limit allows for: always return num_reads
states and go past time limit.
- Case fewer num_reads than time_limit allows for: always return num_reads
states and continue sampling until time_limit
.
num_reads
[ ], time_limit
[x]
- num_reads
is implicitly set to default. See above scenarios.
num_reads
[x], time_limit
[ ]
- Always return num_reads
num_reads
[ ], time_limit
[ ]
- num_reads
is set to default. See above scenario.
I think the bolded case could be an invitation to making human errors for default num_reads
> 1 (currently 10). For example, sampler.sample(time_limit=very-small-value)
suggests the sampler should 1 state, but instead it will return 10. If we want to keep this behaviour, I suggest changing default num_reads
to 1.
edit: I reread the above, and I think the ambiguity comes from the word argument name "time limit" being suggestive of a hard cutoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, happy to lower the default to 1. I had it higher for backwards compatibility reasons with dimod but I don't think that actually matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it looks an implied AND between ==num_reads
and >=time_limit
:
- sampler will always return
num_reads
, - sampler will run for at least
time_limit
?
I find this awkward and inconsistent with Initialized
(num_reads
adapts to initial_states
if states are user-specified and reads are not) and termination criteria in dwave-hybrid/Kerberos
(time/convergence/energy/iteration count are OR-ed - we stop as soon as one condition is met).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how about I just make two samplers. One that take a time_limit
and max_answers
and one that takes num_reads
. If we're going to make the arguments disjoint, we might as well make that clear at the sampler level.
Personally I think that's way overkill, but it seems like no one else agrees with me 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user sets time_limit (and not num_reads), they care about run time, although it's not clear what's the number of samples they want. But if they haven't specified the number, it makes sense to return all samples generated during runtime. Here's where max_samples/max_answers comes into play.
The problem is that we generate a lot of samples. Way more than can easily fit in memory. (we can probably fit them in memory, but still it can be millions for even small runtimes). So returning all is not practical for many problems. Which is why I used the default num_reads
which was 10.
In this framework, setting both num_reads and time_limit would run until one of them is satisfied.
This is what I had before. edit: nevermind, I understand. I was treating time_limit
as having priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we can either do treat time_limit as the most important, and then have a max_answers/max_samples as the secondary criteria. Or we can treat num_reads and the most important and treat time_limit as the secondary criteria when provided.
If we're deciding between the two, I also favour the latter for similar reasons. I do think Radomir's proposal is more intuitive and gives more control over the behaviour of the sampler.
Another point to consider is if we add a similar feature for Neal, a consistent behaviour across classes of samplers would be desirable. And in the case of Neal, I think it's reasonable to set a time limit and retrieve all solutions it came across.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I concede. I will refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending sampler interface with time_limit
(and max_answers
or whatever we call it) would be a good candidate for a follow-up PR, just to keep concerns separated. But since we had all this discussion here, we might as well do it here and then generalize in a follow-up.
@@ -0,0 +1,3 @@ | |||
--- | |||
features: | |||
- Add ``RandomSampler`` which generates samples with variable assignments chosen by coin flip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have a coin flipping device attached to user's machine, I would rephrase this 😄
Ok, refactored. Now both I also have dropped the timing info, pending a decision on #22. We can restore it in a followup PR. |
d4052d3
to
ae0af71
Compare
ae0af71
to
bc196a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
Get the best 5 sample found in .1 seconds | ||
|
||
>>> sampleset = sampler.sample(bqm, time_limit=.1, max_num_samples=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps include .info['num_drawn']
.info['num_reads']
here to clarify many more were actually generated.
If ``time_limit`` is provided, there is no limit imposed on | ||
the number of reads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the limit set by max_num_samples
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the limit on the population, it will take unlimited reads up to the time limit.
Add a performant random sampler. This is intended to replace
dimod.RandomSampler
as part of dwavesystems/dimod#298.This implements the same API, but also includes a
time_limit
parameter. This is a feature used by the benchmarking team - this sampler is meant to replace their random batch sampler with one maintained as part of Ocean.I have also attempted to include timing information following the spec in #22.
I kind of hate that, because we list the samplers alphabetically, this one gets put first 🤷.