Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RandomSampler #24

Merged
merged 11 commits into from
Jun 23, 2022
Merged

Conversation

arcondello
Copy link
Member

Add a performant random sampler. This is intended to replace dimod.RandomSampler as part of dwavesystems/dimod#298.

This implements the same API, but also includes a time_limit parameter. This is a feature used by the benchmarking team - this sampler is meant to replace their random batch sampler with one maintained as part of Ocean.

I have also attempted to include timing information following the spec in #22.

I kind of hate that, because we list the samplers alphabetically, this one gets put first 🤷.

@arcondello arcondello mentioned this pull request Jun 21, 2022
@arcondello
Copy link
Member Author

arcondello commented Jun 21, 2022

This relies on an unreleased dimod feature - hence the CI failures. Will open a PR shortly. dwavesystems/dimod#1210

@@ -24,13 +24,32 @@ or locally on your CPU.
*dwave-samplers* implements the following classical algorithms for solving
:term:`binary quadratic model`\ s (BQM):

* Random: a sampler that draws uniform random samples.
Copy link
Contributor

@kevinchern kevinchern Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks i.e.,

* `Random` : a sampler that draws uniform random samples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are for the link. In this case I am not linking to anything. Though open to a good place to link to, https://en.wikipedia.org/wiki/Randomization seemed a bit over general 😄

Copy link
Contributor

@JoelPasvolsky JoelPasvolsky Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's backticks plus underscore to link to the new Random section in line 35

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I misunderstood. Will do.

Random
======

Random samplers provide a useful baseline performance comparison. The variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be the place to mention which PRNG is used? e.g., Mersenne-Twister?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, though the user can actually set that by passing in a Generator. By default we use NumPy's default, which is currently PCG64. Though I would want to just say that we use the default from NumPy in case NumPy changes.

cdef double sampling_start_time = realtime_clock()

cdef double sampling_stop_time
if time_limit < 0:
Copy link
Contributor

@kevinchern kevinchern Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be <=?

Quoting docstring below

If given and non-negative, samples are drawn until time_limit.

Should this be strictly positive?

update: Nvm, I can see why a strict inequality is an intuitive choice. I don't have an objective argument at the moment.

Comment on lines 83 to 88
if is_spin:
# go back through and convert to spin
for i in range(state.sample.size()):
state.sample[i] = 2 * state.sample[i] - 1

state.energy = cybqm.data().energy(state.sample.begin())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting spins and computing energies seem more fitting as a postprocessing step.

Copy link
Member Author

@arcondello arcondello Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in the case when you're keeping every sample. Otherwise you need the energies in order to keep the population to num_reads. Because for "real uses" of the solver you need to keep the population limited, IMO it's part of sampling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that makes sense. Follow-up question: why do we use num_reads as a cap, as opposed to accepting one and only one parameter (one of num_reads, time_limit)? The latter seems like a more natural use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four different combos with their uses:

  • both unspecified: I just want to use the defaults
  • time_limit set, num_reads unset: I want the best sample I can find randomly in the given time
  • time_limit unset, num_reads set: I want num_reads samples drawn uniformly. If I was to specify the time_limit, they would not be uniform.
  • both set, I want the best sample(s) found after time_limit and I want to specify the number of returned samples and/or the internal memory used.

Obviously these are not inclusive of use cases but should give some idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would forbid user setting both num_reads and time_limit, since time_limit takes precedence and might return less than num_reads.

Copy link
Member Author

@arcondello arcondello Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

947320b changed the behavior to always return num_reads. time_limit now just allows the algorithm to keep sampling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at your comment above, and missed that update, thanks.

Comment on lines 169 to 171
prepreocessing_time=sampling_start_time-preprocessing_start_time,
sampling_time=postprocessing_start_time-sampling_start_time,
postprocessing_time=realtime_clock()-preprocessing_start_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return as a dict, i.e., timing=dict(preprocessing_time=..., sampling_time=..., postprocessing_time=...) for consistency with how we output QPU timings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to having it be nested. Though since we're already changing the names, IMO consistency is not the main reason to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do nest it, and if we're already changing the names, IMO it would look nicer to have

info = dict(timing=dict(
    preprocessing=...,
    sampling=...,
    postprocessing=...,
))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO consistency of timing terminology with QPU and other solvers is a priority. Otherwise the code to plot data from different solvers on the same graph becomes very unwieldy. This may require some exceptions and compromises because different solvers structures have different structures, but aiming for maximum comparability would be good thing. This would also be an argument for generic timing categories based on once per input'' and once per output'' , which are natural comparators, rather than solver-specific functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for nested timings!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also consider typing the duration variables. Perhaps best to use datetime.timedelta. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like that. I did it this way for consistency with QPU timing, but we're already being inconsistent in other places so I think probably worth raising at #22 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QPU timing info could be considered encoded for transport -- and we can decode it as timedelta client-side.

sampleset = RandomSampler().sample(bqm, num_reads=10, time_limit=.02)
runtime = time.perf_counter() - t

self.assertTrue(.01 < runtime < .04)
Copy link
Contributor

@kevinchern kevinchern Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also have a test case for info['timing']['sampling_time'] being reasonably close to 0.02.

Comment on lines 96 to 103
Only the best ``num_reads`` (or fewer) samples are kept.

Returns:
A sample set.
Some additional information is provided in the
:attr:`~dimod.SampleSet.info` dictionary:

* **num_drawn**: The total number of samples generated.
Copy link
Contributor

@kevinchern kevinchern Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding as follows correct?
num_drawn != num_reads is true when time_limit allows for more draws than num_reads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically when time_limit is provided and when it allows for more reads

with self.assertWarns(dimod.exceptions.SamplerUnknownArgWarning):
RandomSampler().sample(bqm, a=5, b=2)

def test_time_limit(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add test cases for different settings:

  1. default (no num_reads, no time_limit)
  2. only num_reads
  3. only time_limit

Copy link
Member Author

@arcondello arcondello Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default case is covered rather extensively by @dimod.testing.load_sampler_bqm_tests(RandomSampler), which adds ~30 tests.
But sure, can separate the num_reads and time_limit tests.

Copy link
Contributor

@kevinchern kevinchern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a few more test cases, otherwise it looks good.

Copy link
Member

@randomir randomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, my only concern is about sampling termination criteria interface.

Comment on lines 117 to 118
sort(samples.begin(), samples.end(), comp_state)
samples.pop_back()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use a heap instead? (for total O(N logN) complexity)

Alternatively, you can use insertion sort (for O(N) per step here, and total ~O(N^2)) instead of "IntroSort" (for O(N logN) per step here, and total of ~O(N^2 logN)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried that, it actually hurt performance. Imagine it's the compiler being smart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or constant factor too big to notice benefits on smaller scale.

Copy link
Member Author

@arcondello arcondello Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though I tried fairly large numbers. You know what, there have been a bunch of changes since then. Let me retest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after the change introduced in 176c25c, I was able to get a performance boost at very large num_reads. So I will indeed make the change. Thanks, and good suggestion!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity I used

bqm = dimod.generators.gnp_random_bqm(1000, .5, 'BINARY')
num_reads=10000

Comment on lines 169 to 171
prepreocessing_time=sampling_start_time-preprocessing_start_time,
sampling_time=postprocessing_start_time-sampling_start_time,
postprocessing_time=realtime_clock()-preprocessing_start_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also consider typing the duration variables. Perhaps best to use datetime.timedelta. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.

Comment on lines 83 to 88
if is_spin:
# go back through and convert to spin
for i in range(state.sample.size()):
state.sample[i] = 2 * state.sample[i] - 1

state.energy = cybqm.data().energy(state.sample.begin())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would forbid user setting both num_reads and time_limit, since time_limit takes precedence and might return less than num_reads.

Comment on lines 93 to 96
time_limit:
The maximum sampling time in seconds.
If given and non-negative, samples are drawn until ``time_limit``.
Only the best ``num_reads`` (or fewer) samples are kept.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we figure out how best to extend/generalize Initialized sampler interface to handle other termination criteria, I would forbid setting both num_reads and time_limit here. It's confusing because time_limit takes precedence and the sampler might return less then num_reads.

A case of setting time_limit, but not num_reads is also confusing -- we return only one sample after num_drawn tries.

Perhaps we should introduce max_answers (or max_samples) that can be used in combination with time_limit. See dwavesystems/dwave-greedy#24 for discussion.

So:

  • num_reads set: exactly num_reads returned, time_limit not allowed. max_answers ignored.
  • time_limit set: max_answers (which can default to 1 or inf/None) is used

With this scheme, I would set num_reads by default to None, with interpretation on how much that actually is depending on time_limit/max_answers. Similarly to how we resolve it in the Initialized arg parser.

Copy link
Member Author

@arcondello arcondello Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had already changed it so num_reads has precedence. And I disagree that they should be mutually exclusive. The solver will now always returns num_reads. time_limit just determines whether it keeps going.

Copy link
Contributor

@kevinchern kevinchern Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the current implementation's behaviour...

num_reads [x], time_limit [x]
- Case more num_reads than time_limit allows for: always return num_reads states and go past time limit.
- Case fewer num_reads than time_limit allows for: always return num_reads states and continue sampling until time_limit.
num_reads [ ], time_limit [x]
- num_reads is implicitly set to default. See above scenarios.
num_reads [x], time_limit [ ]
- Always return num_reads
num_reads [ ], time_limit [ ]
- num_reads is set to default. See above scenario.

I think the bolded case could be an invitation to making human errors for default num_reads > 1 (currently 10). For example, sampler.sample(time_limit=very-small-value) suggests the sampler should 1 state, but instead it will return 10. If we want to keep this behaviour, I suggest changing default num_reads to 1.

edit: I reread the above, and I think the ambiguity comes from the word argument name "time limit" being suggestive of a hard cutoff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to lower the default to 1. I had it higher for backwards compatibility reasons with dimod but I don't think that actually matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it looks an implied AND between ==num_reads and >=time_limit:

  • sampler will always return num_reads,
  • sampler will run for at least time_limit?

I find this awkward and inconsistent with Initialized (num_reads adapts to initial_states if states are user-specified and reads are not) and termination criteria in dwave-hybrid/Kerberos (time/convergence/energy/iteration count are OR-ed - we stop as soon as one condition is met).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how about I just make two samplers. One that take a time_limit and max_answers and one that takes num_reads. If we're going to make the arguments disjoint, we might as well make that clear at the sampler level.

Personally I think that's way overkill, but it seems like no one else agrees with me 😄

Copy link
Member Author

@arcondello arcondello Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user sets time_limit (and not num_reads), they care about run time, although it's not clear what's the number of samples they want. But if they haven't specified the number, it makes sense to return all samples generated during runtime. Here's where max_samples/max_answers comes into play.

The problem is that we generate a lot of samples. Way more than can easily fit in memory. (we can probably fit them in memory, but still it can be millions for even small runtimes). So returning all is not practical for many problems. Which is why I used the default num_reads which was 10.

In this framework, setting both num_reads and time_limit would run until one of them is satisfied.

This is what I had before. edit: nevermind, I understand. I was treating time_limit as having priority.

Copy link
Contributor

@kevinchern kevinchern Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we can either do treat time_limit as the most important, and then have a max_answers/max_samples as the secondary criteria. Or we can treat num_reads and the most important and treat time_limit as the secondary criteria when provided.

If we're deciding between the two, I also favour the latter for similar reasons. I do think Radomir's proposal is more intuitive and gives more control over the behaviour of the sampler.

Another point to consider is if we add a similar feature for Neal, a consistent behaviour across classes of samplers would be desirable. And in the case of Neal, I think it's reasonable to set a time limit and retrieve all solutions it came across.

Copy link
Member Author

@arcondello arcondello Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I concede. I will refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending sampler interface with time_limit (and max_answers or whatever we call it) would be a good candidate for a follow-up PR, just to keep concerns separated. But since we had all this discussion here, we might as well do it here and then generalize in a follow-up.

@@ -0,0 +1,3 @@
---
features:
- Add ``RandomSampler`` which generates samples with variable assignments chosen by coin flip.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a coin flipping device attached to user's machine, I would rephrase this 😄

@arcondello
Copy link
Member Author

arcondello commented Jun 23, 2022

Ok, refactored. Now both num_reads and time_limit are termination criteria. Added max_num_samples to limit the population of samples kept internally. I defaulted max_num_samples to 1000. It's a magic number, but I don't think it would be "safe" to leave it unbounded. The sampler is rather fast and can generate many samples for even short runtimes.

I also have dropped the timing info, pending a decision on #22. We can restore it in a followup PR.

Copy link
Member

@randomir randomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


Get the best 5 sample found in .1 seconds

>>> sampleset = sampler.sample(bqm, time_limit=.1, max_num_samples=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps include .info['num_drawn'] .info['num_reads'] here to clarify many more were actually generated.

Comment on lines +99 to +100
If ``time_limit`` is provided, there is no limit imposed on
the number of reads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the limit set by max_num_samples then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the limit on the population, it will take unlimited reads up to the time limit.

@arcondello arcondello merged commit 4dc4b57 into dwavesystems:main Jun 23, 2022
@arcondello arcondello deleted the feature/random-sampler branch June 23, 2022 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants