Add RandomSampler #24

arcondello · 2022-06-21T00:14:53Z

Add a performant random sampler. This is intended to replace dimod.RandomSampler as part of dwavesystems/dimod#298.

This implements the same API, but also includes a time_limit parameter. This is a feature used by the benchmarking team - this sampler is meant to replace their random batch sampler with one maintained as part of Ocean.

I have also attempted to include timing information following the spec in #22.

I kind of hate that, because we list the samplers alphabetically, this one gets put first 🤷.

arcondello · 2022-06-21T00:23:51Z

This relies on an unreleased dimod feature - hence the CI failures. ~~Will open a PR shortly.~~ dwavesystems/dimod#1210

kevinchern · 2022-06-21T00:27:34Z

README.rst

@@ -24,13 +24,32 @@ or locally on your CPU.
 *dwave-samplers* implements the following classical algorithms for solving
 :term:`binary quadratic model`\ s (BQM):

+* Random: a sampler that draws uniform random samples.


Missing backticks i.e.,

* `Random` : a sampler that draws uniform random samples.

Those are for the link. In this case I am not linking to anything. Though open to a good place to link to, https://en.wikipedia.org/wiki/Randomization seemed a bit over general 😄

It's backticks plus underscore to link to the new Random section in line 35

Got it, I misunderstood. Will do.

kevinchern · 2022-06-21T00:31:01Z

README.rst

+Random
+======
+
+Random samplers provide a useful baseline performance comparison. The variable


Would this be the place to mention which PRNG is used? e.g., Mersenne-Twister?

Could do, though the user can actually set that by passing in a Generator. By default we use NumPy's default, which is currently PCG64. Though I would want to just say that we use the default from NumPy in case NumPy changes.

kevinchern · 2022-06-21T00:54:33Z

dwave/samplers/random/cyrandom.pyx

+    cdef double sampling_start_time = realtime_clock()
+
+    cdef double sampling_stop_time
+    if time_limit < 0:


Should this be <=?

Quoting docstring below

If given and non-negative, samples are drawn until time_limit.

Should this be strictly positive?

update: Nvm, I can see why a strict inequality is an intuitive choice. I don't have an objective argument at the moment.

kevinchern · 2022-06-21T01:13:16Z

dwave/samplers/random/cyrandom.pyx

+    if is_spin:
+        # go back through and convert to spin
+        for i in range(state.sample.size()):
+            state.sample[i] = 2 * state.sample[i] - 1
+
+    state.energy = cybqm.data().energy(state.sample.begin())


Converting spins and computing energies seem more fitting as a postprocessing step.

Only in the case when you're keeping every sample. Otherwise you need the energies in order to keep the population to num_reads. Because for "real uses" of the solver you need to keep the population limited, IMO it's part of sampling.

OK that makes sense. Follow-up question: why do we use num_reads as a cap, as opposed to accepting one and only one parameter (one of num_reads, time_limit)? The latter seems like a more natural use case.

Four different combos with their uses:

both unspecified: I just want to use the defaults

time_limit set, num_reads unset: I want the best sample I can find randomly in the given time

time_limit unset, num_reads set: I want num_reads samples drawn uniformly. If I was to specify the time_limit, they would not be uniform.

both set, I want the best sample(s) found after time_limit and I want to specify the number of returned samples and/or the internal memory used.

Obviously these are not inclusive of use cases but should give some idea.

I would forbid user setting both num_reads and time_limit, since time_limit takes precedence and might return less than num_reads.

947320b changed the behavior to always return num_reads. time_limit now just allows the algorithm to keep sampling.

I was looking at your comment above, and missed that update, thanks.

kevinchern · 2022-06-21T01:19:42Z

dwave/samplers/random/cyrandom.pyx

+        prepreocessing_time=sampling_start_time-preprocessing_start_time,
+        sampling_time=postprocessing_start_time-sampling_start_time,
+        postprocessing_time=realtime_clock()-preprocessing_start_time,


Return as a dict, i.e., timing=dict(preprocessing_time=..., sampling_time=..., postprocessing_time=...) for consistency with how we output QPU timings

I am open to having it be nested. Though since we're already changing the names, IMO consistency is not the main reason to do it.

If we do nest it, and if we're already changing the names, IMO it would look nicer to have

info = dict(timing=dict( preprocessing=..., sampling=..., postprocessing=..., ))

IMHO consistency of timing terminology with QPU and other solvers is a priority. Otherwise the code to plot data from different solvers on the same graph becomes very unwieldy. This may require some exceptions and compromises because different solvers structures have different structures, but aiming for maximum comparability would be good thing. This would also be an argument for generic timing categories based on once per input'' and once per output'' , which are natural comparators, rather than solver-specific functions.

+1 for nested timings!

I would also consider typing the duration variables. Perhaps best to use datetime.timedelta. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.

I would also like that. I did it this way for consistency with QPU timing, but we're already being inconsistent in other places so I think probably worth raising at #22 ?

QPU timing info could be considered encoded for transport -- and we can decode it as timedelta client-side.

kevinchern · 2022-06-21T01:20:29Z

tests/test_random_sampler.py

+        sampleset = RandomSampler().sample(bqm, num_reads=10, time_limit=.02)
+        runtime = time.perf_counter() - t
+
+        self.assertTrue(.01 < runtime < .04)


Should also have a test case for info['timing']['sampling_time'] being reasonably close to 0.02.

kevinchern · 2022-06-21T01:28:21Z

dwave/samplers/random/sampler.py

+                Only the best ``num_reads`` (or fewer) samples are kept.
+
+        Returns:
+            A sample set.
+            Some additional information is provided in the
+            :attr:`~dimod.SampleSet.info` dictionary:
+
+                * **num_drawn**: The total number of samples generated.


Is my understanding as follows correct?
num_drawn != num_reads is true when time_limit allows for more draws than num_reads?

Specifically when time_limit is provided and when it allows for more reads

kevinchern · 2022-06-21T17:11:46Z

tests/test_random_sampler.py

+        with self.assertWarns(dimod.exceptions.SamplerUnknownArgWarning):
+            RandomSampler().sample(bqm, a=5, b=2)
+
+    def test_time_limit(self):


Maybe add test cases for different settings:

default (no num_reads, no time_limit)

only num_reads

only time_limit

The default case is covered rather extensively by @dimod.testing.load_sampler_bqm_tests(RandomSampler), which adds ~30 tests.
But sure, can separate the num_reads and time_limit tests.

kevinchern

Maybe add a few more test cases, otherwise it looks good.

randomir

Looks good overall, my only concern is about sampling termination criteria interface.

randomir · 2022-06-23T14:05:43Z

dwave/samplers/random/cyrandom.pyx

+            sort(samples.begin(), samples.end(), comp_state)
+            samples.pop_back()


Perhaps use a heap instead? (for total O(N logN) complexity)

Alternatively, you can use insertion sort (for O(N) per step here, and total ~O(N^2)) instead of "IntroSort" (for O(N logN) per step here, and total of ~O(N^2 logN)).

Tried that, it actually hurt performance. Imagine it's the compiler being smart.

Or constant factor too big to notice benefits on smaller scale.

~~Indeed, though I tried fairly large numbers.~~ You know what, there have been a bunch of changes since then. Let me retest.

Ok, after the change introduced in 176c25c, I was able to get a performance boost at very large num_reads. So I will indeed make the change. Thanks, and good suggestion!

For posterity I used

bqm = dimod.generators.gnp_random_bqm(1000, .5, 'BINARY') num_reads=10000

randomir · 2022-06-23T14:13:28Z

dwave/samplers/random/cyrandom.pyx

+        prepreocessing_time=sampling_start_time-preprocessing_start_time,
+        sampling_time=postprocessing_start_time-sampling_start_time,
+        postprocessing_time=realtime_clock()-preprocessing_start_time,


I would also consider typing the duration variables. Perhaps best to use datetime.timedelta. I always need to check if some time variable in Ocean is in seconds, ms or us. Hate that.

randomir · 2022-06-23T14:19:15Z

dwave/samplers/random/cyrandom.pyx

+    if is_spin:
+        # go back through and convert to spin
+        for i in range(state.sample.size()):
+            state.sample[i] = 2 * state.sample[i] - 1
+
+    state.energy = cybqm.data().energy(state.sample.begin())


I would forbid user setting both num_reads and time_limit, since time_limit takes precedence and might return less than num_reads.

randomir · 2022-06-23T14:28:26Z

dwave/samplers/random/sampler.py

+            time_limit:
+                The maximum sampling time in seconds.
+                If given and non-negative, samples are drawn until ``time_limit``.
+                Only the best ``num_reads`` (or fewer) samples are kept.


Until we figure out how best to extend/generalize Initialized sampler interface to handle other termination criteria, I would forbid setting both num_reads and time_limit here. It's confusing because time_limit takes precedence and the sampler might return less then num_reads.

A case of setting time_limit, but not num_reads is also confusing -- we return only one sample after num_drawn tries.

Perhaps we should introduce max_answers (or max_samples) that can be used in combination with time_limit. See dwavesystems/dwave-greedy#24 for discussion.

So:

num_reads set: exactly num_reads returned, time_limit not allowed. max_answers ignored.

time_limit set: max_answers (which can default to 1 or inf/None) is used

With this scheme, I would set num_reads by default to None, with interpretation on how much that actually is depending on time_limit/max_answers. Similarly to how we resolve it in the Initialized arg parser.

I had already changed it so num_reads has precedence. And I disagree that they should be mutually exclusive. The solver will now always returns num_reads. time_limit just determines whether it keeps going.

To summarize the current implementation's behaviour...

num_reads [x], time_limit [x]
- Case more num_reads than time_limit allows for: always return num_reads states and go past time limit.
- Case fewer num_reads than time_limit allows for: always return num_reads states and continue sampling until time_limit.
num_reads [ ], time_limit [x]
- num_reads is implicitly set to default. See above scenarios.
num_reads [x], time_limit [ ]
- Always return num_reads
num_reads [ ], time_limit [ ]
- num_reads is set to default. See above scenario.

I think the bolded case could be an invitation to making human errors for default num_reads > 1 (currently 10). For example, sampler.sample(time_limit=very-small-value) suggests the sampler should 1 state, but instead it will return 10. If we want to keep this behaviour, I suggest changing default num_reads to 1.

edit: I reread the above, and I think the ambiguity comes from the word argument name "time limit" being suggestive of a hard cutoff.

Sure, happy to lower the default to 1. I had it higher for backwards compatibility reasons with dimod but I don't think that actually matters.

Ok, so it looks an implied AND between ==num_reads and >=time_limit:

sampler will always return num_reads,

sampler will run for at least time_limit?

I find this awkward and inconsistent with Initialized (num_reads adapts to initial_states if states are user-specified and reads are not) and termination criteria in dwave-hybrid/Kerberos (time/convergence/energy/iteration count are OR-ed - we stop as soon as one condition is met).

Ok, how about I just make two samplers. One that take a time_limit and max_answers and one that takes num_reads. If we're going to make the arguments disjoint, we might as well make that clear at the sampler level.

Personally I think that's way overkill, but it seems like no one else agrees with me 😄

If user sets time_limit (and not num_reads), they care about run time, although it's not clear what's the number of samples they want. But if they haven't specified the number, it makes sense to return all samples generated during runtime. Here's where max_samples/max_answers comes into play.

The problem is that we generate a lot of samples. ~~Way more than can easily fit in memory.~~ (we can probably fit them in memory, but still it can be millions for even small runtimes). So returning all is not practical for many problems. Which is why I used the default num_reads which was 10.

In this framework, setting both num_reads and time_limit would run until one of them is satisfied.

~~This is what I had before.~~ edit: nevermind, I understand. I was treating time_limit as having priority.

So, we can either do treat time_limit as the most important, and then have a max_answers/max_samples as the secondary criteria. Or we can treat num_reads and the most important and treat time_limit as the secondary criteria when provided.

If we're deciding between the two, I also favour the latter for similar reasons. I do think Radomir's proposal is more intuitive and gives more control over the behaviour of the sampler.

Another point to consider is if we add a similar feature for Neal, a consistent behaviour across classes of samplers would be desirable. And in the case of Neal, I think it's reasonable to set a time limit and retrieve all solutions it came across.

Ok, I concede. I will refactor.

Extending sampler interface with time_limit (and max_answers or whatever we call it) would be a good candidate for a follow-up PR, just to keep concerns separated. But since we had all this discussion here, we might as well do it here and then generalize in a follow-up.

randomir · 2022-06-23T14:29:25Z

releasenotes/notes/random-sampler-a5bacc2b7761222f.yaml

@@ -0,0 +1,3 @@
+---
+features:
+  - Add ``RandomSampler`` which generates samples with variable assignments chosen by coin flip.


Unless we have a coin flipping device attached to user's machine, I would rephrase this 😄

arcondello · 2022-06-23T19:55:03Z

Ok, refactored. Now both num_reads and time_limit are termination criteria. Added max_num_samples to limit the population of samples kept internally. I defaulted max_num_samples to 1000. It's a magic number, but I don't think it would be "safe" to leave it unbounded. The sampler is rather fast and can generate many samples for even short runtimes.

I also have dropped the timing info, pending a decision on #22. We can restore it in a followup PR.

…ampler

randomir

LGTM!

randomir · 2022-06-23T21:25:12Z

README.rst

+
+Get the best 5 sample found in .1 seconds
+
+>>> sampleset = sampler.sample(bqm, time_limit=.1, max_num_samples=5)


Perhaps include ~~.info['num_drawn']~~ .info['num_reads'] here to clarify many more were actually generated.

randomir · 2022-06-23T21:33:36Z

dwave/samplers/random/sampler.py

+                If ``time_limit`` is provided, there is no limit imposed on
+                the number of reads.


Isn't the limit set by max_num_samples then?

That's the limit on the population, it will take unlimited reads up to the time limit.

arcondello added the enhancement New feature or request label Jun 21, 2022

arcondello requested review from randomir, kevinchern and JoelPasvolsky June 21, 2022 00:14

arcondello force-pushed the feature/random-sampler branch from ee98703 to b392967 Compare June 21, 2022 00:15

Add RandomSampler

a005317

arcondello force-pushed the feature/random-sampler branch from b392967 to a005317 Compare June 21, 2022 00:17

arcondello mentioned this pull request Jun 21, 2022

Timing info #22

Closed

kevinchern reviewed Jun 21, 2022

View reviewed changes

arcondello mentioned this pull request Jun 21, 2022

Expose C++ QuadraticModelBase::energy() to Cython level dwavesystems/dimod#1210

Merged

kevinchern reviewed Jun 21, 2022

View reviewed changes

Build with dimod==0.11.3 and numpy>=0.19.0

a49551e

arcondello force-pushed the feature/random-sampler branch from 4c3160e to a49551e Compare June 22, 2022 16:39

arcondello added 4 commits June 22, 2022 09:43

Remove unneeded time imports

e414cb1

Add missing package to setup.cfg

c27e8b6

Use chrono for timing in the RandomSampeler

80a3c3f

Change RandomSampler to always draw num_reads samples

947320b

randomir reviewed Jun 23, 2022

View reviewed changes

arcondello added 2 commits June 23, 2022 09:44

Fix segfaults in OSX

176c25c

Use a heap rather than sorting in random sampler

9f740ce

arcondello force-pushed the feature/random-sampler branch from d4052d3 to ae0af71 Compare June 23, 2022 19:57

Refactor the relationship between time_limit and num_reads in RandomS…

bc196a4

…ampler

arcondello force-pushed the feature/random-sampler branch from ae0af71 to bc196a4 Compare June 23, 2022 19:59

Update release note

db329d4

randomir approved these changes Jun 23, 2022

View reviewed changes

Expand RandomSampler README example

f1c61dd

arcondello merged commit 4dc4b57 into dwavesystems:main Jun 23, 2022

arcondello deleted the feature/random-sampler branch June 23, 2022 22:02

		sort(samples.begin(), samples.end(), comp_state)
		samples.pop_back()


		Get the best 5 sample found in .1 seconds

		>>> sampleset = sampler.sample(bqm, time_limit=.1, max_num_samples=5)

		If ``time_limit`` is provided, there is no limit imposed on
		the number of reads.

Add RandomSampler #24

Add RandomSampler #24

Conversation

arcondello commented Jun 21, 2022

arcondello commented Jun 21, 2022 • edited Loading

kevinchern Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoelPasvolsky Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinchern Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinchern Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

kevinchern Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

kevinchern left a comment

Choose a reason for hiding this comment

randomir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

kevinchern Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

kevinchern Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

arcondello Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello commented Jun 23, 2022 • edited Loading

randomir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcondello commented Jun 21, 2022 •

edited

Loading

kevinchern Jun 21, 2022 •

edited

Loading

JoelPasvolsky Jun 21, 2022 •

edited

Loading

kevinchern Jun 21, 2022 •

edited

Loading

arcondello Jun 21, 2022 •

edited

Loading

arcondello Jun 23, 2022 •

edited

Loading

kevinchern Jun 21, 2022 •

edited

Loading

kevinchern Jun 21, 2022 •

edited

Loading

arcondello Jun 21, 2022 •

edited

Loading

arcondello Jun 23, 2022 •

edited

Loading

arcondello Jun 23, 2022 •

edited

Loading

kevinchern Jun 23, 2022 •

edited

Loading

arcondello Jun 23, 2022 •

edited

Loading

kevinchern Jun 23, 2022 •

edited

Loading

arcondello Jun 23, 2022 •

edited

Loading

arcondello commented Jun 23, 2022 •

edited

Loading