WIP: ENH: Basic implementation of SGD #69

stsievert · 2018-03-11T21:33:19Z

This implements the popular mini-batch stochastic gradient descent in dask-glm. At each iteration, it grabs batch_size examples, computes the gradients for these examples, and then updates the parameter accordingly.

The main benefit of SGD is that the convergence time does not depend on the number of examples. Here, "convergence time" means "number of arithmetic operations", not "wall-clock time until completion".

TODOs:

finish implementation
resolve bugs
test

mrocklin · 2018-03-11T22:11:45Z

This is cool to see. Have you tried this on a dataset, artificial or otherwise? Do you have a sense for the performance bottlenecks here? The first thing I would do here would be to run this with the distributed scheduler on a single machine and watch the dashboard during execution. I suspect that it would be illiuminating

stsievert · 2018-03-11T23:26:49Z

@mrocklin and I have had some discussion over at dask/dask#3251 about this on one of the bugs that I encountered while implementing this. The takeaway from this discussion was that I should randomly shuffle the dataset every epoch and walk sequentially through it, not sample the dataset at random.

I'm now curious what kinds of parallel SGD algorithms exist. dask/dask#3251 (comment)

Most parallel SGD algorithms I've seen rely on some sparsity constraints (e.g., Hogwild!, Hogwild++, Cyclades). Practically, there's a much easier method to parallelize SGD: parallelize the gradient computation for a mini-batch. In practice for deep learning at least, that's the main bottleneck.

futures samples asynchronously to overlap communication and computation

Got it. I see what you're saying. The futures API should make that pretty convenient.

We are computing the gradient for different examples, then summing them. Could we use map_blocks to compute the gradient locally for each chunk, then send the gradients computed from each chunk back to the master? Or maybe a better question: would this change anything?

(I meant to send this in as soon as PR filed; sorry the delay)

stsievert · 2018-03-11T23:31:40Z

Have you tried this on a dataset, artificial or otherwise? Do you have a sense for the performance bottlenecks here?

This is a very early work (the code isn't even error-free), and I only filed this PR to move the discussion from the other unrelated issue. I'll look more into performance bottlenecks after I finish the implementation.

mrocklin · 2018-03-11T23:48:32Z

My understanding of Hogwild! and similar algorithms is that they expect roundtrip latencies in the microseconds, which is, for them, a critical number for performance. Is this understanding correct? If so then how does a dynamic distributed system like Dask (which has latencies in the several milliseconds) manage?

The takeaway from this discussion was that I should randomly shuffle the dataset every epoch and walk sequentially through it, not sample the dataset at random

How many epochs are there? Are you expected to use all of the data every epoch? You should be aware that while shuffling is much faster than n random accesses, it's still quite slow. There might be other approaches, like gathering a sizable random sample (like 10% of the data) 10x more frequently.

This is a very early work (the code isn't even error-free), and I only filed this PR to move the discussion from the other unrelated issue. I'll look more into performance bottlenecks after I finish the implementation.

Understood, and I really appreciate the early submission to stimulate discussion. I think that most of my questions so far have been aiming to to the following point: Our cost model has changed dramatically from what it was on a multi-core machine. We should build intuition sooner rather than later to help inform algorithm decisions.

We are computing the gradient for different examples, then summing them. Could we use map_blocks to compute the gradient locally for each chunk, then send the gradients computed from each chunk back to the master? Or maybe a better question: would this change anything?

Sure, that's easy to do, that seems quite different than the SGD approach you were mentioning earlier though that avoids looking at all of the data on each iteration.

mrocklin · 2018-03-11T23:48:43Z

In case you haven't seen it already you might want to look at #57

mrocklin · 2018-03-11T23:49:29Z

Ah, as before I see that you've already covered this ground: #57 (comment)

stsievert · 2018-03-12T03:07:57Z

expect roundtrip latencies in the microseconds, which is, for them, a critical number for performance. Is this understanding correct?

In the literature the latency unit is number of writes to the parameter vector (e.g., section 2.1 of taming the wild). In the limiting case where bandwidth is an issue, I don't see how halving the communication bandwidth could be an issue (but it could be if stragglers are considered).

How many epochs are there? Are you expected to use all of the data every epoch? You should be aware that while shuffling is much faster than n random accesses, it's still quite slow

Normally less than 100 (~100 for CIFAR-10 or ImageNet, ~10–20 for MNIST). In the case of the taxicab dataset, I'd imagine less: it's a simple model with many data points.

Is that acceptable? I'm okay shuffling the data completely infrequently and shuffling on each machine infrequently. Either way, I think this will provide gains over computing the gradient for all examples.

Our cost model has changed dramatically from what it was on a multi-core machine. We should build intuition sooner rather than later to help inform algorithm decisions.

Agreed. We should avoid premature optimization and fix the problems we see.

mrocklin · 2018-03-12T03:15:42Z

Full shuffles on the nyc taxi cab dataset on a modest sized cluster take a minute or so. Pulling a random GB of data or so from the cluster and then shuffling it locally would likely take a few seconds. I could also imagine doing shuffling asynchronously while also pulling data locally.

BUG: dataframe size info not exact and indexing needed squash Getting rid of ASGD

mrocklin · 2018-03-22T16:51:44Z

Have you had a chance to run this while looking at the dashboard? If not let me know and we can walk through this together. I think that it will help.

stsievert · 2018-03-23T04:15:43Z

Yeah, I've had a chance to look at the dashboard (after making sure the code optimizes successfully). It looks like indexing is the bottleneck, I believe it spends about 67% of the time there. At least, that portion of the dashboard remains constant when "getitem" is selected instead of "All" from the dropdown menu.

I think something is high bandwidth too, some task is using up to 20Mb/s. That seems high, especially since the iterations are not fast, my beta is only 20 elements and 32 examples are used to approximate the gradient. Could indexing be responsible for this? I don't think dot products or scalar functions are responsible.

I think we can use map_blocks to get around this issue (well, if this is an issue), which would rely on the fact that gradient(X, beta) = sum(gradient(x, beta) for x in X) for most losses (including GLMs). This would only require the communication of beta, which has the same communication cost as 1 example (at least for a linear model).

stsievert · 2018-03-26T03:04:35Z

I've played with this a little more, encountered some performance issues and thought of some (possible) solutions. Shuffling the arrays (algorithms.py#L189) takes a long time on my local machine, even if it's once every epoch. This issue can be resolved by moving the indexing to each array chunk, which means that we wouldn't ever have to move the entire dataset.

This would need to make one assumption: that each row in X is not split into different blocks, or the chunks are only along the first dimension. SGD is designed for when the number of examples is very large, and it's typically much larger than the dimension of each example (e.g., the NYC taxicab dataset: 19 features, millions of examples).

The implementation would look something like

def _sgd_grad(family, Xbeta, X, y, idx, chunks, block_id=0):
    i = _proper_indices(idx, block_id, chunks)
    return family.grad(Xbeta[i], X[i], y[i])

for k in range(100):
    i = np.random.choice(n, size=(batch_size,))
    grad = da.map_blocks(_sgd_grad, famliy, Xbeta, X, y, i, X.chunks)
    grad = grad.reshape(...).sum(axis=...)
    ...

I think I'll point to this idea tomorrow @mrocklin.

mrocklin

Some small comments. Is there a test dataset that you're using for this work?

mrocklin · 2018-03-26T17:48:54Z

dask_glm/algorithms.py

+        raise ValueError('SGD needs shape information to allow indexing. '
+                         'Possible by passing a computed array in (`X.compute()` '
+                         'or `X.values.compute()`), then doing using '
+                         '`dask.array.from_array ')


This probably won't work well on a larger dataset on a cluster. We should probably find a better approach here if possible.

I've allowed passing a keyword arg in to get the number of examples. I print an error message if it's not present, and provide helpful examples with a distributed dataframe (implementation is at algorithms.py#_get_n).

Is this more of what you'd like to see? I debated going to a dataframe and computing the length from that, but this option was cleanest and allowed the user to use outside information.

In general this is something that we should probably push upstream into dask.array proper. It looks like there is an upstream issue here: dask/dask#3293

Short term if you're experimenting then I say that we just go with this and move on. Long term I don't think that it would be reasonable to ask users to provide this information, but if we're ok not merging this implementation (which given performance issues seems wise) then it might be best to just blow past it for now.

Leaving to the upstream issue sounds best to me.

mrocklin · 2018-03-26T17:49:41Z

dask_glm/algorithms.py

+    for epoch in range(epochs):
+        j = np.random.permutation(n)
+        X = X[j]
+        y = y[j]


As mentioned by @stsievert this can be inefficient both because it causes a shuffle, and because we're pushing a possibly biggish array through the scheduler. Ideally we would find ways to avoid explicit indexing.

This might also force a dask array to a single chunk?

Another approach would be to add a random column, switch to dask.dataframe, sort by that column using set_index, and then come back to an array. This avoids the movement of a possibly large numpy array, incurs extra costs from moving to a dask.dataframe and back (probably not that large) and keeps the costs of doing a full shuffle.

…fling...

stsievert · 2018-05-05T18:21:23Z

There is a dask/sklearn/skimage sprint soon, and will help achieve some of the ML goals: scisprints/2018_05_sklearn_skimage_dask#1 (comment), specifically on scaling to larger datasets. I'll try to put in time during the sprint to implement this, regardless if I'm unable to attend (which may happen for a variety of concerns).

cc @mrocklin

mrocklin · 2018-05-05T19:51:41Z

I agree that this would be a useful topic. I suspect that it is also something that people at the sprint might find interesting to work on or use. I may find some time before the sprint to put together some infrastructure to make experimentation easier here. I may raise this as a broader topic of discussion at dask-ml.

…

On Sat, May 5, 2018 at 2:21 PM, Scott Sievert ***@***.***> wrote: There is a dask/sklearn/skimage sprint soon, and will help achieve some of the ML goals: scisprints/2018_05_sklearn_skimage_dask#1 (comment) <scisprints/2018_05_sklearn_skimage_dask#1 (comment)>, specifically on scaling to larger datasets. I'll try to put in time during the sprint to implement this, regardless if I'm unable to attend (which may happen for a variety of concerns). cc @mrocklin <https://github.com/mrocklin> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGdHsmaGWeE5IagS85rF7g8zcTqKks5tve2kgaJpZM4Sl3kx> .

stsievert · 2018-07-27T03:48:19Z

I've updated this, with help from dask/dask#3407. This implementation relies on slicing a Dask array with a NumPy array. I'm not convinced this is the best approach. I generate something like 200k jobs when I index with Dask slices for a feature matrix with size (1015701, 20) and a batch size of 1000.

I'll try to rework for the re-indexing approach mentioned in
revisiting #69 (comment).

stsievert mentioned this pull request Mar 11, 2018

df.values has unspecified length dask/dask#3251

Closed

ENH: Basic implementation of SGD

876897f

BUG: dataframe size info not exact and indexing needed squash Getting rid of ASGD

stsievert force-pushed the sgd-impl branch from fd1b2c3 to 876897f Compare March 18, 2018 14:51

ENH: clean up SGD implementation

b6963af

mrocklin reviewed Mar 26, 2018

View reviewed changes

stsievert force-pushed the sgd-impl branch from 03bff77 to 7439977 Compare March 27, 2018 17:48

MAINT: allow user to pass n in kwargs and warn

b875735

stsievert force-pushed the sgd-impl branch from 7439977 to b875735 Compare March 27, 2018 17:49

MAINT: choose random batch. These are some strong assumptions on shuf…

d90f70e

…fling...

MAINT: rework

2ba82a6

stsievert mentioned this pull request Aug 30, 2019

Partial fit / Incremental UMAP lmcinnes/umap#62

Open

Base automatically changed from master to main February 10, 2021 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ENH: Basic implementation of SGD #69

WIP: ENH: Basic implementation of SGD #69

stsievert commented Mar 11, 2018 •

edited

Loading

mrocklin commented Mar 11, 2018

stsievert commented Mar 11, 2018

stsievert commented Mar 11, 2018

mrocklin commented Mar 11, 2018

mrocklin commented Mar 11, 2018

mrocklin commented Mar 11, 2018

stsievert commented Mar 12, 2018

mrocklin commented Mar 12, 2018

mrocklin commented Mar 22, 2018

stsievert commented Mar 23, 2018 •

edited

Loading

stsievert commented Mar 26, 2018 •

edited

Loading

mrocklin left a comment

mrocklin Mar 26, 2018

stsievert Mar 27, 2018

mrocklin Mar 27, 2018

stsievert Mar 28, 2018

mrocklin Mar 26, 2018

mrocklin Mar 26, 2018

mrocklin Mar 26, 2018

stsievert commented May 5, 2018

mrocklin commented May 5, 2018 via email

stsievert commented Jul 27, 2018 •

edited

Loading

WIP: ENH: Basic implementation of SGD #69

Are you sure you want to change the base?

WIP: ENH: Basic implementation of SGD #69

Conversation

stsievert commented Mar 11, 2018 • edited Loading

mrocklin commented Mar 11, 2018

stsievert commented Mar 11, 2018

stsievert commented Mar 11, 2018

mrocklin commented Mar 11, 2018

mrocklin commented Mar 11, 2018

mrocklin commented Mar 11, 2018

stsievert commented Mar 12, 2018

mrocklin commented Mar 12, 2018

mrocklin commented Mar 22, 2018

stsievert commented Mar 23, 2018 • edited Loading

stsievert commented Mar 26, 2018 • edited Loading

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin Mar 26, 2018

Choose a reason for hiding this comment

stsievert Mar 27, 2018

Choose a reason for hiding this comment

mrocklin Mar 27, 2018

Choose a reason for hiding this comment

stsievert Mar 28, 2018

Choose a reason for hiding this comment

mrocklin Mar 26, 2018

Choose a reason for hiding this comment

mrocklin Mar 26, 2018

Choose a reason for hiding this comment

mrocklin Mar 26, 2018

Choose a reason for hiding this comment

stsievert commented May 5, 2018

mrocklin commented May 5, 2018 via email

stsievert commented Jul 27, 2018 • edited Loading

stsievert commented Mar 11, 2018 •

edited

Loading

stsievert commented Mar 23, 2018 •

edited

Loading

stsievert commented Mar 26, 2018 •

edited

Loading

stsievert commented Jul 27, 2018 •

edited

Loading