-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LogisticRegression cannot train from Dask DataFrame #84
Comments
Thanks. At the moment the dask_glm based estimators just work with dask arrays, not dataframes. You can use I'm hoping to put in some helpers for handling all the extra DataFrame metadata sometime soon, so this will be more consistent across estimators. |
Thank you so much for the quick response! The problem is that when fitting a glm with intercept (which is usually the case), the dask array containing the features needs to have defined the chunk size, which I believe it is not possible when the array comes from a dataframe. Anyways, I will reach out to the main dask issue page and ask there. Thank you! |
@julioasotodv, yes I forgot about that case. Let me put something together quick. |
Do you think there is a way to achieve this without making changes to dask's engine itself? |
What do you mean by "dasks's engine"?
See dask/dask-glm#63 for a discussion on the
relationship between dask-ml and dask-glm, and
dask/dask-glm@master...TomAugspurger:add-intercept-dd
for what the fix will look like.
…On Mon, Nov 6, 2017 at 5:05 PM, Julio Antonio Soto ***@***.*** > wrote:
Do you think there is a way to achieve this without making changes to
dask's engine itself?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#84 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhn3_1V1qCkEkmlNXqr09SUYkpc7ks5sz5DRgaJpZM4QR2-N>
.
|
I see. Would it work with that fix, even if chunksize is not defined for the underlying dask array? |
Yes, that should work. The solvers only require that the shape along the second axis is known: from dask_ml.linear_model import LinearRegression
from dask_ml.datasets import make_regression
X, y = make_regression(chunks=50)
df = dd.from_dask_array(X)
X2 = df.values # dask.array with unknown chunks along first dim
lm = LinearRegression(fit_intercept=False)
lm.fit(X2, y) Note that
And the intercept is added during the fit. |
That's awesome! But let me be just a little picky with that change (dask/dask-glm@master...TomAugspurger:add-intercept-dd): In theory, if using either L1 or L2 regularization (or Elastic Net), the penalty term should not affect the intercept (this is, the "ones" column that works as the intercept should not be multiplied by the Lagrange multipliers that perform the actual regularization). However, it would still be better than not having intercept. What do you think about this? |
Thanks, I'll take a look at how other packages handle regularization of the intercept, but I think your correct. cc @moody-marlin thoughts on that? |
Yea, I agree that the intercept should not be included in the regularization; I believe this is recommended best practice, and also not regularizing the intercept ensures that all regularizers still produce estimates which satisfy that the residuals have mean 0, which preserves the standard interpretation of things like R^2, etc. |
Opened dask/dask-glm#65 to track that. I'll deprecate the estimators in |
See there is PR ( dask/dask-glm#66 ) to deprecate the |
Yes, in my mind dask-glm has the optimizers, and dask-ml has the estimators
built on top of those.
…On Tue, Jun 5, 2018 at 9:02 PM, jakirkham ***@***.***> wrote:
See there is PR ( dask/dask-glm#66
<dask/dask-glm#66> ) to deprecate the dask-glm
estimators and PR ( #94 <#94> ),
which seems to have migrated the bulk of that content to dask-ml. Is this
still the plan?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#84 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHImZMyUAU1i6xb6RUlqzOslUPFnWeks5t5zgdgaJpZM4QR2-N>
.
|
I'm facing the same issue.
Initially I tried with Dask DataFrame, later changed to Dask Array using |
@asifali22 that looks strange. Can you provide a full example? Does the following work for you?
|
Having a similar issue with dask array @TomAugspurger see my SO question, Any idea? |
@thebeancounter do you have a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
@TomAugspurger |
It looks like data isn’t defined.
Also the error says you have multiple columns with no variance. You probably don’t want that.
…________________________________
From: thebeancounter <[email protected]>
Sent: Friday, June 14, 2019 12:31 AM
To: dask/dask-ml
Cc: Tom Augspurger; Mention
Subject: Re: [dask/dask-ml] LogisticRegression cannot train from Dask DataFrame (#84)
@TomAugspurger<https://github.com/TomAugspurger>
Hi. The code is in the SO question, do you mean copy it here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#84?email_source=notifications&email_token=AAKAOITHSXGZHLBI7F6J3KDP2MUMZA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXVYGBY#issuecomment-501973767>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAKAOIUA32NFF2AGXY7LBFTP2MUMZANCNFSM4ECHN6GQ>.
|
Data is defined Here is the data zipped (read it from folder with generator just for preventing memory from exploding)
Data is under |
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports may be helpful for writing an example.
Does the error show up if you have a dummy dataset where two columns have no variance?
…________________________________
From: thebeancounter <[email protected]>
Sent: Sunday, June 16, 2019 5:21 AM
To: dask/dask-ml
Cc: Tom Augspurger; Mention
Subject: Re: [dask/dask-ml] LogisticRegression cannot train from Dask DataFrame (#84)
@TomAugspurger<https://github.com/TomAugspurger>
Data is defined
It's regular cifar10 data, passed via a pre trained resnet 50 for feature extraction. Trains well with sklearn. I can't guarantee that there are no zero variance columns but those should not prevent learning anyway! Only waste some processing time.
Here is the data zipped (read it from folder with generator just for preventing memory from exploding)
i = ImageDataGenerator(preprocessing_function=preprocess_input)
train_flow = i.flow_from_directory(directory=test_dir, target_size=(224, 224), class_mode="sparse", batch_size=1024, shuffle=True)
pre_model = ResNet50(weights="imagenet", include_top=False)
pre_model.compile(optimizer=Adam(), loss=categorical_crossentropy)
labels = []
data = []
for i in range(len(train_flow)):
imgs, l = next(train_flow)
data.append(pre_model.predict(imgs))
labels.append(l)
labels = np.concatenate(labels)
data = np.concatenate(data, axis=0)
data = data.reshape(-1, np.prod(data.shape[1:]))
Data is under
github.com/thebeancounter/data
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#84?email_source=notifications&email_token=AAKAOIUVGSFQ74FUQXZGRYDP2YH3TA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXZJ35I#issuecomment-502439413>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAKAOIQAUZ67RBF65SEHGMDP2YH3TANCNFSM4ECHN6GQ>.
|
Hi, I posted the code and the data. It's a solid example :-) Anyhow, Can you maybe post a working example for using numpy array for logistic regression in dask? |
I’m guessing it’s not minimal. Simplifying it may reveal the issue.
Why do you want to use dask-ml’s LR on a numpy array?
… On Jun 16, 2019, at 10:49, thebeancounter ***@***.***> wrote:
@TomAugspurger
Hi, I posted the code and the data. It's a solid example :-)
Anyhow, Can you maybe post a working example for using numpy array for logistic regression in dask?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@TomAugspurger |
https://docs.dask.org/en/latest/array-creation.html documents creating dask
arrays, including from array-like things like NumPy arrays.
Though my (vague) question was a bit deeper. Why do you want to use dask's
LR, rather than scikit-learn's or Scipy's? If you're coming from a NumPy
array, then does your data fit in memory? If so, you should just use one of
those.
…On Mon, Jun 17, 2019 at 4:11 AM thebeancounter ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger>
my data originally comes from a numpy array, I need to convert it to some
form that dask can learn on. Can't find any example for that in the
tutorial, maybe that's the issue, can you point me to something of that
kind?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=AAKAOIQRF2DG7VWTZ2IU7RDP25IKXA5CNFSM4ECHN6G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX2RKZQ#issuecomment-502601062>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQ4AUET3IQRB3QVNULP25IKXANCNFSM4ECHN6GQ>
.
|
I have seen above and there is the case:
For me if i use
And will this influence the distributed computation?
Will i need to use
In my case, I find that the sci-kit learn estimator accept the dask data fomat(array, dataframe),so, what is the big difference between these? |
Scikit learn will not utilize the machines cores, and takes way way way too long to run... |
@xiaozhongtian can you please clarify? are you asking a question? Not sure I see the connection to this thread. |
@TomAugspurger |
With the n_job = -1 in sci-kit learn, it uses the multi-process to fit. no? But here, I want to know the manage of the memory for scikit learn and dask-ml. |
I'm having the same problem by building a dataframe from dask arrays, then calling |
I presume you mean an This will compute the chunk sizes and the length of the array. |
Use lr.fit(X.values, y.values) instead |
A simple example:
Returns
KeyError: (<class 'dask.dataframe.core.DataFrame'>,)
I did not have time to try if it is also the case for other models.
The text was updated successfully, but these errors were encountered: