-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature selection taking magnitutes longer than it should #1073
Comments
Hi @Sarius2009! Or maybe I misunderstood: Are you referring to the number of distinct values in |
Hi @nils-braun, |
I am indeed a bit confused (but not because of your question or data, but because of the issue you see). I played around with the data a bit.
finishes in about 3s on my laptop (note: I purposely reduced the problem to a binary classification), but this
(only change is the |
Good to know I am not the only one confused. I can also confirm you observations, and 3s is right around what I would expect. Also, the reverse happens for
|
@nils-braun |
Oh this is really great to hear. Thanks for looking further into this! Yes, let's update the requirements. Would you like to do the PR (because you found it)? |
As I only tested with 1.14.0, I will wait 2 weeks for the full release, and then do the PR |
Just made the PR, closing this issue: #1081 |
When extracting data from the same dataset and selecting from the extracted features, one set of parameters to extract data, which results in 154 classes, 10400 TSs, longest TS of 10000 datapoints and 1.2 GB of json data works fine with EfficientFCParameters, but another one, which results in <90 classes, 550 sessions and 8MB of data with takes 40seconds to extract the features, as expected, but 3 hours to select the features.. This was the smallest sample I could create which exibited this beahviour, larger samples (~200MB) will take >60GB of RAM an crash the program. Using
n_jobs=0
did not help.Profiling my example below, almost all of the time seems to be taken up by
_recv_bytes
and_get_more_data
.Running W11 on a system with 16GB tsfresh 0.20.2 installed via pip, tested code in a jupyter notebook and normal python 3.12.
Minimal example with provided data:
problem_features_eff.csv
problem_id_to_userid_eff.csv
Edit: Changed Parameters to efficient, attached problematic data
Edit: Found out it wasn't inherently a memory issue, adjusted title and text accordingly
The text was updated successfully, but these errors were encountered: