You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset stored in parquet format. I want to filter the dataset by a categorical column and then scale the numerical columns which are the vast majority of columns.
When running the above code, I encounter the following error:
IndexError Traceback (most recent call last)
3 scaler = MinMaxScaler(features=feat_cols, prefix='prep_')
----> 4 scaler.fit_transform(data)
File ~/conda/lib/python3.9/site-packages/vaex/ml/transformations.py:46, in Transformer.fit_transform(self, df)
39 '''Fit and apply the transformer to the supplied DataFrame.
40
41 :param df: A vaex DataFrame.
42
43 :returns copy: A shallow copy of the DataFrame that includes the transformations.
44 '''
45 self.fit(df=df)
---> 46 return self.transform(df=df)
File ~/conda/lib/python3.9/site-packages/vaex/ml/transformations.py:719, in MinMaxScaler.transform(self, df)
717 b = self.feature_range[1]
718 expr = copy[feature]
--> 719 expr = (b-a)*(expr-self.fmin_[i])/(self.fmax_[i]-self.fmin_[i]) + a
720 copy[name] = expr
721 return copy
IndexError: list index out of range
The reason for this error is fmin_ and fmax_ being empty after calling fit. Normally, they should contain the minima and maxima of each column to be scaled.
However, when I remove the filter step, MinMaxScaler works as expected.
data=vaex.open('my_parquet_file_dir')
# data = data[data.filter_col == "A"]scaler=MinMaxScaler(features=feat_cols, prefix='prep_')
scaler.fit_transform(data)
Additional information
The dataset is distributed across 100 parquet files. The shape of the data is around 3M rows and 120 columns.
I tried to create a minmal dataset to reproduce the error but failed. Even when I create a dataset with similar properties like below, filtering and MinMaxScaler still work as expected.
I have a dataset stored in parquet format. I want to filter the dataset by a categorical column and then scale the numerical columns which are the vast majority of columns.
When running the above code, I encounter the following error:
The reason for this error is
fmin_
andfmax_
being empty after callingfit
. Normally, they should contain the minima and maxima of each column to be scaled.However, when I remove the filter step, MinMaxScaler works as expected.
Software information
import vaex; vaex.__version__)
:{'vaex': '4.17.0', 'vaex-core': '4.17.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.9.0', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.2', 'vaex-ml': '0.18.3'}
Additional information
The dataset is distributed across 100 parquet files. The shape of the data is around 3M rows and 120 columns.
I tried to create a minmal dataset to reproduce the error but failed. Even when I create a dataset with similar properties like below, filtering and MinMaxScaler still work as expected.
The text was updated successfully, but these errors were encountered: