-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column order impacting the predictions of LGBM (regression) #6671
Comments
Thanks for using LightGBM, and for taking the time to put together an excellent reproducible example! Short AnswerDuring tree-building, LightGBM looks at multiple "splits", If there are multiple splits that produce the "best" gain, LightGBM will just choose the "first" one, which will generally mean a split from a feature appearing "earlier" (lower column index, or "further left") in the training data. Longer AnswerI've narrowed it down to a smaller example that reproduces the behavior, to help us focus on the root cause: check.py (click me)import pandas as pd
import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name="target")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
features1 = ["HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude", "MedInc"]
features2 = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc"]
params = {
"max_depth": 4,
"learning_rate": 0.019324,
"min_data_in_leaf": 16,
"num_iterations": 200,
"verbose": -1,
"seed": 42,
"num_thread": 1,
"deterministic": True,
"force_row_wise": True,
}
train_data_1 = lgb.Dataset(X_train[features1], label=y_train)
model_1 = lgb.train(params, train_data_1)
y_pred_1 = model_1.predict(X_test[features1])
mse_1 = mean_squared_error(y_test, y_pred_1)
model_1.save_model("model_1.txt")
train_data_2 = lgb.Dataset(X_train[features2], label=y_train)
model_2 = lgb.train(params, train_data_2)
model_2.save_model("model_2.txt")
y_pred_2 = model_2.predict(X_test[features2])
mse_2 = mean_squared_error(y_test, y_pred_2)
assert mse_1 == mse_2, f"mse_1 ({mse_1}) != mse_2 ({mse_2})" In that code snippet, notice that I've also added saving the models out (in text format). I compared those in a text differ, and saw the following in the summary near the end: The default "importance" reported there is "number of splits the feature is chosen for". LightGBM/python-package/lightgbm/basic.py Lines 4457 to 4459 in 668bf5d
Notice that for the model where I suspect there are some regions of the distribution where it's possible to draw a split for more trees:
deeper trees:
|
Hi 🙂
I encountered some unexpected behavior and wanted to understand the reasoning behind it. The issue is regarding the impact of column order on model predictions in a regression setup. I’ve seen similar questions on this topic and tried applying various suggestions to achieve deterministic results, but without success.
Below is a toy example with:
With the default hyperparameters (params 1), I get the same results regardless of column order. However, with the second set (params 2), the results are the same for feature set 1, while they differ for feature set 2—there’s only one observation in the test set that returns a different prediction.
Could you please help me understand where the difference is coming from? In my actual use case, the discrepancies are larger than in this toy dataset.
If you need any further details regarding the environment, please let me know :)
Env:
Toy example:
The text was updated successfully, but these errors were encountered: