Column order impacting the predictions of LGBM (regression) #6671

erykml · 2024-10-10T09:43:37Z

Hi 🙂

I encountered some unexpected behavior and wanted to understand the reasoning behind it. The issue is regarding the impact of column order on model predictions in a regression setup. I’ve seen similar questions on this topic and tried applying various suggestions to achieve deterministic results, but without success.

Below is a toy example with:

Two sets of features
Two sets of hyperparameters

With the default hyperparameters (params 1), I get the same results regardless of column order. However, with the second set (params 2), the results are the same for feature set 1, while they differ for feature set 2—there’s only one observation in the test set that returns a different prediction.

Could you please help me understand where the difference is coming from? In my actual use case, the discrepancies are larger than in this toy dataset.

If you need any further details regarding the environment, please let me know :)

Env:

MacOS Sonoma 14.6.1
LGBM 4.5.0
sklearn 1.5.1

Toy example:

import pandas as pd

import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# feature set #1
# features_set = ['HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude', 'MedInc']

# feature set #2
features_set = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc",]

# params 1
# params = {
#     "verbose": -1,
#     "seed": 42,
# }

# params 2
params = {
    "boosting_type": "gbdt",
    "max_depth": 4,
    "bagging_fraction": 1.0,
    "bagging_freq": 0,
    "feature_fraction": 1.0,
    "learning_rate": 0.019324,
    "num_leaves": 128,
    "min_data_in_leaf": 16,
    "max_bin": 90,
    "num_iterations": 267,
    "min_gain_to_split": 0.0,
    "lambda_l1": 0.001356,
    "lambda_l2": 0.000581,
    "verbose": -1,
    "seed": 42,
    "num_thread": 1,
    "deterministic": True,
    "force_row_wise": True,
}

train_data_1 = lgb.Dataset(X_train, label=y_train)
model_1 = lgb.train(params, train_data_1)
y_pred_1 = model_1.predict(X_test)
mse_1 = mean_squared_error(y_test, y_pred_1)

train_data_2 = lgb.Dataset(X_train[features_set], label=y_train)
model_2 = lgb.train(params, train_data_2)
y_pred_2 = model_2.predict(X_test[features_set])
mse_2 = mean_squared_error(y_test, y_pred_2)

print(mse_1 == mse_2)

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-10-16T03:06:02Z

Thanks for using LightGBM, and for taking the time to put together an excellent reproducible example!

Short Answer

During tree-building, LightGBM looks at multiple "splits", (feature, threshold) pairs. For each candidate, it computes a "gain" , basically improvement in the in-sample fit as a result of splitting the data on that feature and threshold.

If there are multiple splits that produce the "best" gain, LightGBM will just choose the "first" one, which will generally mean a split from a feature appearing "earlier" (lower column index, or "further left") in the training data.

Longer Answer

I've narrowed it down to a smaller example that reproduces the behavior, to help us focus on the root cause:

check.py (click me)

import pandas as pd

import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

features1 = ["HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude", "MedInc"]
features2 = ["Longitude", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "MedInc"]

params = {
    "max_depth": 4,
    "learning_rate": 0.019324,
    "min_data_in_leaf": 16,
    "num_iterations": 200,
    "verbose": -1,
    "seed": 42,
    "num_thread": 1,
    "deterministic": True,
    "force_row_wise": True,
}

train_data_1 = lgb.Dataset(X_train[features1], label=y_train)
model_1 = lgb.train(params, train_data_1)
y_pred_1 = model_1.predict(X_test[features1])
mse_1 = mean_squared_error(y_test, y_pred_1)
model_1.save_model("model_1.txt")

train_data_2 = lgb.Dataset(X_train[features2], label=y_train)
model_2 = lgb.train(params, train_data_2)
model_2.save_model("model_2.txt")
y_pred_2 = model_2.predict(X_test[features2])
mse_2 = mean_squared_error(y_test, y_pred_2)

assert mse_1 == mse_2, f"mse_1 ({mse_1}) != mse_2 ({mse_2})"

In that code snippet, notice that I've also added saving the models out (in text format). I compared those in a text differ, and saw the following in the summary near the end:

The default "importance" reported there is "number of splits the feature is chosen for".

LightGBM/python-package/lightgbm/basic.py

Lines 4457 to 4459 in 668bf5d

    
                   importance_type : str, optional (default="split") 
        
                       What type of feature importance should be saved. 
        
                       If "split", result contains numbers of times the feature is used in a model.

Notice that for the model where Longitude appears earlier in the feature list, it is chosen for 6 more splits. In the model where Latitude appears earlier, it's chosen for 6 more splits.

I suspect there are some regions of the distribution where it's possible to draw a split for Longitude or Latitude which select the exact same samples. You may have only observed this with what you called "params 2" because in general those parameters encourage LightGBM to grow more and deeper trees than it would by default.

more trees:

num_iterations = 267 (default: 100)

deeper trees:

num_leaves: 128 (default: 31)
min_data_in_leaf: 16 (default: 20)

erykml changed the title ~~Column order impacting the prediction of LGBM (regression)~~ Column order impacting the predictions of LGBM (regression) Oct 10, 2024

jameslamb added the question label Oct 10, 2024

jameslamb added the awaiting response label Nov 27, 2024

jameslamb mentioned this issue Nov 27, 2024

[RFC] make deterministic parameter more thorough? #6731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column order impacting the predictions of LGBM (regression) #6671

Column order impacting the predictions of LGBM (regression) #6671

erykml commented Oct 10, 2024 •

edited by jameslamb

Loading

jameslamb commented Oct 16, 2024

Column order impacting the predictions of LGBM (regression) #6671

Column order impacting the predictions of LGBM (regression) #6671

Comments

erykml commented Oct 10, 2024 • edited by jameslamb Loading

jameslamb commented Oct 16, 2024

Short Answer

Longer Answer

erykml commented Oct 10, 2024 •

edited by jameslamb

Loading