- Classification
- Binary classification
- sigmoid(negative log bernoulli likelihood)
- sigmoid_cross_entropy
- hinge(svm)
- l2_hinge
- smooth_hinge
- exponential
- Multi-class classification
- softmax(negative multinoulli likelihood)
- softmax_cross_entropy
- hsoftmax(hierarchical softmax)
- hsoftmax_cross_entropy
- multiclass_hinge
- multiclass_l2_hinge
- multiclass_smooth_hinge
- Binary classification
- Regression
- l2(mean squared error, least squares regression)
- l1(mean absolute error, least absolute deviation regression)
- huber
- poisson
- mape(mean absolute percentage error)
- smape(symmetric mean absolute percentage error)
- inv_mape(inv mean absolute percentage error)
The abbreviations of objective function listed above can be set via loss_function
in model configuration.
If you want to define a new Objective function, there are three steps to accomplish:
- Implements ILossFunction interface
- Accomplish functions in LossFunctions class
- Execute
sh tool/package.sh
to repackage.
Using GitHub with MathJax to render Latex.
- linear
- multiclass_linear
- fm(Factorization Machine)
- ffm(Field-aware Factorization Machine)
- gbdt(Gradient Boosting Decision/Regression Tree, Random Forest)
- gbmlr(Gradient Boosting Mixture Linear Tree model, Mixture Linear Random Forest Model)
- gbsdt(Gradient Goosting Soft Decision/Regression Tree Model, Soft Random Forest Decision/Regression Tree Model)
- gbhmlr(Gradient Boosting Hierarchical Mixture Linear Tree model, Hierarchical Mixture Linear Random Forest Model)
- gbhsdt(Gradient Boosting Hierarchical Soft Decision/Regression Tree Model, Hierarchical Soft Random Forest Decision/Regression Tree Model)
The abbreviation of model name listed above can be set via model_name
in running scripts.
Linear Model: $$ f(\mathbf{x})=\mathbf{w}^T\mathbf{x}+b $$ Various active function and loss function combined can generate different models(share same score function):
- L2: linear model + Identity active function + square loss
- L1: linear model + Identity active function + absolute loss
- Logistic Regression: linear model + sigmoid active function + negative log loss
- Binary Cross Entropy: linear model + sigmoid active function + cross entropy loss
- Hinge: linear model + Identity active function + hinge loss
- ...
configuration, demo, reference.
Multiclass Linear Model: $$ f_k(\mathbf{x})=\mathbf{w}_k^Tx,k=1,2,...,K $$ Various active function and loss function combined can generate different models(share same score function):
- Softmax: multiclass linear model + softmax active function + negative log loss
- Multiclass Hingle: multiclass linear model + Identity active function + hinge loss
- ...
configuration, demo, reference
Factorization Machines(FM):
$$
f(\mathbf{x})=b+\sum_{i=1}^n {w_i}{x_i} + \sum_{i=1}^n\sum_{j=i+1}^n \mathbf{v}_i^T\mathbf{v}_j{x_i}{x_j}
$$
A 2-way FM(degree d=2) contains linear part
configuration, demo, reference
Field-aware Factorization Machines(FFM): $$ f(\mathbf{x})=b+\sum_{i=1}^n {w_i}{x_i} + \sum_{i=1}^n\sum_{j=i+1}^n \mathbf{w}_{i,{f_j}}^T\mathbf{w}_{j,{f_i}}{x_i}{x_j} $$ FFM is a variant of FM, outperforming existing models in several world-wide CTR-prediction competitions. Just like FM, FFM can also be applied to regression and classification tasks(using different active function and loss function).
configuration, demo, reference
Gradient Boosting Decision Tree(GBDT or GBRT) $$ F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \alpha f_m(\mathbf{x}) \\ F_{m-1}(\mathbf{x})=\sum_{i=1}^{m-1}f_m(\mathbf{x})\\ f_m(\mathbf{x})=\mathbf{w}_{q(\mathbf{x})}, q:R^d{\rightarrow}{1, 2, ...,T},\mathbf{w}{\in}R^T $$ GBDT is a very effective machine learning method and is widely used in many machine learning tasks. This model is trained in an additive manner by greedily adding a new tree that decreases the objective(loss) function to the largest extent.
In the above formulations,
Reference:
- Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
- Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
Gradient Boosting Soft Tree(GBST):
$$
F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \alpha f_m(\mathbf{x}) \\ F_{m-1}(\mathbf{x})=\sum_{i=1}^{m-1}f_m(\mathbf{x}) \\ f_m(\mathbf{x};W_g,W_s) = \sum_{k=1}^{K} g_m^k(\mathbf{x};\mathbf{w}_g^k)h_m^k(\mathbf{x};\mathbf{w}_s^k) \\ ({W_g},{W_s})=\arg\min_{{W_g}^{'},{W_s}^{'}}L(F_{m-1}(\mathbf{x}) + f_m(\mathbf{x};{W_g}^{'},{W_s}^{'}))
$$
GBST is very similar with traditional gradient boosting decision tree. The following are the main differences:
-
$f_m$ is a parameterized function optimized with gradient based algorithm like L-BFGS, CG in GBST while$f_m$ is a rule-based function optimized with greedily constructed regression tree . - Trees are axis alignments in GBDT while trees are arbitrary alignments in GBST.
Two kinds of gating functions are provided:
Softmax Gating Function(SGF)
:
$$
g^k(\mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{i=1}^{K}e^{\mathbf{w}_i^T\mathbf{x}}}
$$
Hierarchical Softmax Gating Function(HSGF)
:
$$
g^k(\mathbf{x})=\prod_{j=1}^L \sigma(2\delta(leftchild) - 1)
$$
HSGF has stronger non-linear ability than SGF. Theoretically, HSGF can divide any complex space. GBST with SGF is tree stump with
There are two basis score functions:
Linear Score Function(LSF)
:
$$
h^k(\mathbf{x})=\mathbf{w}^T\mathbf{x}+b
$$
Single Variable Score Function(SVSF)
:
$$
h^k(\mathbf{x}) = z
$$
Advantages | Disadvantages | |
---|---|---|
GBDT | Training/Inference is fast. In most dense feature scenes, its performance is better than GBST. | In sparse categorical feature scenes, GBDT can't handle it directly. |
GBST | In most scenes, its performance is better than linear models, FM and FFM models. Although, in most dense feature sceces, GBST is slightly worse than GBDT, but it can handle large sparse categorical features(e.g. ctr predicting, recommendation). GBST needs fewer trees than GBDT. Only one tree can do the trick in most cases. | With more parameters, training can be very slow. GBST is a non-convex model, so it is hard to tune. |
Different gating functions combined with different basis score functions can create different GBST models. Ytk-learn supports eight GBST models(2(gating function) * 2(score function) * 2(gradient_boosting or random_forest)):
gbmlr
(SGF + LSF), configuration, demogbsdt
(SGF + SVSF), configuration, demogbhmlr
(HSGF + LSF), configuration, demogbhsdt
(HSGF + SVSF), configuration, demo
GBST can also be applied to regression and classification tasks with different active functions and loss functions.
Different models can be used with specified objective functions.
Models | Types of Objective Function |
---|---|
linear | Binary Classification/Regression |
multiclass_linear | Multiclass Classification |
fm | Binary Classification/Regression |
ffm | Binary Classification/Regression |
gbdt | Binary Classification/Regression/Multiclass Classification |
gbmlr | Binary Classification/Regression |
gbsdt | Binary Classification/Regression |
gbhmlr | Binary Classification/Regression |
gbhsdt | Binary Classification/Regression |
Reference:
- Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
- Jordan, Michael I., and Robert A. Jacobs. "Hierarchical mixtures of experts and the EM algorithm." Neural computation 6.2 (1994): 181-214.