In recommendation or online advertising, source data often come with different channels characterizing different dimensions, such as
- data including age, gender, marital status ...
- data including internet behavior ...
- data including phone brand ...
- data including APP installed ...
- ...
Combining features with different dimensions into one always gets better ctr(cvr), so we release the Combine Features Logistic Regression.
CLR.run/ is defined as below:
- the elemnt of data is single instance/sample arranged as (Array[fregata.Vector], fregata.Num)
- features from the same channel should be putted into same fregata.Vector by sort
- the size of Array[fregata.Vector] denotes the number of source datas' channels
- every channel's fregata.Vector should be putted into Array[fregata.Vector] by sort
- fregata.Num denotes the instance's label
- combines denotes how to combine different channels' features. For example combines=Array(Array(0,1,2), Array(2)) says that
- source datas are from 3 different channels
- suppose that channel#1's size is r, channel#2's is m, channel#3's is n
- Array(0,1,2) says that rxmxn combined features are generated by Cartesian product, and each combined feature is generated by 3 features selected from different channels
- Array(2) says that we should reserve all the features from channel#3
- based on the example above, the total number of features is rxmxn+n
def run(data: RDD[(Array[fregata.Vector], fregata.Num)], combines: Array[Array[Int]], iterationNum: Int = 1): CLRModel
CLRModel.clrPredict is defined as below
- parameter data's structure is the same as **CLR.run'**s
def clrPredict(data: RDD[(Array[fregata.Vector], fregata.Num)]): RDD[((Array[fregata.Vector], fregata.Num), (fregata.Num, fregata.Num))]
/**
* Created by takun on 16/9/19.
*/
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("logistic regression")
val sc = new SparkContext(conf)
// the dataset a9a can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a
val (_,trainData) = LibSvmReader.read(sc,"/Volumes/takun/data/libsvm/a9a",123)
val (_,testData) = LibSvmReader.read(sc,"/Volumes/takun/data/libsvm/a9a.t",123)
val model = CLR.run(trainData.map{
case (x,label) => Array(x) -> label
},Array(Array(0,0)),10)
val pd = model.clrPredict(testData.map{
case (x,label) => Array(x) -> label
})
val acc = Accuracy.of( pd.map{
case ((x,l),(p,c)) =>
c -> l
})
println( s"Accuracy = $acc ")
val auc = AreaUnderRoc.of( pd.map{
case ((x,l),(p,c)) =>
p -> l
})
println( s"AreaUnderRoc = $auc ")
}
Accuracy = 0.8462719567620686
AreaUnderRoc = 0.900320784272655