Skip to content

Latest commit

 

History

History
80 lines (77 loc) · 3.08 KB

clr.md

File metadata and controls

80 lines (77 loc) · 3.08 KB

Combine Features Logistic Regression beta

Introduction

In recommendation or online advertising, source data often come with different channels characterizing different dimensions, such as

  • data including age, gender, marital status ...
  • data including internet behavior ...
  • data including phone brand ...
  • data including APP installed ...
  • ...

Combining features with different dimensions into one always gets better ctr(cvr), so we release the Combine Features Logistic Regression.

Quick Start

CLR.run/ is defined as below:

  • the elemnt of data is single instance/sample arranged as (Array[fregata.Vector], fregata.Num)
    • features from the same channel should be putted into same fregata.Vector by sort
    • the size of Array[fregata.Vector] denotes the number of source datas' channels
    • every channel's fregata.Vector should be putted into Array[fregata.Vector] by sort
    • fregata.Num denotes the instance's label
  • combines denotes how to combine different channels' features. For example combines=Array(Array(0,1,2), Array(2)) says that
    • source datas are from 3 different channels
    • suppose that channel#1's size is r, channel#2's is m, channel#3's is n
    • Array(0,1,2) says that rxmxn combined features are generated by Cartesian product, and each combined feature is generated by 3 features selected from different channels
    • Array(2) says that we should reserve all the features from channel#3
    • based on the example above, the total number of features is rxmxn+n

def run(data: RDD[(Array[fregata.Vector], fregata.Num)], combines: Array[Array[Int]], iterationNum: Int = 1): CLRModel


CLRModel.clrPredict is defined as below

  • parameter data's structure is the same as **CLR.run'**s
  def clrPredict(data: RDD[(Array[fregata.Vector], fregata.Num)]): 
  	RDD[((Array[fregata.Vector], fregata.Num), (fregata.Num, fregata.Num))]

Example

  /**
   * Created by takun on 16/9/19.
   */
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("logistic regression")
    val sc = new SparkContext(conf)
    // the dataset a9a can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a
    val (_,trainData) = LibSvmReader.read(sc,"/Volumes/takun/data/libsvm/a9a",123)
    val (_,testData) = LibSvmReader.read(sc,"/Volumes/takun/data/libsvm/a9a.t",123)
    val model = CLR.run(trainData.map{
      case (x,label) => Array(x) -> label
    },Array(Array(0,0)),10)
    val pd = model.clrPredict(testData.map{
      case (x,label) => Array(x) -> label
    })
    val acc = Accuracy.of( pd.map{
      case ((x,l),(p,c)) =>
        c -> l
    })
    println( s"Accuracy = $acc ")
    val auc = AreaUnderRoc.of( pd.map{
      case ((x,l),(p,c)) =>
        p -> l
    })
    println( s"AreaUnderRoc = $auc ")
  }
Accuracy = 0.8462719567620686 
AreaUnderRoc = 0.900320784272655