Skip to content

This project contains the code used in the ROSEFW-RF paper.

Notifications You must be signed in to change notification settings

triguero/ROSEFW-RF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ROSEFW-RF

This repository includes the MapReduce implementations used in [1]. This implementation is based on Apache Mahout 0.8 library. The Apache Mahout (http://mahout.apache.org/) project's goal is to build an environment for quickly creating scalable performant machine learning applications.

Prerequisites:

  • Hadoop 2.5.
  • ant

Associated paper:

  • I. Triguero, S. Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera. ROSEFW-RF: The winner algorithm for the ECBDL'14 Big Data Competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, in press. doi: 10.1016/j.knosys.2015.05.027

Compile the whole project with ANT:

$ ant

Put the dataset folder into the HDFS system:

hadoop fs -put datasets/

Generate descriptor file needed by the mahout code. (Check: ...classifier.df.tools.Describe.java).

$ hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p  datasets/ECBDL14subset.data  -f  datasets/ECBDL14subset.info -d  3 N 18 C 18 N 54 C 38 N 20 C 480 N L

== Random Oversampling

hadoop jar Model.jar  org.apache.mahout.classifier.df.mapreduce.Resampling --help

Usage:                                                                          
 [--data  --dataset  --time  --help --resampling           
 --dataPreprocessing  --nbpartitions  --npos    
 --nneg  --negclass ]                                     
Options                                                                         
  --data (-d) path                    Data path                                 
  --dataset (-ds) dataset             Dataset path                              
  --time (-tm) path                   Time path                                 
  --help (-h)                         Print out help                            
  --resampling (-rs) resampling       The resampling technique (oversampling    
                                      (overs), undersampling (unders) or SMOTE  
                                      (smote))                                  
  --dataPreprocessing (-dp) path      Data Preprocessing path                   
  --nbpartitions (-p) nbpartitions    Number of partitions                      
  --npos (-npos) npos                 Number of instances of the positive class 
  --nneg (-nneg) nneg                 Number of instances of the negative class 
  --negclass (-negclass) negclass     Name of the negative class      

Generate the Preprocessed data example:

To compute the number of mappers, we have to check the number of bytes of the training file:

$ ls -l datasets/
-rw-r--r--. 1 isaact users 19019170 Jun  9 14:10 ECBDL14subset.data

If we want to have 4 maps, we should divide this number by 4 (4754792).

$ hadoop jar Model.jar org.apache.mahout.classifier.df.mapreduce.Resampling -Dmapred.min.split.size=4754792 -Dmapred.max.split.size=4754793 -dp datasets/ECBDL14subset.data -d output-ROS -ds datasets/ECBDL14subset.info -rs overs -p 4 -tm ROS-ECBDL14-build_time

== Evolutionary Feature Weighting

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FeatureWeightingModel --help

Usage:                                                                          
 [--data  --dataset  --header  --output ]          
Options                                                                         
  --data (-d) path           Data path                                          
  --dataset (-ds) dataset    The path of the file descriptor of the dataset     
  --header (-he) header      Header of the dataset in Keel format               
  --output (-o) path         Output path, will contain the set of selected      
                             features   

Example of application of EFW on the previosly generated balanced data. (please adjust the size of the split according to the size of the input data)

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FeatureWeightingModel -Dmapred.max.split.size=XXXX -d output-ROS/part-r-00000 -ds datasets/ECBDL14subset.info -he datasets/ECBDL14subset.header -o output-DEFW

Create the resulting preprocessed dataset:

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FWconstructor --help

Usage:                                                                          
 [--input  --info  --header  --feature_weighting     
--weight threshold  --output  --help]                             
Options                                                                         
  --input (-i) input                Path to job input directory.                
  --info (-ds) test                 The path of the file descriptor of the      
                                    dataset                                     
  --header (-he) header             Header of the dataset in Keel format        
  --feature_weighting (-fw) path    Feature weights path                        
  --weight threshold (-w) path      Weight threshold to select features         
  --output (-o) output              The directory pathname for output.          
  --help (-h)                       Print out help  
 hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FWconstructor -i output-ROS/part-r-00000 -fw output-DEFW/Pesos.txt -w 0.46 -ds datasets/ECBDL14subset.info -he datasets/ECBDL14subset.header -o output-FWconstructor

== RandomForest

First, generate the describe info file for this data:

hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p output-FWconstructor/part-r-00000.out -f   output-FWconstructor/part-r-00000.info -d 3 N 18 C 18 N 54 C 38 N 20 C 480 N L

Build a model with the previous preprocessed data. Please adjust the split size accordingly.

hadoop jar Model.jar  org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.min.split.size=XXXXX -Dmapred.max.split.size=XXXX -o output-RF/  -d output-FWconstructor/part-r-00000.out -ds output-FWconstructor/part-r-00000.info -sl 25 -p -t 200 -tm model_build_time

Classify test data:

hadoop jar  Model.jar org.apache.mahout.classifier.df.mapreduce.TestForest -Dmapred.min.split.size=XXXX -Dmapred.max.split.size=XXXX -i datasets/ECBDL14subset.data
-ds datasets/ECBDL14subset.info -m  output-RF/ -a -mr -o outputTEST-RF

About

This project contains the code used in the ROSEFW-RF paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages