Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapper FS based on Cuckoo Search algorithm #335

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

Mohammed-Ryiad-Eiadeh
Copy link

Description

< Adding a wrapper FS approach based on binary Cuckoo Search algorithm to Tribuo engine>

Motivation

< Recently Evolutionary Computation have shown alot of good results in a wide range of technologies especially in engineering problems. Moreover, many published research papers include the utilizing of EO algorithms for FS and inform good results. Therefore, adding a wrapper FS model to Tribuo engine would be great >

< Please link any relevant issues or PRs >

Paper reference

< Rodrigues, D. et al. BCS: A Binary Cuckoo Search algorithm for feature selection. Proc. - IEEE Int. Symp. Circuits Syst. 465–468 (2013) doi:10.1109/ISCAS.2013.6571881. >

@oracle-contributor-agreement
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 21, 2023
@Craigacp
Copy link
Member

Craigacp commented Apr 24, 2023

Thanks for the contribution. Once you've sorted out the OCA I can go through and review this, but I'm not allowed to review it until the OCA check passes.

@Mohammed-Ryiad-Eiadeh
Copy link
Author

I signed the OCA form 4 days ago. And now I am waiting for the confirmation.

@Craigacp
Copy link
Member

I signed the OCA form 4 days ago. And now I am waiting for the confirmation.

Ok, thanks. Once it's gone through the system we can get going with this. It shouldn't take too long.

@Mohammed-Ryiad-Eiadeh
Copy link
Author

Thank you.

@Craigacp
Copy link
Member

It looks like there were some issues with the OCA you submitted, could you check the status of it and fix the issues it describes?

@Mohammed-Ryiad-Eiadeh
Copy link
Author

The problem was that in the first OCA form, I forgot to write down my GitHub username. But few moments ago, I signed the OCA, and I provided my GitHub username.

@oracle-contributor-agreement oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels May 1, 2023
@Mohammed-Ryiad-Eiadeh
Copy link
Author

Dear Adam,

Is there anyway to get a list or an array of the feature values from the 'selectedFeatureDataset' ?, such as getting the array of values of a certain feature based on the index.

That this I need in order to compute functions like Pearson Correlation Coefficient between two features for some purposes

@Craigacp
Copy link
Member

Craigacp commented May 5, 2023

Not easily from the selected feature dataset as it's stored row wise. The current information theoretic feature selection dataset converts the data into a columnar storage format before running the feature selection procedure.

Depending on what you want to do it might be more efficient to do that, like if you need to compute the correlations between all features. Otherwise you pay a log(d) lookup cost for each feature, which makes the whole Pearson's correlation coefficient 2n*log(d) just to get the data in a columnar format to compute the correlation. If you want to do it the other way around you can make a n*d matrix and then write all the feature values into it, which is much more efficient if you're computing all of them, and not worth it if it's only for a few values.

@Mohammed-Ryiad-Eiadeh
Copy link
Author

Mohammed-Ryiad-Eiadeh commented May 5, 2023

I want to compute the correlation between the entire features in the selected feature dataset and how can I construct the d*n matrix using tribuo engine

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, sorry it took a while to review. The implementation looks interesting, but there's a bit of work required to get it to play nicely with Tribuo's provenance and configuration systems. Also could you add a basic unit test which runs a feature selection operation, similar to the other tests in the FeatureSelection subproject?

@@ -0,0 +1,27 @@
package FS_Wrapper_Approaches.Discreeting;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base package name needs updating to org.tribuo.classification.fs.wrapper and then the files should be moved into the right directory. At the moment this won't compile because the directory and package names don't line up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also all the files need the copyright and license header.

@@ -0,0 +1,27 @@
package FS_Wrapper_Approaches.Discreeting;

import static org.apache.commons.math3.special.Erf.erf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tribuo no longer depends on Apache Commons Math 3, and this import isn't used.

* Enumeration that contains the types of transfer functions in which they are used to define the type of transfer function
*/
public enum TransferFunction {
V1, V2, V3, V4, S1, S2, S3, S4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of discreteValue could be folded in here, making the enum contain a DoubleUnaryFunction which performs the necessary operation, and then it could have a discretise method which applies the function the output. Then the Binarizing interface can be removed.

import org.tribuo.classification.Label;
import org.tribuo.classification.evaluation.LabelEvaluation;
import org.tribuo.classification.evaluation.LabelEvaluator;
import org.tribuo.common.nearest.KNNClassifierOptions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid surfacing the options classes in the library code as they are not under the same compatibility guarantees as the library code. It's probably better to import KNNTrainer directly.

/**
* This interface includes the evaluation function of each solution
*/
public interface FitnessFunction {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface only contains static members. For the time being those methods can be moved onto CuckooSearchOptimizer and be private, then if we need to use them from other places they can be easily moved and made more accessible.

private final double delta;
private final int populationSize;
private int [][] setOfSolutions;
private final int maxIteration;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the final variables here should not be final and need to be tagged @Config (with appropriate description fields) so they are automatically captured by the provenance system.

* @param totalNumberOfFeatures The number of features in the given dataset
* @return The population of subsets of selected features
*/
private int[][] GeneratePopulation(int totalNumberOfFeatures) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere this method should accept a SplittableRandom rather than make a fresh RNG each time.

/**
* This record is used to hold subset of features with its corresponding fitness score
*/
record FeatureSet_FScore_Container(int[] subSet, double score) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CuckooSearchFeatureSet?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this record in order to store the solutions with their corresponding fitness values in order to use this to arrange the solutions based on their fitness scores using Comparator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I was suggesting this record should be renamed CuckooSearchFeatureSet as it's specific to this approach and it's currently got a very general name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am sorry, I don't get your note as it should be. You are right I should change the name, and I am currently working on the FJP issue. Thanks Dr. Adam for your suggestions. All of your suggestions are suitable, and I know that I still need more time to learn more and more in programming in Java (my best language)

KNNClassifierOptions classifier = new KNNClassifierOptions();
CrossValidation<Label, LabelEvaluation> crossValidation = new CrossValidation<>(classifier.getTrainer(), selectedFeatureDataset, new LabelEvaluator(), 10);
double avgAccuracy = 0d;
for (Pair<LabelEvaluation, Model<Label>> ACC : crossValidation.evaluate())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use curly braces even for single line for loops and if statements.

import FS_Wrapper_Approaches.Discreeting.Binarizing;
import FS_Wrapper_Approaches.Discreeting.TransferFunction;
import com.oracle.labs.mlrg.olcut.util.Pair;
import org.tribuo.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All imports should be explicit, no star imports for packages.

@Mohammed-Ryiad-Eiadeh
Copy link
Author

Thank you for your suggestions, all of them are great and I will do them, but currently I am working on my PhD proposal and I have meetings with my advisor during the current month, so I will make these suggestions during the next one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants