Detecting Cancer Biomarkers from RNA-seq Using Machine Learning

ConsensusML

Detecting Cancer Biomarkers from RNA-seq Using Machine Learning

Machine Learning to Detect Cancer Biomarkers from RNAseq Data

Workflow to apply machine learning methods for feature selection and selection consensus, to determine sets of best discriminating gene biomarkers using RNA-seq data from heterogeneous cancer populations (e.g. pediatric AML, etc.).
This would be a workflow that is publicly available to bioinformaticians who use RNA-seq gene expression data to characterize tumors, and would assist in finding biomakers that are best able to identify specific subpopulations in a cancer cohort (e.g. TARGET AML, TCGA AML, TARGET NBL, TARGET WT).

Background

Leukemia is a cancer of the blood arising in white blood cells of the bone marrow. It poses a substantial population burden as the most common pediatric cancer (Steliarova-Foucher et al 2017; SEER data). Acute Myelogenous Leukemeia (AML) is a type of leukemia impacting the myeloblast stem cells. AML arises at a current rate of approximately 20,000 cases per year, with 27.4% 5-year survival 2. It is a molecularly heterogeneous cancer, with several clinically relevant subtypes, including perhaps dozens of subtypes defined by factors ranging from cell differentiation state to cytogenic and sequencing assays (Yi G. et al 2019; Tyner J. W. et al 2018). Pediatric AML is characterized at a molecular level by rare somatic mutations, absence of common adult AML mutations, and relatively frequent structural variants (Bolouri H et al 2018). Here, we apply several machine learning approaches for feature selection of RNA-seq data from both pediatric and adult AML cases. Our goal was to better understand gene expression-based heterogeneity underlying AML cases, as well as age-related and -unrelated dysregulation patterns. We used clinical and assay data from pediatric cancer patients from the Therapeutically Applicable Research To Generate Effective Treatments (TARGET) initiative (https://ocg.cancer.gov/programs/target/).

Methods and Analysis Overview

We were interested in applying machine learning principles for feature selection, to identify the most important genes and gene sets for predicting clinically-relevant classifiers in pediatric and adult AML cases. Classifiers of main interest include age, stage, and survival. For this investigation, we focused on the risk group sample classifier. We performed both pan-cancer and cancer-specific analyses of TARGET pediatric cancers. For analysis of AML cases, we combined primary peripheral blood and bone marrow samples.

Methods Workflow

Expression Data

TARGET gene counts were obtained from the Genomic Data Commons (https://gdc.cancer.gov/), which are based on RNA-seq run using the Illumina HiSeq platform. Gene counts were normalized using trimmed mean of m-values (TMM) method. We further pre-filtered TMM-normalized expression based on extent of differential expression between these classifiers of interest, using multiple thresholds. After these preprocessing and QC methods were complete, for each classifier of interest we randomly divided data in training and test subsets, conserving classifier frequency in each subset.

Analysis Approach using Machine Learning Algorithms

We then applied an "ensemble" learning approach comparing feature selection results. We assessed results using multiple machine learning algorithms from various R and Python packages, including: 1. Support Vector Machines (SVM) using e1071; 2. Random Forest with boosting in Python; 3. Neural Networks with keras; 4. Logistic regression in R; 5. Elastic net with Lasso using glmnet; 6. AutoML.

Results and Feature Selection

Using normalized and pre-filtered RNA-seq expression data, we fitted models using various algorithms as described in Methods. For models that performed well on the filtered gene set, we then identified the most important gene features for model prediction. We assessed consensus of selected features across algorithm classes, and for the most common recurrently selected features, we mined the scientific literature for evidence validating these genes' functional roles in leukemia and AML.

Links to Shared Documents

1. Manuscript

https://docs.google.com/document/d/1DPAmUFfggAnAjsMIPTs1hV90k25ZKckYLi18b3dBot0/edit#

2. Day 2 Presentation

https://docs.google.com/presentation/d/1HxHyaGLNxAbhsEd2OVs6R3HiGGu5hrJO_lcLS5ffZlc/edit#slide=id.g4f487fb995_0_278

3. Final Hackathon Presentation

https://docs.google.com/presentation/d/1WdvgktxKQAXjABnFVWuMOW6ENCd5gwZKb_SjKrnLZbs/edit#slide=id.g4ec347d104_0_7

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
BuieRProj		BuieRProj
Clinical_Data		Clinical_Data
Expn_Data		Expn_Data
JSmith_code		JSmith_code
Manifest_Data		Manifest_Data
Ryan		Ryan
VikasP		VikasP
composite_code		composite_code
iscience_submission		iscience_submission
manuscript		manuscript
online_supplement		online_supplement
robjects		robjects
scripts		scripts
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
TARGET_AML_Logistic_Lasso_withDEGs_RiskGroup_Associated_Genes.csv		TARGET_AML_Logistic_Lasso_withDEGs_RiskGroup_Associated_Genes.csv
methods.jpg		methods.jpg
ml-fhack_day1-flowchart_v2_SeanMaden.jpg		ml-fhack_day1-flowchart_v2_SeanMaden.jpg
ml-fhack_day1-flowchart_v3_SeanMaden.jpg		ml-fhack_day1-flowchart_v3_SeanMaden.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConsensusML

Detecting Cancer Biomarkers from RNA-seq Using Machine Learning

Background

Methods and Analysis Overview

Methods Workflow

Expression Data

Analysis Approach using Machine Learning Algorithms

Results and Feature Selection

Links to Shared Documents

1. Manuscript

2. Day 2 Presentation

3. Final Hackathon Presentation

About

Releases 1

Packages

Contributors 7

Languages

License

NCBI-Hackathons/ConsensusML

Folders and files

Latest commit

History

Repository files navigation

ConsensusML

Detecting Cancer Biomarkers from RNA-seq Using Machine Learning

Background

Methods and Analysis Overview

Methods Workflow

Expression Data

Analysis Approach using Machine Learning Algorithms

Results and Feature Selection

Links to Shared Documents

1. Manuscript

2. Day 2 Presentation

3. Final Hackathon Presentation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 7

Languages

Packages