In this module, we validate the final ML model.
We use the models from 2.train_model to classify nuclei images from the Cell Health Dataset. The classification probabilities across CRISPR guide/cell line are then correlated to the Cell Health label in cell_health_correlations.ipynb for the the respective CRISPR perturbation/cell line.
The Cell Health dataset has cell painting images across 119 CRISPR guide perturbations (~2 per gene perturbation) and 3 cell lines. More information regarding the generation of this dataset can be found at https://github.com/broadinstitute/cell-health.
In Cell-Health-Data/4.classify-features, we use the trained models to determine phenotypic class probabilities for each of the Cell Health cells. We average these probabilities across CRISPR guide/cell line to create 357 classifiction profiles (119 CRISPR guides x 3 cell lines).
Way et al. derived cell health indicators as part of Predicting cell health phenotypes using image-based morphology profiling. These indicators consist of 70 specific cell health phenotypes including proliferation, apoptosis, reactive oxygen species, DNA damage, and cell cycle stage. Way et al averaged these indicators across CRISPR guide/cell line to create 357 Cell Health label profiles.
We use pandas.DataFrame.corr to find the Pearson correlation coefficient between the classifiction profiles and the Cell Health label profiles. The Pearson correlation coefficient measures the linear relationship between two datasets, with correlations of -1/+1 implying exact linear inverse/direct relationships respectively.
We also derive the Clustermatch Correlation Coefficient (CCC) introduced in Pividori et al, 2022. This is a not-only-linear coefficient based on machine learning models and gives an idea of how correlated the feature coefficients are (where 0 is no relationship and 1 is a perfect relationship).
These correlations are briefly interpreted in preview_CH_correlations.ipynb and preview_CH_correlations.ipynb with seaborn.clustermap to display the hierarchically-clustered correlation values. Searborn clustermap groups similar correlations into clusters that are broadly similar to each other.
Inside the notebook cell_health_correlations.ipynb, the variable classification_profiles_save_dir
needs to be set to specify where the classficiation profiles are saved.
We used an external harddrive and therefore needed to use specific paths.
The classification profiles are the output of cell-health-data/4.classify-single-cell-phenotypes.
Use the commands below to validate the final ML model:
# Make sure you are located in 5.validate_model
cd 5.validate_model
# Activate phenotypic_profiling conda environment
conda activate phenotypic_profiling
# Interpret model
bash validate_model.sh
Notes:
- Intermediate
.tsv
data are stored in tidy format, a standardized data structure (see Tidy Data by Hadley Wickham for more details). - SCM stands for "single cell model(s)" and is used as an abbrevation for the binary, sinlge-class models throughout this module.