Skip to content

A framework for modelling the interactions that result between large molecule systems to inform materials design.

License

Notifications You must be signed in to change notification settings

Frank-Gu-Lab/infrno

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 INFRNO: Interpretable framework for uncovering interaction opportunities in macromolecules

Samantha Stuart, Jeffrey Watchorn, Frank Gu

Institute of Biomedical Engineering, University of Toronto, Toronto, Ontario, Canada

Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada


This formal analysis repository accompanies the work: An Interpretable Machine Learning Framework for Modelling Macromolecular Interaction Mechanisms with Nuclear Magnetic Resonance.

In this work, we developed a framework for modelling the interactions that result between large molecule systems to inform biomaterial design. In addition to modelling structure-activity, the framework identifies "undervalued" ligand sites as engineering design opportunities to unlock receptor interaction. The input data and feature descriptors are obtained from experimental screening with DISCO-NMR. Any receptor-ligand interaction dataset generated from DISCO-NMR screening 🕺 can be analyzed equivalently with INFRNO 🔥.


Using INFRNO, we can:

  • Model Atomic-Level Macromolecular Interaction Trends: We apply linear principal component analysis to DISCO NMR data descriptors and labels, and train a binary decision tree classifier to construct proton structure-interaction trends across ligand chemical species.
  • Identify Opportunities for Designed Interaction: Inert-labeled protons bordering cross-species decision regions indicate opportunities for physical property tuning towards interaction without additional chemical functionalization.

  • Create a runway to interaction prediction: The decision tree for a given receptor can be re-trained to "grow" as increasingly diverse ligands are screened, while informing ligand design with data-driven insights along the way.

Quick Start on Google Colab:

To get quick intuition for the framework we provide a tutorial in Google Colab which can be run without any local environment setup.

The input dataset to upload to the Colab notebook can be downloaded from this repository in: data/raw/proton_binding_dataset.xlsx


Project Organization

├── LICENSE
├── README.md          <- The top-level README for this project.
├── data
│   ├── processed      <- The benchmarking result files output from scripts
│   └── raw            <- The training dataset
│
├── notebooks          <- Notebooks and scripts for formal analysis
│   ├── benchmark_CDEpipe.py         <- Benchmarking script for cumulative  
│   │                                   disco effect pipeline
│   ├── benchmark_maxsteadyslope.py  <- Benchmarking script for curve attribute 
│   │                                   pipeline
│   ├── benchmark_meandiscoeff.py    <- Benchmarking script for mean disco effect 
│   │                                   pipeline
│   ├── benchmark_chemonly.py        <- Benchmarking script for pipeline without 
│   │                                   disco effect
│   ├── benchmarking_analysis.ipynb  <- Global pipeline benchmarking analysis (SI)
│   ├── final_model_paper_CDE_rs148.ipynb  <- Formal analysis and figure generation
│   └── utils                        <- Utility functions
│       └── feature_generation.py    <- DISCO NMR feature generation script
│
├── figures           
│   ├── main           <- Main formal analysis figures
│   ├── misc           <- Misc. figure files
│   └── supplementary  <- SI figures
│
└── requirements.txt   <- The requirements for the analysis environment

Setup to run the code locally:

1. Clone or download this GitHub repository:

Do one of the following:

  • Clone this repository to a directory of your choice on your computer using the command line or GitHub Desktop.

  • Download the ZIP file of archive of the repository, move and extract it in the directory of your choice on your computer.

2. Install dependencies using Anaconda or Pip

Instructions for installing dependencies via Anaconda:

  1. Download and install Anaconda

  2. Navigate to the project directory

  3. Open Anaconda prompt in this directory (or Terminal)

  4. Run the following commend from Anaconda prompt (or Terminal) to automatically create an environment from the requirements.txt file: $ conda create --name infrno --file requirements.txt

  5. Run the following command to activate the environment: conda activate infrno

  6. You are now ready to open and run files in the repository in a code editor of your choice that runs your virtual environment (ex: VSCode)

For detailed information about creating, managing, and working with Conda environments, please see the corresponding help page.

Instructions for installing dependencies with pip

If you prefer to manage your packages using pip, navigate in Terminal to the project directory and run the command below to install the preqrequisite packages into your virtual environment:

$ pip install -r requirements.txt

With either install option, you may need to create an additional Jupyter Notebook kernel containing your virtual environment, if it does not automatically appear. See this guide for more information.


3. Run the model

  1. Navigate to the notebook notebooks/final_model_paper_CDE_rs148.ipynb

  2. Execute all cells sequentially

4. Run the benchmarking

  1. Execute each benchmarking script in notebooks

    • benchmark_CDEpipe.py
    • benchmark_chemonly.py
    • benchmark_maxsteadyslope.py
    • benchmark_meandiscoeff.py
  2. Open notebooks/benchmarking_analysis.ipynb

  3. Execute all cells sequentially to compare pipelines


Re-using repository with new dataset

  1. Replace training dataset in data/raw with any new DISCO NMR screening results named proton_binding_dataset.xlsx. The name proton_binding_dataset must be preserved to maintain compatibility with all file read operations in this repository

  2. Open notebooks/final_model_paper_CDE_rs148.ipynb, re-run all cells for file reading, feature generation, and model generation until updated tree figures are generated and displayed in the console

    • adjustment of hyperparameter grid and random seed may be required for new datasets to yield the best tree
  3. To interpret the resulting tree decisions, customize the provided exemplary figure generation cells, and proton average properties, according to updated high importance principal components and decision rules

  4. Where cross-polymer decision rules result, examine identities of inert protons near the interactive border as "hypotheses" for physical property tuning towards achieving interaction

  5. If desired, execute benchmark_CDEpipe.py script to evaluate the out of sample error of the updated model for the updated dataset

    • Note that if the hyperparameter grid and random seeds have been altered, the benchmarking script should be equivalently adjusted to reflect updates
    • The majority classifier baseline F1 score should also be updated in accordance with new datasets

How to cite

@article{TBD,
  title={An Interpretable Machine Learning Framework for Modelling Macromolecular Interaction Mechanisms with Nuclear Magnetic Resonance},
  author={Stuart, Samantha and Watchorn, Jeffrey and Gu, Frank},
  journal={TBD},
  year={2022},
  publisher={TBD}
}

License

MIT License

About

A framework for modelling the interactions that result between large molecule systems to inform materials design.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published