Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Blood-Brain Barrier Database (B3DB) to TDC #215

Merged
merged 2 commits into from
Feb 21, 2024

Conversation

ayushnoori
Copy link
Member

@ayushnoori ayushnoori commented Dec 20, 2023

Based on discussion in #174, adding the Blood-Brain Barrier Database (B3DB) to TDC. Currently not adding B3DB to the ADMET benchmark group, but this could be added later (@kexinhuang12345, please see ac35e01).

Dataset Description

The Blood-Brain-Barrier Dataset (B3DB) is a curated resource of 7,807 small molecules classified as either BBB permeable (BBB+) or BBB non-permeable (BBB-), with 4,956 BBB+ and 2,851 BBB- molecules originally included. BBB permeability is measured by the logarithm of the brain-plasma concentration ratio:

$$\log{BB} = \log{\frac{C_{brain}}{C_{blood}}}$$

Numerical $\log{BB}$ data was originally included for 1,058 of the 7,807 molecules in the dataset.

Data Processing

After removing duplicates and NA IUPAC identifiers, there is classification data for 6,167 molecules and regression data for 942 molecules. Data processing script available at: https://gist.github.com/ayushnoori/af42cc651856f347614d0bd2a8fe7def

Data Description

We add two new datasets:

  • b3db_classification: Binary permeability label for all 6,167 small molecules, where the $\log{BB}$ ratio is binarized using $\log{BB} > 0 =$ BBB+ (i.e., 1) and $\log{BB} < 0 =$ BBB- (i.e., 0).
  • b3db_regression: Numerical $\log{BB}$ data for 942 small molecules.

Reference

Meng, F., Xi, Y., Huang, J. & Ayers, P. W. A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. Sci Data 8, 289 (2021).

DOI: 10.1038/s41597-021-01069-5

GitHub: https://github.com/theochem/B3DB

Harvard DataVerse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1RVMJ0

@kexinhuang12345
Copy link
Collaborator

Thank you Ayush and hope you had a nice winter break! This looks good - one question is how does the regression dataset differ from the classification dataset? Are the regression molecules a subset of the classification molecules? The most desirable case is to have the raw value for the large 6K molecules for the classification datasets. Let me know your thoughts!

@ayushnoori
Copy link
Member Author

Thanks, Kexin, and to you as well! I'm not sure if the regression dataset is a strict subset of the classification one. In the original repo, they're described as two separate files, perhaps with different pre-processing or features available.

I can check on this. If I were to combine them into one TSV file, how should I name the columns? Right now, for both b3db_classification.tab and b3db_regression.tab, the column names are: Drug_ID, Drug, Y.

In a new b3db.tab file, I could potentially keep the classification label as Y, and then add a new column called logBB for the regression data. What do you think?

@kexinhuang12345
Copy link
Collaborator

Hi Ayush - I think we can keep two files and two dataset names. In the single pred data class, it currently assumes single column Y for many data functions to work. Hope this clarifies!!

@ayushnoori
Copy link
Member Author

ayushnoori commented Feb 20, 2024

Great! Would you like us to make any additional changes then, or does this look good to merge? @kexinhuang12345

@kexinhuang12345
Copy link
Collaborator

Sounds good! Thanks!!

@kexinhuang12345 kexinhuang12345 merged commit 10865f6 into mims-harvard:main Feb 21, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants