Skip to content

Latest commit

 

History

History
62 lines (47 loc) · 2.74 KB

README.md

File metadata and controls

62 lines (47 loc) · 2.74 KB

ModelSEED Biochemistry Database Scripts (Structures)

There are several scripts that we use to organize and collate the structures that we wish to use as part of the biochemistry:

  • Print_Structure_Formula_Charge.py

This script takes all the InChI and SMILE strings from the original sources and extracts the formulas and charges, including protonation states You should only have to do this whenever you update the structures themselves, but, you might get changes occuring if you install updated versions of RDKit and/or OpenBabel in your conda environment, so double-check. At time of submission we used RDKit 2020.03.1.0 and OpenBabel 2.4.1

  • List_ModelSEED_Structures.py

This script takes the list of InChI and SMILE strings from the original sources and attempts to consolidate them for the ModelSEED database. As it checks for formula conflicts, it needs the output of the previous script and it reports structure and formula conflicts in files in the same directory. Its two key output files are ../../Biochemistry/Structures/All_ModelSEED_Structures.txt and ../../Biochemistry/Structures/Unique_ModelSEED_Structures.txt.

The latter file is the structural heart, and our goal is to expand on it via curation of the conflicts and integration of more structures.

  • Update_Compound_Structures_Formulas_Charge.py

This script takes the output of the previous two scripts, and uses them to update the ModelSEED database.

  • Update_Compound_Structures_Formulas_Charge.py

This script takes the pKa files found in ../../Biochemistry/Structures, and uses them to update the ModelSEED database.

  • Report_Redundant_Structures.py
  • Report_Conflicting_Structures.py

These two scripts build tab-separated files of conflicts and redundancies for reporting. Their output isn't necessarily optimal, or include all possible problems, but is used as a starting point for loading into spreadsheets with the goal of curating by a team.

Redundant structures are structures that are found assigned to more than one compound, and could be merged. Concflicting structures are multiple but different structures found assigned to a single compound and could be disambiguated.

Addendum

Acyl-carrier proteins are frequently found in metabolic reconstructions and are a key component of some pathways such as fatty acid biosynthesis. As the protein chains vary from species to species there's no determined structure to use, and frequently the formula and the charge of the phosphopantetheine prosthetic group as well as the attached fatty acyl chain can be overlooked and leads to reaction imbalance. Here we attempt to manually curate the formula and charge that would maintain the mass-balance of the fatty acid biosynthetic pathways, and others, in the ACPs_Master_Formula_Charge.txt file.