Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encourage the use of dedicated fields rather than semantic overloading #27

Open
draeger opened this issue Sep 15, 2020 · 12 comments
Open

Comments

@draeger
Copy link
Collaborator

draeger commented Sep 15, 2020

Description of the issue:

Traditionally, identifiers have been used to encode various types of information, e.g., a metabolite's compartment. While this approach might lead to recognizable displays, it causes several problems in information storage and is an avoidable source of error and ongoing debates. It should be noted that display names and scientific names can alternatively be encoded in different ways.

Expected feature/value/output:

Working with models would become easier if identifiers could be treated as what they are: unique String keys to identify particular objects within a model. We should avoid expecting anything else from them. Using them as display-names or for any further semantic information often leads to misunderstanding and complicated parsing routines. Trying to validate their assumed structure can lead to incompatibilities to models from scientists outside of the COBRA community.

Current feature/value/output:

Let's use only dedicated fields to store the semantics of a model. For instance, the compartmental localization of a metabolite can unambiguously be stored using the species' compartment attribute, so that we don't need to rely on any suffix of an identifier. If a display name is needed as an abbreviation, it can, for instance, be stored using the new key-value pair structure in the upcoming FBC version 3 package. To avoid redundancy, one could, for example, keep an abbreviation for a metabolite (e.g., glc) and algorithmically append _c, [c], or any other representation for the compartment depending on user preferences.

Reproducing these results:

Many sources of error could be eliminated. Identifier parsing is cumbersome and error-prone. It should be avoided. It is complicated and compromises the reproducibility and interoperability of models.

@JonathanRob
Copy link
Member

Very good points, @draeger. To be sure that I understand correctly, are you suggesting that metabolite identifiers in a model should not contain any compartment information? If so, I suspect this would create problems for software/formats that expect or require that the metabolite identifiers be unique.

Or are you suggesting that the metabolite identifiers, which may contain a compartment abbreviation, should not be used to extract such information?

@Midnighter
Copy link
Collaborator

I think it's a good idea to propose a change like this for a forward looking recommendation like standard-GEM. Even though it will make it even harder for tools to read since there will be thousands of "old-style" models and a few new ones 😝

@draeger
Copy link
Collaborator Author

draeger commented Sep 15, 2020

@JonathanRob, an ID can, of course, contain any prefix, infix, or suffix that is commonly allowable. If it is an abbreviation for a compartment, that is fine. However, a software should not rely on a specific pattern for an identifier. If a metabolite has the ID _452oijrkjna095486foherl it is as valid as M_glc_6_p_c. The ID is not a display name. If a short, human-readable display name is needed, it can either be generated from the information stored in the object or stored as name or within the key-value pair list.

@JonathanRob
Copy link
Member

Ok, yes, I absolutely agree.

@edkerk
Copy link
Collaborator

edkerk commented Sep 15, 2020

@draeger Very good point. Besides the case of compartment information in metabolite IDs, are there other cases that you are aware of?

@Midnighter Reading should be fine, use of compartment field is part of SBML L3V1 + FBCv2 (maybe even earlier?) and at least COBRA toolbox, cobrapy and RAVEN can deal with this. Once loaded, however, COBRA still appends the compartment information to metabolite ID instead of having a dedicated compartment field, but there has been some progress on including a metComps fields: opencobra/cobratoolbox#1396. When that will be released, COBRA can import->export models without changing metabolite IDs whatsoever. I don't know what is the status of cobrapy.

@Midnighter
Copy link
Collaborator

I think @matthiaskoenig would have some ideas on this topic, too.

@matthiaskoenig
Copy link

Not sure what to add here.

  • It is clear how to do correct SBML models (see SBML specification and validation).
  • Abusing identifiers to store additional information is just incorrect (nothing to discuss here).
  • If tools/services use identifier overloading they should stop and use the correct mechanism to store the information. The community should discourage/stop using these tools and protest whenever they see such problems.
  • identifier changing should be a crime! Everybody is relying in their tool chains on the SBML identifiers. Adding/removing prefixes to identifiers will break so many workflows. Please stop doing this !!! (e.g., adding/removing compartments to identifiers programatically (please don't); adding/removing "G_" or other prefixes so that identifiers are valid SBMLIdentifiers. Please don't.

@Midnighter
Copy link
Collaborator

I think it's worth talking about how to best organize "display names" or "symbols" for model elements. I know that the interactivity offered by using most of the BiGG identifiers in cobrapy interactively is a major selling point for it. (Basically, being able to call methods on an element such as model.metabolites.atp_c is super convenient for exploration.)

  • In Python, this means the display name needs to be a valid variable name. I think this would be similar for most languages.
  • Where is this information best stored? You mention the key-value store but we have to come up with an agreed key fast then, to avoid everyone using a different key. It also does not enforce that the value is compatible. Or maybe it should be another SBML SBase attribute so this is more formalized?

@draeger
Copy link
Collaborator Author

draeger commented Sep 16, 2020

We have three different things that people seem to like to represent with identifiers or names:

  1. An actual identifier for specific objects, e.g., metabolites,
  2. An abbreviated display name that reflects a summarized version of key attribute values,
  3. In contrast, a scientific name that is typically relatively long.

Now, the identifier could be an arbitrary String, e.g., e83oiha24vuiase2 or so. The abbreviated display name (e.g., M_glc_c) could be stored in at least two different ways in SBML:

  1. Use the name attribute that all objects in SBML have now. When doing so, the scientific name has to go somewhere else. As it turns out, there is often more than one scientific name. Let's stick to the example with metabolites. They usually have a lot of synonyms. If MIRIAM annotations are used, any scientific name can be retrieved on the fly from linked resources, such as KEGG, ChEBI, or BiGG via REST APIs or similar mechanisms. Alternatively, one could store the scientific name in a key-value pair field.
  2. If using the new key-value pair list from FBC version 3 for encoding the abbreviated name, then the preferred scientific name can be stored in the name attribute.

In any case, I'd recommend only keeping a very fundamental abbreviation and algorithmically appending any information, such as the compartment code, because if the model is changed, that information is likely to get out of sync. If only an abbreviation, such as glc for D-glucose, is stored in the key-value list, all other information can be obtained from the object, e.g., the compartment.

As @matthiaskoenig points out, if a software internally stores multiple different kinds of information in one variable at once, this is a tool problem and should be solved by adding additional variables to that software (which is much easier to do than complicated and highly error-prone String processing).

To answer @edkerk's question: There can be many different things encoded in some identifiers, such as compartment, tissue, reaction types, etc.. Here is an overview: https://github.com/SBRG/bigg_models/wiki/BiGG-Models-ID-Specification-and-Guidelines.

@matthiaskoenig
Copy link

matthiaskoenig commented Sep 22, 2020

I know that the interactivity offered by using most of the BiGG identifiers in cobrapy interactively is a major selling point for it.

@Midnighter Yes, it would be a selling point if it would work. With the current Bigg identifiers this does not work consistently (see opencobra/cobrapy#828). With valid SBML identifiers code completion just works, you just use model.metabolites.M_3pg_c. Another reason not to add/remove prefixes. If you use SBML SIds as they are without any changes they work as variable names and in autocompletion, e.g. in ipython.

SBML SIds are valid variable names! This is one of the big selling points for them.

Storing the additional information is not too complicated. One can use the name attribute for a long display name and store an additional label (short-name) for instance by using key-value pairs.

@Midnighter
Copy link
Collaborator

Storing the additional information is not too complicated. One can use the name attribute for a long display name and store an additional label (short-name) for instance by using key-value pairs.

I strongly suggest that if standard-GEM recommends using key-value pairs to also recommend one specific key. I don't care what it's called but better to settle on one key early on.

@matthiaskoenig
Copy link

@Midnighter The complete key-value pair feature is already in the open pull request for cobrapy from GSOC (https://github.com/Hemant27031999/cobrapy/blob/metadata_fbc3_group/src/cobra/core/metadata/keyvaluepairs.py)
Importantly the Key-Value-Pairs have a URI attribute which solves exactly the issue of different keys everywhere. I.e. the URI allows to link to a set of defined keys with explanation. The fbc community just has to decide which keys to use and what they exactly mean, then we can define the URI for the keys and the meaning is unique throughout the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

5 participants