Skip to content

Commit

Permalink
Merge branch 'reorganizing_documentation' of https://github.com/ga4gh…
Browse files Browse the repository at this point in the history
…/cat-vrs into reorganizing_documentation
  • Loading branch information
DanielPuthawala committed Apr 16, 2024
2 parents 1a3a5cd + c151a3b commit f2e3e6a
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 10 deletions.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Categorical Variation Representation Specification
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


The Categorical Variation Representation Specification (Cat-VRS, pronounced "cat verse") is a specification developped by the Global Alliance for Genomics and Health (GA4GH) to provide a standard for the representation of categorical variant concepts in genomics knowledgebases, and improve genomic knowledge search, curation, and harmonization. The specification consists of a JSON Schema for representing classes of categorical variation, conventions to maximize the utility of the schema, and a python implementation that promotes adoption of the standard.
The Categorical Variation Representation Specification (Cat-VRS, pronounced "cat verse") is a specification developed by the Global Alliance for Genomics and Health (GA4GH) to provide a standard for the representation of categorical variant concepts in genomics knowledgebases, and improve genomic knowledge search, curation, and harmonization. The specification consists of a JSON Schema for representing classes of categorical variation, conventions to maximize the utility of the schema, and a python implementation that promotes adoption of the standard.



Expand Down
14 changes: 7 additions & 7 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,16 @@ Challenges to Unifying the Representation of Categorical Variants
.. CatVars are hard to pin down
.. Why they arise
Categorical variants arise organically and continuously in the course of genomics research. When clinical studies are run and journal papers published, the results are typically not charactorized in terms of an exhaustive list of assayed variants to which the conclusions apply. Rather, the domain of the conclusions are currently characterized in terms of a chategorical variant, all of the individual assayed variants that fall into the same biological bucket. Like all scientific abstractions, these models have several useful properties. They describe insightful conclusions related to the biological events that underly a function common to a class of variants. They also make useful predictions, namely that the same conclusions should apply to variants that weren't explicitly tested but ought to function in a similar way to those explicitly tested. They thus allow us to generalize genomic knowledge.
Categorical variants arise organically and continuously in the course of genomics research. When clinical studies are run and journal papers published, the results are typically not charactorized in terms of an exhaustive list of assayed variants to which the conclusions apply. Rather, the domain of the conclusions are currently characterized in terms of a categorical variant, all of the individual assayed variants that fall into the same biological bucket. Like all scientific abstractions, these models have several useful properties. They describe insightful conclusions related to the biological events that underly a function common to a class of variants. They also make useful predictions, namely that the same conclusions should apply to variants that weren't explicitly tested but ought to function in a similar way to those explicitly tested. They thus allow us to generalize genomic knowledge.

To return to the running example, the BRAF V600E categorical variant inlcudes as its members any of 2 single-nucleotide substitutions and 6 double-nucleotide substitions that convert a Valine codon into one coding for Glutamic acid. The Valine to Glutamic Acid amino acid substitution variant is also a member of that set. Any other variant or series of variants that would have the net effect of substituting glutamic acid for valine in the same location of the resulting polypeptide chain is also a member of the same categorical variant.
To return to the running example, the BRAF V600E categorical variant inlcudes as its members any of 2 single-nucleotide substitutions and 6 double-nucleotide substitions that convert a Valine codon into one coding for Glutamic acid. The Valine to Glutamic Acid amino acid substitution variant is also a member of that set. Any other variant or series of variants that would have the net effect of substituting Glutamic acid for Valine in the same location of the resulting polypeptide chain is also a member of the same categorical variant.




.. CatVars have complicated relationships with each other
While a single categorical variant may have many assayed variant members, the same is true in the other direction. A single assayed variant is a member of many possible categorical variants simultaneously. While NC_000007.13:g.140453136A>T is a member of the BRAF V600E categorical variant, it is also a Change-of-function variant, a protein missense variant, and a chromosome 7 variant, among other categorical variants.
While a single categorical variant may have many assayed variant members, the same is true in the other direction. A single assayed variant is a member of many possible categorical variants simultaneously. While NC_000007.13:g.140453136A>T is a member of the BRAF V600E categorical variant, it is also a Change-of-Function variant, a protein missense variant, and a chromosome 7 variant, among other categorical variants.


.. image:: images/relations-between-assayed-and-CatVars-and-CatVars-to-other-CatVars.png
Expand All @@ -78,7 +78,7 @@ Because a single categorical variant may have many assayed variants as members,

.. CatVar labels do not always denote the same thing across different KBs, and may even be redundant-specified
To make categoricla variant matching even more complicated, it is often the case that identical labels across different resiuorces in fact describe different categroical variants, as seen in the figure below where an ACT sequence has been inserted directly 3' of a ACTG sequence. While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence. This implies that the catgorical variant descriptor "duplication" has different meanings across different resources.
To make categoricla variant matching even more complicated, it is often the case that identical labels across different resources in fact describe different categroical variants, as seen in the figure below where an ACT sequence has been inserted directly 3' of a ACTG sequence. While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence. This implies that the catgorical variant descriptor "duplication" has different meanings across different resources.


.. image:: images/CatVar-CatVar-matching.png
Expand All @@ -87,21 +87,21 @@ To make categoricla variant matching even more complicated, it is often the case
:alt: The figure depicts a hypothetical variant where an ACT sequence has been inserted directly 3' of a ACTG sequence. While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence, or alternately simply as an insertion of ACT. This implies that the catgorical variant descriptor "duplication" has different meanings across different resources.


On the other hand, it is also often the case that spurious ambiguity exists within resources. The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HVGS, this variant could either validly be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.
On the other hand, it is also often the case that spurious ambiguity exists within resources. The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HGVS, this variant could either validly be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.


.. image:: images/CatVar-CatVar-spurious-ambiguity.png
:width: 40%
:align: center
:alt: The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HVGS, this variant could either be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.
:alt: The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HGVS, this variant could either be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.



Discussion
@@@@@@@@@@


In summary, a crucial step in the course of genomic variant interpretation is assayed-categorical variant matching, where one determines all and only those categorical variants to whoch the assayed variant in question is a member. Successful assayed-categorical variant matching makes it possible to connect evidence to support or refute determinations of pathogenicity and/or oncogenicity of the assayed variants. In a different but related use case, categorical-categorical variant matching is crucial to the process of data harmonization and knowledgebase curation.
In summary, a crucial step in the course of genomic variant interpretation is assayed-categorical variant matching, where one determines all and only those categorical variants to which the assayed variant in question is a member. Successful assayed-categorical variant matching makes it possible to connect evidence to support or refute determinations of pathogenicity and/or oncogenicity of the assayed variants. In a different but related use case, categorical-categorical variant matching is crucial to the process of data harmonization and knowledgebase curation.



Expand Down
4 changes: 2 additions & 2 deletions docs/source/terms_and_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ correctly reflecting uncertainty of our understanding at the
time. Unfortunately, such terms are not readily translatable into an
unambiguous representation of knowledge.

As discussed in the :ref:'Introduction', categorical variation labels are homophonous, ambiguous, and vague, often all three simultanously. This poses a great difficulty to the precise repreentation of categorical variation. In contrast, **the computational representation of categorical variation concepts requires
As discussed in the :ref:`Introduction`, categorical variation labels are homophonous, ambiguous, and vague, often all three simultanously. This poses a great difficulty to the precise representation of categorical variation. In contrast, **the computational representation of categorical variation concepts requires
translating precise categorical definitions into information models and
data structures that may be used in software.** This translation
should result in a representation of information that is consistent
Expand All @@ -25,7 +25,7 @@ Accordingly, for each term we define below, we begin by describing the
term as used by the genetics and/or bioinformatics communities as
available. When a term has multiple such definitions, we
explicitly choose one of them for the purposes of computational
modelling. We then define the **computational definition** that
modeling. We then define the **computational definition** that
reformulates the community definition in terms of information content.
Finally, we translate each of these computational definitions into precise
specifications for the (**information model**).
Expand Down

0 comments on commit f2e3e6a

Please sign in to comment.