Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "ga4gh" namespace identifier registry and instructions/policy to TASC repo #32

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions namespace/ga4gh-namespace-identifiers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# GA4GH Namespace Identifiers

This document outlines the cross-workstream, GA4GH-wide namespace identifier system. Identifiers under the `ga4gh` namespace are compact uniform resource identifiers (CURIEs) that unambiguously associate the object at the given id with the data model (outlined in a GA4GH specification) it is constructed according to.

## Background

The GA4GH mission entails structuring, connecting, and sharing data reliably. A key component of this effort is to be able to `identify` entities, that is, to associate identifiers with entities. Ideally, there will be exactly one identifier for each entity, and one entity for each identifier. Traditionally, identifiers are assigned to entities, which means that disconnected groups must coordinate on identifier assignment.

The computed identifier scheme proposed in the Variation Representation Specification (VRS) computes identifiers from the data itself. Because identifiers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs.

The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem.
Copy link
Contributor

@andrewyatz andrewyatz Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst this has origins within the computed identifier space, is there a reason to limit this to just computed identifiers? Unambiguous is the target right? So long as that can be achieved then no dramas I think

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do think it would be limited only to computed identifiers.

However, if we were to allow one or more authorities to mint, maintain and manage curated identifiers then we would need some oversight to make sure each prefix was assigned to one and only one authority so as to avoid collissions.

I suppose it would be technically okay to do this, but there would need to be some oversight, guidelines and transparency on assuring certain prefixed namespaces are "owned" by one and only one authority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computed identifiers would be my preference too, as curated identifiers already have a route via identifiers.org etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, the way GA4GH-namespaced identifiers are managed is already tightly coupled to the associated specification, so I can see the argument to allow non-computed identifiers in this namespace. However, in the case of entities that have registered identifiers (e.g. DUO codes) it makes more sense to have a dedicated namespace for those. So I think it makes sense to limit the scope to computed identifiers as written here.

Regardless, I agree with @larrybabb that the key message here is for TASC to create and maintain oversight, guidelines, and transparency on identifier prefixes within the GA4GH namespace.


## Background

The following algorithmic processes, described in depth in VRS [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers), are included in this proposal by reference:

* **GA4GH Digest Serialization** is the process of converting an object to a canonical binary form based on JSON and inspired by similar (but unratified) JSON standards. This serialization for is used only for the purposes of computing a digest.

* **GA4GH Truncated Digest** is a convention for using [SHA-512](https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf), truncated to 24 bytes, and encoding using [base64url](https://datatracker.ietf.org/doc/html/rfc4648#section-5).

* **GA4GH Identification** is the CURIE-based syntax for constructing a namespaced and typed identifier for an object.

## Type Prefixes
ahwagner marked this conversation as resolved.
Show resolved Hide resolved

![GA4GH type prefixes](./img/type-prefixes.png)

A GA4GH identifier is constructed according to this syntax:

```
"ga4gh" ":" type_prefix "." digest
```

The `digest` is computed as described above. The type_prefix is a short alphanumeric code that corresponds to the type of object being represented.

These are our recommendations for type prefixes:

* Prefixes SHOULD be short, ideally 2-4 characters.
* Prefixes SHOULD be for concrete types, not polymorphic parent classes.
* A prefix MUST map 1:1 with a schema type.

See the [type prefix registry](./type-prefix-registry.md) for the current list of approved type prefixes under the `ga4gh` namespace.

## Type Prefix Submission Process

All GA4GH work streams are encouraged to use existing type prefixes in their specifications wherever possible, and to submit new type prefixes for data models that aren't currently represented in the registry.

To add an entry to the type prefix registry:
* Fork the TASC repository in your own user space
* In your user space, create a new branch from `ga4gh:master`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GA4GH should use a consistent default branch name across all projects. Most projects are migrating to using main rather than master for reasons discussed elsewhere. That is my preference too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this and certainly we should move what we can do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I will bring this up at TASC to rename it as main for at least this repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams is there an open GH issue to track this so we can resolve this thread?

* In the new branch, add an additional row in the markdown table in `type-prefix-registry.md` with your proposed type prefix.
Copy link
Member

@reece reece Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the authoritative table should be code, preferably YAML, and not merely in documentation. That code should probably be in a new central/shared/core repo under TASC's purview, e.g., ga4gh-common. PRs to add prefixes should be in that code. And, the documentation for those should be in the same repo, adjacent to the changes.

When I originally implemented prefixes for VRS, I anticipated separating GA4GH shared components from VRS. For this reason, the VRS repo has a separate file ga4gh.yaml.

So, the most straightforward action would be to move this file to another repo and VRS clients would include that (as a subrepo or with vendoring).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. the new repo would be the authoritative GA4GH namespace registry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also strongly agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reece do you have any examples of how the registry would be used in a machine-readable sense? We discussed this at TASC and saw it mainly as human-readable documentation. i.e. developers of data platforms/systems would read through the registry and find GA4GH prefixes related to data models they are using, and then apply our identifier system to their objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams not that I want to speak for @reece or @larrybabb, but I do think the intent here is to be descriptive of the repo contents, as you suggest.

It might make sense to rename this repo to do so, but I think that there are several other TASC-related concerns that are tracked here; do you want to limit the scope of the TASC repo to concern itself only with the ga4gh-common technical product?

I think @reece sees such a repo to eventually serve as a complete replacement for the GA4GH YAML in VRS in addition to other technical artifacts to be coordinated / blessed centrally by GA4GH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams : Apologies for missing the questions two weeks ago.

As @ahwagner suggested, please see the VRS repo. The file he pointed to (ga4gh.yaml) is used to build a map of type ⇒ prefix, which is then used during identifier assignment. That is, when an object of type Allele is serialized, the prefix for that identifier is specified in the file ga4gh.yaml. In an ideal world, this mapping would be stored in exactly one place and all code and documentation would be derived from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams : IMO, "TASC" is a group and doesn't help understand what this code is about. "config" or "globals" might be a better name. Admittedly, I know of only one file that would go there now.

Despite being a primarily Python guy these days, I do think that files of this sort should be language neutral.

Copy link
Member

@jmarshall jmarshall Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re repository naming: This is trivia. It doesn't matter what repo this prefix registry is in; what matters is that it exists. This repository currently contains information about the TASC committee and the registry of Service Info types, and the PR proposes to add a registry of prefixes. Both of these registries are administered by the TASC committee, and IMHO this ga4gh/TASC repository is a convenient place to gather them.

The place that this information ought to be located in is a GA4GH technical web site. To date, GA4GH as an organisation has not provided a technical web site for the information produced by its working groups. The rumblings are that this is finally on the way to changing. In the meantime, this repository with the PR as proposed is as good a staging area as any; IMHO “keep it simple” applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re maps of type ⇒ prefix: VRS's ga4gh.yaml file contains entries like Allele: VA. But what does Allele mean?

In the VRS context, it specifically means VRS's Allele object as defined in your schema. That's good and concrete.

However in the (lack of) context of this general registry list, it's just a word. Different groups will inevitably mean slightly different things when they use the word. That's why IMHO this registry's entries need to consist of the reserved prefix (VA), a prose description of the intention (“An allele, encompassing blah blah”), and links to its specific incarnations in individual GA4GH protocols (link to VRS's Allele initially; eventually also links to other groups' protocols' uses of VA).

I guess this is the crux of why I don't see this registry as a machine-readable list. What would the machine-readable entries contain? We are trying to make a list of registered prefixes like VA (and collate their use in GA4GH formats and protocols); we are not trying to define an ontology of terms like Allele. It's great to have a type/prefix map in a concrete context like VRS, but I don't see that it's feasible or useful for TASC to attempt in the general context.

* Create a Pull Request (PR) to the `ga4gh:master` branch
* TASC will review the proposed type prefix and comment and/or request modifications on the PR thread.
* TASC will merge the PR once all comments have been addressed
Binary file added namespace/img/type-prefixes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions namespace/type-prefix-registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
| Type Prefix | Class Name | Definition | Example | Reference |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be computationally consumable IMO. I don't have a concern with a textual representation but the primary source should be elsewhere (this might be covered by Reece's comment above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes machine readable-readable at the minimum, human-readable form may exist in sync with this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jb-adams see GA4GH YAML in VRS for an example of the computable resource we are seeking.

|-------------|------------|------------|---------|-----------|
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|
| SQ | Sequence | ... | `ga4gh:SQ.…` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)|

(…and similarly throughout the table)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred three periods to the ellipsis character, but this probably wouldn't cause any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three periods would obviously be confusing next to a literal period character, but either way the ellipsis is just a placeholder until the examples (if they are retained) are filled in. But the comment is a reminder that the : should be a ..

| VA | Allele | ... | `ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VH | Haplotype | ... | `ga4gh:VH:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VAB | Abundance | ... | `ga4gh:VAB:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we changed Abundance to CopyNumber and I think VAB should be VCN (Alex W can verify)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb is correct. Though in fairness we haven't updated the documentation on VRS latest to reflect this. Made an issue to address this here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I wait until the VRS PR merges before I update it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please wait for 1.2.1; I will put in PRs this week so that we may merge and release soon.

| VS | VariationSet | ... | `ga4gh:VS:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VSL | SequenceLocation | ... | `ga4gh:VSL:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VCL | ChromosomeLocation | ... | `ga4gh:VCL:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |
| VT | Text | ... | `ga4gh:VT:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) |