Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GA4GH Namespace policy #16

Open
ahwagner opened this issue Apr 28, 2020 · 39 comments · May be fixed by #37
Open

GA4GH Namespace policy #16

ahwagner opened this issue Apr 28, 2020 · 39 comments · May be fixed by #37
Labels
CRITICAL Decision needs to be made soon

Comments

@ahwagner
Copy link
Member

ahwagner commented Apr 28, 2020

Problem Statement

The VRS standard submitted a proposal to the GA4GH Steering Committee to approve the use of the ga4gh namespace for VRS computed identifier CURIEs. This was approved by the Steering Committee, with the understanding that a governance mechanism would be established for object identifiers under this namespace once an appropriate technical committee was established (TASC).

VRS is preparing to release new data type prefixes under the ga4gh namespace as part of its upcoming minor version releases. We intend to move forward with these as scheduled until a formal mechanism for registering / requesting identifier prefixes is created.

Impact of alignment between standards

Due to the organization-level namespacing, any CURIEs generated under this namespace should have a consistent meaning across products. If two standards (e.g. refget and VRS) develop identifiers with similar prefixes or inconsistent structures under the ga4gh namespace, interoperability between resources implementing multiple GA4GH products expecting ga4gh CURIEs will fail.

Background research and landscape analysis

We have worked to create an identifier scheme that has been reviewed and approved by unanimous vote of the GA4GH Steering Committee.

We have established several type prefixes already, with more that will be generated over the next year as we iteratively expand VRS.

This has direct relevance to the DRS, VRS, and refget standards, and likely others down the road.

Proposed solution

The VRS Project Maintainers are granted authority to build additional type prefixes under the ga4gh namespace without external review, so long as those identifiers begin with V, (e.g. VS for VariationSet, VA for Allele, VH for Haplotype). Data type prefixes outside of this scope must be reviewed and approved by TASC before implementation. Similar project-level subdomains could be approved and allocated by TASC for other standards as needed.

In addition, a TASC-maintained public registry of type prefixes (approved, reserved, and pending) should be provided for quick reference by product developers and implementers.

@rrfreimuth
Copy link
Collaborator

There are several items underlying this proposal that TASC must address:

  1. Determine the scope of prefixes within the ga4gh namespace. Specifically, should a prefix be established independently by a GA4GH product (irrespective of similar prefixes or entities used in other products) or should prefixes describe entities that may be used across products?
    • If two products reference a common object (e.g., sequence), should both products use the same prefix for that object?
    • If the same prefix is used by two products to describe a common entity (e.g., sequence), are the products required to use the same structure and semantics for that entity?
  2. Determine whether naming conventions should be established for prefixes.
    • It is natural to choose prefixes that contain some semantic meaning, as they are often abbreviations of the entities they describe (e.g., VRS uses SQ for sequence).
    • Should there be ga4gh-wide conventions for prefix length or substructure (e.g., OID-like subdomains using the first letter to indicate product or work space)?
  3. Establish governance process for approval of prefixes.
    • Should prefixes be approved by a project team, sponsoring work space, TASC, and/or Steering Committee?
    • Should prefixes be assigned on a “first come, first served” basis, or chosen in an open process that includes potential future users of the prefix?

I will add my own thoughts to these issues when TASC brings the issue Under Discussion.

@mamanambiya mamanambiya added the CRITICAL Decision needs to be made soon label Apr 30, 2020
@rrfreimuth
Copy link
Collaborator

Since this issue is on the agenda for Monday, I'm adding some thoughts here. Numbers correspond to the points I listed earlier in this thread.

  1. I think prefixes should be scoped beyond a single product, and should instead roughly reflect the type of entity they name. Products and organizational structures (e.g., work spaces) change over time, whereas prefixes and the entities they describe should be constant. This means if two products refer to the same type of entity (i.e., implement the same object), they should use the same prefix. A corollary is that each prefix should be associated with a defined entity type (object) so its use is consistent across products.

  2. I think we need to establish some minimal conventions, and we should try to avoid embedding too much meaning in what are essentially business identifiers.

  3. I think we need a short process to assign prefixes. As an example:
    a. Sponsoring team submits to TASC a request, which includes the proposed prefix and its corresponding entity. The latter includes a description and a reference to the object in a spec or GA4GH object repository (e.g., SchemaBlocks). This could require no more than a few sentences of description and cross-references.
    b. TASC reviews the proposal, checking for overlap with any existing or similar elements
    c. TASC announces the proposal to all project teams, which have 1 week to declare interest; non-response is taken to mean assent
    d. Discussion as needed among project team representatives, if needed, to achieve consensus
    e. TASC reviews and approves the final proposal

@jaeddy
Copy link
Member

jaeddy commented May 11, 2020

Do we need conventions/governance beyond "register the proposed namespace in identifiers.org"? It seems like that would handle most use cases in terms of exposing existing, related products and avoiding duplication in prefixes.

Where TASC could add value is by aggregating namespaces in identifiers.org that are specific to or being used by GA4GH — just to have a more constrained search scope. Something similar to this for ServiceInfo could be sufficient.

@jaeddy
Copy link
Member

jaeddy commented May 11, 2020

In terms of structure for a registry/catalog, here's an example for an identifiers.org namespace object (as returned by the registry API) (cc @rishidev):

    "namespaces" : [ {
      "prefix" : "3dmet",
      "mirId" : "MIR:00000066",
      "name" : "3DMET",
      "pattern" : "^B\\d{5}$",
      "description" : "3DMET is a database collecting three-dimensional structures of natural metabolites.",
      "created" : "2019-06-11T14:15:50.652+0000",
      "modified" : "2019-06-11T14:15:50.652+0000",
      "deprecated" : false,
      "deprecationDate" : null,
      "sampleId" : "B00162",
      "namespaceEmbeddedInLui" : false,
      "_links" : {
        "self" : {
          "href" : "https://registry.api.identifiers.org/restApi/namespaces/230"
        },
        "namespace" : {
          "href" : "https://registry.api.identifiers.org/restApi/namespaces/230"
        },
        "contactPerson" : {
          "href" : "https://registry.api.identifiers.org/restApi/namespaces/230/contactPerson"
        }
      }
    },

@ahwagner
Copy link
Member Author

I agree with registering at identifiers.org. This was a recommended registry in the above mentioned presentation that was reviewed and approved by the Steering Committee. However, I think that I'm not understanding how registering at identifiers.org will resolve the issue presented here.

To clarify the problem statement, ga4gh is the prefix that would be registered, and the type prefixes (e.g. VA) should not be confused with CURIE prefixes (e.g. ga4gh). You can read more about rationale for why we want a unified namespace / CURIE prefix in the above-linked documents.

@mamanambiya
Copy link
Collaborator

I agree with registering at identifiers.org. This was a recommended registry in the above mentioned presentation that was reviewed and approved by the Steering Committee. However, I think that I'm not understanding how registering at identifiers.org will resolve the issue presented here.

To clarify the problem statement, ga4gh is the prefix that would be registered, and the type prefixes (e.g. VA) should not be confused with CURIE prefixes (e.g. ga4gh). You can read more about rationale for why we want a unified namespace / CURIE prefix in the above-linked documents.

So @ahwagner in ga4gh:VA.56789zyxwv for example, is the prefix VA simply referring to the type of entity or rather to a product? In other words is the prefix agnostic of a product or WS? So that an exisitng prefix could be used by different groups/teams when proposing identifiers of the same type?

@ahwagner
Copy link
Member Author

ahwagner commented May 25, 2020

So @ahwagner in ga4gh:VA.56789zyxwv for example, is the prefix VA simply referring to the type of entity or rather to a product? In other words is the prefix agnostic of a product or WS?

Yes. For example, the VA refers to an identifier describing a VRS Allele. VRS also uses the SQ type prefix for Sequence identifiers, which may also be generated by another group, such as the refget standard. @andrewyatz may have more to say on the plans for using these type prefixes in refget.

@ahwagner
Copy link
Member Author

ahwagner commented May 25, 2020

So that an exisitng prefix could be used by different groups/teams when proposing identifiers of the same type?

The idea would be that groups could reuse identifiers–but I think that it is important to clarify that once a type prefix is assigned, new groups should not be proposing / TASC should not be approving overloaded use of that prefix.

@mellybelly
Copy link

Please ensure that multiple designated resolvers are utilized. I recommend N2T as the second one (though there are a number of options). It would also be terrific if all prefixes could be registered in prefix commons. https://github.com/prefixcommons

@mellybelly
Copy link

@cmungall @jmcmurry @deepakunni3 might have suggestions too

@rrfreimuth
Copy link
Collaborator

It appears there may be some confusion/conflation of the various topics in this issue. I apologize for not catching that earlier and providing clarification. To recap:

The CURIE syntax that was previously approved is of the following format: namespace:type.id, where namespace (aka CURIE prefix) is "ga4gh", type is an abbreviation (aka "type prefix" or "prefix"), and id is the identifier. The type prefix is intended to give the consumer an idea of what type of entity the id references.

The VR Spec has defined several type prefixes already, including SQ for "sequence", VA for "allele", and VT for "text" (meaning variation represented as unstructured text).

The proposal by the VRS team is reposted here for convenience:

The VRS Project Maintainers are granted authority to build additional type prefixes under the ga4gh namespace without external review, so long as those identifiers begin with V, (e.g. VS for VariationSet, VA for Allele, VH for Haplotype). Data type prefixes outside of this scope must be reviewed and approved by TASC before implementation. Similar project-level subdomains could be approved and allocated by TASC for other standards as needed.

In addition, a TASC-maintained public registry of type prefixes (approved, reserved, and pending) should be provided for quick reference by product developers and implementers.

Since the type prefixes are unique within the ga4gh namespace, any project that defines a type prefix (e.g., VS, VA, VH, SQ, VT) is reserving that prefix across the entire organization. The reuse of type prefixes should be encouraged, but if two projects use the same prefix for different types of entities, confusion will abound.

In my post earlier in this thread I provided 3 opinions. I will summarize all 3 here and add a fourth that was mentioned by others.

  1. Type prefixes (e.g., VA, SQ) should represent entities and their naming should be independent of a particular product; the former are constant, the latter change. Prefixes must be defined well so they can be re-used appropriately across products.

  2. Naming conventions are fine, but we should avoid embedding too much meaning in business identifiers.

  3. TASC needs a short process to review and approve proposed type prefixes. A suggestion for this is in my previous post.

  4. All GA4GH-approved type prefixes should be registered at identifiers.org and/or similar registries.

@jaeddy
Copy link
Member

jaeddy commented Jun 8, 2020

I think I'm more or less on board with the conventions that @rrfreimuth and @ahwagner are proposing above, with the caveat that I don't think ga4gh is a logical/meaningful namespace (or prefix). I would argue that 'GA4GH' (an SDO) as a namespace is analogous to iso: or w3c:, etc. (which, to my knowledge, are not used in most contexts).

The actual namespace/standard in question here is the Variant Representation Specification (VRS). Thus, I would argue that the prefix should be vrs: OR for the sake of uniqueness (and also compliant with identifiers.org conventions), ga4gh.vrs:.

The end result being something along the lines of...

"sequence_id": "ga4gh.vrs:SQ._0wi-qoDrvram155UmcSC-zA5ZK4fpLT"

There probably isn't a huge value-add with the ga4gh.<standard> syntax in terms of avoiding collisions — i.e., there probably won't be a ga4gh.wes: SQ... or ga4gh.duo:SQ — but it definitely provides greater transparency and clarifies provenance.

To @rrfreimuth's point (4): the above still wouldn't establish a convention for registering type prefixes, but more so GA4GH standard prefixes. I think, to some extent, this is where the work @jb-adams is doing on the service-registry API/portal comes into play... but probably still needs some more thought.

@jaeddy
Copy link
Member

jaeddy commented Jun 8, 2020

Examples for reference:

image

@andrewyatz
Copy link
Contributor

My push back against the THING.THING syntax is based on that the ga4gh: prefix had been agreed at SC and is now part of the VRS & will be part of the refget spec. For refget any subsequent changes will be the 3rd change in identifiers.

Also to just bring up that example you've got involving Ensembl (close to my heart). Here's an example of where the namespaces has been developed, each of those point to a single website however that tightly binds namespaces to an implementation. As Ensembl changes its implementation (one website vs. 6) or introduces new resources (e.g. the recent covid-19 browser) new namespaces are create. So it suggests that these differences are now encoded in external identifiers rather than delegating this all onto Ensembl resolvers to sort out.

@cdvoisin
Copy link
Collaborator

cdvoisin commented Jun 8, 2020

In the solution statement, it would be helpful to mention that this namespace solution is for CURIE-related identifiers, and may not be applicable for other types of namespaces. My example is in the Passport v1 spec, there is a "ga4gh" namespace created for JWT claim names of "ga4gh_passport_v1" to avoid collisions with non-ga4gh claims. However, this is:

  1. a name field namespace, not an identifier value itself;
  2. the field name is not a CURIE, and has its own rules on what format is allowed, what should be avoided, and how long it can be.

So any wording in the final solution text that scopes the approach to CURIES in a similar way to the problem statement does would help clarify the applicability of the solution.

@mamanambiya mamanambiya pinned this issue Apr 12, 2021
@susanfairley
Copy link
Contributor

Mentioning that I am hoping to discuss this on the next TASC call on 12th July

@mellybelly
Copy link

Consider synching prefixes with other prefix coordination strategies. Some of these are documented here:
OBOFoundry/OBOFoundry.github.io#1524
OBOFoundry/OBOFoundry.github.io#1519
OBOFoundry/OBOFoundry.github.io#1038

I would recommend that GA4GH prefixes be disseminated to multiple resolvers such as N2T and W3ID.
GA4GH could declare a context file, more info here for an example: https://github.com/prefixcommons/biocontext

@ianfore
Copy link

ianfore commented Mar 8, 2022

Rereading the thread I believe the simple solution that @jaeddy proposed with vrs: as the prefix covers it.

Points above that this addresses:

The VRS Project Maintainers are granted authority to build additional type prefixes under the ga4gh namespace without external review, so long as those identifiers begin with V, (e.g. VS for VariationSet, VA for Allele, VH for Haplotype). Data type prefixes outside of this scope must be reviewed and approved by TASC before implementation.

That is not only provided for but can be simplified to

The VRS Project Maintainers are granted authority to build additional type prefixes under the vrs namespace without external review.

An important part of the semantics is that the prefix is the part before the colon. The VS, VA

The resolver problem
This hasn't been raised in the thread yet. For the ga4gh: prefix and sub prefixes as described above requires someone to run as continually available and reliable service a resolver to resolve the prefixes. That's a large scope, and is GA4GH undertaking to do it?

This is simplified with the vrs: prefix. A resolver is still needed, but that then sits with VRS. It should not require much additional than VRS would have to do to create/resolve the compute ids anyway. Keeping those things together is good encapsulation. VRS owns the logic, as they extend it (e.g. Vx, Vy, Vz) it is incumbent on them to revise it. That is best kept as a separate concern from other GA4GH ids.

@susanfairley
Copy link
Contributor

Listing some key points from the TASC call on 8th March 2022 below - to those on the call, please flag anything important that I miss.

  1. The 'ga4gh' prefix is what was approved at Steering Committee (SC) two years ago and, therefore, is what VRS has developed against. A decision against having a 'ga4gh' prefix would necessitate VRS reworking recently published material.

  2. Having a 'ga4gh' prefix would, we believe, necessitate a resolver being run, which identifiers.org (or similar) would direct to for resolution of sub-prefixes (see comment from @ianfore above and https://docs.identifiers.org/articles/docs/resolving_mechanisms.html)

  3. GA4GH is focussed on federated solutions that do not rely on centralised infrastructure

My opinion is that SC committed to a 'ga4gh' prefix two years ago and I'd want a very good reason to ask VRS to reverse their subsequent development work, based on that decision. While the 'ga4gh' prefix will require that we take on maintenance of a resolver the discussion indicated that this was likely fairly easily achievable. I continue to argue against centralised infrastructure as a general direction of travel but think this case likely warrants it.

That would then bring us to management and maintenance:

  1. We already have a document outlining a proposal from two years ago, which looks to cover most things (https://docs.google.com/document/d/1Jq1g5FkRUf4Mky1FKWcvDaUoAjTSKVh2Re9rK4P7Xac/edit?usp=sharing). (I'm afraid I only spotted this while looking for the meeting minutes)

  2. Registration of the prefixes should be done by members of Secretariat, likely registering under GA4GH Inc. This would be done after review of proposals (new or changes) had been completed by TASC, as outlined in the existing doc.

  3. We need to address versioning. On the call we wanted to invite @ahwagner to share his thinking on this. @ahwagner , could you share your thoughts here please?

@andrewyatz
Copy link
Contributor

Few more points here I hope helps to clarify:

  • Identifiers with ga4gh: were meant to represent organisational wide unambiguous identifiers e.g. ga4gh:SQ.NNNN is how we prefer to uniquely refer to a sequence
  • Those identifiers are generated from checksum based methods meaning if you have the data you can create the id
  • The id also based itself around a now peer-reviewed method in a truncated base64 url-encoding of SHA-512
  • The design of the identifier and sub-prefix was to aid a resolver service to disambiguate identifiers should the need arise to create one
  • There was no expectation that GA4GH was to run such a resolver service capable of routing an individual to knowledge about that identifier. I cannot see how this would be possible as GA4GH does not act as an issuing authority nor a data repository. Where would we redirect the example SQ id to? ENA? SeqRepo? Ensembl? Another meta resolver?

I feel like the fact that this is a partial CURIE has sent discussions down the resolver route and with fair reason.

@susanfairley
Copy link
Contributor

Thanks @andrewyatz

If I'm following correctly, that could take us to:

  1. Registering the overall 'ga4gh' namespace (to avoid hopefully unlikely use by another group and to make it explicit at identifiers.org what the prefix 'ga4gh' relates to).

  2. We would be looking at managing the namespace for sub prefixes after 'ga4gh:' (such as SQ) via TASC, as outlined in the doc.

  3. Interpretation of a specific instance of an identifier (such as abc123 in ga4gh:SQ.abc123), would stem from the algorithm associated with the entity and the data combining to create the identifier/digest in a process similar to the one described here: https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html#identify The expectation would be that someone would use data and the digest algorithm to generate the digest, which would then be used to verify someone was looking at the same data.

Also touched on in the call but not included in my summary was that describing the entities could be done via the GA4GH Data Model Library, which was consulted on last year.

Some questions and open issues for me at this point:

  1. Do all of the entities that we would currently want a sub prefix for also have an algorithm to generate a digest? (which would allow anyone to create, as is the case for VRS, and not rely on a single authority).

  2. I haven't seen reference to a VRS use case of "I have ga4gh:VA.xxxxxxxx and I want to know what the data in the VA object was that led to that digest", or digest -> data. I'm assuming at present that there is no intention to support this here but am aware that I am assuming :) - happy to be corrected. I think this would be akin to the reverse lookup conversation?

  3. For understanding what the sub prefix level entity is, would having some means of linking from ga4gh:SQ to information that SQ is a sequence make sense? I could imagine someone wanting to understand what SQ represented but curious if there is support or concrete use cases here? This feels similar to the data model library discussions of last year. Possibly useful but not essential to the VRS example that motivated this issue?

Happy to hear thoughts and thanks for the contributions.

@jb-adams
Copy link
Member

There was no expectation that GA4GH was to run such a resolver service capable of routing an individual to knowledge about that identifier. I cannot see how this would be possible as GA4GH does not act as an issuing authority nor a data repository. Where would we redirect the example SQ id to? ENA? SeqRepo? Ensembl? Another meta resolver?

@andrewyatz , at a recent TASC call, we discussed a potential resolving service that would refer the client to the standard GA4GH schema denoted by the type prefix (e.g. VRS referring to the JSON schema for a VRS variant). So while we may not be able to refer to the platform/API holding the data, we could at least provide a lookup service for the data type indicated by the prefix. Does this sound valuable or irrelevant/misleading to you?

@ahwagner

@andrewyatz
Copy link
Contributor

@jb-adams agreed there's no way we can provide a service that could route someone to platform/API hosting said data. But yes something where programmatically the sub-extensions can be described and explored would be useful. I'd expect though this is a very easy thing to implement as there wouldn't be that many of them and the processes above would limit to pretty much to things that have been accepted

@andrewyatz
Copy link
Contributor

@susanfairley for your point 2, I have a digest and want what that digest is. Yes I do think this is something that is wanted in a number of use-cases that haven't yet been recorded but would personally expect services/resources that implement the specs to support that if they want to and feel it is useful to do so. Some may for sure. Some may not. I would say this point is closer to what is the underlying use-case behind refget and not the reverse lookup

@jb-adams jb-adams linked a pull request Apr 1, 2022 that will close this issue
@susanfairley
Copy link
Contributor

TASC has met today and discussed this further.

@ahwagner - could you please come back to this issue with information on versioning?

The other focus of discussion was on the necessity (or otherwise?) of being able to get from an identifier to the data in the specific instance of that entity (resolving). The conversation touched on a number of points.

  1. There can be instances where the intention behind an identifier may actively be to avoid linking back to the real data but may be to consistently identify a given instance. For example, a patient ID - you want to recognise the same individual in many settings is the same individual but don't want to identify them personally (for example sharing their name and address).
  2. In some cases, having identifiers that people cannot resolve is potentially problematic. This is frequently seen in science.
  3. The identifiers here are generated by an algorithm. As such, who owns the namespace if multiple institutions can generate these identifiers?

In practical terms, providing generic entity level resolution (saying that SQ is a sequence) and providing some minimal information (possibly including information on how subsequent accessions/identifiers/checksums/digests are produced) could likely be fairly easily implemented, possibly working with the planned GA4GH Data Model Library. This doesn't seem to be a source of controversy.

Beyond that, resolving to the digest level would seem to be problematic. It is also likely out of scope for GA4GH, being more in line with the level of service provision associated with EMBL-EBI or NCBI.

In summary, a simple generic-entity level resolution is likely achievable and has some support. Digest level resolution would be problematic, likely out of scope for GA4GH but there is concern that having unresolvable IDs may be problematic for the community. As such, any further context or clarification from GKS would be welcome.

Comments welcome before the end of April. TASC will next meet in May.

@ahwagner
Copy link
Member Author

ahwagner commented Apr 5, 2022

Hi all, apologies for the delay in my reply. I have had many thoughts on the several subjects above and have been putting off my response until I could make time to compose it here. I have split these out by headers into discrete areas of concern raised in this thread.

Uniform Resource Names (URNs) and Uniform Resource Locations (URLs)

First, just a reminder that VRS computed identifiers are Uniform Resource Names (URNs), which are a subclass of Uniform Resource Identifiers (URIs). URNs by their definition must be globally unique and persistent (which the GA4GH computed identifier algorithm guarantees to meet with extremely high probability), and similarly are not intended to be resolvable entities as URLs are. This is why a resolver service on VRS IDs or RefGet sequences is not feasible; as URNs they were never meant to be globally retrievable (like URLs), only globally unique and persistent.

Identifiers are useful

Regarding point 2:

In some cases, having identifiers that people cannot resolve is potentially problematic. This is frequently seen in science.

I think that all identifiers, including both accessioned and computed identifiers, are potentially problematic in that they may be difficult to resolve. In fact, obsolete sequence accessions were a major factor in the development of the Universal Transcript Archive.

Since this point was raised, however, I feel obligated to point out that IDs (both accessioned and computed) are useful despite these potential problems. In fact, variants are often referenced by identifier instead of by a descriptive syntax; you don't need to look any further than rsIDs, allele registry IDs, ClinVar IDs, etc. to see the use cases where identifiers provide a convenient linking mechanism between evidence and knowledge statements. These are all in widespread use, especially for evidence and knowledge documents: e.g. variant x is_pathogenic_for disease y. But what happens if one of these resources is discontinued? Unlike accessioned identifiers from a resource, VRS provides a mechanism to create truly persistent and globally unique identifiers directly from underlying data (both requirements for URNs), a distinct advantage over the heterogeneous accessioned identifiers (typically crafted as resource-maintained URLs) that dominate the field.

Of course, for any resource that provides documents using VRS IDs (or accessioned IDs generated by that resource, e.g. ClinVar IDs), it is the responsibility of that resource to provide necessary context and/or reverse lookup services for the use of identifiers as needed for the use cases anticipated by that resource. Regarding what resources may actually implement retrieval services, I agree with @andrewyatz's comment:

[I] would personally expect services/resources that implement the specs to support that if they want to and feel it is useful to do so. Some may for sure. Some may not.

Notably, the VRS reference implementation does provide a mechanism for enref / deref operations on VRS objects that may be leveraged by a genomic data provider resource.

A GA4GH Namespace Resolver

With respect to question #3, there is no requirement for "ownership" of these URNs (as you would expect for URLs), but there is an opportunity for ownership of the ga4gh namespace for resolving the identifier type to schemas/documentation (see figure below for distinction between type and namespace as used in VRS and this comment). As discussed above, a type resolver is lightweight and may advance the objective of interoperable data models between GA4GH projects.

VRS PaperGraphics_Figure 4 (1)

I see the associated GA4GH Namespace resolver working as described by @jb-adams and @susanfairley, namely that GA4GH will host a service take can inspect the ID type and return information on that type. For example, a ga4gh.VA.xxx CURIE could resolve to the VRS Allele documentation through the VA prefix.

Versioning and type

As VRS and associated VRSATILE specs mature, we will inevitably need to make some changes to our data models to accommodate previously unforeseen use cases or improvements to usability or adoption. We are working on a maturity model and data model versioning system to assist with this. Our current thinking is to have several levels of maturity for any given data model (e.g an Allele, Haplotype, or SequenceLocation), corresponding to the degree of community adoption / stability of the model. Once a data model is granted a "stable" designation, it is assigned a version. Any breaking changes to that model will go through the full maturity model review and implementation cycle as a new version of that model, and that version would be indicated between the computed identifier type and digest.

For example, the current Allele model may be represented as ga4gh:VA.1.xxx and the next stable version would be referenced as ga4gh:VA.2.xxx. We would only increment new identifier versions on advancement of a data type to "stable". Implementations computing identifiers for any trial use or draft versions of upcoming models will use a type identifier scheme that allows them to indicate the exact draft entity that is being shared; e.g. ga4gh:VA.d-45a8d585ba.xxx might resolve to a draft version of Allele from the specific commit at 45a8d585ba. This scheme would require little-to-no additional governance from the Secretariat by allowing us to develop an existing type space with versions as needs and the data model evolve.

@susanfairley
Copy link
Contributor

@ahwagner, thank you for the above.

TASC is due to meet next week (I believe on Wednesday 4th May at 1pm BST/ 8am EDT - TBC).

Looking at the comments above, including my comments from 29th March and particularly those from @ahwagner and @andrewyatz, my impression is that:

  1. The 'ga4gh' prefix is what was approved at Steering Committee (SC) two years ago and, therefore, is what VRS has developed against. A decision against having a 'ga4gh' prefix would necessitate VRS reworking recently published material.
  2. We could register the 'ga4gh' namespace with identifiers.org (and any other appropriate groups).
  3. GA4GH could operate a resolver to the type+version level.
  4. Versioning could be handled as outlined by @ahwagner in the previous comment.

As this has been an open issue for a while, I'd like to suggest that we aim to conclude this discussion on the upcoming call. I'd propose that:

  1. Secretariat takes responsibility for registering the 'ga4gh' namespace
  2. Secretariat takes responsibility for developing a type+version level resolver as part of the Data Model Library work
  3. The type namespace (under 'ga4gh') be managed via TASC, with use of new type prefixes requiring TASC approval
  4. Versioning be handled as outlined by @ahwagner

@rrfreimuth
Copy link
Collaborator

  1. Secretariat takes responsibility for registering the 'ga4gh' namespace
  2. Secretariat takes responsibility for developing a type+version level resolver as part of the Data Model Library work
  3. The type namespace (under 'ga4gh') be managed via TASC, with use of new type prefixes requiring TASC approval
  4. Versioning be handled as outlined by @ahwagner

+1 to @susanfairley 's proposal, above. This does, however, lead to the question about when and how versioning will be performed. I suggest the following:

  • A new version shall be defined when a) the object referenced by the type prefix changes in structure or semantics, or b) the algorithm used to generate the identifier is changed
  • Versions shall be represented as <TASC to determine, see below>
  • When a new version is needed, TASC shall be notified as soon as possible and provided with a brief rationale, description of the change, and new definition (to be used with the resolver)
  • As a guiding principle, projects shall strive to minimize the need for new versions

We should provide guidance on the representation of versions within the "type.version" string. Options include: string (see VRS Computed Identifiers for character set restrictions), Semantic Versioning string, decimal number, integer, alphanumeric string. Since the period character is used for field separation I think we should avoid SemVer and decimals. In addition, I think it should be intuitive to determine which version is more recent, so I prefer integers or simple alphanumerics (e.g., [1-9][\d]*[a-z]?) over a string.

@ahwagner
Copy link
Member Author

ahwagner commented May 5, 2022

To supplement @rrfreimuth's point: I think some of the specifics of versioning are still being trialed, so just want to put in a note that while my above proposal is an idea for a starting point, what we find most effective in practice is still being worked out. It should be close to above.

For example, one potential change from the original proposal may be the notion of a draft version, similar to alpha/beta/rc builds in SemVer. So instead of a commit string (e.g. the above VA.d-45a8d585ba.xxx) you would have VA.2d3.xxx for the third draft of Allele v2. This is less granular and requires more tooling on the standards dev side (ensuring version bumps on draft change), but is nice for compactness.

Tagging @larrybabb who might have some additional thoughts on this.

@mamanambiya
Copy link
Collaborator

I agree with the current proposal as it stands. However, I think the two last points regarding governance by TASC and versioning might need more discussion.

@jaeddy
Copy link
Member

jaeddy commented May 17, 2022

+1 to @susanfairley's proposal from me as well. Thanks to @ahwagner for the careful explanation, and to everyone else for plenty of thoughtful discussion!

@mamanambiya mamanambiya unpinned this issue Apr 19, 2023
@mamanambiya mamanambiya pinned this issue Apr 19, 2023
@mbaudis
Copy link
Member

mbaudis commented May 23, 2023

@mamanambiya @ahwagner @andrewyatz As mentioned in London etc. I've started another attempt at a GA4GH standards documentation site, splitting this off from SchemaBlocks. Current discussion is referenced here - waiting to be taken over...

https://ga4gh-community-standards.github.io/standards/identifiers-and-CURIEs/#the-ga4gh-namespace

@ahwagner
Copy link
Member Author

Thanks @mbaudis. I appreciate you trying to move the ball here. Is the idea of the ga4gh-community-standards GitHub site for the community to document ongoing discussions, to provide "unofficial" recommendations for GA4GH developers to consider, or something else? I am very supportive of anything that helps us move forward with standard development alignment across GA4GH!

@mbaudis
Copy link
Member

mbaudis commented May 25, 2023

@ahwagner Still unofficially filling the gaps; providing a template for IMO TASC to take over. Not bound to stick with name, repo etc. but why not ¯\_(ツ)_/¯.

But then - what is "unofficial"? GH threads as documentation are even more so (un-).

@tcezard
Copy link
Contributor

tcezard commented Jun 20, 2023

A quick comment in this thread to highlight that Refget v2 is now going to officially adopt the SQ prefix for describing sequence entity following on what VRS already defined.

Similarly Sequence collection has been discussing the adoption of a prefix for describing a group of sequences. No decision have been made yet but we will be looking to TASC for guidance on how to chose and register it, once we have.

@mbaudis
Copy link
Member

mbaudis commented Jun 20, 2023

@mbaudis
Copy link
Member

mbaudis commented Jul 10, 2023

Re. ga4gh namespace: identifiers.org doesn't know it, just ga4ghdos. Hasn't GA4GH (TASC) established claims before we get into a squatter's rights situation? (And then ga4ghdos:xxx... could IMO become ga4gh:DOS.xxx... - another argument to move forward here).

@mbaudis
Copy link
Member

mbaudis commented Jul 10, 2023

Proposal:

  • TASC registers the ga4gh prefix w/ relevant resolvers etc.
  • TASC documents the general ga4gh usage format & policies, namely
    • the format of identifiers using a internal prefix (e.g. ga4gh:__IP__.__namespaced-part__) where __IP__ is an uppercase, TASC approved part
  • TASC establishes a registry (in the form of a GH repo) where the internal GA4GH prefixes are discussed / maintained
  • there is no requirement for GA4GH standards to implement such identifiers; it is proposed as a "good practice where good fit & need"
  • standards/products are responsible for their identifier use; ids themselves are not issued/monitored by TASC but occasionally reviewed, documented (?)
  • the use of identifiers is documented in the ga4gh-community-standards

This as basis for discussions a decision?

@jmcmurry
Copy link

This conversation has been going on for a while and there are good arguments on both sides. At this juncture, I would gently suggest making a decision and sticking with it. It isn't super clear to me whose decision it is to make as there are resourcing questions at play, especially for the ga4gh: scheme. However, the longer work goes on, the harder it will be to retrofit a coherent strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CRITICAL Decision needs to be made soon
Projects
None yet
Development

Successfully merging a pull request may close this issue.