-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GA4GH Namespace policy #16
Comments
There are several items underlying this proposal that TASC must address:
I will add my own thoughts to these issues when TASC brings the issue Under Discussion. |
Since this issue is on the agenda for Monday, I'm adding some thoughts here. Numbers correspond to the points I listed earlier in this thread.
|
Do we need conventions/governance beyond "register the proposed namespace in identifiers.org"? It seems like that would handle most use cases in terms of exposing existing, related products and avoiding duplication in prefixes. Where TASC could add value is by aggregating namespaces in identifiers.org that are specific to or being used by GA4GH — just to have a more constrained search scope. Something similar to this for ServiceInfo could be sufficient. |
In terms of structure for a registry/catalog, here's an example for an identifiers.org namespace object (as returned by the registry API) (cc @rishidev): "namespaces" : [ {
"prefix" : "3dmet",
"mirId" : "MIR:00000066",
"name" : "3DMET",
"pattern" : "^B\\d{5}$",
"description" : "3DMET is a database collecting three-dimensional structures of natural metabolites.",
"created" : "2019-06-11T14:15:50.652+0000",
"modified" : "2019-06-11T14:15:50.652+0000",
"deprecated" : false,
"deprecationDate" : null,
"sampleId" : "B00162",
"namespaceEmbeddedInLui" : false,
"_links" : {
"self" : {
"href" : "https://registry.api.identifiers.org/restApi/namespaces/230"
},
"namespace" : {
"href" : "https://registry.api.identifiers.org/restApi/namespaces/230"
},
"contactPerson" : {
"href" : "https://registry.api.identifiers.org/restApi/namespaces/230/contactPerson"
}
}
}, |
I agree with registering at identifiers.org. This was a recommended registry in the above mentioned presentation that was reviewed and approved by the Steering Committee. However, I think that I'm not understanding how registering at identifiers.org will resolve the issue presented here. To clarify the problem statement, |
So @ahwagner in |
Yes. For example, the |
The idea would be that groups could reuse identifiers–but I think that it is important to clarify that once a type prefix is assigned, new groups should not be proposing / TASC should not be approving overloaded use of that prefix. |
Please ensure that multiple designated resolvers are utilized. I recommend N2T as the second one (though there are a number of options). It would also be terrific if all prefixes could be registered in prefix commons. https://github.com/prefixcommons |
@cmungall @jmcmurry @deepakunni3 might have suggestions too |
It appears there may be some confusion/conflation of the various topics in this issue. I apologize for not catching that earlier and providing clarification. To recap: The CURIE syntax that was previously approved is of the following format: The VR Spec has defined several type prefixes already, including The proposal by the VRS team is reposted here for convenience:
Since the type prefixes are unique within the ga4gh namespace, any project that defines a type prefix (e.g., VS, VA, VH, SQ, VT) is reserving that prefix across the entire organization. The reuse of type prefixes should be encouraged, but if two projects use the same prefix for different types of entities, confusion will abound. In my post earlier in this thread I provided 3 opinions. I will summarize all 3 here and add a fourth that was mentioned by others.
|
I think I'm more or less on board with the conventions that @rrfreimuth and @ahwagner are proposing above, with the caveat that I don't think The actual namespace/standard in question here is the Variant Representation Specification (VRS). Thus, I would argue that the prefix should be The end result being something along the lines of...
There probably isn't a huge value-add with the To @rrfreimuth's point (4): the above still wouldn't establish a convention for registering type prefixes, but more so GA4GH standard prefixes. I think, to some extent, this is where the work @jb-adams is doing on the service-registry API/portal comes into play... but probably still needs some more thought. |
My push back against the Also to just bring up that example you've got involving Ensembl (close to my heart). Here's an example of where the namespaces has been developed, each of those point to a single website however that tightly binds namespaces to an implementation. As Ensembl changes its implementation (one website vs. 6) or introduces new resources (e.g. the recent covid-19 browser) new namespaces are create. So it suggests that these differences are now encoded in external identifiers rather than delegating this all onto Ensembl resolvers to sort out. |
In the solution statement, it would be helpful to mention that this namespace solution is for CURIE-related identifiers, and may not be applicable for other types of namespaces. My example is in the Passport v1 spec, there is a "ga4gh" namespace created for JWT claim names of "ga4gh_passport_v1" to avoid collisions with non-ga4gh claims. However, this is:
So any wording in the final solution text that scopes the approach to CURIES in a similar way to the problem statement does would help clarify the applicability of the solution. |
Mentioning that I am hoping to discuss this on the next TASC call on 12th July |
Consider synching prefixes with other prefix coordination strategies. Some of these are documented here: I would recommend that GA4GH prefixes be disseminated to multiple resolvers such as N2T and W3ID. |
Rereading the thread I believe the simple solution that @jaeddy proposed with vrs: as the prefix covers it. Points above that this addresses:
That is not only provided for but can be simplified to
An important part of the semantics is that the prefix is the part before the colon. The VS, VA The resolver problem This is simplified with the vrs: prefix. A resolver is still needed, but that then sits with VRS. It should not require much additional than VRS would have to do to create/resolve the compute ids anyway. Keeping those things together is good encapsulation. VRS owns the logic, as they extend it (e.g. Vx, Vy, Vz) it is incumbent on them to revise it. That is best kept as a separate concern from other GA4GH ids. |
Listing some key points from the TASC call on 8th March 2022 below - to those on the call, please flag anything important that I miss.
My opinion is that SC committed to a 'ga4gh' prefix two years ago and I'd want a very good reason to ask VRS to reverse their subsequent development work, based on that decision. While the 'ga4gh' prefix will require that we take on maintenance of a resolver the discussion indicated that this was likely fairly easily achievable. I continue to argue against centralised infrastructure as a general direction of travel but think this case likely warrants it. That would then bring us to management and maintenance:
|
Few more points here I hope helps to clarify:
I feel like the fact that this is a partial CURIE has sent discussions down the resolver route and with fair reason. |
Thanks @andrewyatz If I'm following correctly, that could take us to:
Also touched on in the call but not included in my summary was that describing the entities could be done via the GA4GH Data Model Library, which was consulted on last year. Some questions and open issues for me at this point:
Happy to hear thoughts and thanks for the contributions. |
@andrewyatz , at a recent TASC call, we discussed a potential resolving service that would refer the client to the standard GA4GH schema denoted by the type prefix (e.g. |
@jb-adams agreed there's no way we can provide a service that could route someone to platform/API hosting said data. But yes something where programmatically the sub-extensions can be described and explored would be useful. I'd expect though this is a very easy thing to implement as there wouldn't be that many of them and the processes above would limit to pretty much to things that have been accepted |
@susanfairley for your point 2, I have a digest and want what that digest is. Yes I do think this is something that is wanted in a number of use-cases that haven't yet been recorded but would personally expect services/resources that implement the specs to support that if they want to and feel it is useful to do so. Some may for sure. Some may not. I would say this point is closer to what is the underlying use-case behind refget and not the reverse lookup |
TASC has met today and discussed this further. @ahwagner - could you please come back to this issue with information on versioning? The other focus of discussion was on the necessity (or otherwise?) of being able to get from an identifier to the data in the specific instance of that entity (resolving). The conversation touched on a number of points.
In practical terms, providing generic entity level resolution (saying that SQ is a sequence) and providing some minimal information (possibly including information on how subsequent accessions/identifiers/checksums/digests are produced) could likely be fairly easily implemented, possibly working with the planned GA4GH Data Model Library. This doesn't seem to be a source of controversy. Beyond that, resolving to the digest level would seem to be problematic. It is also likely out of scope for GA4GH, being more in line with the level of service provision associated with EMBL-EBI or NCBI. In summary, a simple generic-entity level resolution is likely achievable and has some support. Digest level resolution would be problematic, likely out of scope for GA4GH but there is concern that having unresolvable IDs may be problematic for the community. As such, any further context or clarification from GKS would be welcome. Comments welcome before the end of April. TASC will next meet in May. |
Hi all, apologies for the delay in my reply. I have had many thoughts on the several subjects above and have been putting off my response until I could make time to compose it here. I have split these out by headers into discrete areas of concern raised in this thread. Uniform Resource Names (URNs) and Uniform Resource Locations (URLs)First, just a reminder that VRS computed identifiers are Uniform Resource Names (URNs), which are a subclass of Uniform Resource Identifiers (URIs). URNs by their definition must be globally unique and persistent (which the GA4GH computed identifier algorithm guarantees to meet with extremely high probability), and similarly are not intended to be resolvable entities as URLs are. This is why a resolver service on VRS IDs or RefGet sequences is not feasible; as URNs they were never meant to be globally retrievable (like URLs), only globally unique and persistent. Identifiers are usefulRegarding point 2:
I think that all identifiers, including both accessioned and computed identifiers, are potentially problematic in that they may be difficult to resolve. In fact, obsolete sequence accessions were a major factor in the development of the Universal Transcript Archive. Since this point was raised, however, I feel obligated to point out that IDs (both accessioned and computed) are useful despite these potential problems. In fact, variants are often referenced by identifier instead of by a descriptive syntax; you don't need to look any further than rsIDs, allele registry IDs, ClinVar IDs, etc. to see the use cases where identifiers provide a convenient linking mechanism between evidence and knowledge statements. These are all in widespread use, especially for evidence and knowledge documents: e.g. Of course, for any resource that provides documents using VRS IDs (or accessioned IDs generated by that resource, e.g. ClinVar IDs), it is the responsibility of that resource to provide necessary context and/or reverse lookup services for the use of identifiers as needed for the use cases anticipated by that resource. Regarding what resources may actually implement retrieval services, I agree with @andrewyatz's comment:
Notably, the VRS reference implementation does provide a mechanism for enref / deref operations on VRS objects that may be leveraged by a genomic data provider resource. A GA4GH Namespace ResolverWith respect to question #3, there is no requirement for "ownership" of these URNs (as you would expect for URLs), but there is an opportunity for ownership of the I see the associated GA4GH Namespace resolver working as described by @jb-adams and @susanfairley, namely that GA4GH will host a service take can inspect the ID type and return information on that type. For example, a Versioning and typeAs VRS and associated VRSATILE specs mature, we will inevitably need to make some changes to our data models to accommodate previously unforeseen use cases or improvements to usability or adoption. We are working on a maturity model and data model versioning system to assist with this. Our current thinking is to have several levels of maturity for any given data model (e.g an Allele, Haplotype, or SequenceLocation), corresponding to the degree of community adoption / stability of the model. Once a data model is granted a "stable" designation, it is assigned a version. Any breaking changes to that model will go through the full maturity model review and implementation cycle as a new version of that model, and that version would be indicated between the computed identifier type and digest. For example, the current |
@ahwagner, thank you for the above. TASC is due to meet next week (I believe on Wednesday 4th May at 1pm BST/ 8am EDT - TBC). Looking at the comments above, including my comments from 29th March and particularly those from @ahwagner and @andrewyatz, my impression is that:
As this has been an open issue for a while, I'd like to suggest that we aim to conclude this discussion on the upcoming call. I'd propose that:
|
+1 to @susanfairley 's proposal, above. This does, however, lead to the question about when and how versioning will be performed. I suggest the following:
We should provide guidance on the representation of versions within the "type.version" string. Options include: string (see VRS Computed Identifiers for character set restrictions), Semantic Versioning string, decimal number, integer, alphanumeric string. Since the period character is used for field separation I think we should avoid SemVer and decimals. In addition, I think it should be intuitive to determine which version is more recent, so I prefer integers or simple alphanumerics (e.g., [1-9][\d]*[a-z]?) over a string. |
To supplement @rrfreimuth's point: I think some of the specifics of versioning are still being trialed, so just want to put in a note that while my above proposal is an idea for a starting point, what we find most effective in practice is still being worked out. It should be close to above. For example, one potential change from the original proposal may be the notion of a draft version, similar to alpha/beta/rc builds in SemVer. So instead of a commit string (e.g. the above Tagging @larrybabb who might have some additional thoughts on this. |
I agree with the current proposal as it stands. However, I think the two last points regarding |
+1 to @susanfairley's proposal from me as well. Thanks to @ahwagner for the careful explanation, and to everyone else for plenty of thoughtful discussion! |
@mamanambiya @ahwagner @andrewyatz As mentioned in London etc. I've started another attempt at a GA4GH standards documentation site, splitting this off from SchemaBlocks. Current discussion is referenced here - waiting to be taken over... https://ga4gh-community-standards.github.io/standards/identifiers-and-CURIEs/#the-ga4gh-namespace |
Thanks @mbaudis. I appreciate you trying to move the ball here. Is the idea of the ga4gh-community-standards GitHub site for the community to document ongoing discussions, to provide "unofficial" recommendations for GA4GH developers to consider, or something else? I am very supportive of anything that helps us move forward with standard development alignment across GA4GH! |
@ahwagner Still unofficially filling the gaps; providing a template for IMO TASC to take over. Not bound to stick with name, repo etc. but why not But then - what is "unofficial"? GH threads as documentation are even more so (un-). |
A quick comment in this thread to highlight that Refget v2 is now going to officially adopt the Similarly Sequence collection has been discussing the adoption of a prefix for describing a group of sequences. No decision have been made yet but we will be looking to TASC for guidance on how to chose and register it, once we have. |
@tcezard So if you feel like adding this as example to https://github.com/ga4gh-community-standards/ga4gh-community-standards.github.io/blob/main/docs/standards/identifiers-and-CURIEs.md be my guest :-) |
Re. |
Proposal:
This as basis for |
This conversation has been going on for a while and there are good arguments on both sides. At this juncture, I would gently suggest making a decision and sticking with it. It isn't super clear to me whose decision it is to make as there are resourcing questions at play, especially for the |
Problem Statement
The VRS standard submitted a proposal to the GA4GH Steering Committee to approve the use of the
ga4gh
namespace for VRS computed identifier CURIEs. This was approved by the Steering Committee, with the understanding that a governance mechanism would be established for object identifiers under this namespace once an appropriate technical committee was established (TASC).VRS is preparing to release new data type prefixes under the
ga4gh
namespace as part of its upcoming minor version releases. We intend to move forward with these as scheduled until a formal mechanism for registering / requesting identifier prefixes is created.Impact of alignment between standards
Due to the organization-level namespacing, any CURIEs generated under this namespace should have a consistent meaning across products. If two standards (e.g. refget and VRS) develop identifiers with similar prefixes or inconsistent structures under the
ga4gh
namespace, interoperability between resources implementing multiple GA4GH products expectingga4gh
CURIEs will fail.Background research and landscape analysis
We have worked to create an identifier scheme that has been reviewed and approved by unanimous vote of the GA4GH Steering Committee.
We have established several type prefixes already, with more that will be generated over the next year as we iteratively expand VRS.
This has direct relevance to the DRS, VRS, and refget standards, and likely others down the road.
Proposed solution
The VRS Project Maintainers are granted authority to build additional type prefixes under the
ga4gh
namespace without external review, so long as those identifiers begin withV
, (e.g.VS
for VariationSet,VA
for Allele,VH
for Haplotype). Data type prefixes outside of this scope must be reviewed and approved by TASC before implementation. Similar project-level subdomains could be approved and allocated by TASC for other standards as needed.In addition, a TASC-maintained public registry of type prefixes (approved, reserved, and pending) should be provided for quick reference by product developers and implementers.
The text was updated successfully, but these errors were encountered: