Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR: sequence digest algorithm to be GA4GH digest #31

Merged
merged 7 commits into from
Aug 25, 2023

Conversation

andrewyatz
Copy link
Collaborator

No description provided.

@nsheff
Copy link
Member

nsheff commented May 13, 2022

Looks right to me!

@yash-puligundla
Copy link
Member

Looks good to me. Link for sha512t24u function points to the comments section. Maybe replace it with "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883"

@andrewyatz
Copy link
Collaborator Author

Good point @yash-puligundla I will need to update that

@sveinugu
Copy link
Collaborator

sveinugu commented Jun 2, 2022

Looks good to me. However, the way the ADR is worded, I think it really also says that we will also include the namespace part of the identifier ("ga4gh:"). I am very much in favor of this, but I am not sure all agree. So we might need to take a separate round on this.

@sveinugu
Copy link
Collaborator

Looks good to me. However, the way the ADR is worded, I think it really also says that we will also include the namespace part of the identifier ("ga4gh:"). I am very much in favor of this, but I am not sure all agree. So we might need to take a separate round on this.

Just a few CURIE-related clarifications from the CURIE standard that seems relevant based on the discussions today:

[1]

The prefix part is required to be somehow mappable to a IRI prefix. Concatenating the IRI prefix with the reference part should create a valid IRI, e.g.:

"ga4gh:SQ.whatnot" + {"ga4gh": "https://ga4gh.net/refget/"} = "https://ga4gh.net/refget/SQ.whatnot"

Apart from that, I don't think the CURIE standard defines semantically what the prefix should represent.

[2]

Quoting from the standard:

A host language MAY declare a default prefix value, or MAY provide a mechanism for defining a default prefix value. In such a host language, when the prefix is omitted from a CURIE, the default prefix value MUST be used. Conversely, if such a language does not define a default prefix value mechanism and does not define a set of reserved values, CURIEs MUST NOT be used without a leading prefix and colon.

Basically, the way I read this, is that the prefix part of a CURIE is optional given that the host language defines a default prefix. This is also described in the grammar of CURIEs:

[ [ prefix ] ':' ] reference

I am not sure whether the seqcol standard should be defined as a "host language" though. I assume not. Python, SQL, etc. are host languages, but I don't think data models/APIs are.


### Decision

The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should specify where we intend to use the identifiers. Something along the line of

Suggested change
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation.
It will be used to digest:
- the sequences that are stored in the `sequences` array
- the canonical representation of arrays of level 2
- the canonical representation of the sequence collection of level 1


### Rationale

GA4GH identifiers were created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html), which included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting GA4GH identifiers ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should talk about the prefix ga4gh:SQ in this ADR. First, I don't think we want to use SQ since it is reserved for sequences and second it's not clear that the prefix is part of all the digest that Sequence Collection uses. It should only be used for the level identifier 0.

@nsheff
Copy link
Member

nsheff commented Oct 5, 2022

There are two issues:

  1. What digest algorithm to use
  2. How to construct the prefixes of the identifiers that we actually either 2A) Digest; or 2B) Return to users.

I think this ADR should be restricted to the former: this is only about algorithm and not about identifier construction.

We still need to debate the identifier construction question, which I've tried to summarize in issue #37


### Rationale

The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify the example like this:

Suggested change
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT is digested as `aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2` and result in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2` once prefixed. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewyatz I think we should make this ADR only about algorithm, and not about identifier construction at all, which is handled in #37

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsheff I agree! We should in that case make sure that we use the term digest and not identifier everywhere.

@nsheff nsheff changed the title Adding sequence identifier ADR ADR: sequence digest algorithm to be GA4GH digest Jan 11, 2023
Copy link
Collaborator

@sveinugu sveinugu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me if we change this to only talk about digests and not include the word "identifier" (or possibly only to explain the difference, if that is needed).


### Rationale

The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsheff I agree! We should in that case make sure that we use the term digest and not identifier everywhere.

Formalising the use of sha512t24u over md5. Switching this text to reflect this rather than just its use as a sequence identifier
@nsheff
Copy link
Member

nsheff commented Jul 28, 2023

ok I think the changes look good... just that are you explicitly wanting to say "preferred" ? Probably saying "will use" or "MUST use" is more accurate...

@nsheff nsheff changed the base branch from master to dev August 25, 2023 19:40
@nsheff nsheff merged commit 413af2b into ga4gh:dev Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants