-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADR: sequence digest algorithm to be GA4GH digest #31
Conversation
Looks right to me! |
Looks good to me. Link for |
Good point @yash-puligundla I will need to update that |
Looks good to me. However, the way the ADR is worded, I think it really also says that we will also include the namespace part of the identifier ("ga4gh:"). I am very much in favor of this, but I am not sure all agree. So we might need to take a separate round on this. |
Just a few CURIE-related clarifications from the CURIE standard that seems relevant based on the discussions today: [1] The
Apart from that, I don't think the CURIE standard defines semantically what the [2] Quoting from the standard:
Basically, the way I read this, is that the
I am not sure whether the seqcol standard should be defined as a "host language" though. I assume not. Python, SQL, etc. are host languages, but I don't think data models/APIs are. |
docs/decision_record.md
Outdated
|
||
### Decision | ||
|
||
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should specify where we intend to use the identifiers. Something along the line of
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation. | |
The GA4GH identifier will be used as our default sequence identifier instead of MD5. Other identifiers can be provided in a separate array and should not be part of the collection checksum calculation. | |
It will be used to digest: | |
- the sequences that are stored in the `sequences` array | |
- the canonical representation of arrays of level 2 | |
- the canonical representation of the sequence collection of level 1 |
docs/decision_record.md
Outdated
|
||
### Rationale | ||
|
||
GA4GH identifiers were created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html), which included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting GA4GH identifiers ensures sequence collections remains inline with newer standards within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should talk about the prefix ga4gh:SQ
in this ADR. First, I don't think we want to use SQ
since it is reserved for sequences and second it's not clear that the prefix is part of all the digest that Sequence Collection uses. It should only be used for the level identifier 0.
There are two issues:
I think this ADR should be restricted to the former: this is only about algorithm and not about identifier construction. We still need to debate the identifier construction question, which I've tried to summarize in issue #37 |
docs/decision_record.md
Outdated
|
||
### Rationale | ||
|
||
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify the example like this:
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem. | |
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT is digested as `aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2` and result in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2` once prefixed. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewyatz I think we should make this ADR only about algorithm, and not about identifier construction at all, which is handled in #37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsheff I agree! We should in that case make sure that we use the term digest
and not identifier
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me if we change this to only talk about digests and not include the word "identifier" (or possibly only to explain the difference, if that is needed).
docs/decision_record.md
Outdated
|
||
### Rationale | ||
|
||
The GA4GH digest was created as part of the [Variation Representation Specification standard](https://vrs.ga4gh.org/en/stable/impl-guide/computed_identifiers.html). This document included a way of creating identifiers to be used with sequences e.g. ACGT results in the identifier `ga4gh:SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2`. The scheme uses the [`sha512t24u` function](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239883) to create a base64 URL-safe representation of a sha512 digest. Adopting this standard ensures sequence collections remains inline with newer standards within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsheff I agree! We should in that case make sure that we use the term digest
and not identifier
everywhere.
Formalising the use of sha512t24u over md5. Switching this text to reflect this rather than just its use as a sequence identifier
ok I think the changes look good... just that are you explicitly wanting to say "preferred" ? Probably saying "will use" or "MUST use" is more accurate... |
No description provided.