Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the hashing algorithm, and will there be one or multiple? #6

Closed
nsheff opened this issue Nov 25, 2020 · 6 comments
Closed

What is the hashing algorithm, and will there be one or multiple? #6

nsheff opened this issue Nov 25, 2020 · 6 comments
Milestone

Comments

@nsheff
Copy link
Member

nsheff commented Nov 25, 2020

The seqcol spec relies on a hashing algorithm to compute digests. These are digests of refget digests, so it makes sense to align the hashing algorithm with the one used by refget. Right now, refget allows 2 options, md5 or TRUNC512. As I understand it, TRUNC512 is tweaked slightly into a "GA4GH identifier".

@andrewyatz put it like this:

TRUNC512:
Normalise seq -> sha-512 -> take first 24 bits -> encode into hex

GA4GH identifier:
Normalise seq -> sha-512 -> take first 24 bits -> base64 url encode -> prefix "ga4gh:SQ." to the encoding

The two identifiers are the same, the only difference being GA4GH was taken up by the VR group in GKS. So if we want to produce identifiers which VR can use for their statements, we need to support the GA4GH identifier and since they're both the same "thing" under the hood we can deprecate trunc512 in favour of ga4gh (plus you can convert on the fly between the two).

So, it seems clear that we will base seqcol on this GA4GH identifier, but the question is: what should we do about md5 digests? Are md5s so deeply embedded that we should continue to allow them as an option? Do we:

  1. Make it so that you can use either GA4GH digests or md5 digests to look up sequence collections?

or,

  1. Allow only GA4GH digests?

If we do choose to allow either digest type, then do we make separate endpoints for each digest type, or do we have just a single endpoint that can accept either type of digest? I'm not sure I see the value of separate endpoints. From the perspective of the lookup, the which algorithm was used to create the digest is irrelevant -- it simply enters the digest in as a key that maps to some value in a database. It is also possible to infer the digest type from the length.

@daviesrob
Copy link
Contributor

As there is no legacy system for sequence collections (unlike refget, which has to support MD5 for compatibility), I think we should only support one algorithm. Simplicity suggests that TRUNC512 would be a reasonable choice.

@nsheff
Copy link
Member Author

nsheff commented Dec 9, 2020

Something seems unsatisfying about using a different algorithm from what the refget digests are though. But I guess you're right that there's no practical reason why we can't.

@andrewyatz
Copy link
Collaborator

I agree on using just one checksum identifier generator scheme so meaning it doesn't matter what the input digests are. @daviesrob I would still push towards using the GA4GH identifier over trunc512, since that is the one used by VRS (variation representation specification). In fact I'd propose that refget deprecates trunc512 usage to avoid this.

@jmarshall
Copy link
Member

I'd want to think about it further, but on the face of it: IMHO for our more low-level protocol, sticking ga4gh:SQ. on the front of everything is just noise.

(Also this is not a SQ.. The definition of that prefix has I think not yet been formalised, but this is a set of sequences, not a sequence.)

@andrewyatz
Copy link
Collaborator

I'd be happy if we were to use the non-prefixed version of the identifier within the low-level protocol. I agree it adds noise and the prefix is useful when transmitting or holding identifiers in situations when there are unclear semantics/provenance behind the data

@nsheff
Copy link
Member Author

nsheff commented Jan 11, 2023

ADR for decision to use GA4GH identifier is in PR #31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants