What is the hashing algorithm, and will there be one or multiple? #6

nsheff · 2020-11-25T16:13:28Z

The seqcol spec relies on a hashing algorithm to compute digests. These are digests of refget digests, so it makes sense to align the hashing algorithm with the one used by refget. Right now, refget allows 2 options, md5 or TRUNC512. As I understand it, TRUNC512 is tweaked slightly into a "GA4GH identifier".

@andrewyatz put it like this:

TRUNC512:
Normalise seq -> sha-512 -> take first 24 bits -> encode into hex

GA4GH identifier:
Normalise seq -> sha-512 -> take first 24 bits -> base64 url encode -> prefix "ga4gh:SQ." to the encoding

The two identifiers are the same, the only difference being GA4GH was taken up by the VR group in GKS. So if we want to produce identifiers which VR can use for their statements, we need to support the GA4GH identifier and since they're both the same "thing" under the hood we can deprecate trunc512 in favour of ga4gh (plus you can convert on the fly between the two).

So, it seems clear that we will base seqcol on this GA4GH identifier, but the question is: what should we do about md5 digests? Are md5s so deeply embedded that we should continue to allow them as an option? Do we:

Make it so that you can use either GA4GH digests or md5 digests to look up sequence collections?

or,

Allow only GA4GH digests?

If we do choose to allow either digest type, then do we make separate endpoints for each digest type, or do we have just a single endpoint that can accept either type of digest? I'm not sure I see the value of separate endpoints. From the perspective of the lookup, the which algorithm was used to create the digest is irrelevant -- it simply enters the digest in as a key that maps to some value in a database. It is also possible to infer the digest type from the length.

daviesrob · 2020-12-09T15:18:21Z

As there is no legacy system for sequence collections (unlike refget, which has to support MD5 for compatibility), I think we should only support one algorithm. Simplicity suggests that TRUNC512 would be a reasonable choice.

nsheff · 2020-12-09T16:07:32Z

Something seems unsatisfying about using a different algorithm from what the refget digests are though. But I guess you're right that there's no practical reason why we can't.

andrewyatz · 2020-12-10T14:27:36Z

I agree on using just one checksum identifier generator scheme so meaning it doesn't matter what the input digests are. @daviesrob I would still push towards using the GA4GH identifier over trunc512, since that is the one used by VRS (variation representation specification). In fact I'd propose that refget deprecates trunc512 usage to avoid this.

jmarshall · 2020-12-10T15:06:39Z

I'd want to think about it further, but on the face of it: IMHO for our more low-level protocol, sticking ga4gh:SQ. on the front of everything is just noise.

(Also this is not a SQ.. The definition of that prefix has I think not yet been formalised, but this is a set of sequences, not a sequence.)

andrewyatz · 2020-12-10T15:11:03Z

I'd be happy if we were to use the non-prefixed version of the identifier within the low-level protocol. I agree it adds noise and the prefix is useful when transmitting or holding identifiers in situations when there are unclear semantics/provenance behind the data

nsheff · 2023-01-11T20:13:12Z

ADR for decision to use GA4GH identifier is in PR #31

nsheff mentioned this issue May 4, 2022

Accepted/Recommended Sequence digest algorithms #30

Closed

tcezard added this to the V1.0 milestone Sep 5, 2022

nsheff added the likely-solved label Jan 11, 2023

nsheff closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the hashing algorithm, and will there be one or multiple? #6

What is the hashing algorithm, and will there be one or multiple? #6

nsheff commented Nov 25, 2020

daviesrob commented Dec 9, 2020

nsheff commented Dec 9, 2020

andrewyatz commented Dec 10, 2020

jmarshall commented Dec 10, 2020

andrewyatz commented Dec 10, 2020

nsheff commented Jan 11, 2023

What is the hashing algorithm, and will there be one or multiple? #6

What is the hashing algorithm, and will there be one or multiple? #6

Comments

nsheff commented Nov 25, 2020

daviesrob commented Dec 9, 2020

nsheff commented Dec 9, 2020

andrewyatz commented Dec 10, 2020

jmarshall commented Dec 10, 2020

andrewyatz commented Dec 10, 2020

nsheff commented Jan 11, 2023