-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the hashing algorithm, and will there be one or multiple? #6
Comments
As there is no legacy system for sequence collections (unlike refget, which has to support MD5 for compatibility), I think we should only support one algorithm. Simplicity suggests that TRUNC512 would be a reasonable choice. |
Something seems unsatisfying about using a different algorithm from what the refget digests are though. But I guess you're right that there's no practical reason why we can't. |
I agree on using just one checksum identifier generator scheme so meaning it doesn't matter what the input digests are. @daviesrob I would still push towards using the GA4GH identifier over trunc512, since that is the one used by VRS (variation representation specification). In fact I'd propose that refget deprecates trunc512 usage to avoid this. |
I'd want to think about it further, but on the face of it: IMHO for our more low-level protocol, sticking (Also this is not a |
I'd be happy if we were to use the non-prefixed version of the identifier within the low-level protocol. I agree it adds noise and the prefix is useful when transmitting or holding identifiers in situations when there are unclear semantics/provenance behind the data |
ADR for decision to use GA4GH identifier is in PR #31 |
The seqcol spec relies on a hashing algorithm to compute digests. These are digests of refget digests, so it makes sense to align the hashing algorithm with the one used by refget. Right now, refget allows 2 options, md5 or TRUNC512. As I understand it, TRUNC512 is tweaked slightly into a "GA4GH identifier".
@andrewyatz put it like this:
TRUNC512:
Normalise seq -> sha-512 -> take first 24 bits -> encode into hex
GA4GH identifier:
Normalise seq -> sha-512 -> take first 24 bits -> base64 url encode -> prefix "ga4gh:SQ." to the encoding
So, it seems clear that we will base seqcol on this GA4GH identifier, but the question is: what should we do about md5 digests? Are md5s so deeply embedded that we should continue to allow them as an option? Do we:
or,
If we do choose to allow either digest type, then do we make separate endpoints for each digest type, or do we have just a single endpoint that can accept either type of digest? I'm not sure I see the value of separate endpoints. From the perspective of the lookup, the which algorithm was used to create the digest is irrelevant -- it simply enters the digest in as a key that maps to some value in a database. It is also possible to infer the digest type from the length.
The text was updated successfully, but these errors were encountered: