-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shasum of gzipped files depends on use of Matlab vs. Octave & host #457
Comments
+1 on this... especially considering that gzip can be called with different options (e.g. compression level) and even have optional comment fields, this was always rather fragile. It's annoying, but I don't see a workaround. |
+1 on storing non-zipped sums. but in general since a change of a bit can effect a shasum, these are not good substitutes for anything other than identity. we have always considered more flexible hashes to match binary blob, header, etc.,. we can describe an image based on overall hash, the blob being the same, the header being the same, etc.,. |
given that shasum's are bit dependent, what is the likelihood of two unzipped nifti files having the same shasum when run through the same processing say in spm and fsl? i.e. should we start moving towards breaking down the information content into pieces that we want to query on. |
@satra: if two pipelines reused the same data? |
@cmaumet - yes. i worry there are too many pieces in the nifti file that would be different. so the only thing consistent would be at the level of the input data. and if that's the case, then the SHASUM as it stands currently would be fine to refer to input data. |
@satra - what would be your suggestion of update for NIDM? Creating separate entities for headers & image, for each file? |
@cmaumet - perhaps it may be useful to know what sort of equality comparisons are you planning to make? |
I had understood this was only the most superficial comparison... "is this
the same file" basically, with out any nuance of "Is this the same data
before preprocessing". If I went in and changed the NIFTI comment header
field the shasum should change. (It's a trivial change, but it's no longer
the same file).
|
@nicholst - that is correct. hence my question of what types of comparisons to make. i used the phrase "same same but different" for an ohbm brainhack project last year, to illustrate issues with similarity. two files can be similar on the basis of: image similarity
graph similarity
for this specific issue, perhaps we should be focusing on attributes directly/easily extractable from the image. we want a set of comparison attributes associated with an image. we could insert new attributes to the file, or create a new companion entity of similarity measures. i.e when are two files similar. i do think this topic is worth a good discussion. we should determine what aspects of similarity we get:
and what use cases these pieces of information are intended to help address. |
FYI - There is a similar discussion regarding the use of |
Hi everyone,
In a NIDM-Results pack:
But shasum of gzipped files are different:
Differences in shasum can be explained by the fact that different processes were used to gzip the images. But, this is disserving our initial goal to be able to identify common images across multiple NIDM graphs (for reconciliation).
As a workaround, we could additionally store the shasum of the file before compression.
What are your thoughts on this?
Note: This issue was identified with @gllmflndn when testing the SPM-NIDM-Results exporter in Octave at incf-nidash/nidmresults-spm#46 and briefly discussed on NIDM call (Jan. 29th, 2018).
The text was updated successfully, but these errors were encountered: