Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

cmaumet · 2018-03-01T17:11:17Z

Hi everyone,

In a NIDM-Results pack:

we store the shasum of the files we refer to.
we store gzipped images (.nii.gz) (in order to save space).

But shasum of gzipped files are different:

whether images have been compressed using Matlab versus Octave
depending on the host system (local mac OS versus ubuntu on Travis CI) when Octave is launched via docker

Differences in shasum can be explained by the fact that different processes were used to gzip the images. But, this is disserving our initial goal to be able to identify common images across multiple NIDM graphs (for reconciliation).

As a workaround, we could additionally store the shasum of the file before compression.

What are your thoughts on this?

Note: This issue was identified with @gllmflndn when testing the SPM-NIDM-Results exporter in Octave at incf-nidash/nidmresults-spm#46 and briefly discussed on NIDM call (Jan. 29th, 2018).

nicholst · 2018-03-01T17:27:55Z

+1 on this... especially considering that gzip can be called with different options (e.g. compression level) and even have optional comment fields, this was always rather fragile. It's annoying, but I don't see a workaround.

satra · 2018-03-01T17:31:16Z

+1 on storing non-zipped sums. but in general since a change of a bit can effect a shasum, these are not good substitutes for anything other than identity.

we have always considered more flexible hashes to match binary blob, header, etc.,. we can describe an image based on overall hash, the blob being the same, the header being the same, etc.,.

cmaumet · 2018-03-05T16:26:34Z

We discussed this on NIDM call on March 5th.

@cmaumet - to write up a proposal on how to store the original shasum (including pros and cons).

satra · 2018-03-05T16:29:33Z

given that shasum's are bit dependent, what is the likelihood of two unzipped nifti files having the same shasum when run through the same processing say in spm and fsl?

i.e. should we start moving towards breaking down the information content into pieces that we want to query on.

cmaumet · 2018-03-05T16:32:46Z

@satra: if two pipelines reused the same data?

satra · 2018-03-05T16:42:43Z

@cmaumet - yes. i worry there are too many pieces in the nifti file that would be different.

so the only thing consistent would be at the level of the input data. and if that's the case, then the SHASUM as it stands currently would be fine to refer to input data.

cmaumet · 2018-03-05T17:13:30Z

@satra - what would be your suggestion of update for NIDM? Creating separate entities for headers & image, for each file?

satra · 2018-03-05T17:33:07Z

@cmaumet - perhaps it may be useful to know what sort of equality comparisons are you planning to make?

nicholst · 2018-03-05T20:21:43Z

I had understood this was only the most superficial comparison... "is this the same file" basically, with out any nuance of "Is this the same data before preprocessing". If I went in and changed the NIFTI comment header field the shasum should change. (It's a trivial change, but it's no longer the same file).

satra · 2018-03-06T13:43:01Z

@nicholst - that is correct. hence my question of what types of comparisons to make.

i used the phrase "same same but different" for an ohbm brainhack project last year, to illustrate issues with similarity. two files can be similar on the basis of:

image similarity

imaging modality
imaging object (brain, cerebellum, spinal cord, ...)
image subject to transformation

graph similarity

processing applied
participant/group similiarity
participant characteristics (age, gender, zygosity, clinical diagnosis, ...)

for this specific issue, perhaps we should be focusing on attributes directly/easily extractable from the image. we want a set of comparison attributes associated with an image. we could insert new attributes to the file, or create a new companion entity of similarity measures. i.e when are two files similar.

i do think this topic is worth a good discussion. we should determine what aspects of similarity we get:

directly from attributes
through processing of the graph
through processing of the image

and what use cases these pieces of information are intended to help address.

khelm · 2018-03-06T16:54:31Z

FYI - There is a similar discussion regarding the use of owl:sameAs. They point out that owl:sameAs is often used to convey "represents", "very similar to", "same thing but a different context", etc. Some of which are relevant to the discussion above by @satra

cmaumet added nidm-results Data Model labels Mar 1, 2018

cmaumet mentioned this issue Mar 2, 2018

Tests ignore shasums incf-nidash/nidmresults-spm#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

cmaumet commented Mar 1, 2018

nicholst commented Mar 1, 2018

satra commented Mar 1, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

nicholst commented Mar 5, 2018 via email

satra commented Mar 6, 2018

khelm commented Mar 6, 2018

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

Comments

cmaumet commented Mar 1, 2018

nicholst commented Mar 1, 2018

satra commented Mar 1, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

cmaumet commented Mar 5, 2018

satra commented Mar 5, 2018

nicholst commented Mar 5, 2018 via email

satra commented Mar 6, 2018

khelm commented Mar 6, 2018