-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIH Common Fund - SPARC Dataset Structure #8
Comments
Thanks @jgrethe for this post! I think this is the same initiative as what @tgbugs described e.g here: #4 (comment)), correct? (if so, I propose to close this issue to keep things centralized in just one thread... ok with you @jgrethe ?) Cheers, Sylvain |
.xlsx files for seemingly trivial tabular data - yikes! anyone knows what was motivation for going with that beast instead of a simple .tsv? |
👍 |
There is some tension between deposition format (xlsx) and other more interoperable formats that we might like to publish with the dataset. Right now we have only implemented functions that go from xlsx -> json, but have plans to implement going in the other direction as well, so that the xlsx file could serve purely as a user interface and never actually appear in the published dataset. |
@SylvainTakerkart yes, same one I mention in #4 (comment). |
So every tool supporting this format for output needs to be able to write xlsx and ensure consistent dumping also in all other formats? In other words: Multiplicity of possible data representations IMHO just brings possible inconsistency, difficulty in I/O, and for unclear benefit, since Excel etc open tsv just fine. |
@yarikoptic no. Writing xlsx is only needed to make the life of the user easier if they are depositing data in xlsx format. In the minimal case writing xlsx would not be required, and for publication we might replace the xlsx files with tsv or json so that people who wanted to use the dataset did not have to deal with parsing the xlsx files. In the minimal case a validator would just read the xlsx file in and tell the user "this is malformed." That validation is implemented at 3 levels, xlsx -> generic tabular, tabular -> json, and json. Only the xlsx -> generic tabular step needs additional work beyond csv/tsv. In the maximal case it can be easier to show users malformed errors by writing another xlsx file with all the bad fields marked in red. If you were doing this via a web interface there are other options and of course the user might never interact with the underlying json structure at all. edit: with regard to possible inconsistency, we have found that the more steps away from default that a user has to take, the more likely they are to produce inconsistent data. By supporting the defaults that 90% of our data depositors experience, we cut out a lot of steps that they can screw up. In short, there are more human errors that can happen when using tsv and csv that are significantly harder to fix than any of the implementation issues that might or might not be encountered when using xlsx. Note that I think that this is true despite the fact that the current implementation of the validation pipelines always run two parsers for xlsx files so that we can catch different sets of errors. Better to do that than to try to get 20 different labs to change how they save their files on 3 operating systems and 5 different localization defaults (actually probably more operating systems because some labs are probably still on windows xp for some of their data acquisition computers). |
The paper on the SPARC Data Structure is here https://doi.org/10.1101/2021.02.10.430563 |
There is an effort within the SPARC program to develop such a structure:
https://sparc.science/help/3FXikFXC8shPRd8xZqhjVT
A white paper is about to be published. There is also some tooling being developed to assist researchers in migrating files to the structure as well as tools for validation.
The text was updated successfully, but these errors were encountered: