Allow for custom attributes and read type description of fastq #102

cgirardot · 2024-12-24T15:38:54Z

Implements #86

Allows to add "schema_attribute[tag]" (e.g. sample_attribute[treatment]) in the input schema tables (tsv only) where the schema ('sample', 'run', 'experiment', 'study') e.g. a new sample_attribute[treatment] column in the ena_sample.tsv. These extra headers are injected in the XML generation stream, and injected in the generated XML as a ATTRIBUTE sequence (templates where modified accordingly). For samples, only the default ERC000011 was modified to support these additional attributes. Unit is not yet supported.

Additionally, support for read_type and read_label (as new headers in the ena_run.tsv) is added to the run XML for files of type fastq to support single cell situations where more than 2 fastq files are available (ENA then requires to have read_type described). Multiple values can be passed using CSV format eg paired,cell_barcode

Limitations: read_label is not fully supported as it would require to support SpotDescriptorType in the run XML but it is unclear how this information could be passed. Basic support for SPOT_DECODE_SPEC with a READ_SPEC using BASE_COORD (see SRA.common.xsd) could be provided with:

headers like READ_SPEC 1...READ_SPEC n where the header number is the READ_SPEC's READ_INDEX
value would be formatted like READ_LABEL:READ_CLASS:READ_TYPE:BASE_COORD:SPOT_LENGTH. For example: UMI1:Application Read:Other:1:8.

…e[tag]' and inject each of these tags as a schema_attribute XML sequence. Additionally, support for read_type and read_label (not fully supported yet though) is added to the run XML (when present in the run ena table)

bedroesb · 2024-12-25T11:55:58Z

@cgirardot Thanks a lot for starting this. This is a long wanted feature. I was wondering if you could also test/make it work for xlsx files or ISA-JSON (which should be easy since they all, including the tsv's, get parsed to a pandas dataframe). The controlled vocabulary you added, could you add this to the xml updater script in var ? I don't hardcode these things but pull them straight from ENA and create the xml's with variable vocabularies on the fly using jinja templates. WHich make me wonder where you are getting the ENA_template_FASTQ and ENA_template_READ_TYPE from? The SpotDescriptorType question I will look into another day.

cgirardot · 2024-12-25T12:26:57Z

@bedroesb thank you for your quick feedback.

@cgirardot Thanks a lot for starting this. This is a long wanted feature. I was wondering if you could also test/make it work for xlsx files or ISA-JSON (which should be easy since they all, including the tsv's, get parsed to a pandas dataframe).

I dont know when I can have a look at this. My data management system exports data using the tsv tables format so it was easy to get in my code to have the additional information exported with different studies available in our DM (single cell , not single cell ...). I am not sure how to get started with the other formats and test.

The controlled vocabulary you added, could you add this to the xml updater script in var ? I don't hardcode these things but pull them straight from ENA and create the xml's with variable vocabularies on the fly using jinja templates. Which make me wonder where you are getting the ENA_template_FASTQ and ENA_template_READ_TYPE from? The SpotDescriptorType question I will look into another day.

I am really not familiar with these templating frameworks (took me quite some time to get in...) but can take a look. The values are coming from:

ENA_template_READ_TYPE is coming from SRA.run.xsd ; see line 85 the READ_TYPE element
commit of ENA_template_FASTQ is a mistake, this was a first attempt and I forgot to remove it (it is not used) i.e. the fastq case is now managed in the ENA_template_runs.xml directly ; see line 26 and below.

Will try to find some time in the next days. Happy holidays!

cgirardot · 2024-12-25T12:28:11Z

@bedroesb I forgot to mention about the pipeline failing; this is not on me as far as I can see, the command line is simply missing a value for the --center option.

…ng without overriding the templates

…lates

cgirardot · 2024-12-26T11:35:24Z

@bedroesb I managed the second point ie updating the jinja templates so all XML templates are automatically generated.

I also looked into the first point but I dont see how to approach to ISA-JSON easily, this feels out of scope of my contribution and requires much more knowledge into ISA-JSON than I have. One needs to catch which attributes need to be exported as schema_attribute[tag]. Also I dont see where the READ_TYPE lives in the ISA-JSON.

Regarding the xlxs support, this seems it would just work out if the box when adding schema_attribute[tag] columns; not tested tho.

bedroesb · 2024-12-26T17:44:05Z

@bedroesb I forgot to mention about the pipeline failing; this is not on me as far as I can see, the command line is simply missing a value for the --center option.

Intresting, 2 weeks ago this worked and I didn't change anything to it. Let me double check the secrets.

@bedroesb I managed the second point ie updating the jinja templates so all XML templates are automatically generated.

I also looked into the first point but I dont see how to approach to ISA-JSON easily, this feels out of scope of my contribution and requires much more knowledge into ISA-JSON than I have. One needs to catch which attributes need to be exported as schema_attribute[tag]. Also I dont see where the READ_TYPE lives in the ISA-JSON.

Regarding the xlxs support, this seems it would just work out if the box when adding schema_attribute[tag] columns; not tested tho.

That is really great! And looks good at first sight. Related to the ISA-JSON, you are right, I will do a test with the test-isa-json and text-xlsx to see if everything still behaves as it should. Which reminds me that I should improve the testing to try out those + do some test submissions, only problem there is that the files, aliases one submits should be unique.

bedroesb · 2024-12-26T17:45:53Z

To Do list for myself:

Fix workflow checks
Test with ISA-JSON and xlsx files as input
DO a real submission with extra attributes
Test submission read-type.

bedroesb · 2024-12-26T17:48:31Z

I am very grateful for you contributions btw, you seemed to have quickly find your way through the (not so well documented) code of templates creating templates :) It's one way of having them automatically up to date + client side validation using XSD's.

cgirardot added 6 commits December 25, 2024 18:31

deleted ENA_template_FASTQ.xml left over

9e9a2e7

should produce ENA_template_READ_TYPE.xml automatically from xsd

9cbeff8

updated sample jinja template to loop over extra_attributes

a338533

skips the empty value as an READ_TYPE option

5c08076

adds manual test flag to export in alt tests/ folder, handy for testi…

2767429

…ng without overriding the templates

new sample templates from running xml_converter.py using the new temp…

863e3f6

…lates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for custom attributes and read type description of fastq #102

Allow for custom attributes and read type description of fastq #102

cgirardot commented Dec 24, 2024

bedroesb commented Dec 25, 2024

cgirardot commented Dec 25, 2024

cgirardot commented Dec 25, 2024

cgirardot commented Dec 26, 2024

bedroesb commented Dec 26, 2024

bedroesb commented Dec 26, 2024

bedroesb commented Dec 26, 2024

Allow for custom attributes and read type description of fastq #102

Are you sure you want to change the base?

Allow for custom attributes and read type description of fastq #102

Conversation

cgirardot commented Dec 24, 2024

bedroesb commented Dec 25, 2024

cgirardot commented Dec 25, 2024

cgirardot commented Dec 25, 2024

cgirardot commented Dec 26, 2024

bedroesb commented Dec 26, 2024

bedroesb commented Dec 26, 2024

bedroesb commented Dec 26, 2024