[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

ajandria · 2023-04-11T12:43:37Z

Is your feature request related to a problem? Please describe.

I was wondering whether it is possible to also retrieve data processing description that is present in the sample's records in GEO. See here for an example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6005004 - there is a lot of information that we would like to see in the table that pysradb generates:

Status
Title
Sample type
Source name
Organism
Characteristics
Treatment protocol
Growth protocol
Extracted molecule
Extraction protocol
Library strategy
Library source
Library selection
Instrument model
Description
Data processing

Describe the solution you'd like

I like the table that is currently generated using the following:
df = db.sra_metadata(df["study_accession"], detailed = True, expand_sample_attributes = True, output_read_lengths = True)
although I feel like it is missing sometimes crucial information that is only included in GEO under specific records of the samples. For an example it the record of the sample that I have included above you can find the following:

Sequenced reads were trimmed for adaptor sequence and low-quality sequence (bbduk; minlength=30, qtrim=rl, trimq=15)
Reads were then mapped to the reference genome of Mus musculus (GRCm38) using STAR aligner version 2.5.3a with parameters --quantMode GeneCounts --runThreadN 4
Assembly: GRCm38

It would be nice to get that into the sra_metadata table too if that is possible. I guess for now I could just use geoquery for that and then merge two tables if possible by GSM sample ids, although I would need to test that. Then probably the hustle including this here would be redundant. But still it seems like a nice direction that one could take to expand this :)

Thank you for your work so far!

The text was updated successfully, but these errors were encountered:

saketkc · 2023-04-11T14:10:53Z

Thanks, this is a great suggestion! It is doable - once the experiment_alias is fetched pysradb would need to make another request for the corresponding detailed GEO metadata. I currently do not have the bandwidth to do this, but pull requests are always welcome!

ajandria added the enhancement New feature or request label Apr 11, 2023

ajandria changed the title ~~[ENH]~~ [ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

ajandria commented Apr 11, 2023

saketkc commented Apr 11, 2023

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

Comments

ajandria commented Apr 11, 2023

saketkc commented Apr 11, 2023