Output DGE analysis files omit gene names #86

sydjo07 · 2024-04-17T18:00:27Z

Ask away!

This is my first time running this workflow and my DGE analysis tsv files (for example results_dge.tsv) aren't incorporating the gene names in the gene_name column. Instead, the files display NA in the gene_name column and MSTRG in the gene_id column. In the output html file, the gene names are displaying correctly under the differential gene analysis table. Is there a reason that it's omitting this from the tsv output files but not the html output?

For reference, I am working with a non-conventional yeast strain so I used non-publicly available reference genome and annotation files. However, I also tested with genome/ annotation files from NCBI for a different strain of the same organism and found the same issue. When running the test dataset, I found the gene names displaying correctly in the tsv files.

sarahjeeeze · 2024-04-19T11:45:44Z

Hi, thanks for raising this. This is a known issue with how stringtie assigned gene_name as a unique identifier that we have a plan to fix/look in to hopefully by the next release. See -

prepDE.py pulls around 50% MSTRG as gene_id from Stringtie_merge RNA-Seq · Issue #179 · gpertea/stringtie

Disreprency in counts between MSTRG genes and nonMSTRG genes · Issue #206 · gpertea/stringtie

sydjo07 · 2024-04-23T20:56:09Z

Hi Sarah, thanks for your help! I didn't realize this was a known issue but thanks for pointing me in the right direction.

sarahjeeeze · 2024-05-15T13:33:15Z

Sorry for the delay, this is still on our radar, will hopefully have an improvement soon.

sydjo07 · 2024-05-17T13:43:59Z

Thanks, I appreciate it! I've been able to work around this a bit because I noticed that the unfiltered_tpm_transcript_counts.tsv and the unfiltered_transcript_counts_with_genes.tsv files contain both the proper annotation and their associated MSTRG annotations. I've been able to merge the annotations with the results_dge.tsv to get the proper gene name associations in most cases, although it's not perfect and I know I miss some.

Also to clarify, does the unfiltered_transcript_counts_with_genes.tsv file contain the raw counts before filtering and normalization? If so, then I should be able to use this file as input to EdgeR and generate my own DEG list since it contains the MSTRG to feature_id associations?

sarahjeeeze · 2024-06-06T09:50:02Z

Hi, correct it is before filtering and normalisation so you could use it with EdgeR - we are still working on this, haven't got round to it yet but will do soon!

kfletcherelo · 2024-08-09T13:56:48Z

I have a question further to this issue, perhaps either of you can help? I notice that there seems to be three types of genes in the de_analysis output:

gene_id = gene_name & gene_id ~ /MSTRG/
gene_id ~ /MSTRG/ & gene_name = NULL
gene_id & gene_name match entries in provided annotation
My assumption is that:
was assembled by stringtie and not present in the annotation - sequence can be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID
was assembled by stringtie and was present in the annotation - sequence cannot be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID, instead ID should be found in unfiltered_transcript_counts_with_genes.tsv
was not assembled by stringtie but had reads mapping to it for DGE so GFF id was used?

Are my assumptions correct or am I missing something?
I am not sure the third classification makes sense, but I am also not sure how else it could come about.
Thanks

sarahjeeeze · 2024-11-04T16:16:25Z

Yes i think your assumptions are correct, we will aim to make this clearer in the documentation in future

sydjo07 added the question Further information is requested label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output DGE analysis files omit gene names #86

Output DGE analysis files omit gene names #86

sydjo07 commented Apr 17, 2024

sarahjeeeze commented Apr 19, 2024 •

edited

Loading

sydjo07 commented Apr 23, 2024 •

edited

Loading

sarahjeeeze commented May 15, 2024

sydjo07 commented May 17, 2024

sarahjeeeze commented Jun 6, 2024

kfletcherelo commented Aug 9, 2024

sarahjeeeze commented Nov 4, 2024

Output DGE analysis files omit gene names #86

Output DGE analysis files omit gene names #86

Comments

sydjo07 commented Apr 17, 2024

Ask away!

sarahjeeeze commented Apr 19, 2024 • edited Loading

sydjo07 commented Apr 23, 2024 • edited Loading

sarahjeeeze commented May 15, 2024

sydjo07 commented May 17, 2024

sarahjeeeze commented Jun 6, 2024

kfletcherelo commented Aug 9, 2024

sarahjeeeze commented Nov 4, 2024

sarahjeeeze commented Apr 19, 2024 •

edited

Loading

sydjo07 commented Apr 23, 2024 •

edited

Loading