Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output DGE analysis files omit gene names #86

Open
sydjo07 opened this issue Apr 17, 2024 · 7 comments
Open

Output DGE analysis files omit gene names #86

sydjo07 opened this issue Apr 17, 2024 · 7 comments
Labels
question Further information is requested

Comments

@sydjo07
Copy link

sydjo07 commented Apr 17, 2024

Ask away!

This is my first time running this workflow and my DGE analysis tsv files (for example results_dge.tsv) aren't incorporating the gene names in the gene_name column. Instead, the files display NA in the gene_name column and MSTRG in the gene_id column. In the output html file, the gene names are displaying correctly under the differential gene analysis table. Is there a reason that it's omitting this from the tsv output files but not the html output?

For reference, I am working with a non-conventional yeast strain so I used non-publicly available reference genome and annotation files. However, I also tested with genome/ annotation files from NCBI for a different strain of the same organism and found the same issue. When running the test dataset, I found the gene names displaying correctly in the tsv files.

@sydjo07 sydjo07 added the question Further information is requested label Apr 17, 2024
@sarahjeeeze
Copy link
Contributor

sarahjeeeze commented Apr 19, 2024

Hi, thanks for raising this. This is a known issue with how stringtie assigned gene_name as a unique identifier that we have a plan to fix/look in to hopefully by the next release. See -

prepDE.py pulls around 50% MSTRG as gene_id from Stringtie_merge RNA-Seq · Issue #179 · gpertea/stringtie

Disreprency in counts between MSTRG genes and nonMSTRG genes · Issue #206 · gpertea/stringtie

@sydjo07
Copy link
Author

sydjo07 commented Apr 23, 2024

Hi Sarah, thanks for your help! I didn't realize this was a known issue but thanks for pointing me in the right direction.

@sarahjeeeze
Copy link
Contributor

Sorry for the delay, this is still on our radar, will hopefully have an improvement soon.

@sydjo07
Copy link
Author

sydjo07 commented May 17, 2024

Thanks, I appreciate it! I've been able to work around this a bit because I noticed that the unfiltered_tpm_transcript_counts.tsv and the unfiltered_transcript_counts_with_genes.tsv files contain both the proper annotation and their associated MSTRG annotations. I've been able to merge the annotations with the results_dge.tsv to get the proper gene name associations in most cases, although it's not perfect and I know I miss some.

Also to clarify, does the unfiltered_transcript_counts_with_genes.tsv file contain the raw counts before filtering and normalization? If so, then I should be able to use this file as input to EdgeR and generate my own DEG list since it contains the MSTRG to feature_id associations?

@sarahjeeeze
Copy link
Contributor

Hi, correct it is before filtering and normalisation so you could use it with EdgeR - we are still working on this, haven't got round to it yet but will do soon!

@kfletcherelo
Copy link

I have a question further to this issue, perhaps either of you can help? I notice that there seems to be three types of genes in the de_analysis output:

  1. gene_id = gene_name & gene_id ~ /MSTRG/
  2. gene_id ~ /MSTRG/ & gene_name = NULL
  3. gene_id & gene_name match entries in provided annotation
    My assumption is that:
  4. was assembled by stringtie and not present in the annotation - sequence can be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID
  5. was assembled by stringtie and was present in the annotation - sequence cannot be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID, instead ID should be found in unfiltered_transcript_counts_with_genes.tsv
  6. was not assembled by stringtie but had reads mapping to it for DGE so GFF id was used?

Are my assumptions correct or am I missing something?
I am not sure the third classification makes sense, but I am also not sure how else it could come about.
Thanks

@sarahjeeeze
Copy link
Contributor

Yes i think your assumptions are correct, we will aim to make this clearer in the documentation in future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants