Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get log(TPM+1) values #44

Open
sunta3iouxos opened this issue Oct 28, 2024 · 9 comments
Open

how to get log(TPM+1) values #44

sunta3iouxos opened this issue Oct 28, 2024 · 9 comments

Comments

@sunta3iouxos
Copy link

Thank you for this tool.
I am a novice in all TCGA data, but I am looking to do some analysis, and I wanted to download TPM normalised values, so that I can compine my own RNA-seq data. I think for my need, want to do GSVA, the TPM should be more appropriate than the percentile ranking.
From some tutorials I got some values that look more scaled than TPM normalised.
I want to use the data for GSVA or singscore
Is there a way to accomplish this with the XENAtools?
This is the code: (taken from https://github.com/XSLiuLab/tumor-immunogenicity-score)

library(UCSCXenaTools)
library(dplyr)
xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
#Create data queries and download them:
# download_xena_pancan, eval=FALSE
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,
  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE
)

xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
)
# hide_download_pancan, include=FALSE
if (!dir.exists("UCSC_Xena")) {
  xe_clinical.query <- XenaQuery(xe_clinical)
  xe_clinical.download <- XenaDownload(xe_clinical.query,
    destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE
  )

  xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
  xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
    destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
  )
}

The author of the code mentions:
The RNASeq data we downloaded are pancan normalized. For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: [here](https://docs.google.com/document/d/1q-7Tkzd7pci4Rz-_IswASRMRzYrbgx1FTTfAWOyHbmk/edit?usp=sharing

Do you have any recommendations on this?
Theodoros

Copy link

Thanks for reporting, Shixiang will reply as soon as possible:)

@ShixiangWang
Copy link
Member

Hi, for simple datasets, you can find the count data in the gdc hub, and transform it into TPM format.

@sunta3iouxos
Copy link
Author

Thank you for this, but it seems that I can not download the counts:

library(UCSCXenaTools)
XE <- XenaGenerate(subset = XenaHostNames == "gdcHub")
XE %>% XenaFilter(filterDatasets = "clinical") -> XE_clinical
XE %>% XenaFilter(filterDatasets = "htseq_counts") -> XE_rna_counts
#download gdc
#download clinical information, this one works
XE_clinical.query <- XenaQuery(XE_clinical)
XE_clinical.download <- XenaDownload(XE_clinical.query,
                                     destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE, force = TRUE
)
#try to download the counts
XE_rna_counts.query <- XenaQuery(XE_rna_counts)
XE_rna_counts.download <- XenaDownload(XE_rna_counts.query,
                                       destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
)
if (!dir.exists("UCSC_Xena")) {
    XE_clinical.query <- XenaQuery(XE_clinical)
    XE_clinical.download <- XenaDownload(XE_clinical.query,
                                         destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE
    )
    
    XE_rna_pancan.query <- XenaQuery(XE_rna_pancan)
    XE_rna_pancan.download <- XenaDownload(XE_rna_pancan.query,
                                           destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
    )
}

downolading of all gdc counts fails:

Downloading TCGA-LAML.htseq_counts.tsv.gz
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #2
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #3
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
Tried 3 times but failed, please check your internet connection!

this is what the quesrry looks like:

> head(XE_rna_pancan.download)
                     hosts                       datasets
1 https://gdc.xenahubs.net     TCGA-BLCA.htseq_counts.tsv
2 https://gdc.xenahubs.net     TCGA-LUSC.htseq_counts.tsv
3 https://gdc.xenahubs.net     TCGA-ESCA.htseq_counts.tsv
4 https://gdc.xenahubs.net     TARGET-RT.htseq_counts.tsv
5 https://gdc.xenahubs.net MMRF-COMMPASS.htseq_counts.tsv
6 https://gdc.xenahubs.net     TCGA-MESO.htseq_counts.tsv
                                                                  url                         fileNames
1     https://gdc.xenahubs.net/download/TCGA-BLCA.htseq_counts.tsv.gz     TCGA-BLCA.htseq_counts.tsv.gz
2     https://gdc.xenahubs.net/download/TCGA-LUSC.htseq_counts.tsv.gz     TCGA-LUSC.htseq_counts.tsv.gz
3     https://gdc.xenahubs.net/download/TCGA-ESCA.htseq_counts.tsv.gz     TCGA-ESCA.htseq_counts.tsv.gz
4     https://gdc.xenahubs.net/download/TARGET-RT.htseq_counts.tsv.gz     TARGET-RT.htseq_counts.tsv.gz
5 https://gdc.xenahubs.net/download/MMRF-COMMPASS.htseq_counts.tsv.gz MMRF-COMMPASS.htseq_counts.tsv.gz
6     https://gdc.xenahubs.net/download/TCGA-MESO.htseq_counts.tsv.gz     TCGA-MESO.htseq_counts.tsv.gz
                                                       destfiles
1     UCSC_Xena/TCGA/counts_RNAseq/TCGA-BLCA.htseq_counts.tsv.gz
2     UCSC_Xena/TCGA/counts_RNAseq/TCGA-LUSC.htseq_counts.tsv.gz
3     UCSC_Xena/TCGA/counts_RNAseq/TCGA-ESCA.htseq_counts.tsv.gz
4     UCSC_Xena/TCGA/counts_RNAseq/TARGET-RT.htseq_counts.tsv.gz
5 UCSC_Xena/TCGA/counts_RNAseq/MMRF-COMMPASS.htseq_counts.tsv.gz
6     UCSC_Xena/TCGA/counts_RNAseq/TCGA-MESO.htseq_counts.tsv.gz

@sunta3iouxos
Copy link
Author

How can I get using the XENA tools those counts?

https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
this is what I am looking for RSEM and log(tpm+1)

@ShixiangWang
Copy link
Member

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub

remotes::install_github("ropensci/UCSCXenaTools")

@ShixiangWang
Copy link
Member

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub

remotes::install_github("ropensci/UCSCXenaTools")

And XE <- XenaGenerate(subset = XenaHostNames == "gdcHub") changed to XE <- XenaGenerate(subset = XenaHostNames == "gdcHubV18") as UCSC Xena updated the data source.

@sunta3iouxos
Copy link
Author

I will do and report.

@sunta3iouxos
Copy link
Author

sunta3iouxos commented Nov 12, 2024

This one works.
Could you please help with this:
"For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here."
I understand that this is the TCGAs way to normalise the data to avoid batch effects is done by using this EB++ algorithm, but they also stating that if you need to add your own dataset maybe it is better to normalized by percentile ranking. Any clues on how to do this?
I have never normalised data using that approach.

Is this approach something related to this:
https://www.nature.com/articles/s41598-020-72664-6#Sec2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants