how to get log(TPM+1) values #44

sunta3iouxos · 2024-10-28T16:11:42Z

Thank you for this tool.
I am a novice in all TCGA data, but I am looking to do some analysis, and I wanted to download TPM normalised values, so that I can compine my own RNA-seq data. I think for my need, want to do GSVA, the TPM should be more appropriate than the percentile ranking.
From some tutorials I got some values that look more scaled than TPM normalised.
I want to use the data for GSVA or singscore
Is there a way to accomplish this with the XENAtools?
This is the code: (taken from https://github.com/XSLiuLab/tumor-immunogenicity-score)

library(UCSCXenaTools)
library(dplyr)
xe <- XenaGenerate(subset = XenaHostNames == "tcgaHub")
xe %>% XenaFilter(filterDatasets = "clinical") -> xe_clinical
xe %>% XenaFilter(filterDatasets = "HiSeqV2_PANCAN$") -> xe_rna_pancan
#Create data queries and download them:
# download_xena_pancan, eval=FALSE
xe_clinical.query <- XenaQuery(xe_clinical)
xe_clinical.download <- XenaDownload(xe_clinical.query,
  destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE, force = TRUE
)

xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
  destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
)
# hide_download_pancan, include=FALSE
if (!dir.exists("UCSC_Xena")) {
  xe_clinical.query <- XenaQuery(xe_clinical)
  xe_clinical.download <- XenaDownload(xe_clinical.query,
    destdir = "UCSC_Xena/TCGA/Clinical", trans_slash = TRUE
  )

  xe_rna_pancan.query <- XenaQuery(xe_rna_pancan)
  xe_rna_pancan.download <- XenaDownload(xe_rna_pancan.query,
    destdir = "UCSC_Xena/TCGA/RNAseq_Pancan", trans_slash = TRUE
  )
}

The author of the code mentions:
The RNASeq data we downloaded are pancan normalized. For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: [here](https://docs.google.com/document/d/1q-7Tkzd7pci4Rz-_IswASRMRzYrbgx1FTTfAWOyHbmk/edit?usp=sharing

Do you have any recommendations on this?
Theodoros

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-28T16:12:09Z

Thanks for reporting, Shixiang will reply as soon as possible:)

ShixiangWang · 2024-10-29T07:44:01Z

Hi, for simple datasets, you can find the count data in the gdc hub, and transform it into TPM format.

Example: https://xenabrowser.net/datapages/?dataset=TCGA-GBM.star_counts.tsv&host=https%3A%2F%2Fgdc.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

sunta3iouxos · 2024-10-29T10:13:36Z

Thank you for this, but it seems that I can not download the counts:

library(UCSCXenaTools)
XE <- XenaGenerate(subset = XenaHostNames == "gdcHub")
XE %>% XenaFilter(filterDatasets = "clinical") -> XE_clinical
XE %>% XenaFilter(filterDatasets = "htseq_counts") -> XE_rna_counts
#download gdc
#download clinical information, this one works
XE_clinical.query <- XenaQuery(XE_clinical)
XE_clinical.download <- XenaDownload(XE_clinical.query,
                                     destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE, force = TRUE
)
#try to download the counts
XE_rna_counts.query <- XenaQuery(XE_rna_counts)
XE_rna_counts.download <- XenaDownload(XE_rna_counts.query,
                                       destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
)
if (!dir.exists("UCSC_Xena")) {
    XE_clinical.query <- XenaQuery(XE_clinical)
    XE_clinical.download <- XenaDownload(XE_clinical.query,
                                         destdir = "UCSC_Xena/TCGA/counts_Clinical", trans_slash = TRUE
    )
    
    XE_rna_pancan.query <- XenaQuery(XE_rna_pancan)
    XE_rna_pancan.download <- XenaDownload(XE_rna_pancan.query,
                                           destdir = "UCSC_Xena/TCGA/counts_RNAseq", trans_slash = TRUE
    )
}

downolading of all gdc counts fails:

Downloading TCGA-LAML.htseq_counts.tsv.gz
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #2
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
==> Trying #3
trying URL 'https://gdc.xenahubs.net/download/TCGA-LAML.htseq_counts.tsv.gz'
Tried 3 times but failed, please check your internet connection!

this is what the quesrry looks like:

> head(XE_rna_pancan.download)
                     hosts                       datasets
1 https://gdc.xenahubs.net     TCGA-BLCA.htseq_counts.tsv
2 https://gdc.xenahubs.net     TCGA-LUSC.htseq_counts.tsv
3 https://gdc.xenahubs.net     TCGA-ESCA.htseq_counts.tsv
4 https://gdc.xenahubs.net     TARGET-RT.htseq_counts.tsv
5 https://gdc.xenahubs.net MMRF-COMMPASS.htseq_counts.tsv
6 https://gdc.xenahubs.net     TCGA-MESO.htseq_counts.tsv
                                                                  url                         fileNames
1     https://gdc.xenahubs.net/download/TCGA-BLCA.htseq_counts.tsv.gz     TCGA-BLCA.htseq_counts.tsv.gz
2     https://gdc.xenahubs.net/download/TCGA-LUSC.htseq_counts.tsv.gz     TCGA-LUSC.htseq_counts.tsv.gz
3     https://gdc.xenahubs.net/download/TCGA-ESCA.htseq_counts.tsv.gz     TCGA-ESCA.htseq_counts.tsv.gz
4     https://gdc.xenahubs.net/download/TARGET-RT.htseq_counts.tsv.gz     TARGET-RT.htseq_counts.tsv.gz
5 https://gdc.xenahubs.net/download/MMRF-COMMPASS.htseq_counts.tsv.gz MMRF-COMMPASS.htseq_counts.tsv.gz
6     https://gdc.xenahubs.net/download/TCGA-MESO.htseq_counts.tsv.gz     TCGA-MESO.htseq_counts.tsv.gz
                                                       destfiles
1     UCSC_Xena/TCGA/counts_RNAseq/TCGA-BLCA.htseq_counts.tsv.gz
2     UCSC_Xena/TCGA/counts_RNAseq/TCGA-LUSC.htseq_counts.tsv.gz
3     UCSC_Xena/TCGA/counts_RNAseq/TCGA-ESCA.htseq_counts.tsv.gz
4     UCSC_Xena/TCGA/counts_RNAseq/TARGET-RT.htseq_counts.tsv.gz
5 UCSC_Xena/TCGA/counts_RNAseq/MMRF-COMMPASS.htseq_counts.tsv.gz
6     UCSC_Xena/TCGA/counts_RNAseq/TCGA-MESO.htseq_counts.tsv.gz

sunta3iouxos · 2024-10-29T13:23:23Z

How can I get using the XENA tools those counts?

https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
this is what I am looking for RSEM and log(tpm+1)

ShixiangWang · 2024-10-30T08:10:26Z

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub

remotes::install_github("ropensci/UCSCXenaTools")

ShixiangWang · 2024-10-30T08:21:56Z

Hi @sunta3iouxos , please rerun the code with the latest version from GitHub
remotes::install_github("ropensci/UCSCXenaTools")

And XE <- XenaGenerate(subset = XenaHostNames == "gdcHub") changed to XE <- XenaGenerate(subset = XenaHostNames == "gdcHubV18") as UCSC Xena updated the data source.

sunta3iouxos · 2024-11-05T18:25:25Z

I will do and report.

sunta3iouxos · 2024-11-12T10:45:08Z

This one works.
Could you please help with this:
"For comparing data within independent cohort (like TCGA-LUAD), we recommend to use the "gene expression RNAseq" dataset. For questions regarding the gene expression of this particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. For comparing with data outside TCGA, we recommend using the percentile version if the non-TCGA data is normalized by percentile ranking. For more information, please see our Data FAQ: here."
I understand that this is the TCGAs way to normalise the data to avoid batch effects is done by using this EB++ algorithm, but they also stating that if you need to add your own dataset maybe it is better to normalized by percentile ranking. Any clues on how to do this?
I have never normalised data using that approach.

Is this approach something related to this:
https://www.nature.com/articles/s41598-020-72664-6#Sec2

ShixiangWang · 2024-11-13T08:15:36Z

Check https://www.r-bloggers.com/2024/03/mastering-quantile-normalization-in-r-a-step-by-step-guide/ and see more at https://www.google.com/search?q=percentile+normalization+in+r&sca_esv=5487afd26f79d4e0&sxsrf=ADLYWIL88t2cjXP4xQNDR8JUUzRTbtmP2g%3A1731485684107&source=hp&ei=9F80Z9nEBKrh0-kPja2O0Qc&iflsig=AL9hbdgAAAAAZzRuBHEtAsgdwPxbLON8SrenTMM22rhN&ved=0ahUKEwjZjojp7tiJAxWq8DQHHY2WI3oQ4dUDCBY&uact=5&oq=percentile+normalization+in+r&gs_lp=Egdnd3Mtd2l6Ih1wZXJjZW50aWxlIG5vcm1hbGl6YXRpb24gaW4gcjIFECEYoAFI4TdQAFilNnAAeACQAQCYAeABoAH-KKoBBjAuMjYuNbgBA8gBAPgBAvgBAZgCF6ACuh_CAgUQABiABMICCBAAGIAEGMsBwgIEEAAYHsICCBAAGAUYChgewgIGEAAYBRgewgIGEAAYCBgewgIIEAAYgAQYogSYAwCSBwYwLjE4LjWgB_F9&sclient=gws-wiz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get log(TPM+1) values #44

how to get log(TPM+1) values #44

sunta3iouxos commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

ShixiangWang commented Oct 29, 2024

sunta3iouxos commented Oct 29, 2024

sunta3iouxos commented Oct 29, 2024

ShixiangWang commented Oct 30, 2024

ShixiangWang commented Oct 30, 2024

sunta3iouxos commented Nov 5, 2024

sunta3iouxos commented Nov 12, 2024 •

edited

Loading

ShixiangWang commented Nov 13, 2024

how to get log(TPM+1) values #44

how to get log(TPM+1) values #44

Comments

sunta3iouxos commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

ShixiangWang commented Oct 29, 2024

sunta3iouxos commented Oct 29, 2024

sunta3iouxos commented Oct 29, 2024

ShixiangWang commented Oct 30, 2024

ShixiangWang commented Oct 30, 2024

sunta3iouxos commented Nov 5, 2024

sunta3iouxos commented Nov 12, 2024 • edited Loading

ShixiangWang commented Nov 13, 2024

sunta3iouxos commented Nov 12, 2024 •

edited

Loading