Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on the how genomicElementDistribution works #15

Open
Kyung-TaeLee opened this issue Aug 19, 2021 · 1 comment
Open

Questions on the how genomicElementDistribution works #15

Kyung-TaeLee opened this issue Aug 19, 2021 · 1 comment

Comments

@Kyung-TaeLee
Copy link

hi @jianhong ,

First of all, thank you for providing such a nice tool! I am currently using the tool to analyze ATAC-seq and it works beautifully. I have some questions on the results from genomicElementDistribution.

In "genomicEelementDistribution", there are options named as "promoterRegion" and "geneDownstream". I wonder if promoterRegion is calculated based on the transcription start site (5'end of gene) of gene. Second, I wonder if either promoterRegion and geneDownstream option consider strand (+ or - strand) of gene.

so for example, let's say there is a gene name as "Tubulin" whose coordinates of start and end are 1000 and 2000 (start and end coordinates specified in GTF file), respectively, and strand is "-" and options for promoterRegion and geneDownstream are set as below

promoterRegion=c(upstream=500, downstream=500),
geneDownstream=c(upstream=0, downstream=200))

If the strandness of gene is considered, promoter region would be 1500-2500 (5'end of Tubulin is 2000 considering that the strand is "-") and downstream region would be 800-1000 (3'end of Tubulin in 1000 considering that the strand is "-").
So it would be great if you can confirm that genomicElementDistribution consider the strandness of gene or not.

Lastly, as far as I understand, If the peaks overlap multiple features (exon, intron, etc), single feature is assigned to the peak based on the pre-determined priority. It would be great if you can tell me how the features are prioritized.

In my R code, I make txdb from gencode gene annotation using the makeTxdbFromGFF as shown below

R code

setwd(dir)
library(ChIPpeakAnno)
library(GenomicFeatures)
gtf <- "gencode.v29.annotation.gtf" ## genocde annoataion
txdb <- makeTxDbFromGFF('gencode.v29.annotation.gtf')
test_peak_file <- "ATAC-seq_peaks.bed"
test_peak <- toGRanges(test_peak_file, format="BED")
out<- genomicElementDistribution(test_peak,
TxDb = txdb,
promoterRegion=c(upstream=5000, downstream=5000),
geneDownstream=c(upstream=0, downstream=2000))

Please let me know if there is any flaw in the code.
Thank you and have a nice day!

@jianhong
Copy link
Owner

Thank you for trying ChIPpeakAnno to annotate your data.
Yes, the promoter region and downstream region will consider the strand information. The ignore.strand=TRUE is designed for the peaks to be annotated.
About the priority, it is determined by the order of element in labels parameter.
for example:

Exons=c(utr5="5' UTR", utr3="3' UTR", CDS="CDS", otherExon="Other exon")
will behavior different from
Exons=c(CDS="CDS", utr5="5' UTR", utr3="3' UTR", otherExon="Other exon")

And your codes looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants