Skip to content

Latest commit

 

History

History
39 lines (32 loc) · 1.7 KB

README.md

File metadata and controls

39 lines (32 loc) · 1.7 KB

What it does

This repo contains scripts and data files that find Gencode gene IDs with given gene names. It relies on 3 resources to find the hits, and the search is attempted in the following priority order:

Note: if a gene name is found to be linked with more than one Gencode ID it will be reported as no hit.

Load function and data in R

library(RCurl)
script=getURL("https://raw.githubusercontent.com/weishwu/find_gencodeID/master/find_gencodeID.r", ssl.verifypeer = FALSE)
load(url("https://raw.githubusercontent.com/weishwu/find_gencodeID/master/queryTabs.RData"))
eval(parse(text = script))

Example run

  • Read input from a file name which has to end with ".txt". The file contains gene names in the first column, one name per line.
hits=find_gencodeID("ENSG_Missing.txt")
  • Read input from the variable name of a vector of characters.
hits=find_gencodeID(ENSG_Missing)

Read below only if data needs to be re-created

  • Merge NCBI and HGNC records with Gencode GTF to find the unique matches (each NCBI/HGNC record can only be linked with one Gencode ID; each gene name can only be linked with on Gencode ID).
  • Requried inputs: gencode.v34.annotation.gtf.zip, NCBI_genes.txt.zip, HGNC_gene_aliases.txt.zip (downloaded from the links listed above)
python gencodeID_queryTables.py gencode.v34.annotation.gtf.zip
  • Combine the gene name and ID tables created above into an R object
Rscript gencodeID_queryTables.r