Specification for GTF file #30

dylkot · 2018-04-23T12:49:54Z

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL:
ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems:
Missing transcript_name
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

Hoohm · 2018-04-27T15:59:27Z

Hey ! Glad you enjoy the tool. I've had another user having similar issue with arabidopsis annotation. It would be nice to have some safety checks to deal with those issues. Have you finally managed to fix this problem? If yes, would share what you have done so that we can integrate it?

…

On Mon, Apr 23, 2018, 14:50 Dylan Kotliar ***@***.***> wrote: Hi There, thanks for putting all of this together! I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL: ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name" Problems: Missing transcript_name at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97) at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42) It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNXaCXqC84rNG6RXtCtXQCH8jlWBKA9ks5trc3zgaJpZM4Tf1KX> .

dylkot · 2018-05-02T12:16:03Z

Hi. Yep, I fixed it by manually appending gene_name and transcript_name entries to the rows that were missing them in the GTF. I simply used the gene_id and transcript_id for those that were lacking them. I then found that gene_name (which is used by the pipeline as the gene identifier) was not unique in the Macaque annotation. So I overwrote the gene_name entry in the attribute column in the format geneid__genename so it would propagate information for both features through the pipeline. I'm not sure which of these features you would want to add into the package or if you would want to just indicate that gene_name and transcript_name are required attributes. In any case, I'm copying some relevant code here below:

`
import pandas as pd
import numpy as np
import re

def append_gtf_attr(ingtf, outgtf):
    '''Adds gene_name and transcript_name fields to rows that are missing them. Saves gene_name 
       in the format geneid__genename'''
    gtf = pd.read_csv(ingtf, sep='\t', header=None, comment='#')

    ## Extract gene_id and transcript_id
    gtf['gid'] = gtf[8].apply(lambda x: re.findall('gene_id "(.+?)"', x)[0])
    gtf['tid'] = np.nan
    tidind = gtf[2] != 'gene'
    gtf.loc[tidind, 'tid'] = gtf.loc[tidind,8].apply(lambda x: re.findall('transcript_id "(.+?)"', x)[0])

    ## Add attr column to hold our updated attribute info
    gtf['attr'] = gtf[8].copy()

    ## Use gene_id as gene_name for rows that don't have gene_name in attr
    hasgn = gtf['attr'].apply(lambda x: 'gene_name' in x)
    gtf.loc[~hasgn, 'attr'] += ' gene_name "' + gtf.loc[~hasgn,'gid'] + '";'

    ## Use transcript_id as transcript_name for rows missing transcript_name
    hastn = gtf['attr'].apply(lambda x: 'transcript_name' in x)
    notnidx = ~hastn & (gtf[2]!='gene')
    gtf.loc[notnidx, 'attr'] += ' transcript_name "' + gtf.loc[notnidx,'tid'] + '";'

    ## Update gene_name to be geneid__genename format
    gtf['gene_name'] = gtf['attr'].apply(lambda x: re.findall('gene_name "(.+?)"', x)[0])
    gtf['newname'] = gtf['gid'] + '__' + gtf['gene_name']
    strbase = 'gene_name "%s"'
    gtf['attr'] = gtf.apply(lambda x: x['attr'].replace(strbase % x['gene_name'], strbase % x['newname']), axis=1)

    ## Drop extraneous columns and save resulting file
    gtf = gtf.drop([8, 'gid', 'tid', 'gene_name', 'newname'], axis=1)
    gtf.to_csv(outgtf, sep='\t', index=False, header=False, quoting=3)

`

Hoohm · 2018-11-28T12:57:36Z

Hey @dylkot !
I'm working on automating the reference and genome download generation. I'm not familiar with all possible species that people use for single cell and I would greatly appreciate help for specific ones such as Rhesus Macaque.

Would you be able to help integrate this once the new version is out?

dylkot · 2018-11-28T13:48:12Z

Sure! I'd be happy to help with integration for Rhesus Macaque. By 'new version', are you referring to incorporating Drop-seq tools 2.0? I was just starting to wonder about how I will use that going forward because I would especially like to have quantitation of the intronic reads in my data

Hoohm · 2018-11-28T14:06:39Z

Version 0.4 and many more things.

Which functionnality would you specifically need?

dylkot · 2018-11-28T17:56:22Z

Exciting! Really just the ability to generate a count matrices for reads falling in introns in addition to reads falling in the coding sequence.

Hoohm · 2018-12-19T21:39:15Z

Hey! If you want to try out the new feature for automatic mixed species download, merging and generation, you can test out the new 0.4 version!

dylkot · 2018-12-20T15:21:08Z

Awesome, I just got a new load of data and so I'll run it with the new version! I'll let you know how it goes.

dylkot · 2018-12-22T22:10:09Z

Is there documentation anywhere on how to setup the binaries to run the new version? When I try to run it with the environment that was working for the previous version of dropSeqPipe, I get the error:

/bin/bash: cutadapt: command not found

I can install cutadapt but I imagine there might be several changes to the environment including downloading the next version of Drop-seq_tools and I just want to make sure I do all of that before I try to run things.

Hoohm · 2018-12-22T22:12:05Z

are you adding --use-conda to the command? Because that would download and activate the envs for each rule and you should not get this kind of errors.

dylkot · 2018-12-22T22:13:45Z

Nope, I wasn't. Sorry I missed that! Will give it a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specification for GTF file #30

Specification for GTF file #30

dylkot commented Apr 23, 2018

Hoohm commented Apr 27, 2018 via email

dylkot commented May 2, 2018 •

edited

Loading

Hoohm commented Nov 28, 2018

dylkot commented Nov 28, 2018

Hoohm commented Nov 28, 2018 •

edited

Loading

dylkot commented Nov 28, 2018

Hoohm commented Dec 19, 2018

dylkot commented Dec 20, 2018

dylkot commented Dec 22, 2018

Hoohm commented Dec 22, 2018

dylkot commented Dec 22, 2018

Specification for GTF file #30

Specification for GTF file #30

Comments

dylkot commented Apr 23, 2018

Hoohm commented Apr 27, 2018 via email

dylkot commented May 2, 2018 • edited Loading

Hoohm commented Nov 28, 2018

dylkot commented Nov 28, 2018

Hoohm commented Nov 28, 2018 • edited Loading

dylkot commented Nov 28, 2018

Hoohm commented Dec 19, 2018

dylkot commented Dec 20, 2018

dylkot commented Dec 22, 2018

Hoohm commented Dec 22, 2018

dylkot commented Dec 22, 2018

dylkot commented May 2, 2018 •

edited

Loading

Hoohm commented Nov 28, 2018 •

edited

Loading