Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of gene_name attribute in GTF modification #145

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

khajoue2
Copy link
Collaborator

@khajoue2 khajoue2 commented Oct 21, 2024

Changes Made to GTF Filter Script

Major Changes

  1. Removed dependency on external biotypes file

    • Hardcoded allowable biotypes in the script
    • Removed -b/--biotypes argument
    • Added support for tRNA, rRNA, and mRNA biotypes
  2. Added mitochondrial gene handling

    • Special handling for MT genes regardless of biotype
    • Added checks for MT genes with missing biotypes
    • Support for genes with MT- prefix in gene name
  3. Added species-specific handling

    • Added new -s/--species argument
    • Added human-specific PAR region filtering (chr Y: 2752083-56887903)
    • Added human-specific XGY2 gene filtering (ENSG00000290840)
  4. Improved biotype detection

    • Now checks both transcript_type and transcript_biotype
    • Better handling of missing biotype fields
    • Changed RefSeq processing to look at 'transcript' entries instead of 'gene' entries

Code Structure Improvements

  1. Split GTF filtering into a separate function for better organization
  2. Added more descriptive error messages and progress reporting
  3. Improved code documentation
  4. Removed unnecessary comments and debug print statements

Bug Fixes

  1. Fixed mitochondrial gene filtering that was causing MT genes to be excluded
  2. Fixed handling of genes without biotype attributes
  3. Fixed version handling for IDs without version numbers

Usage Changes

Old usage:

python3 filter_gtf.py -i input.gtf -o output.gtf -b biotypes.tsv

New usage:

python3 filter_gtf.py -i input.gtf -o output.gtf -s species_name

The script is now more general-purpose and can handle any species while maintaining special handling for human-specific features when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant