Add support for SSM files within ICGC projects #4

victorlin · 2018-11-28T04:09:05Z

I'm wondering if it would be appropriate to add functionality of reading files such as simple_somatic_mutation.open.BRCA-US.tsv.gz.

Reading this data is the initial step of a project I will be starting soon. I would be more than willing to implement the parser.

For reference, the file has these columns:

icgc_mutation_id
icgc_donor_id
project_code
icgc_specimen_id
icgc_sample_id
matched_icgc_sample_id
submitted_sample_id
submitted_matched_sample_id
chromosome
chromosome_start
chromosome_end
chromosome_strand
assembly_version
mutation_type
reference_genome_allele
mutated_from_allele
mutated_to_allele
quality_score
probability
total_read_count
mutant_allele_read_count
verification_status
verification_platform
biological_validation_status
biological_validation_platform
consequence_type
aa_mutation
cds_mutation
gene_affected
transcript_affected
gene_build_version
platform
experimental_protocol
sequencing_strategy
base_calling_algorithm
alignment_algorithm
variation_calling_algorithm
other_analysis_algorithm
seq_coverage
raw_data_repository
raw_data_accession
initial_data_release_date

The text was updated successfully, but these errors were encountered:

Ad115 · 2019-02-04T22:47:07Z

Sorry for the late response. Of course! It would be great. Feel free to make the changes and send me a pull request! 💃

Ad115 · 2019-02-04T23:04:10Z

Also, it would be great if you used the facilities in the standard library for gzip files and tsv files.

Maybe that parser for project SSM's may be another class analogous to the SSM_Reader, but how would it be named?? Maybe Project_SSM_Reader? or should we extend the interface of the existing parser?? I'm thinking something like:

reader = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz', file_type='project ssm')

Another thing that seems sensible to me would be to refactor the dependency on the vcf.Reader, so that one could change from vcf to tsv and not make a separate class. Something like:

# The old behavior:
reader = SSM_Reader(filename=' simple_somatic_mutation.aggregated.vcf.gz', file_type='vcf')

# The new behavior:
reader = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz', file_type='tsv')

And in each case, internally, a different reader (vcf.Reader or csv.reader) would be instantiated internally.

I'd be great to hear your thoughts on the subject 🌝

victorlin · 2019-02-05T05:22:07Z

It seems like ICGC provides SSM data in two formats: VCF-like and ICGC-like Mutation Format. I'll reference the "ICGC-like" format as TSV for now.

It's worth noting that the TSV format isn't only available per-project. All SSM data downloaded from the web portal is in the TSV format. It seems to be a more widely available option, whereas the VCF format is only available by downloading all at once from the data release in DCC/current/Summary.

That is just my personal understanding of the ICGC structure. Maybe this library could automatically detect either format:

reader1 = SSM_Reader(filename='simple_somatic_mutation.aggregated.vcf.gz')
reader2 = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz')

or have the user specify which is passed in:

reader1 = SSM_Reader(vcf='simple_somatic_mutation.aggregated.vcf.gz')
reader2 = SSM_Reader(tsv='simple_somatic_mutation.open.BRCA-US.tsv.gz')

Ad115 · 2019-02-05T05:54:46Z

I love the idea of automatic detection of the file format, given that every well-formed VCF must start with a line specifying the format, according to the VCF specification.

Given that, I can not think of a plausible case for having the user to specify the file format manually, maybe it would be good to have an optional switch, or to remove the option altogether.

Ad115 assigned victorlin Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for SSM files within ICGC projects #4

Add support for SSM files within ICGC projects #4

victorlin commented Nov 28, 2018

Ad115 commented Feb 4, 2019

Ad115 commented Feb 4, 2019 •

edited

Loading

victorlin commented Feb 5, 2019

Ad115 commented Feb 5, 2019 •

edited

Loading

Add support for SSM files within ICGC projects #4

Add support for SSM files within ICGC projects #4

Comments

victorlin commented Nov 28, 2018

Ad115 commented Feb 4, 2019

Ad115 commented Feb 4, 2019 • edited Loading

victorlin commented Feb 5, 2019

Ad115 commented Feb 5, 2019 • edited Loading

Ad115 commented Feb 4, 2019 •

edited

Loading

Ad115 commented Feb 5, 2019 •

edited

Loading