-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for SSM files within ICGC projects #4
Comments
Sorry for the late response. Of course! It would be great. Feel free to make the changes and send me a pull request! 💃 |
Also, it would be great if you used the facilities in the standard library for gzip files and tsv files. Maybe that parser for project SSM's may be another class analogous to the
Another thing that seems sensible to me would be to refactor the dependency on the
And in each case, internally, a different reader ( I'd be great to hear your thoughts on the subject 🌝 |
It seems like ICGC provides SSM data in two formats: VCF-like and ICGC-like Mutation Format. I'll reference the "ICGC-like" format as TSV for now. It's worth noting that the TSV format isn't only available per-project. All SSM data downloaded from the web portal is in the TSV format. It seems to be a more widely available option, whereas the VCF format is only available by downloading all at once from the data release in DCC/current/Summary. That is just my personal understanding of the ICGC structure. Maybe this library could automatically detect either format:
or have the user specify which is passed in:
|
I love the idea of automatic detection of the file format, given that every well-formed VCF must start with a line specifying the format, according to the VCF specification. Given that, I can not think of a plausible case for having the user to specify the file format manually, maybe it would be good to have an optional switch, or to remove the option altogether. |
I'm wondering if it would be appropriate to add functionality of reading files such as
simple_somatic_mutation.open.BRCA-US.tsv.gz
.Reading this data is the initial step of a project I will be starting soon. I would be more than willing to implement the parser.
For reference, the file has these columns:
The text was updated successfully, but these errors were encountered: