-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide raw data and scripts #2
Comments
Hi @jszinger , I'm guessing you're referring to the SRA tracks: the raw data for each track can be obtained by following the link in the "about this track" dialog box, which you get from mousing over the track label and clicking on menu down triangle. Also mentioned in the "about" box is a link to the analysis performed by the Galaxy people, from whom I got the VCF files. While the url is to the top level of the analysis repo, you can get more information about the variant analysis by digging a few directories down to the variant readme: https://github.com/galaxyproject/SARS-CoV-2/blob/master/genomics/4-Variation/README.md The only thing I did to the VCF files after getting them from the Galaxy folks was to change the name of the reference sequence (in Galaxy they used "NC_045512" in JBrowse I used "NC_045512.2") and then filter out variants with a frequency of less than 1%, which I did with a simple perl one liner: Is that what you're looking for? |
I'm actually asking about the other tracks: CDS, Genes, primers and multi alignment. For example, there's a bunch of processing that needs to happen to https://www.ncbi.nlm.nih.gov/nuccore/NC_045512 before it can be displayed by JBrowse---I wish to know the details of retreival and processing. Thanks, |
Ah, OK. The data processing for that is relatively straight forward. It would require getting the fasta and GFF files for NC_045512 from the page you linked to by clicking on the "send to" link, and selecting complete record, file for the destination, and then selecting FASTA and GFF3 from the drop down menu for format. Once you have the files, you first run Then you can process data from the GFF3 file to get tracks for genes and CDS. The command generally looks like The primers tracks resulted from me "scraping" the primer sequences from the linked resources and using the "Add sequence search track" for each primer sequence so that I could identify the coordinates and writing a GFF3 file by hand and processing it with flatfile-to-json similar to above. The primers.gff file in this repo is the result of those searches. The multialignment track I know I little bit less about: The BED file I used was created by @cmdcolin and I just grabbed the data. I know that it was fairly straight forward, using data obtained from GenBank for all SARS-CoV-2 sequences and then downloading them as a multialignment fasta file and then processing into a BED file that is then tabix indexed. Yes, that feels a little hand-wavy; perhaps @cmdcolin can fill in a little bit of detail if you like. I added the track configuration for this track to the trackList.json file by hand. This is a fairly brief overview but should do the job of letting you know how the data were processed. If you want to do something similar, please feel free to email the JBrowse mailing list at [email protected] or hit us up in Gitter: https://gitter.im/GMOD/jbrowse |
If you feel like the above descriptions are adequate, let me know and I can add them to the "about this track" for each track where it makes sense. |
Please provide links to the raw data and the scripts necessary to format them for JBrowse. I would like to set up my own instance using data of known provenance and a proven chain of custody. Pulling prefomatted data from the cloud does not meet this requirement.
The text was updated successfully, but these errors were encountered: