Skip to content

Latest commit

 

History

History
996 lines (801 loc) · 32.9 KB

README.md

File metadata and controls

996 lines (801 loc) · 32.9 KB

Table of Contents

Created by gh-md-toc

README

Call Omicron variants

The Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an RNA virus currently causing the 2019–20 coronavirus pandemic. This repository contains code and notes for the analysis of SARS-CoV-2.

I also wrote some blog posts related to the analyses in this repository.

  1. https://davetang.org/muse/2020/03/05/sequence-analysis-sars-cov-2/

  2. https://davetang.org/muse/2020/03/06/sequence-analysis-of-sars-cov-2-part-2/

  3. https://davetang.org/muse/2020/03/12/sequence-analysis-of-sars-cov-2-part-3/

  4. https://davetang.org/muse/2022/01/26/omicron-variants/ (this analysis workflow is also implemented as a GitHub Actions workflow)

Tools

Install Mambaforge and then use mamba to create the environment that has all the necessary tools.

mamba env create --file environment.yml

conda activate sars_cov_2

The Conda package for snpEff is out of date and does not download the reference sequences properly. We need to install this manually.

cd src
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

cd snpEff
java -jar snpEff.jar download NC_045512.2

Sequences

Reference sequence

Reference sequence NC_045512. Download GFF for NC_045512 from https://www.ncbi.nlm.nih.gov/sars-cov-2/.

mkdir tmp && cd tmp
datasets download genome accession GCF_009858895.2 --filename GCF_009858895.2.zip --include genome,gtf,gbff
unzip GCF_009858895.2.zip

gzip ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna
mv ncbi_dataset/data/GCF_009858895.2/GCF_009858895.2_ASM985889v3_genomic.fna.gz ../raw
cd .. && rm -rf tmp

Variants

See SARS-CoV-2 Variant Classifications and Definitions and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8342008/ for more information.

Download variants of concern using script/download_variants.sh. The script will keep trying to download a lineage until successful; this was implemented because for variants with a lot of sequences (Alpha and Delta), the download would disconnect quite often (despite having a very fast Internet connection).

  • Alpha (B.1.1.7)
  • Beta (B.1.351, B.1.351.2, B.1.351.3)
  • Delta (B.1.617.2, AY.1, AY.2, AY.3)
  • Gamma (P.1, P.1.1, P.1.2)
  • Omicron (B.1.1.529, BA.1, BA.2, BA.3)

Summarise using dataformat.

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-P.1.1.20210819.zip --fields accession,virus-pangolin,release-date,isolate-lineage | head -5
Accession       Virus Pangolin Classification   Release date    Isolate Lineage
MZ799138.1      P.1.1   2021-08-15      SARS-CoV-2/human/USA/2105200446/2021
MZ788313.1      P.1.1   2021-08-13      SARS-CoV-2/human/USA/TX-DSHS-7363/2021
MZ770359.1      P.1.1   2021-08-12      SARS-CoV-2/human/USA/UT-UPHL-210729378097/2021
MZ746322.1      P.1.1   2021-08-10      SARS-CoV-2/human/USA/UT-UPHL-210729378097/2021

# confirm that accession MZ157012 is a delta variant
bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-B.1.617.2.20210819.zip --fields accession,virus-pangolin,release-date,isolate-lineage | grep MZ157012
MZ157012.1      B.1.617.2       2021-05-11      SARS-CoV-2/human/NPL/LMB11/2021

Count number of sequences per lineage.

for file in $(ls raw/variants/*.zip); do
   echo ${file};
   bin/macos/dataformat tsv virus-genome --package ${file} | wc -l
done

raw/variants/SARS-CoV-2-AY.1.20210819.zip
     254
raw/variants/SARS-CoV-2-AY.2.20210819.zip
    1036
raw/variants/SARS-CoV-2-AY.3.20210819.zip
    5678
raw/variants/SARS-CoV-2-B.1.1.7.20210819.zip
  467304
raw/variants/SARS-CoV-2-B.1.351.2.20210819.zip
      47
raw/variants/SARS-CoV-2-B.1.351.20210819.zip
    4311
raw/variants/SARS-CoV-2-B.1.351.3.20210819.zip
     340
raw/variants/SARS-CoV-2-B.1.617.2.20210819.zip
  152212
raw/variants/SARS-CoV-2-P.1.1.20210819.zip
    1004
raw/variants/SARS-CoV-2-P.1.2.20210819.zip
     746
raw/variants/SARS-CoV-2-P.1.20210819.zip
   16109

Omicron

BA.1.

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.1.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | head -5
Accession	Virus Pangolin Classification	Release date	Isolate Lineage	Geographic location
OL815078.1	BA.1	2021-12-11	SARS-CoV-2/human/USA/NY-CDC-IBX854867598434/2021	USA: New York
OL808790.1	BA.1	2021-12-10	SARS-CoV-2/human/USA/MA-MGB-02636/2021	USA: Massachusetts
OL802692.1	BA.1	2021-12-10	SARS-CoV-2/human/USA/NY-CDCBI-CRSP_GHHFUDBUXELXFY54/2021	USA: New York
OL800703.1	BA.1	2021-12-10	SARS-CoV-2/human/DEU/FFM-ZAF0396/2021	Germany: Hesse

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.1.20220225.zip | wc -l
8792232

BA.2.

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | head -5
Accession	Virus Pangolin Classification	Release date	Isolate Lineage	Geographic location
OM364005.1	BA.2	2022-01-24	SARS-CoV-2/human/USA/WA-S16397/2022	USA: Washington,King County
OM362060.1	BA.2	2022-01-24	SARS-CoV-2/human/USA/NY-CDC-LC0466930/2022	USA: New York
OM354922.1	BA.2	2022-01-24	SARS-CoV-2/human/USA/CA-CDC-STM-X4Q4J65PP/2022	USA: California
OM349979.1	BA.2	2022-01-24	SARS-CoV-2/human/IND/MH-ICMR-NIV-INSACOG-GSEQ-7029/2021	India

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip | wc -l
   97503

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields geo-location | grep -v "^Geographic" | cut -f1 -d':' | sort | uniq -c | sort -k1rn
6896 United Kingdom
1623 USA
 301 Switzerland
  50 Germany
  32 Bangladesh
  30 Denmark
  25 Slovakia
  18 South Africa
  16 Bahrain
   5 China
   4 Austria
   3 Japan
   1 India

bin/macos/dataformat tsv virus-genome --package raw/variants/SARS-CoV-2-BA.2.20220225.zip --fields accession,virus-pangolin,release-date,isolate-lineage,geo-location | grep Japan
BS002318.1	BA.2	2022-01-26	hCoV-19/Japan/SZ-NIG-4-C118/2022	Japan:Shizuoka, Hamamatsu
BS002317.1	BA.2	2022-01-26	hCoV-19/Japan/SZ-NIG-4-C117/2022	Japan:Shizuoka, Hamamatsu
BS002303.1	BA.2	2022-01-26	hCoV-19/Japan/SZ-NIG-4-C67/2022	Japan:Shizuoka, Hamamatsu

Genomes

Download Coronavirus genomes using datasets.

mkdir -p raw/genome
today=$(date +%Y%m%d)
bin/datasets download virus genome tax-name sars2 --filename raw/genome/sars2.${today}.zip

973,966 genome sequences as of Thu Jul 22 17:37:22 JST 2021 up from 7,031 genome sequences from Mon Jul 20 14:32:46 JST 2020.

cd raw/genome/
unzip sars2.20210722.zip

cat ncbi_dataset/data/genomic.fna | grep "^>" | wc -l
973966

Look for MN908947.

cat ncbi_dataset/data/genomic.fna | grep MN908947
>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
>MT576029.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/ESP/2019-nCoV-MN908947-cOVID-96_19/2020, complete genome

Look for MZ157012 (an isolate from lineage B.1.617.2 or also known as the Delta variant).

cat ncbi_dataset/data/genomic.fna | grep MZ157012
>MZ157012.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/NPL/LMB11/2021 ORF1ab polyprotein (ORF1ab) gene, partial cds; ORF1a polyprotein (ORF1ab) gene, complete cds; surface glycoprotein (S) and ORF3a protein (ORF3a) genes, partial cds; envelope protein (E), membrane glycoprotein (M), and ORF6 protein (ORF6) genes, complete cds; ORF7a protein (ORF7a) gene, partial cds; ORF8 gene, partial sequence; and nucleocapsid phosphoprotein (N) gene, partial cds

Extract MN908947.3 and MZ157012.1.

echo -e "MN908947.3\nMZ157012.1" > raw/wanted.txt

bin/seqtk subseq raw/genome/ncbi_dataset/data/genomic.fna raw/wanted.txt > raw/wanted.fa

Proteins

The download virus protein command downloads complete protein sequences (excluding partial sequences) and annotation data as a zip file in the BDBag (Big Data Bag) format.

bin/datasets download virus protein S --filename raw/SARS2-spike.zip

The default protein dataset for a given protein includes the following for all complete SARS2 RefSeq and GenBank genomes:

  • amino acid sequences in FASTA (.faa) format
  • protein structures in PDB (.pdb) format
  • nucleotide coding (CDS) sequences in FASTA (.fna) format
  • data report containing taxonomy, isolate, host and other metadata (data_report.yaml)
  • annotation and amino acid sequences in the GenPept flat file format (protein.gpff)
  • a README containing details on sequence file data content and other general information (virus_dataset.md)
  • a list of files and file types (dataset_catalog.json)
|-- ncbi_dataset
|   |-- bag-info.txt
|   |-- bagit.txt
|   |-- data
|   |   |-- cds.fna
|   |   |-- data_report.yaml
|   |   |-- dataset_catalog.json
|   |   |-- pdb
|   |   |   |-- 6VYB.pdb
|   |   |   |-- 6VYO.pdb
|   |   |   |-- 6W37.pdb
|   |   |   |-- 6W4H.pdb
|   |   |   |-- 6W9C.pdb
|   |   |   |-- 6W9Q.pdb
|   |   |   |-- 6WEY.pdb
|   |   |   |-- 6WJI.pdb
|   |   |   |-- 6WLC.pdb
|   |   |   |-- 7BQY.pdb
|   |   |   `-- 7BV2.pdb
|   |   |-- protein.faa
|   |   |-- protein.gpff
|   |   `-- virus_dataset.md
|   |-- fetch.txt
|   |-- manifest-md5.txt
|   `-- tagmanifest-md5.txt
`-- README.md

Extract only the amino acid FASTA sequences.

for file in $(ls raw/protein/SARS2*.zip); do
   base=$(basename $file .zip);
   unzip -jp $file ncbi_dataset/data/protein.faa | gzip > raw/protein/${base}.faa.gz
done

zcat raw/protein/SARS2.20200720.S.faa.gz | grep "^>" | wc -l
7015

Number of unique sequences.

for file in $(ls raw/protein/*.faa.gz); do
   echo $file;
   script/unique_seq.pl -f $file | wc -l;
done

raw/protein/SARS2.20200720.E.faa.gz
30
raw/protein/SARS2.20200720.M.faa.gz
95
raw/protein/SARS2.20200720.N.faa.gz
360
raw/protein/SARS2.20200720.nsp10.faa.gz
36
raw/protein/SARS2.20200720.nsp11.faa.gz
7
raw/protein/SARS2.20200720.nsp13.faa.gz
254
raw/protein/SARS2.20200720.nsp14.faa.gz
307
raw/protein/SARS2.20200720.nsp15.faa.gz
166
raw/protein/SARS2.20200720.nsp16.faa.gz
146
raw/protein/SARS2.20200720.nsp1.faa.gz
134
raw/protein/SARS2.20200720.nsp2.faa.gz
388
raw/protein/SARS2.20200720.nsp3.faa.gz
862
raw/protein/SARS2.20200720.nsp4.faa.gz
198
raw/protein/SARS2.20200720.nsp5.faa.gz
118
raw/protein/SARS2.20200720.nsp6.faa.gz
120
raw/protein/SARS2.20200720.nsp7.faa.gz
31
raw/protein/SARS2.20200720.nsp8.faa.gz
69
raw/protein/SARS2.20200720.nsp9.faa.gz
48
raw/protein/SARS2.20200720.ORF10.faa.gz
28
raw/protein/SARS2.20200720.orf1ab.faa.gz
5773
raw/protein/SARS2.20200720.orf1a.faa.gz
3756
raw/protein/SARS2.20200720.ORF3a.faa.gz
290
raw/protein/SARS2.20200720.ORF6.faa.gz
58
raw/protein/SARS2.20200720.ORF7a.faa.gz
88
raw/protein/SARS2.20200720.ORF7b.faa.gz
1
raw/protein/SARS2.20200720.ORF8.faa.gz
99
raw/protein/SARS2.20200720.RdRp.faa.gz
342
raw/protein/SARS2.20200720.S.faa.gz
779

Nucleotide sequences

The page https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ contains a list of SARS-CoV-2 sequences. Download latest list of SARS-CoV-2 nucleotide sequence IDs.

outfile=acc.$(date +%F).txt.gz
wget https://www.ncbi.nlm.nih.gov/sars-cov-2/download-nuccore-ids/ -O - | gzip > raw/$outfile

zcat raw/$outfile | wc -l
10001

We can use efetch to download an assembled SARS-CoV-2 genome sequence in various formats.

mkdir raw

efetch -db sequences -format genbank -id MN908947 > raw/MN908947.genbank
efetch -db sequences -format fasta -id MN908947 > raw/MN908947.fa
efetch -db sequences -format fasta_cds_aa -id MN908947

We can use esearch to query the number of sequences associated with the term "coronavirus". There were 41,874 results as of 2020/03/01.

esearch -db nuccore -query coronavirus
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>NCID_1_121026101_130.14.22.33_9001_1583054896_904858778_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>41874</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

We can pipe the output from esearch to efetch to fetch all sequences.

today=$(date "+%Y%m%d")

time esearch -db nuccore -query coronavirus | efetch -db sequences -format fasta > raw/coronavirus_$today.fa

real    10m39.511s
user    0m53.739s
sys     0m7.705s

cat raw/coronavirus_$today.fa | grep "^>" | wc -l
   43581

cat raw/coronavirus_$today.fa | grep MN908947
>MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

I wrote a simple Perl script to calculate the length of the FASTA sequences.

today=$(date "+%Y%m%d")
script/fasta_stats.pl -f raw/coronavirus_$today.fa | gzip > result/coronavirus_${today}_stat.txt.gz

Identifying variants in lineages

The SARS-CoV-2 Delta variant is a variant of lineage B.1.617 of SARS-CoV-2; it belongs to lineage B.1.617.2. It has mutations in the gene encoding the SARS-CoV-2 spike protein causing the substitutions T478K, P681R and L452R. The workflow below will download the FASTA sequence of one particular Delta variant (accession MZ157012), align it with the SARS-CoV-2 reference sequence, generate a VCF file based on the alignment, and finally annotate the differences to the reference sequence.

id=MZ157012
efetch -db sequences -format fasta -id ${id} | gzip > raw/${id}.fa.gz
gunzip -c raw/GCF_009858895.2_ASM985889v3_genomic.fna.gz raw/MZ157012.fa.gz | gzip > raw/ref_vs_delta.fa.gz
gunzip -c raw/ref_vs_delta.fa.gz | muscle -out raw/ref_vs_delta.aln.fa
snp-sites -v -o raw/ref_vs_delta.vcf raw/ref_vs_delta.aln.fa
cat raw/ref_vs_delta.vcf | sed 's/ID=1/ID=NC_045512.2/;s/^1/NC_045512.2/' > blah
mv -f blah raw/ref_vs_delta.vcf
java -Xmx8g -jar bin/snpEff/snpEff.jar NC_045512.2 raw/ref_vs_delta.vcf > raw/ref_vs_delta.ann.vcf 

There are four mutations in the spike protein in MZ157012; two intersect with the reported substitutions in the Wikipedia article: L452R and T478K but P681R was not found.

cat raw/ref_vs_delta.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]' 
p.Leu452Arg
p.Thr478Lys
p.Asp614Gly
p.Asp950Asn

The above workflow can be run by using script/var_to_vcf.sh; make sure you activate the Conda environment before running the script. We will use the script to generate VCF files for BA.1 (OL815078) and BA.2 (OM364005) variants.

script/var_to_vcf.sh -i OL815078

cat raw/ref_vs_OL815078.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]'
p.Ala67Val
p.Thr95Ile
p.Gly339Asp
p.Ser371Pro
p.Ser371Phe
p.Ser373Pro
p.Ser375Phe
p.Lys417Asn
p.Thr547Lys
p.Asp614Gly
p.His655Tyr
p.Asn679Lys
p.Pro681His
p.Ala701Val
p.Asn764Lys
p.Asp796Tyr
p.Asn856Lys
p.Gln954His
p.Asn969Lys
p.Leu981Phe
p.Asp1146Asp

script/var_to_vcf.sh -i OM364005
cat raw/ref_vs_OM364005.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]'
p.Thr19Ile
p.Gly142Asp
p.Val213Gly
p.Gly339Asp
p.Ser371Phe
p.Ser373Pro
p.Ser375Phe
p.Thr376Ala
p.Asp405Asn
p.Arg408Ser
p.Lys417Asn
p.Asn440Lys
p.Ser477Asn
p.Thr478Lys
p.Glu484Ala
p.Gln493Arg
p.Gln498Arg
p.Asn501Tyr
p.Tyr505His
p.Asp614Gly
p.His655Tyr
p.Asn679Lys
p.Pro681His
p.Asn764Lys
p.Asp796Tyr
p.Gln954His
p.Asn969Lys
p.Asn1023Ser
p.Asp1146Asp

Spike protein variants specific to BA.1 (OL815078 <) and BA.2 (OM364005 >).

diff <(cat raw/ref_vs_OL815078.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]') <(cat raw/ref_vs_OM364005.ann.vcf | perl -nle 'next if /^#/; @s=split(/\|/); next unless $s[3] eq "S"; print $s[10]')
1,2c1,3
< p.Ala67Val
< p.Thr95Ile
---
> p.Thr19Ile
> p.Gly142Asp
> p.Val213Gly
4d4
< p.Ser371Pro
7a8,10
> p.Thr376Ala
> p.Asp405Asn
> p.Arg408Ser
9c12,19
< p.Thr547Lys
---
> p.Asn440Lys
> p.Ser477Asn
> p.Thr478Lys
> p.Glu484Ala
> p.Gln493Arg
> p.Gln498Arg
> p.Asn501Tyr
> p.Tyr505His
14d23
< p.Ala701Val
17d25
< p.Asn856Lys
20c28
< p.Leu981Phe
---
> p.Asn1023Ser

Gamma variant: This variant of SARS-CoV-2 has been named lineage P.1 and has 17 amino acid substitutions, ten of which are in its spike protein, including these three designated to be of particular concern: N501Y, E484K and K417T.

BLAST

Create BLAST database using the sequences associated with the term "coronavirus".

mkdir db

makeblastdb -dbtype nucl \
            -in raw/coronavirus_20200426.fa \
            -input_type fasta \
            -title coronavirus_20200426 \
            -out db/coronavirus_20200426

Building a new DB, current time: 04/26/2020 09:29:19
New DB name:   /Users/dtang/github/sars_cov_2/db/coronavirus_20200426
New DB title:  coronavirus_20200426
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 43581 sequences in 13.0157 seconds.

After creating the database we can blast the assembled genome to all the sequences we fetched to see if it matches other coronaviruses.

# -evalue <Real> - Expectation value (E) threshold for saving hits 
# Default = `10'
blastn -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200426 > result/MN908947_blast.txt

blastn -evalue 1 -outfmt 7 -query raw/MN908947.fa -db db/coronavirus_20200426 | wc -l
     506

# accessions with BLAST hits
cat result/MN908947_blast.txt | grep -v "^#" | cut -f2 | sort -u | wc -l
     500

cat result/MN908947_blast.txt | grep -v "^#" | cut -f2 | sort -u > result/MN908947_matched.txt

The script extract_fasta.pl will extract specific FASTA entries. Below we fetch all the sequences that our query sequence matched (500 in total).

script/extract_fasta.pl -i result/MN908947_matched.txt -f raw/coronavirus_20200426.fa > result/MN908947_matched.fa

cat result/MN908947_matched.fa | grep "^>" | wc -l
     500

We will calculate some simple FASTA stats on the matched sequences.

script/fasta_stats.pl -f result/MN908947_matched.fa > result/MN908947_matched_stats.txt

Parse results

The script parse_outfmt7.pl simply parses the BLAST result and outputs the results in a more readable format.

script/parse_outfmt7.pl -i result/MN908947_blast.txt -p 80 -l 10000 -f raw/coronavirus_20200426.fa | less

ClustalW

Create a multiple sequence alignment of MN908947 to other bat coronaviruses.

script/extract_fasta.pl -i raw/wanted.txt -f raw/coronavirus_20200426.fa > raw/wanted.fa
clustalw -infile=raw/wanted.fa

script/extract_fasta.pl -i raw/MN908947_MN996532.txt -f raw/coronavirus_20200301.fa > raw/MN908947_MN996532.fa
clustalw -infile=raw/MN908947_MN996532.fa

SRA

The page https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ contains a YAML file with accession information on the SARS-CoV-2 sequences currently deposited on NCBI databases.

wget -N https://www.ncbi.nlm.nih.gov/core/assets/genbank/files/ncov-sequences.yaml -O raw/ncov-sequences.yaml

head raw/ncov-sequences.yaml
updated: 'Thursay Apr 23 15:15 2020 EST'

genbank-sequences: [
    {
      "accession": "NC_045512",
      "accession_list": "<a href=\"https://www.ncbi.nlm.nih.gov/nuccore/NC_045512\">NC_045512</a>",
      "collection_date": "2019-12",
      "country": "China"
    },
    {

cat raw/ncov-sequences.yaml | grep "sra-run\"" | wc -l
     288

Did we get all the GenBank sequences using esearch and efetch? We should directly use the accessions in the YAML file as 417 sequences are missing.

cat raw/ncov-sequences.yaml | grep 'accession"' | perl -nle 's/.*"(\w+)",/$1/; print' > raw/genbank_list.txt

cat raw/genbank_list.txt | wc -l
    1622

script/extract_fasta.pl -i raw/genbank_list.txt -f raw/coronavirus_20200426.fa > raw/genbank_list.fa 2> raw/genbank_list_missing.txt

cat raw/genbank_list.fa | grep "^>" | wc -l
    1205

cat raw/genbank_list_missing.txt | wc -l
     417

Download using efetch and check out the distribution of sequence lengths.

my_accession=$(cat raw/genbank_list.txt | tr '\n' ',' | sed 's/,$//')
efetch -db sequences -format fasta -id $my_accession > raw/genbank_efetch.fa

cat raw/genbank_efetch.fa | grep "^>" | wc -l
    1621

script/fasta_stats.pl -f raw/genbank_efetch.fa | gzip > result/genbank_stat.txt.gz

# get stats from https://github.com/arq5x/filo/
gunzip -c result/genbank_stat.txt.gz  | cut -f2 | grep -v "Length" | stats
Total lines:            1621
Sum of lines:           45891887
Ari. Mean:              28310.8494756323
Geo. Mean:              23869.0792681778
Median:                 29844
Mode:                   29882 (N=208)
Anti-Mode:              64 (N=1)
Minimum:                64
Maximum:                29945
Variance:               41665645.6466825
StdDev:                 6454.89315532663

# some short sequences
gunzip -c result/genbank_stat.txt.gz  |  sort -k2n | head -6
Accession       Length  A       C       G       T       Unknown
MT293547.1      64      13      16      12      23      0
MT273658.1      84      20      15      22      27      0
MT163712.1      87      20      17      23      27      0
MN938387.1      107     38      14      22      33      0
MN938388.1      107     38      14      22      33      0

cat raw/genbank_efetch.fa | grep  MT293547.1
>MT293547.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/IRQ/KRD/2020 envelope protein (E) gene, partial cds

Use wget to obtain metadata of all SARS-CoV-2 raw sequences upload to the SRA. The YAML file no longer uses project accessions (SRP242226), so we will get information on each run instead (SRR10948550).

mkdir sra

# "sra-run": "SRR10948550"
wget -O sra/SRR10948550_info.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRR10948550'

Checkout metadata using csvkit.

csvcut -n sra/SRR10948550_info.csv 
  1: Run
  2: ReleaseDate
  3: LoadDate
  4: spots
  5: bases
  6: spots_with_mates
  7: avgLength
  8: size_MB
  9: AssemblyName
 10: download_path
 11: Experiment
 12: LibraryName
 13: LibraryStrategy
 14: LibrarySelection
 15: LibrarySource
 16: LibraryLayout
 17: InsertSize
 18: InsertDev
 19: Platform
 20: Model
 21: SRAStudy
 22: BioProject
 23: Study_Pubmed_id
 24: ProjectID
 25: Sample
 26: BioSample
 27: SampleType
 28: TaxID
 29: ScientificName
 30: SampleName
 31: g1k_pop_code
 32: source
 33: g1k_analysis_group
 34: Subject_ID
 35: Sex
 36: Disease
 37: Tumor
 38: Affection_Status
 39: Analyte_Type
 40: Histological_Type
 41: Body_Site
 42: CenterName
 43: Submission
 44: dbgap_study_accession
 45: Consent
 46: RunHash
 47: ReadHash

csvcut -c Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName sra/SRR10948550_info.csv | csvlook
| Run         | Sample     | BioSample    |   spots | LibraryStrategy | LibrarySource | LibraryLayout | Platform        | Model  | SampleType | CenterName            |
| ----------- | ---------- | ------------ | ------- | --------------- | ------------- | ------------- | --------------- | ------ | ---------- | --------------------- |
| SRR10948550 | SRS6014638 | SAMN13871323 | 425,717 | RNA-Seq         | GENOMIC       | SINGLE        | OXFORD_NANOPORE | MinION | simple     | HKU-SHENZHEN HOSPITAL |
|             |            |              |         |                 |               |               |                 |        |            |                       |

Download all metadata using SRR accessions from raw/ncov-sequences.yaml.

my_accession=$(cat raw/ncov-sequences.yaml | grep 'sra-run"' | perl -nle 's/.*"(\w+)",/$1/; print')
for acc in $my_accession; do
   wget -O sra/${acc}_info.csv "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=$acc"
done

# concenate into one metadata file
rm -f sra/metadata.txt
touch sra/metadata.txt
for file in `ls sra/SRR*info.csv`; do
   echo $file;
   csvcut -c Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName $file >> sra/metadata.txt
done

head sra/metadata.txt
Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
SRR10902284,SRS6014638,SAMN13871323,261890,RNA-Seq,METAGENOMIC,SINGLE,OXFORD_NANOPORE,MinION,simple,UNIVERSITY OF HONG KONG
,,,,,,,,,,
Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
SRR10903401,SRS6007144,SAMN13872787,476632,RNA-Seq,METATRANSCRIPTOMIC,PAIRED,ILLUMINA,Illumina MiSeq,simple,WUHAN UNIVERSITY
,,,,,,,,,,
Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName
SRR10903402,SRS6007143,SAMN13872786,676694,RNA-Seq,METATRANSCRIPTOMIC,PAIRED,ILLUMINA,Illumina MiSeq,simple,WUHAN UNIVERSITY
,,,,,,,,,,
Run,Sample,BioSample,spots,LibraryStrategy,LibrarySource,LibraryLayout,Platform,Model,SampleType,CenterName

# technology
cat sra/metadata.txt | grep -v "^," | grep -v "^Run" | cut -f9 -d','| sort | uniq -c
   1 BGISEQ-500
 248 GridION
   7 Illumina HiSeq 3000
  18 Illumina MiSeq
   1 Illumina MiniSeq
   2 Ion Torrent S5
   3 MinION
   8 NextSeq 500

# center
cat sra/metadata.txt | grep -v "^," | grep -v "^Run" | cut -f11 -d','| sort | uniq -c
   1 "SHANGHAI PUBLIC HEALTH CLINICAL CENTER & SCHOOL OF PUBLIC HEALTH
  11 "WUHAN INSTITUTE OF VIROLOGY
   1 BGI
   2 HKU-SHENZHEN HOSPITAL
   1 THE SCRIPPS RESEARCH INSTITUTE
   2 UFBA
   1 UNIVERSIDAD TECNOLOGICA DE PEREIRA
   1 UNIVERSITY OF HONG KONG
   3 UNIVERSITY OF MELBOURNE
  14 UNIVERSITY OF WASHINGTON
   8 UNIVERSITY OF WISCONSIN - MADISON
 243 WUHAN UNIVERSITY

cat sra/metadata.txt | grep MELBOURNE
SRR11267570,SRS6201528,SAMN14167851,779208,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE
SRR11350376,SRS6201528,SAMN14167851,779208,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE
SRR11300652,SRS6313628,SAMN14371025,430923,RNA-Seq,VIRAL RNA,SINGLE,OXFORD_NANOPORE,GridION,simple,UNIVERSITY OF MELBOURNE

Download all Illumina data for further analysis. (I have not analysed Oxford Nanopore before, so I won't download these yet.)

for acc in `cat sra/metadata.txt | grep ILLUMINA | cut -f1 -d','`; do
   echo $acc
   fasterq-dump -p --outdir raw/fastq $acc
done

SRR10971381

We will analysis the data from SRR10971381 following the preprocessing steps outlined in https://github.com/galaxyproject/SARS-CoV-2/blob/master/1-PreProcessing/pp_wf.png.

Use SRA Toolkit to download FASTQ sequences from the SRA. First use prefetch, a command-line for downloading SRA, dbGaP, and ADSP data, to download sra files. See https://www.ncbi.nlm.nih.gov/books/NBK242621/ for more information.

Using prefetch kept resulting in timeout errors. Perhaps https://github.com/ncbi/sra-tools/wiki/06.-Connection-Timeouts will help?

prefetch --output-directory raw SRR10971381

# connection keeps dying
wget https://sra-download.ncbi.nlm.nih.gov/traces/sra46/SRR/010714/SRR10971381

md5sum SRR10971381 > SRR10971381.md5sum
cat SRR10971381.md5sum
5496488662893a836e23541b84bfb7cd  SRR10971381

Next use fastq-dump to convert SRA data into FASTQ.

fastq-dump --split-files ./SRR10971381
2020-03-10T14:51:00 fastq-dump.2.10.3 err: name not found while resolving query within virtual file system module - failed to resolve accession './SRR10971381' - Cannot resolve accession ( 404 )
Read 28282964 spots for ./SRR10971381
Written 28282964 spots for ./SRR10971381

Preprocess using fastp.

fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq
Read1 before filtering:
total reads: 28282964
total bases: 4017125680
Q20 bases: 1783384314(44.3945%)
Q30 bases: 1735659038(43.2065%)

Read2 before filtering:
total reads: 28282964
total bases: 4013917534
Q20 bases: 1723725994(42.9437%)
Q30 bases: 1652844944(41.1779%)

Read1 after filtering:
total reads: 13054241
total bases: 1786633510
Q20 bases: 1671652872(93.5644%)
Q30 bases: 1634552420(91.4878%)

Read2 aftering filtering:
total reads: 13054241
total bases: 1782180210
Q20 bases: 1625652911(91.2171%)
Q30 bases: 1568126467(87.9892%)

Filtering result:
reads passed filter: 26108482
reads failed due to low quality: 30441256
reads failed due to too many N: 16190
reads failed due to too short: 0
reads with adapter trimmed: 582728
bases trimmed due to adapters: 30162896

Duplication rate: 5.57505%

Insert size peak (evaluated by paired-end reads): 150

JSON report: fastp.json
HTML report: fastp.html

fastp --thread 8 -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1_fastp.fastq -O SRR10971381_2_fastp.fastq --thread 8
fastp v0.20.0, time used: 544 seconds

Quality control using FastQC.

mkdir fastqc_out
fastqc -o fastqc_out -f fastq SRR10971381_1_fastp.fastq SRR10971381_2_fastp.fastq

Use BWA MEM to map raw reads back to MN908947.

mkdir bwa_index
cp MN908947.fa bwa_index
cd bwa_index
bwa index MN908947.fa

bwa mem -t 8 raw/bwa_index/MN908947.fa raw/SRR10971381/SRR10971381_1_fastp.fastq raw/SRR10971381/SRR10971381_2_fastp.fastq | samtools sort - -o result/SRR10971381_MN908947.bam

Stats on the BAM file.

samtools flagstat -@8 SRR10971381_MN908947.bam 
26137357 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
28875 + 0 supplementary
0 + 0 duplicates
174256 + 0 mapped (0.67% : N/A)
26108482 + 0 paired in sequencing
13054241 + 0 read1
13054241 + 0 read2
136034 + 0 properly paired (0.52% : N/A)
136414 + 0 with itself and mate mapped
8967 + 0 singletons (0.03% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools view -F 0x804 -f 2 -b SRR10971381_MN908947.bam > SRR10971381_MN908947_mapped.bam

Variant calling.

bcftools mpileup -f raw/MN908947.fa result/SRR10971381_MN908947_mapped.bam | bcftools call -mv -Ov -o result/SRR10971381_MN908947_mapped.vcf

Links

Appendix

Entrez Direct Functions

See Entrez Direct (EDirect) for more information.

Navigation functions support exploration within the Entrez databases:

  • esearch performs a new Entrez search using terms in indexed fields.
  • elink looks up neighbors (within a database) or links (between databases).
  • efilter filters or restricts the results of a previous query.

Records can be retrieved in specified formats or as document summaries:

  • efetch downloads records or reports in a designated format.

Desired fields from XML results can be extracted without writing a program:

  • xtract converts EDirect XML output into a table of data values.

Several additional functions are also provided:

  • einfo obtains information on indexed fields in an Entrez database.
  • epost uploads unique identifiers (UIDs) or sequence accession numbers.
  • nquire sends a URL request to a web page or CGI service.