Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find full-test data #85

Closed
jfy133 opened this issue Jun 2, 2022 · 17 comments · Fixed by #229
Closed

Find full-test data #85

jfy133 opened this issue Jun 2, 2022 · 17 comments · Fixed by #229
Assignees
Labels
enhancement Improvement for existing functionality first-release Functionality defined in initial design high-priority

Comments

@jfy133
Copy link
Member

jfy133 commented Jun 2, 2022

Description of feature

These should be 3-5 'real life' samples that you would profile against.

Ideally these would be shortread/long read pairs.

@jfy133 jfy133 added the enhancement Improvement for existing functionality label Jun 2, 2022
@jfy133
Copy link
Member Author

jfy133 commented Jun 2, 2022

https://lomanlab.github.io/mockcommunity/

And @sofstam will help look for 'real' illumina/nanopore stuff :)

@sofstam
Copy link
Collaborator

sofstam commented Jun 3, 2022

I have asked at our site if we are allowed to use some of our data as test-data. Meanwhile, I found those for Nanopore:

https://www.ebi.ac.uk/ena/browser/view/PRJNA312719

@sofstam
Copy link
Collaborator

sofstam commented Jun 10, 2022

@jfy133 I got a response from our site that we cannot use any of our data as test data right now, we need an ethical approval that we will be working on during fall.

@jfy133
Copy link
Member Author

jfy133 commented Jun 10, 2022

OK lets just look for already published stuff 👍

Maybe: https://www.nature.com/articles/s41597-019-0287-z ?

@sofstam
Copy link
Collaborator

sofstam commented Jun 10, 2022

I will have a look at this!

Sounds good with the dataset from this article. Since the dataset is focused on bacteria, it might be good to have test data for viruses as well? https://www.ebi.ac.uk/ena/browser/view/PRJNA670157?show=reads

@jfy133 jfy133 added the first-release Functionality defined in initial design label Jun 16, 2022
@jfy133 jfy133 added this to the First Release milestone Sep 27, 2022
@Midnighter
Copy link
Collaborator

Beyond the mock communities, I'm personally interested in using the CAMI data.

@jfy133
Copy link
Member Author

jfy133 commented Nov 3, 2022

And also need to decide databases, and where to store them (presumably aws...?)

@Midnighter
Copy link
Collaborator

I think Zenodo could be a good place if we want to publish the benchmark. Then every database has a DOI.

@jfy133
Copy link
Member Author

jfy133 commented Nov 3, 2022

I fear the file sizes for some will be too large for Zenodo (50GB limit) but we can see

@sofstam
Copy link
Collaborator

sofstam commented Dec 13, 2022

Minimum criteria for full-test data:

  • Fastq files, 5 Illumina, 2 Nanopore
  • Sequencing depth > 10M
  • Shotgun experiment
  • Multiple run accessions for one sample
  • Host removal (contaminant preferably)

@jfy133
Copy link
Member Author

jfy133 commented Jan 12, 2023

To 'borrow' from the MAG full-test data we can pick 2-3 illumina and 2-3 ONT samples/runs from here: https://www.ebi.ac.uk/ena/browser/view/PRJEB29152

Have both Illumina and Nanopoe, and sequencing depth is >10m, and is shotgun

@sofstam
Copy link
Collaborator

sofstam commented Jan 19, 2023

https://www.nature.com/articles/s41597-019-0287-z

I post the article here so we do not forget.

@jfy133
Copy link
Member Author

jfy133 commented Jan 23, 2023

Meslier2022 is what we are going for: https://www.nature.com/articles/s41597-022-01762-z

I did a test run:

nextflow run nf-core/taxprofiler
		 -profile mpcdf,raven
		 -r combine-kreports-fix
		 --input fulltest_samplesheet.csv
		 --databases fulltest_dbsheet.csv
		 --outdir ./results
		 --save_preprocessed_reads
		 --perform_shortread_qc
		 --shortread_qc_mergepairs
		 --perform_shortread_complexityfilter
		 --save_complexityfiltered_reads
		 --perform_longread_qc
		 --perform_shortread_hostremoval
		 --perform_longread_hostremoval
		 --hostremoval_reference 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/819/615/GCA_000819615.1_ViralProj14015/GCA_000819615.1_ViralProj14015_genomic.fna.gz'
		 --save_hostremoval_index
		 --save_hostremoval_mapped
		 --save_hostremoval_unmapped
		 --perform_runmerging
		 --save_runmerged_reads
		 --run_centrifuge
		 --centrifuge_save_reads
		 --run_diamond
		 --run_kaiju
		 --run_kraken2
		 --kraken2_save_reads
		 --kraken2_save_readclassification
		 --kraken2_save_minimizers
		 --run_krakenuniq
		 --krakenuniq_save_reads
		 --krakenuniq_save_readclassifications
		 --run_bracken
		 --run_malt
		 --malt_save_reads
		 --malt_generate_megansummary
		 --run_metaphlan3
		 --run_motus
		 --run_profile_standardisation
		 --run_krona
		 -ansi-log false
		 -with-tower
		 -resume
		 --run_metaphlan3
		 --run_motus

With a the following samplesheet:

sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
MOCK_001_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/000/ERR9765780/ERR9765780.fastq.gz,,
MOCK_002_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/001/ERR9765781/ERR9765781.fastq.gz,,
MOCK_003_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/002/ERR9765782/ERR9765782.fastq.gz,,
MOCK_001_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_2.fastq.gz,
MOCK_002_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,2,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_2.fastq.gz,

And database sheet

tool,db_name,db_params,db_path
bracken,bracken-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/bracken.tar.gz
centrifuge,centrifuge-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/centrifuge.tar.gz
diamond,diamond-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022/diamond/diamond.dmnd
kaiju,kaiju-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kaiju.tar.gz
kraken2,kraken2-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kraken2.tar.gz
krakenuniq,krakenuniq-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/krakenuniq.tar.gz
malt,malt-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/malt.tar.gz
metaphlan3,metaphlan3-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/metaphlan3.tar.gz
motus,motus-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/motus.tar.gz

The databases TARs basd on these instructions

And it mostly ran! Observations in next commetn

@jfy133
Copy link
Member Author

jfy133 commented Jan 23, 2023

Output files:

Overall we get 35% reads classified with Brakcken so I think this a good sign this is reasonable dataset

@jfy133
Copy link
Member Author

jfy133 commented Jan 24, 2023

@sofstam
Copy link
Collaborator

sofstam commented Jan 31, 2023

Shall we close this?

@jfy133
Copy link
Member Author

jfy133 commented Jan 31, 2023

Not yet, lets get the samplesheets and databases seets upload to test-datasets and then we can close it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality first-release Functionality defined in initial design high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants