Find full-test data #85

jfy133 · 2022-06-02T10:38:14Z

Description of feature

These should be 3-5 'real life' samples that you would profile against.

Ideally these would be shortread/long read pairs.

jfy133 · 2022-06-02T13:45:27Z

https://lomanlab.github.io/mockcommunity/

And @sofstam will help look for 'real' illumina/nanopore stuff :)

sofstam · 2022-06-03T12:32:51Z

I have asked at our site if we are allowed to use some of our data as test-data. Meanwhile, I found those for Nanopore:

https://www.ebi.ac.uk/ena/browser/view/PRJNA312719

sofstam · 2022-06-10T08:17:53Z

@jfy133 I got a response from our site that we cannot use any of our data as test data right now, we need an ethical approval that we will be working on during fall.

jfy133 · 2022-06-10T08:23:17Z

OK lets just look for already published stuff 👍

Maybe: https://www.nature.com/articles/s41597-019-0287-z ?

sofstam · 2022-06-10T08:27:05Z

I will have a look at this!

Sounds good with the dataset from this article. Since the dataset is focused on bacteria, it might be good to have test data for viruses as well? https://www.ebi.ac.uk/ena/browser/view/PRJNA670157?show=reads

Midnighter · 2022-11-03T12:57:09Z

Beyond the mock communities, I'm personally interested in using the CAMI data.

jfy133 · 2022-11-03T13:41:16Z

And also need to decide databases, and where to store them (presumably aws...?)

Midnighter · 2022-11-03T14:03:20Z

I think Zenodo could be a good place if we want to publish the benchmark. Then every database has a DOI.

jfy133 · 2022-11-03T14:27:07Z

I fear the file sizes for some will be too large for Zenodo (50GB limit) but we can see

sofstam · 2022-12-13T09:50:44Z

Minimum criteria for full-test data:

Fastq files, 5 Illumina, 2 Nanopore
Sequencing depth > 10M
Shotgun experiment
Multiple run accessions for one sample
Host removal (contaminant preferably)

jfy133 · 2023-01-12T13:15:53Z

To 'borrow' from the MAG full-test data we can pick 2-3 illumina and 2-3 ONT samples/runs from here: https://www.ebi.ac.uk/ena/browser/view/PRJEB29152

Have both Illumina and Nanopoe, and sequencing depth is >10m, and is shotgun

sofstam · 2023-01-19T15:52:31Z

https://www.nature.com/articles/s41597-019-0287-z

I post the article here so we do not forget.

jfy133 · 2023-01-23T14:35:06Z

Meslier2022 is what we are going for: https://www.nature.com/articles/s41597-022-01762-z

I did a test run:

nextflow run nf-core/taxprofiler
		 -profile mpcdf,raven
		 -r combine-kreports-fix
		 --input fulltest_samplesheet.csv
		 --databases fulltest_dbsheet.csv
		 --outdir ./results
		 --save_preprocessed_reads
		 --perform_shortread_qc
		 --shortread_qc_mergepairs
		 --perform_shortread_complexityfilter
		 --save_complexityfiltered_reads
		 --perform_longread_qc
		 --perform_shortread_hostremoval
		 --perform_longread_hostremoval
		 --hostremoval_reference 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/819/615/GCA_000819615.1_ViralProj14015/GCA_000819615.1_ViralProj14015_genomic.fna.gz'
		 --save_hostremoval_index
		 --save_hostremoval_mapped
		 --save_hostremoval_unmapped
		 --perform_runmerging
		 --save_runmerged_reads
		 --run_centrifuge
		 --centrifuge_save_reads
		 --run_diamond
		 --run_kaiju
		 --run_kraken2
		 --kraken2_save_reads
		 --kraken2_save_readclassification
		 --kraken2_save_minimizers
		 --run_krakenuniq
		 --krakenuniq_save_reads
		 --krakenuniq_save_readclassifications
		 --run_bracken
		 --run_malt
		 --malt_save_reads
		 --malt_generate_megansummary
		 --run_metaphlan3
		 --run_motus
		 --run_profile_standardisation
		 --run_krona
		 -ansi-log false
		 -with-tower
		 -resume
		 --run_metaphlan3
		 --run_motus

With a the following samplesheet:

sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
MOCK_001_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/000/ERR9765780/ERR9765780.fastq.gz,,
MOCK_002_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/001/ERR9765781/ERR9765781.fastq.gz,,
MOCK_003_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/002/ERR9765782/ERR9765782.fastq.gz,,
MOCK_001_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_2.fastq.gz,
MOCK_002_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,2,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_2.fastq.gz,

And database sheet

tool,db_name,db_params,db_path
bracken,bracken-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/bracken.tar.gz
centrifuge,centrifuge-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/centrifuge.tar.gz
diamond,diamond-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022/diamond/diamond.dmnd
kaiju,kaiju-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kaiju.tar.gz
kraken2,kraken2-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kraken2.tar.gz
krakenuniq,krakenuniq-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/krakenuniq.tar.gz
malt,malt-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/malt.tar.gz
metaphlan3,metaphlan3-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/metaphlan3.tar.gz
motus,motus-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/motus.tar.gz

The databases TARs basd on these instructions

And it mostly ran! Observations in next commetn

jfy133 · 2023-01-23T14:48:09Z

jfy133 · 2023-01-24T10:10:06Z

Add to docs we recommend running SR/LR separtely (although running together is supported) Add comment about recommendation to split SR/LR #220

sofstam · 2023-01-31T12:26:42Z

Shall we close this?

jfy133 · 2023-01-31T12:48:24Z

Not yet, lets get the samplesheets and databases seets upload to test-datasets and then we can close it :)

jfy133 added the enhancement Improvement for existing functionality label Jun 2, 2022

jfy133 assigned jfy133 and sofstam Jun 2, 2022

jfy133 added the first-release Functionality defined in initial design label Jun 16, 2022

jfy133 added this to the First Release milestone Sep 27, 2022

jfy133 added the high-priority label Nov 3, 2022

jfy133 mentioned this issue Feb 2, 2023

Add full test data and documentation of the test data #229

Merged

9 tasks

jfy133 closed this as completed in #229 Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find full-test data #85

Find full-test data #85

jfy133 commented Jun 2, 2022

jfy133 commented Jun 2, 2022

sofstam commented Jun 3, 2022

sofstam commented Jun 10, 2022

jfy133 commented Jun 10, 2022

sofstam commented Jun 10, 2022 •

edited

Loading

Midnighter commented Nov 3, 2022

jfy133 commented Nov 3, 2022

Midnighter commented Nov 3, 2022

jfy133 commented Nov 3, 2022

sofstam commented Dec 13, 2022

jfy133 commented Jan 12, 2023

sofstam commented Jan 19, 2023

jfy133 commented Jan 23, 2023

jfy133 commented Jan 23, 2023 •

edited by sofstam

Loading

jfy133 commented Jan 24, 2023 •

edited

Loading

sofstam commented Jan 31, 2023

jfy133 commented Jan 31, 2023

Find full-test data #85

Find full-test data #85

Comments

jfy133 commented Jun 2, 2022

Description of feature

jfy133 commented Jun 2, 2022

sofstam commented Jun 3, 2022

sofstam commented Jun 10, 2022

jfy133 commented Jun 10, 2022

sofstam commented Jun 10, 2022 • edited Loading

Midnighter commented Nov 3, 2022

jfy133 commented Nov 3, 2022

Midnighter commented Nov 3, 2022

jfy133 commented Nov 3, 2022

sofstam commented Dec 13, 2022

jfy133 commented Jan 12, 2023

sofstam commented Jan 19, 2023

jfy133 commented Jan 23, 2023

jfy133 commented Jan 23, 2023 • edited by sofstam Loading

jfy133 commented Jan 24, 2023 • edited Loading

sofstam commented Jan 31, 2023

jfy133 commented Jan 31, 2023

sofstam commented Jun 10, 2022 •

edited

Loading

jfy133 commented Jan 23, 2023 •

edited by sofstam

Loading

jfy133 commented Jan 24, 2023 •

edited

Loading