Skip to content

3. Pointing seQuoia Sheppard to Location of Sequencing Data

Rauf Salamzade edited this page Dec 22, 2020 · 1 revision

Meta-Information File

The meta-information file is simply a tab delimited file which can contain any data associated with the sequenced samples. The only field required and expected as the first column is "sample_id". This should contain the naming corresponding to sequencing files found in the sequencing directory.

Here is an example of how this file could look, the header is essential.

sample_id       taxonomy                  dna_concentration_ug/L
Sample_1        Escherichia coli          5.0
Sample_2        Bacillus subtilis         5.4
Sample_3        Escherichia coli          3.3
.
.
.

Illumina Input Data

Illumina data can be input in a variety of ways. Some will require minimal preparation from the users end. Others will require the construction of 'mapping' files, which relate sample names to the full paths of where the sequencing is stored. There are four types of formats for Illumina data that are accepted by seQuoia currently. This information about data type need to be provided when users run seQuoia:

gp-directory

this is when you have the Illumina data as BAM files stored in GP directories in the common structure. If this is the case you will often have Picard stats for these samples as well. These will contain useful information which sheppard.py will make sure to copy over to the seQuoia repos for each sample for downstream visualization in reporter.py.

How to provide input files:

These can be provided in only one way, a two column tab delimited file. The first column should be the sample identifier (matching what is in the --meta input file) and the second column should be the path to the current or specific GP repository (the full path to the directory containing the BAM file and the Picard stats files):

# no header is required for these files
Sample_1    /path/to/GP-33234/current/
Sample_2    /path/to/GP-33233/current/
.
.
.

illumina-paired or illumina-single

if you did not get your data directly for GP, chances are they are in the FASTQ format still. If so, you will need to specify the data as illumina-paired if you have paired end sequencing data or illumina-single if you have single end sequencing data.

How to provide input files:

When you have FASTQ files, you can provide these in one of two ways depending on how you have them structured and what you find more convenient:

option 1: tab delimited file

This will require you to create a tab delimited file with two or three columns, depending on whether you have single end or paired end FASTQ data, respectively. The first column will be the sample_id and the subsequent columns will provide the full paths to either the single-end or both paired-end read files:

# no header is required for these files
Sample_1    /path/to/sample1_R1.fastq.gz 
Sample_2    /path/to/sample2_R1.fastq.gz

or

# no header is required for these files
Sample_1    /path/to/sample1_R1.fastq.gz    /path/to/sample1_R2.fastq.gz
Sample_2    /path/to/sample2_R1.fastq.gz    /path/to/sample2_R2.fastq.gz
.
.
.

option 2: a directory of FASTQ files with appropriate names

Alternatively, if you have all your FASTQ files nicely named to match the sample identifiers provided in the meta input file and all of these files exist in the same directory, then you can point to this directory directly instead of mapping each sample name to the sequencing data with option 1.

This option relies on some assumptions about file naming and suffices used. Some accepted formats include:

X_R1.fastq   (with / without gzip accepted)
X.1.fastq  (with / without gzip accepted)
X_R1.fq (with / without gzip accepted)
X.1.fq (with / without gzip accepted)

Where X is the sample name and matches what is found in the meta input file.

bam

If you did not get your data directly for GP, but have the data in BAM format (unaligned or aligned). Fear not, you can simply provide your data in the following way.

How to provide input files:

These can be provided in only one way, a two column tab delimited file. The first column should be the sample identifier (matching what is in the --meta input file) and the second column should be the path to the sample's BAM file.

# no header is required for these files
Sample_1    /path/to/sample1.bam
Sample_2    /path/to/sample2.bam
.
.
.

Providing Nanopore Data

Currently, there is only one supported way to provide nanopore data and it assumes the data is structured in an Albacore post-basecalling + demultiplexing result directory. Importantly, this result directory should be devoid of subdirectories as labelled '0/', '1/', '2/', '3/' ... This indicates that Albacore mimicked the structure of how the data was provided to it, and to collapse things down for easier parsing, please try setting the READS_PER_FASTQ_BATCH option in Albacore to something really high.

Once you have your data in a collapsed-down Albacore results directory, you need to create a listings file which essentially links the sample name (consistent with what the user provides in the the meta information file) to the Albacore results directory and to the barcode identifier for the sample. This information is provided to seQcSheppard in a tab-delimited three column text file that looks something like the following:

Sample1       /path/to/Albacore_FlowCell1_results/      barcode0001
Sample2       /path/to/Albacore_FlowCell1_results/      barcode0002
Sample3       /path/to/Albacore_FlowCell2_results/      barcode0001