Skip to content

Check Fastqs Program

sarpiens edited this page Mar 6, 2024 · 7 revisions

Description

The Check Fastqs program enables to carry out some checks of interest on the downloaded fastq files. This program corresponds to the Control Check Programs group, which means that it will not generate output files, but rather only information through the command terminal (stdout).

There are two execution modes available:

  • ENA Metadata Mode. This mode allows to perform the corresponding fastq checks using the ENA Metadata Table as reference in the case that the typical ENA workflow is being followed.

  • Generic Metadata Mode. This mode allows to perform the corresponding fastq checks using a Generic Manifest and Metadata Tables as reference. Especially useful when working with external datasets. The Manifest Table file must contain the following columns of interest:

    • File Name. Indicates the names of the expected Fastq files. This column must be indicated as "file_name" in the table header.

    • Sample Name. Indicates the sample names associated to the Fastq files. This column must be indicated as "sample_name" in the table header. The Generic Metadata Table must contain a column with sample identifiers (--generic_common_column_mt) that will be used to compare with this column in the Manifest Table.

    • File MD5 Sum. Indicates the associated MD5 sum values of the Fastq files. This column must be indicated as "file_md5" in the table header.

    For instance, see the filtered_manifest_CRA001372_example.tsv test file as an example.

The analyses are divided into two main parts:

  • Main Information. Some relevant default information will be displayed depending on the execution mode:

    • ENA Metadata Mode. (1) Number of run accessions, (2) Number of unique sample accessions, (3) Number of unique sample aliases, (4) Appearances per library layout, (5) Number of URLs expected to be downloaded, (6) Number of fastq files in the provided directory.

    • Generic Metadata Mode. (1) Number of rows in metadata, (2) Number of unique samples for the provided generic_common_column_mt in metadata, (3) Number of Fastq files in manifest, (4) Number of unique samples for "sample_name" column in manifest, (5) Number of fastq files in the provided directory.

  • Fastq Checks. Some relevant checks will be carried out for the fastq files depending on the execution mode:

    • ENA Metadata Mode. (1) Check that the expected files from the Metadata Table exist in the provided directory, (2) Check if there are fastq files in the provided directory absent in the Metadata Table, (3) Check if there are fastq files of the provided directory with multiple matches in the Metadata Table. Furthermore, it has an optional mode (parameter -m) for checking that the MD5 sum checks of the downloaded files match with the corresponding MD5s in the Metadata Table.

    • Generic Metadata Mode. (1) Check that the expected files from the Manifest Table exist in the provided directory, (2) Check if there are fastq files in the provided directory absent in the Manifest Table, (3) Check if there are fastq files of the provided directory with multiple matches in the Manifest Table, (4) Compare the values between the "sample_name" column from Manifest Table and the provided generic_common_column_mt from the Generic Metadata Table. Furthermore, it has an optional mode (parameter -m) for checking that the MD5 sum checks of the downloaded files match with the corresponding MD5s in the Metadata Table.

If warnings are detected during the various checks, advice messages will be displayed indicating what could be the reasons for concern and what should be done.

Input Elements:

Input Type Description
PROJECT_metadata.tsv File Metadata Table. One of the Metadata Tables generated in the different steps of the ENA workflow by Download Metadata ENA program (PROJECT_ENA_metadata.tsv), Merge Metadata program (PROJECT_merged_metadata.tsv) or Filter Metadata program (PROJECT_filtered_metadata.tsv). Also a Generic Metadata Table (GENERIC_metadata_file.tsv).
manifest.tsv File Manifest Table. Only needed when working in Generic mode.
/directory/path/ Directory Downloaded Fastqs Directory

Output Elements:

Output Type Description
Analysis and Checks stdout Results of the different analyses and checks of the Downloaded Fastq Files

Arguments

Usage:

check_fastqs [-h] -t METADATA_TABLE -d FASTQS_DIRECTORY [-a MANIFEST_TABLE] [-s {ENA,Generic}]
             [-c {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}]
             [-g GENERIC_COMMON_COLUMN_MT] [-p FASTQ_PATTERN] [-m] [-x] [-v]

Options:

Parameter Description
-h, --help Show help message and exit.
-t, --metadata_table Metadata Table [Expected sep=TABS]. Indicate the path to the Metadata Table file.
-d, --fastqs_directory Fastqs Directory. Indicate the path to the Fastqs Directory.
-a, --manifest_table Manifest Table [Expected sep=TABS]. Indicate the path to the Manifest Table file. This parameter will be skipped if ENA mode is used.
-s, --mode Execution Mode (Optional) [Default:ENA]. Options: 1) ENA Metadata Table File [Expected sep=TABS] or 2) Generic Metadata and Manifest Table Files [Expected sep=TABS]. Permitted options are {ENA, Generic}.
-c, --ena_download_column ENA Download Column (Optional) [Default:fastq_ftp]. Indicate the ENA Metadata Table column that was used to download Fastq files. Permitted options are {fastq_ftp, fastq_aspera, fastq_galaxy, submitted_ftp, submitted_aspera, submitted_galaxy}. This parameter will be skipped if Generic mode is used.
-g, --generic_common_column_mt Generic Common Metadata Column (Optional) [Default:sample_id]. Indicate the name of the Common Column in Metadata Table to compare Metadata and Manifest Table Files. This parameter will be skipped if ENA mode is used.
-p, --fastq_pattern Fastq File Pattern (Optional) [Default:".fastq.gz"]. Indicate the pattern to identify Fastq files.
-m, --md5_check MD5 Check (Optional). If indicated, it will enable MD5 Check mode.
-x, --plain_text Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors.
-v, --version Show program's version number and exit.

Examples

Commands:

  • Check fastqs with colored text stdout:
check_fastqs -t PRJEB10949_ENA_metadata.tsv -d downloads
  • Check fastqs with plain text stdout:
check_fastqs -t PRJEB10949_ENA_metadata.tsv -d downloads --plain_text
  • Check fastqs using "submitted_ftp" instead of the default "fastq_ftp" as ENA Download Column:
check_fastqs -t PRJEB10949_ENA_metadata.tsv -d downloads -c submitted_ftp
  • Check fastqs using additional MD5 Check mode:
check_fastqs -t PRJEB10949_ENA_metadata.tsv -d downloads --md5_check 
  • Check fastqs using "fq.gz" instead of the default "fastq.gz" Fastq Pattern:
check_fastqs -t PROJECT_metadata_files_other_fastq_extension.tsv -d downloads -p ".fq.gz"
  • Check fastqs in Generic mode:
check_fastqs -s Generic -t GENERIC_metadata_file.tsv -a GENERIC_manifest_file.tsv -d downloads
  • Check fastqs in Generic mode using a different generic_common_column_mt:
check_fastqs -s Generic -g SampleID -t GENERIC_metadata_file.tsv -a GENERIC_manifest_file.tsv -d downloads

To see a full and detailed example of dataset curation, see the Tutorial Full Example page.