Skip to content

Make Treatment Template Program

sarpiens edited this page Mar 6, 2024 · 2 revisions

Description

The Make Treatment Template program allows to generate a raw treatment template using the information available of a reference input file and the downloaded fastq files. The resulting raw template file must be further curated so that it can be used by the next programs of the workflow Treat Metadata and Treat Fastqs programs. The three programs together can be used for the extra treatment of fastq files and associated metadata. This program corresponds to the Optional Programs group, which means that this steps could be skipped if there is no need for further treatment of your metadata and fastq files.

There are two execution modes available:

  • ENA Mode. This mode allows to generate the initial raw treatment template using the ENA Metadata Table as reference in the case that the typical ENA workflow is being followed.

  • Generic Mode. This mode allows to generate the initial raw treatment template using a Generic Manifest Table as reference. The Manifest Table file must contain the following columns of interest:

    • File Name. Indicates the names of the expected Fastq files. This column must be indicated as "file_name" in the table header.

    • Sample Name. Indicates the sample names associated to the Fastq files. This column must be indicated as "sample_name" in the table header. The Generic Metadata Table must contain a column with sample identifiers (--generic_merge_column) that will be used to compare with this column in the Manifest Table.

    • File MD5 Sum. Indicates the associated MD5 sum values of the Fastq files. This column must be indicated as "file_md5" in the table header.

By default, the raw template will have the following structure:

  • Sample Names Columns. This column must be indicated as "sample_name" in the final curated treatment template header. This will be used to define the final name of the files, except in the cases where “copy” treatment mode is used. The program will provide different results depending on the execution mode:

    • ENA Mode. The first element will be an empty column named "sample_name" followed by various candidate sample columns (run_accession, sample_title, sample_alias, secondary_sample_accession, library_name, sample_accession, run_alias). More candidate columns can be provided using the Extra Sample Columns option (-e parameter). The user will have to manually select a candidate column and copy its content in the "sample_name" column (the other candidate sample columns can be left as they are or be removed, since they will be ignored by the Treat Fastqs and Treat Metadata programs).

    • Generic Mode. The values of the "sample_name" column of the provided Manifest Table will be used directly to generate this column.

  • Fastq File Name. The second element will be a ready to use column named "fastq_file_name", which contains a list of the fastq file names of interest in the provided Fastqs Directory. This column must be indicated as "fastq_file_name" in the final curated treatment template header.

  • Fastq Type. The third element will be a ready to use column named "fastq_type", which contains a list of the fastq types. Valid options are "pair1" (forward,R1 reads), "pair2" (reverse, R2 reads) or "single". This column must be indicated as "fastq_type" in the final curated treatment template header.

  • Treatment. The fourth element will be an empty column named "treatment" in which the user will have to indicate the treatment for each file. Valid options are "merge", "rename" or "copy". This column must be indicated as "treatment" in the final curated treatment template header.

For further details, of a curated template see the treatment_template_filtered_PRJEB10949_merged_metadata_example.tsv test file.

Input Elements:

Input Type Description
PROJECT_metadata.tsv or manifest.tsv File Input File. For ENA Mode, it will be one of the Metadata Tables generated in the different steps of the workflow by Download Metadata ENA program (PROJECT_ENA_metadata.tsv), Merge Metadata program (PROJECT_merged_metadata.tsv) or Filter Metadata program (PROJECT_filtered_metadata.tsv). Whereas for Generic Mode, it will be a Generic Manifest Table.
/directory/path/ Directory Downloaded Fastqs Directory

Output Elements:

Output Type Description
PROJECT_raw_treatment_template.tsv File Raw Treatment Template File

The resulting PROJECT_raw_treatment_template.tsv file needs to be further curated to obtain the PROJECT_treatment_template.tsv. The latter will be used in the next workflow steps, using the Treat Metadata and Treat Fastqs programs. To get a general idea of the optional treatment steps of the workflow, check the workflow's diagram.

Arguments

Usage:

make_treatment_template [-h] -i INPUT_FILE -d FASTQS_DIRECTORY [-s {ENA,Generic}]
                        [-c {fastq_ftp,fastq_aspera,fastq_galaxy,submitted_ftp,submitted_aspera,submitted_galaxy}]
                        [-p FASTQ_PATTERN] [-r1 R1_PATTERN][-r2 R2_PATTERN] 
                        [-e EXTRA_SAMPLE_COLUMNS [EXTRA_SAMPLE_COLUMNS ...]] [-o OUTPUT_DIRECTORY] [-x] [-v]

Options:

Parameter Description
-h, --help Show help message and exit.
-i, --input_file Input Reference File. Indicate the path to the Input Reference File with the information to create raw treatment template.
-d, --fastqs_directory Fastqs Directory. Indicate the path to the Fastqs Directory.
-s, --mode Execution Mode (Optional) [Default:ENA]. Options: 1) ENA Metadata Table File [Expected sep=TABS] or 2) Generic Manifest Table File [Expected sep=TABS]. Permitted options are {ENA, Generic}.
-c, --ena_download_column ENA Download Column (Optional) [Default:fastq_ftp]. Indicate the ENA Metadata Table column that was used to download Fastq files. Permitted options are {fastq_ftp, fastq_aspera, fastq_galaxy, submitted_ftp, submitted_aspera, submitted_galaxy}. This parameter will be skipped if Generic mode is used.
-p, --fastq_pattern Fastq File Pattern (Optional) [Default:".fastq.gz"]. Indicate the pattern to identify Fastq files.
-r1, --r1_pattern R1 File Pattern (Optional) [Default:"_1.fastq.gz"]. Indicate the pattern to identify R1 PAIRED Fastq files.
-r2, --r2_pattern R2 File Pattern (Optional) [Default:"_2.fastq.gz"]. Indicate the pattern to identify R2 PAIRED Fastq files.
-e, --extra_sample_columns Extra Sample Columns (Optional). Indicate the column names for the extra possible sample names separated by spaces (If a column name has spaces, quote it). This parameter will be skipped if Generic mode is used.
-o, --output_directory Output Directory (Optional). Indicate the path to the Output Directory. Output files will be created in the current directory if not indicated.
-x, --plain_text Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors.
-v, --version Show program's version number and exit.

Examples

Commands:

  • Make Raw Treatment Template with colored text stdout:
make_treatment_template -i filtered_PRJEB10949_merged_metadata.tsv -d downloads
  • Make Raw Treatment Template with plain text stdout:
make_treatment_template -i filtered_PRJEB10949_merged_metadata.tsv -d downloads --plain_text

  • Make Raw Treatment Template using "submitted_ftp" instead of the default "fastq_ftp" as ENA Download Column:
make_treatment_template -i filtered_PRJEB10949_merged_metadata.tsv -d downloads -c submitted_ftp

  • Make Raw Treatment Template using "fq.gz" instead of the default "fastq.gz" Fastq Pattern:
make_treatment_template -i PROJECT_metadata_files_other_fastq_extension.tsv -d downloads -p ".fq.gz" -r1 "_1.fq.gz" -r2 "_2.fq.gz"
  • Make Raw Treatment Template including extra sample column candidate:
make_treatment_template -i filtered_PRJEB10949_merged_metadata.tsv -d downloads --extra_sample_columns sample_column
  • Make Raw Treatment Template and save results in the specified directory (Example):
make_treatment_template -i filtered_PRJEB10949_merged_metadata.tsv -d downloads -o /home/user/Desktop/Example
  • Make Raw Treatment Template using Generic Mode:
make_treatment_template -s Generic -i GENERIC_manifest_file.tsv -d downloads

To see a full and detailed example of dataset curation, see the Tutorial Full Example page. Particularly recommended in this case.