Skip to content

Merge Metadata Program

sarpiens edited this page Feb 17, 2024 · 29 revisions

Description

The Merge Metadata program provides a set of different options to merge your Main Metadata Table with any Extra Metadata Table as a way to combine metadata from different sources. In the typical ENA workflow, the Main Metadata Table usually corresponds to the ENA metadata previously obtained from the Download Metadata ENA program and the Extra Metadata Table to the publication's metadata. Nevertheless, the program is generalist and can be used to merge metadata from external projects. By default a left join will be carried out, taking the Main Metadata Table file as reference. After merging, it will also analyze the values of the two provided merge columns by performing an intersection to check for the presence of non-common unique values (except when using the cross merge mode). For a more detail explanation of the pandas merge modes see the official pandas documentation. This program corresponds to the Optional Programs group, which means that this step could be skipped if there is no extra metadata available.

Input Elements:

Input Type Description
MAIN_metadata.tsv File Main Metadata Table
EXTRA_metadata.tsv File Extra Metadata Table

Output Elements:

Output Type Description
MERGED_metadata.tsv File Merged Metadata Table
Merge Columns Intersection Checks stdout Results of the Merge Columns Intersection Analysis (except when using the cross merge mode)

The resulting MERGED_metadata.tsv file is the one that will be used in the next workflow steps, namely the Check Metadata ENA program if you are using the typical ENA workflow. Nevertheless, depending on your particular case it could also be used in other workflow steps, including the Filter Metadata, Download Fastqs, Check Fastqs and Make Treatment Template programs. To get an idea of what the next step would be in your particular case, check the workflow's diagram.

Arguments

Usage:

merge_metadata [-h] -m MAIN_METADATA_TABLE -e EXTRA_METADATA_TABLE [-mc MAIN_MERGE_COLUMN] 
               [-ec EXTRA_MERGE_COLUMN] [-p {left,right,outer,inner,cross}] 
               [-ms MAIN_MERGE_SUFFIX] [-es EXTRA_MERGE_SUFFIX][-o OUTPUT_DIRECTORY] [-x] [-v]

Options:

Parameter Description
-h, --help Show help message and exit.
-m, --main_metadata_table Main Metadata Table [Expected sep=TABS]. Indicate the path to the Main Metadata Table file.
-e, --extra_metadata_table Extra Metadata Table [Expected sep=TABS]. Indicate the path to the Extra Metadata Table file.
-mc, --main_merge_column Main Metadata Merge Column. Main Metadata Table column to be used for merging. This parameter will be skipped if pandas_merge_mode = cross.
-ec, --extra_merge_column Extra Metadata Merge Column. Extra Metadata Table column to be used for merging. This parameter will be skipped if pandas_merge_mode = cross.
-p, --pandas_merge_mode Pandas Merge Mode (Optional) [Default:left]. Indicate the pandas merge mode to be used for merging. Permitted options are {left,right,outer,inner,cross}.
-ms, --main_merge_suffix Main Metadata Pandas Merge Suffix (Optional) [Default:"_x"]. Suffix to add to overlapping column names for the Main Metadata columns.
-es, --extra_merge_suffix Extra Metadata Pandas Merge Suffix (Optional) [Default:"_y"]. Suffix to add to overlapping column names for the Extra Metadata columns.
-o, --output_directory Output Directory (Optional). Indicate the path to the Output Directory. Output files will be created in the current directory if not indicated.
-x, --plain_text Plain Text Mode (Optional). If indicated, it will enable Plain Text mode, and text will appear without colors.
-v, --version Show program's version number and exit.

Examples

Commands:

  • Merge metadata with colored text stdout:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions
  • Merge metadata with plain text stdout:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions --plain_text
  • Merge metadata and save results in the specified directory (Example):
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions -o Example
  • Merge metadata with different suffixes:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -ms _ENA -e PRJEB10949_publication_example.tsv -ec run_accessions -es _publication
  • Merge metadata in right mode:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions -p right
  • Merge metadata in outer mode:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions -p outer
  • Merge metadata in inner mode:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -mc run_accession -e PRJEB10949_publication_example.tsv -ec run_accessions -p inner
  • Merge metadata in cross mode:
merge_metadata -m PRJEB10949_ENA_metadata.tsv -e PRJEB10949_publication_example.tsv -p cross

To see a full and detailed example of dataset curation, see the Tutorial Full Example page. If you only want to run this particular program, you can use the following PRJEB10949_publication_example.tsv test file example and the PRJEB10949_ENA_metadata.tsv file generated by the Download Metadata ENA program.