Skip to content

Latest commit

 

History

History
393 lines (293 loc) · 19.7 KB

README.md

File metadata and controls

393 lines (293 loc) · 19.7 KB

2022_Bean_mergeData

Merge 2 files by a common column.
Add missing labels from the reference 2-column (matching label pairs) file into the data file.

Algorithm Notes
Merge data app
The figure shows the algorithm of merging two files by common column.
  • both input files [data, labels] should be a column-like text file
  • both input files should include a matching label-column (with the same values)
  • the header of label-column [ L1 ] may be different in each file
  • the automatically created output file contains all columns from the data file and additional column [ L2 ] of matching sample numbers from the labels file
  • if some labels pairs are missing in the labels file, the corresponding fields in the [ L2 ] column of the output file are filled with error_value
  • if labels file contains some additional labels pairs (with no correponding data in the data file), the corresponding data fields in the output file are filled with pre-set missing_value (-9999.99 by default, user can customize it)
  • App description

    The application enables:

    1. adding matching labels from the reference file of label pairs into the data file,
    2. transfering annotated data into the original (master) file of samples with assigned error_value=-9999.99 for missing data.

    Inputs

    The user has to provide input files:

    • [ labels ] - a 2-column file with pairs of corresponding labels (e.g., ARS label and Sample num)
    • [ data ] - a multi-column file annotated with one of the reference labels (e.g., ARS label)
    • [ samples ] - optionally, a 1-column file with the list of all samples (e.g., Sample num)
      when the file is provided the additional output will be generated automatically and it will contain the data assigned to the original samples while missing data fields will be filled with error_value

    The user has to determine:

    • the index of label column in the [ data ] file;
      by default, the algorithm assumes it is the first column in the data file (indexing starts from 0)

    The user can customize:

    • the output filename and format of exported data;
      by default the filename is data_output-$date and it is exported as an Excel file (.xlsx)

    Requirements: python3, pandas, openpyxl, conda (optionally)

    PRO TIP:
    If you do NOT want to use Conda, but have python3 already installed, install the remaining requirements directly:
    pip3 install pandas
    pip3 install openpyxl

    Hands-on tutorial

    The data_matcher.py is a ready-made mini application written in Python programming language. Technically, it is a block of code, without the graphical user interface (GUI), and it needs to be executed from the command line (CLI) in the terminal window.


    Figure. In the graphical user interface (GUI, on the left), you use a mouse to navigate the file system and execute applications by clicking, while in the command-line interface (CLI, on the right), you type text-like commands on your keyboard to do the same (and much more!).

    PRO TIP:
    If you are not familiar with the Command-line Interface (CLI) and the basics of working in the terminal window, start by exploring the Terminal: the command-line interface ⤴ tutorial, available in the Data Science Workbook ⤴. You can also learn more about the Unix Basics ⤴ from another tutorial.

    Setting up the computational environment

    A. On your local machine

    1. Open terminal window: see instructions here ⤴
    2. Install Python3
    sudo apt-get update
    sudo apt-get install python3.9 python3-pip
    
    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
    brew update && brew upgrade
    brew install python3
    
    1. Install Conda ⤴, an open-source and cross-platform environment management system.

    Test your Conda installation by typing in the terminal window:

    conda info
    

    On the screen your Conda settings should be displayed. Otherwise, if a command not found error is thrown back, you need to fix the Conda installation step.

    1. Set up Conda environment
    • A) create the data_wrangling virtual environment:
    conda create -n data_wrangling python=3.9
    
    • B) activate the data_wrangling environment:
    conda activate data_wrangling
    


    Once the environment is active, you can see its name preceding the prompt.

    • C) install new dependencies within environment:
    pip3 install pandas
    pip3 install openpyxl
    
    1. Navigate to your workspace on the file system using the cd command followed by the path to the desired location. For example, if you want to create a new working directory in your $HOME location, copy-paste the following commands:
    mkdir ~/WORKDIR
    cd ~/WORKDIR
    

    B. On HPC infrastructure (Ceres, Atlas, and other clusters)

    1. Open terminal window: see instructions here ⤴
    2. Log in to the remote computing infrastructure using the ssh command. Learn more about the Secure Shell Connection ⤴ from the tutorial in the Data Science Workbook ⤴.
    • log in to Atlas:
    ssh <user.name>@atlas-login.hpc.msstate.edu
    
    • log in to Ceres:
    ssh <user.name>@ceres.scinet.usda.gov
    
    1. Load Conda module:
    • A) search for conda modules
    module avail conda
    

    • B) load selected module variant
    module load miniconda3/4.3.30-qdauveb
    
    1. Set up Conda environment
    • A) create the data_wrangling virtual environment:
    conda create -n data_wrangling python=3.9
    
    • B) activate the data_wrangling environment:
    source activate data_wrangling
    
    • C) install new dependencies within environment:
    pip3 install pandas
    pip3 install openpyxl
    
    1. Navigate to your workspace on the file system using the cd command followed by the path to the desired location. On the SCINet HPC clusters your workspace should be located on the path: /project/{project_account}/{user}/{workdir}.
    cd /project/< provide-here-path-to-your-project-account >
    

    Download Python mini app

    1. Make sure you are in your workspace on the file system using the pwd command.

    2. Download the 2022_Bean_mergeData repository from the GitHub:

    • if you use git already:
    git clone https://github.com/ISUgenomics/2022_Bean_mergeData.git
    
    PRO TIP:
    If you tried the command above and received an error message command not found, you do NOT have GIT installed. Then, if working on your local machine, follow instructions provided at git-scm.com : 1.5 Getting Started - Installing Git ⤴.

    In the downloaded repo, you will find: the app and example_dataset folders, containing the data_matcher.py Python script and input files, respectively.

    • alternatively, you can download directly only the necesary files (app script + example inputs):
    wget https://raw.githubusercontent.com/ISUgenomics/2022_Bean_mergeData/main/app/data_matcher.py
    wget https://github.com/ISUgenomics/2022_Bean_mergeData/raw/main/example_dataset/data.xlsx
    wget https://github.com/ISUgenomics/2022_Bean_mergeData/raw/main/example_dataset/labels.xlsx
    wget https://github.com/ISUgenomics/2022_Bean_mergeData/raw/main/example_dataset/samples.xlsx
    

    Options available in the application

    python3 data_matcher.py -l labels -f data
                            [-t samples]
                            [-c col]
                            [-o format] [-n outfile] [-m master]
                            [-h]
    

    ^ arguments provided in square brackets [] are optional

    help & info arguments:

    -h,          --help                         # show this help message and exit
    

    required arguments:

    -l labels,   --labels-filename labels       # [string] input 2-col file in the xlsx format; required
    -f data,     --data-file data               # [string] input multi-col file in the xlsx format; required
    

    optional arguments:

    -t samples,  --master-file samples          # [string] input Master Excel (samples) file in the xlsx format; optional
    -c col,      --label-column col             # [int]    index of the label column in the data_file; default=0
    -o format,   --output-format format         # [int]    select format for output file: 0 - xlsx, 1 - csv; default=0
    -n outfile,  --output-filename outfile      # [string] provide custom name for the output file; default="data_output"
    -m master,   --output-masterfile master     # [string] provide custom name for the output master_file; default: "master_output"
    

    defaults for optional arguments:

    -t ''                                       # means: no samples file is provided
    -c 0                                        # the column of index 0 in [ data ] file is assumed to contain labels
    -o 0                                        # output is saved in the .xlsx (Excel) format
    -n "data_output"                            # output is saved in the "data_output.xlsx" file
    -m "master_output"                          # output master_file is saved in the "master_output.xlsx" file
    

    Example usage:

    Inputs

    Below is the preview of the required data and labels input files, as well as the optional samples file. The data and labels share the same column with ARS labels, even though in both files this column has a different name (i.e., USDA# and ARS_Label, marked with the green frames). The labels file contains an additional column with sample IDs paired with the ARS labels. As you can see, the number of rows in the labels file is significantly higher than inthe data file. And finally, the samples file is much longer because it contains all the pre-set sample identifiers.

    Tasks

    There are two tasks to performe:

    1. [REQUIRED] Adding matching sample IDs from the labels file into the data file. The data_output file size is equal to the data input file.
    2. [OPTIONAL] Transfering data into the original samples (master) file with assigned error_value=-9999.99 for missing data hits. The master_output file size is equal to the samples input file.

    Practical Examples

    NOTE:
    Depending on your operating system and Python configuration, you should use a python or python3 keyword to indicate the correct interpreter for the "data_matcher.py" script.
    The script requires Python 3.8 or later release. To check the version of your Python installation, type in the terminal or command prompt:

    python --version
    

    If the returned Python version is lower than 3.8, try again by typing:

    python3 --version
    

    If the returned value is still lower than 3.8, then you need to install a new instance of Python in version 3.8 or newer.

    PRO TIP:
    You can use all built-in flags [-l, -f, -t, -c, -n, -m, -o] for a given script run but each of them only once. The value following the flag can be customized by users depending on their needs and the datasets used. The type of values matching the flags are specified in the Options available in the application. For files (inputs and outputs), they should be a string, while for all other parameters, they should be an integer.

    For the purpose of this tutorial, we use simplified names of input and output files to make it easier for the user to know which flag matches each file type. So, please feel free to use your original or customized filenames without renaming them to "labels.xlsx", "data.xlsx", and "samples.xlsx".


    1. Minimal arguments to generate data_output:

    python3 data_matcher.py -l labels.xlsx -f data.xlsx
    

    This run will generate a single file with a name randomized by a current date in the format day-month-year-seconds, e.g., data_output-17-03-2023-104731.xlsx


    2. Minimal arguments to generate data_output and master_output:

    python3 data_matcher.py -l labels.xlsx -f data.xlsx -t samples.xlsx
    

    This run will generate two files with names randomized by a current date in the format day-month-year-seconds, e.g., data_output-17-03-2023-104731.xlsx and master_output-17-03-2023-104731.xlsx


    3. Customize names of the output files:

    • use -n flag to change the name of the data_output file
    • use -m flag to change the name of the master_output file
    python3 data_matcher.py -l labels.xlsx -f data.xlsx -t samples.xlsx -n cutom_data_output -m cutom_master_output
    


    NOTE:
    The user-provided filenames can be randomized by a variable, for example depending on the date (added as a unique tag):

    -n custom_name-$(date +"%d-%m-%Y-%s")
    

    4. Specify the index of the label column in the data file:

    • use -c flag followed by the index of the label column, if other than the first column, i.e., 0 (in Python, indexing starts from 0 instead of 1)
    python3 data_matcher.py -l labels.xlsx -f data.xlsx -c 2
    

    In this variant, the script will assume that the label column is a third column in the "data" file. This column has the index=2 because in Python we count from 0: 0 (first), 1 (second), and 2 (third). When the flag -c is NOT used, the script assumes that the label column is the first column with index 0.


    5. Customize the format of the output files

    • use -o flag to select the format of the output files: 0 - xlsx, 1 - csv; default=0
    python3 data_matcher.py -l labels.xlsx -f data.xlsx -o 1
    

    This run will save the "output_data" file in CSV format instead of the Excel file (.xlsx). When the -o flag is NOT used, the script saves all outputs in the default Excel format.


    6. Example of the fully customized run:

    python3 data_matcher.py -l reference_labels.xlsx -f our_data.xlsx -t experimental_samples.xlsx -n cutom_data_output -m cutom_master_output -c 1 -o 1
    

    This run will use three input files (with their original names), all located in the same directory as the data_matcher.py script. The run is executed from within this directory. The second column (so with index 1) from the data file provided with the -f flag will be used as a label column. Two output files will be generated, both with names provided by the user after the corresponding flags (-n and -m). Both files will be saved in CSV format instead of the default Excel format.


    7. You can run the script from any location in the file system:

    There is no need to copy the Python script into each directory with a new dataset. You can store this and other scripts in a selected location in the file system. For example, create the PYTHON_SCRIPTS directory in your home directory by typing:

    mkdir ~/PYTHON_SCRIPTS
    cd ~/PYTHON_SCRIPTS
    

    Then, download the current version of the python script from the GitHub repository:

    wget https://raw.githubusercontent.com/ISUgenomics/2022_Bean_mergeData/main/app/data_matcher.py
    

    or if you have it already, then copy it locally using the cp data_matcher.py ~/PYTHON_SCRIPTS/ command in the location where the script is stored.

    Then, navigate to the location of your dataset, replacing the content of the < > with the path to the selected directory:

    cd <path/to/your/dataset>
    

    Now, you are ready to call the script in the external location by specifying its path:

    python3 ~/PYTHON_SCRIPTS/data_matcher.py -l reference_labels.xlsx -f our_data.xlsx -t experimental_samples.xlsx -n cutom_data_output -m cutom_master_output -c 1 -o 1
    

    If you place the script on different path (i.e., in different location) than ~/PYTHON_SCRIPTS/, then you need to provide the corresponding path instead of the one from this example. You can also provide an individual path to each file (input or output) if they are stored in a different location than your current workdir (i.e., the location from which you run the command).

    Input data and outputs

    data_output
    The exact copy of the input datafile [data.xlsx] with one additional column (next to the ARS label) is created [output.xlsx by default]. The new column contains the corresponding sample label. If there is no matching sample label, the error value -f -9999.99 is assigned.

    master_output
    The exact copy of the input sample file [sample.xlsx] (containing the sample labels) is extended of the corresponding ARS label and all remaining data columns (taken from data_file). If there is no matching hit for a sample label, the values in the new columns will contain the default (-9999.99) for the missing data.

    Files description:

    file description
    Original_Sample_Master_List.xlsx [samples.xlsx] original sample labels of experimental data
    F-set_complete.xlsx [labels.xlsx] list of matching labels pairs; 2-column file
    Sample_Data_by_ARS_F-number.xlsx data annotated with ARS label
    Sample_check_by_F-number.xlsx [data.xlsx] data annotated with ARS label
    data_matcher.py python script