satmut_utils is a Python package for simulation and variant calling of saturation mutagenesis data. The two main subcommands are:
- 'sim'
- 'call'
satmut_utils commands are designed to simulate and call variants in paired-end, targeted sequencing reads. Alignments to a mature mRNA reference (contiguous, spliced coding sequence with possible untranslated regions) are expected. Genome-wide and transcriptome-wide variant calling is not supported.
Currently, only Unix/Linux and MacOSX operating systems are supported. To get started, follow these steps:
-
If conda is not installed, install miniconda for managing environments. See this link for installation on your particular architecture.
-
Clone the satmut_utils repository:
git clone https://github.com/CenikLab/satmut_utils.git
SATMUT_ROOT="$PWD/satmut_utils"
- Execute the provided shell script to generate the satmut_utils environment, install the package, and optionally download curated reference files, which are required if using Ensembl identifiers (see Reference files). Finally, activate the satmut_utils environment.
REF_DIR="$HOME/satmut_utils_refs"
$SATMUT_ROOT/install_satmut_utils.sh -h
$SATMUT_ROOT/install_satmut_utils.sh -t -g -r "$REF_DIR" "$SATMUT_ROOT"
conda activate satmut_utils
You are now ready to call the command-line executable satmut_utils
satmut_utils is the primary command, with subcommands 'sim' and 'call'.
See the satmut_utils manual for more detailed usage information.
Parameter help:
satmut_utils -h
satmut_utils sim -h
satmut_utils call -h
Common arguments to both 'sim' and 'call' subcommands should be provided first, then the subcommand, and then the subcommand-specific arguments.
It is recommended that a new output directory is created for each job. Default is to output to the current directory.
OUTPUT_DIR="/tmp/satmut_utils_test"
Additionally, the user can set a directory for temporary files by setting an environment variable $SCRATCH. If this variable is not set, temporary files are written to /tmp.
export SCRATCH="$(mktemp -d -t satmut_temp)"
Run 'sim' on in silico alignments to generate SNPs, MNPs, insertions, and deletions. Structural variants and gene fusions are not currently supported.
TEST_DIR="$SATMUT_ROOT/src/tests/test_data"
OUTPUT_DIR="/tmp/satmut_utils_test"
satmut_utils -i ENST00000398165.7 -x $REF_DIR -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed sim -f -a $TEST_DIR/CBS_sim.bam -v $TEST_DIR/CBS_sim.vcf
The 'sim' workflow outputs paired FASTQs, a realigned BAM file, and a truth VCF containing simulated variants and their expected counts and frequencies.
Support exists for calling SNPs/SNVs, MNPs/MNVs, insertions, and deletions. Multi-codon changes and complex InDels are not supported (e.g. delete a codon, insert a nt).
Run 'call' on the simulated data by specifying an Ensembl transcript/gene ID and the directory containing curated reference files.
TEST_DIR="$SATMUT_ROOT/src/tests/test_data"
OUTPUT_DIR="/tmp/satmut_utils_test"
satmut_utils -i ENST00000398165.7 -x $REF_DIR -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed call -1 $TEST_DIR/CBS_sim.R1.fq.gz -2 $TEST_DIR/CBS_sim.R2.fq.gz -v -m 1
Here, we call variants on a simulated dataset with no adapter sequences, so we pass -v. However, in most cases the user should provide 5' and 3' adapters for trimming.
If the Ensembl ID is not in the curated set of primary transcripts, or if you want to align to a custom reference, several reference files most be provided. Reference files.
satmut_utils -r $TEST_DIR/CBS.fa -o $OUTPUT_DIR -p $TEST_DIR/CBS_sim_primers.bed call -1 $TEST_DIR/CBS_sim.R1.fq.gz -2 $TEST_DIR/CBS_sim.R2.fq.gz -v -m 1 -g $TEST_DIR/CBS.gff -k $REF_DIR/GRCh38.fa.gz
The 'call' workflow produces a VCF of candidate variant calls as well as a bedgraph file reporting fragment coverage across the transcript reference. The output VCF and its corresponding tab-delimited summary.txt file contain records for each mismatched base in a MNP. See the satmut_utils manual or the corresponding VCF header for column/field descriptions.
For convenience, a curated transcriptome of primary human transcripts from APPRIS is provided, allowing the user to pass an Ensembl gene or transcript ID to satmut_utils. However, if the requested Ensembl ID is not found in this set, custom reference files must be passed, which include:
A. Transcript reference (FASTA)
B. Transcript annotations (GFF)
C. GFF reference (FASTA)
Common transcript annotations in GFF format (file B) map coordinates in the genome. For this case, file A should specify a transcript FASTA and file C should specify the genome FASTA.
In typical saturation mutagenesis datasets, an intron-less coding sequence is expressed from a vector. In this case, set file A and C to the same composite (vector + coding sequence) reference FASTA, then make a custom GFF annotation (file B) with a coding sequence exon. See the user manual for more details on creating custom reference files.
To run unit tests, execute the following from the satmut_utils repository:
nose2 -q
Two command-line interfaces are provided to enable pre-processing of reads prior to satmut_utils 'sim':
- satmut_trim
- satmut_align
satmut_trim is a wrapper around cutadapt, and satmut_align a wrapper around bowtie2. satmut_align should be used to generate the BAM file accepted by 'sim'. If reads have been aligned with some other method, there is no guarantee 'sim' will complete without error, as alignment tags output by bowtie2 are required for 'sim' (MD, NM).
Additionally, a number of helpful scripts are available on the development repository of satmut_utils in the src/scripts directory. For example, if subdomains of a gene were targeted, MAVE-HGVS annotations can be updated with run_mave_hgvs_annot.py to report positions relative to target regions (as required by MAVEdb).
If you use satmut_utils, please cite the following paper:
Hoskins I, Sun S, Cote A, Roth FP, Cenik C. satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect. Genome Biol. BioMed Central; 2023 Apr 20;24(1):1–2