Two ways to install LOTUS are presented here, the easiest way is the following:
conda create -p lotus_env
conda activate lotus_env
conda install -c gsiekaniec -c conda-forge -c bioconda lotus
Warning In case the installation is done this way the external files will be missing and must be retrieved from the LOTUS github: reference genome annotation file
Homo_sapiens.GRCh38.108.chr.gff3.gz
, cytoband filehg38_cytoband.tsv
and external databases fileLotus_ExternalBases_202301.xlsx
.
LOTUS is composed of the following four modules to process vcf files from GATK output (annotated with Funcotator):
Note In order to simplify the output of the files it can be interesting to create a results folder with the following tree structure:
results
|
+-- filter
|
+-- summarise
|
+-- compare
|
+-- merge
Details of the output and input files for every module can be found in the inputs_outputs_description directory.
If we start from a basic fastq file, before using LOTUS many steps are necessary to go from fastq (sequences) to annotated vcf (variants).
For that there are many ways to proceed, we advise here to use the GATK best practices1.
Simple filters on the vcf file from Funcotator using multiple informations to keep only trustworthy somatic variants.
Inputs/Outputs (get more details)
Parameters
Parameters | Description | Default |
---|---|---|
--vcf, -v | Result vcf file from Funcotator output. | |
--output, -o | Filtered vcf file. The Passed vcf file is also create using this output name. | output.filtered.vcf and output.passed.vcf |
--working-method, -w | "InMemory" (default) loads the vcf file in memory into a list (more speed but higher memory consumption) or "Direct" reads and modifies the vcf file on the fly (slow speed but low memory consumption). | InMemory |
--MBQ | Minimum median base variant quality for variant. | 20 |
--DP | Minimum variant coverage. | 10 |
--AF | Minimum fractions of variant in the tumor. | 0.1 |
--AD | Minimum variant depths. | 5 |
--POPAF | Maximum population (often GnomAD) variant frequencies. | 0.00001 |
--unpaired | Argument to use if the reads used are unpaired (single end), put False in the paired variable. | True |
lotus filter -v {PATH_TO_VCF}/sample.funcotated.vcf -o {OUTPUT_PATH}/sample.vcf
lotus filter -v {PATH_TO_VCF}/sample_unpaired_reads.funcotated.vcf -o {OUTPUT_PATH}/sample.vcf -wm Direct --MBQ 20 --DP 10 --AF 0.1 --AD 5 --POPAF 0.00001 --unpaired
The summarise module provides information on the variants from the vcf files, including statistics on the number and nature of variants passing or not passing the filters, graphs representing the mutational profiles or the size of the indels and the list of impacted genes and their tumor burden.
Inputs/Outputs (get more details)
Parameters
Parameters | Description | Default |
---|---|---|
--vcf, -v | Vcf file containing variants that pass filter (*.filtered.pass.vcf). | None |
--vcf_pass, -vp | Vcf file containing variants that pass filter (*.filtered.pass.vcf). | |
--genome, -g | Genome fasta file (allowed extensions : .fasta, .fa, .fan) or pickle (.pk, .pickle) file created after a first run. | |
--statistics, -s | Output statistics file. | stats.txt |
--genes, -genes | Output file containing genes impacted by variants. | genes.txt |
--profile, p | SVG | PNG file that shows the mutations profile of the vcf file. |
--indel, -i | SVG | PNG file that shows the indel mutations size of the vcf file. |
--enrichment | Did the GO enrichment analysis on the genes list using ToppGene and Panther and returns the biological processes (works if the APIs are not down). | False |
lotus summarise -vp {FILTER_OUTPUT_PATH}/sample.passed.vcf -g hg38.fasta
lotus summarise -vp {FILTER_OUTPUT_PATH}/sample.passed.vcf -v {FILTER_OUTPUT_PATH}/sample.filtered.vcf -s {SUMMARISE_OUTPUT_PATH}/sample.stats.txt -p {SUMMARISE_OUTPUT_PATH}/sample_profile.svg -i {SUMMARISE_OUTPUT_PATH}/sample_indel.svg -g ../hg38.fasta -genes {SUMMARISE_OUTPUT_PATH}/sample.tsv --enrichment
Note
hg38.fasta
represents the reference genome fasta file.
The compare module allows a longitudinal comparative genomic analysis of the vcf files of a sample in order to determine the variants present at a time point (TPn) and disappearing/appearing at a time point (TPn+1) as well as the genes impacted by these variants.
Inputs/Outputs (get more details)
Parameters
Parameters | Description | Default |
---|---|---|
--config, -c | Configuration file containing path to vcf file (filtered.vcf and pass.vcf file from LOTUS filter) and tsv files for indel and snp from LOTUS summarise. Example available here. | |
--gff3, -gff3 | Gff3 file. This file can be found here or in LOTUS. | |
--output, -o | Excel file containing the genes specific to the first or second biopsy. | "genes.xlsx" wich give "{vcf1}_{vcf2}_genes.tsv/.xlsx" |
--profile, -p | SVG | PNG file that shows the comparison between mutations profiles of the two vcf file. |
--indel, -i | SVG | PNG file that shows the indel mutations size of the vcf file. |
--enrichment | Did the GO enrichment analysis on the genes list using ToppGene and Panther and returns the biological processes (works if the APIs are not down). | False |
--pickle_gff3 | Did the gff3 file given is a pickle file from previous lauch ? | False |
--additional_gene_information | Add gene informations using the LOTUS file containing information from tumorspecific database (CancerHotSpot, CIViC, COSMIC, DoCM, IntOGen and TSGene 2.0). | False |
--profile_proportion_off | Get different y-axis for the snp profile plot. Useful when one of the two axes is flattened by the size of the other one. | False |
lotus compare -c config_compare_sample.txt -gff3 LOTUS_external_files/Homo_sapiens.GRCh38.108.chr.gff3.gz
lotus compare -c config_compare_sample.txt -gff3 LOTUS_external_files/Homo_sapiens.GRCh38.108.chr.gff3.pk -i {COMPARE_OUTPUT_PATH}/sample_indel.svg -o {COMPARE_OUTPUT_PATH}/compare.tsv -p {COMPARE_OUTPUT_PATH}/sample_profile.svg --additional_gene_information --enrichment --pickle_gff3 --profile_proportion_off
The merge module allows to have an overview of all the samples, it allows to group and compare all TPn against all TPn+1.
Inputs/Outputs (get more details)
Parameters
Parameters | Description | Default |
---|---|---|
--config, -c | Configuration file containing genes list from all patients. Merged patients results. | |
--output, -o | Ouput file name. | union.xlsx |
--cytoband, -cyto | Human cytoband file for the corresponding genome version. This file can be download here or find the LOTUS github (for hg38). If this file is not provided the chromosome.svg plot will not be created. | None |
--chromosome-step, -step | Frame used for counting the number of genes along the chromosomes. | 500000 |
--chromosomes_output, -co | Output file name for the chromosomes plot. | chromosomes.svg |
--upset, -u | Output name for upset plot. The upset plot is not created if no name is given. |
None |
--weakness_threshold, -w | Mean weakness threshold to take a gene into account. | 100 |
--min_subset_size, -minsb | Minimum size of a subset (nb of genes by subset) to be shown in the UpSetPlot. All subsets with a size smaller than this threshold will be omitted from plotting. | 1 |
--max_subset_size, -maxsb | Maximum size of a subset (nb of genes by subset) to be shown in the UpSetPlot. All subsets with a size greater than this threshold will be omitted from plotting. | 0 |
--min_degree, -mind | Minimum degree of a subset (nb of patients) to be shown in the UpSetPlot. | 1 |
--max_degree, -maxd | Maximum degree of a subset (nb of patients) to be shown in the UpSetPlot. | 0 |
--additional_gene_information | Add gene informations using the LOTUS file containing information from tumorspecific database (CancerHotSpot, CIViC, COSMIC, DoCM, IntOGen and TSGene 2.0). | False |
--enrichment | Did the GO enrichment analysis on the genes list using ToppGene and Panther and returns the biological processes (works if the APIs are not down). | False |
lotus merge -c config_merge.txt
lotus merge -c config_merge.txt -o {MERGE_OUTPUT_PATH}/union.xlsx -cyto LOTUS_external_files/hg38_cytoband.tsv -w 99 -co {MERGE_OUTPUT_PATH}/chromosomes.svg -step 500000 --additional_gene_information --enrichment
Future plot
Currently LOTUS allows to create an UpsetPlot2 representing for each sample set the corresponding impacted gene set. However, due to the high computational complexity, this graph is only available for a maximum of 15 samples. The passage to a larger number is envisaged in the future.
Footnotes
-
Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Curr Protoc Bioinformatics, 43:11.10.1-11.10.33. DOI: 10.1002/0471250953.bi1110s43. ↩
-
A. Lex, N. Gehlenborg, H. Strobelt, R. Vuillemot and H. Pfister, "UpSet: Visualization of Intersecting Sets," in IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 1983-1992, 31 Dec. 2014, doi: 10.1109/TVCG.2014.2346248. ↩