Releases: broadinstitute/Drop-seq
Fix invocation of Picard programs in alignment and reference creation scripts
Changes
- Fix invocation of Picard command-line programs in Drop-seq_alignment.sh and create_Drop-seq_reference_metadata.sh, which were broken by change to skinny picard.jar.
- Reduce memory footprint of CorrectAndSplitScrnaReadPairs. Some metrics that were memory-intensive were removed.
- Fix bug when processing reads that did not have UMI tag.
- Upgrade to picard 3.1.1 to address security issue.
- Add option GENE_SORT to CreateMetacells.
- Enable multiple BAM input to FilterBAM.
Installation
Java
Download and unzip dropseq-3.0.2.zip . Wrapper scripts for all the command-line programs will be in the dropseq-3.0.2 directory that is created when unzipping.
R packages
Do this in the order below, because DropSeq.dropulation depends on DropSeq.utilities, and if you don't install that package first, the source version of DropSeq.utilities will be installed, which may be unstable.
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.utilities_3.0.2.tar.gz")
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.dropulation_3.0.2.tar.gz")
Shell script standardization and other minor changes
Changes
- Standardize shell scripts, including alignment, reference metadata creation, and tool wrapper scripts. Thanks to @mschilli87.
- Add ability to request multiple cores in alignment script.
- Remove dependent jars and instead get from Maven Central. Thanks to @lbergelson.
- In various programs, filter out reads with UMI ==
-
, which is how STARsolo marks chimeric reads. - Change the dependency between our R packages from
Imports
toSuggests
.
Installation
Java
Download and unzip dropseq-3.0.1.zip . Wrapper scripts for all the command-line programs will be in the dropseq-3.0.1 directory that is created when unzipping.
R packages
Do this in the order below, because DropSeq.dropulation depends on DropSeq.utilities, and if you don't install that package first, the source version of DropSeq.utilities will be installed, which may be unstable.
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.utilities_3.0.1.tar.gz")
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.dropulation_3.0.1.tar.gz")
Java 21, R packages for Dropulation and Census-seq
New and changed
Java 21 and Gradle
This is the first release built with Java 21. You will need to use a JVM version >= 21.
R packages for Dropulation and Census-seq
Our Java programs have long been available for donor assignment and Census-Seq, but we've typically analyzed their outputs using R for standard QC plots and analysis. This release marks the first time we're making these tools public, offering the community a standard set of downstream analyses for this data. This ensures proper analysis of outputs and provides users with plots to help diagnose technical or biological issues with donor pools.
We will shortly release a new "cookbook" detailing how to run donor assignment on both scRNASeq and scATACSeq data, with more in depth explanations of the program outputs and explanations of how to interpret each QC plot. Please look for that on this github's front page next to the other cookbooks.
New program FilterReadsByUMISupport
For some analysis, it may be useful to only evaluate UMIs that have at least some minimum or maximum number of reads supporting them. This tool counts the number of UMIs supporting each read, and emits a BAM file that only contains reads within the bounds of MIN_READ_SUPPORT and MAX_READ_SUPPORT.
GatherDigitalAlleleCounts changes
Tracking of alternate alleles improved. Multiallelic sites now report the alternate allele with the highest allele frequency instead of randomly selecting one of the alternate alleles. When POLYMORPHIC_SNPS_ONLY=false
, in rare cases GDAC lost track of the alternate allele and emitted an N instead with 0 counts. This was fixed to emit the proper alternate allele and counts data.
DigitalExpression changes - smells like STARSolo
We've conducted a recent assessment of how Optimus/STARSolo expression measurements stack up against DropSeq's techniques. STARSolo outperformed DropSeq in retrieving UMIs from reads that align to multiple genomic locations but only support a single gene. To replicate this capability, simply set READ_MQ=0, allowing for the recovery of these UMIs in a similar fashion.
STARSolo uses different functional annotation strategy - that is, the way annotations of which regions of the genome are exonic, intronic, and antisense are interpreted and prioritized. The STARSOLO strategy priority is very similar to DropSeq, except in cases where a read overlaps both an intron on the sense strand and a coding region on the antisense strand. In these cases, DropSeq favors the intronic interpretation, while STARSolo interprets this as a technical artifact and labels the read as coming from the antisense coding gene, and the read does not contribute to the expression counts matrix. This functional strategy is enabled in STARsolo using the flag --soloFeatures GeneFull_Ex50pAS
. This can be summarized by this schematic:
We have added a new parameter FUNCTIONAL_STRATEGY to all programs that interpret gene annotations. This can be set to interpret the annotations with method's priority. When READ_MQ=0
and FUNCTIONAL_STRATEGY=STARSOLO
, the methods generate very similar results. The plot below is the summed expression of each gene across cells for a single experiment, where the expression was generated by STARSolo and DropSeq on the same set of sequencing reads aligned by STARSolo.
Cell barcode correction tools
These tools emulate Cell Ranger's cell barcode correction algorithm.
- CountBarcodeSequences tallies the appearance of cell barcodes in the input paired-end BAM. Recommended usages is to pass the list of expected barcodes via ALLOWED_BARCODES option, so that only these cell barcodes are counted.
- CorrectAndSplitScrnaReadPairs corrects cell barcodes that are edit-distance 1 away from one of the barcodes in the ALLOWED_BARCODE_COUNTS file (produced by CountBarcodeSequences). The program can also split the read pairs into multiple BAM files, in which all the reads for a cell barcode are in a single BAM. This can facilitate parallel processing of the reads.
Installation
Java
Download and unzip dropseq-3.0.0.zip . Wrapper scripts for all the command-line programs will be in the dropseq-3.0.0 directory that is created when unzipping.
R packages
Do this in the order below, because DropSeq.dropulation depends on DropSeq.utilities, and if you don't install that package first, the source version of DropSeq.utilities will be installed, which may be unstable.
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.utilities_3.0.0.tar.gz")
install.packages("https://github.com/broadinstitute/Drop-seq/releases/download/v3.0.0/DropSeq.dropulation_3.0.0.tar.gz")
Donor Assignment Minor Release
Update to donor assignment and Census-Seq code
We have recently noticed that as the sequence length increases and SNP backbone density improves, the fraction of reads or UMIs observing more than one transcribed SNP also increases. This trend was particularly evident when comparing the number of "informative" UMIs to the total number of UMIs expressed by each cell. In some instances, the count of informative UMIs surpassed the total UMIs produced by DGE for a cell. In such cases, multiple SNPs were observed on the same read or UMI, which might not represent independent observations. This essentially leads to "double counting" of these observations. To address this issue, AssignCellsToSamples and DetectDoublets now pre-sort the reads by cell and molecular barcode, selecting only the "best" SNP for each pileup. SNP quality is measured using the mean GQ score across all donors, and in case of a tie, a random SNP is chosen. For algorithms not employing UMIs (CensusSeq tools), this filtering occurs at the read level.
Changes in results in a modest test data set of 1614 cells
Donor assignment changes were minor - all discordant donor assignment cells were also flagged as doublets, so would not affect the final donor labels. Probability of assignment was lower on average due to removal of non-independent allele observations.
This change alters the doublet classification of a small percentage (~3%) of cells, with roughly equal numbers of cells flipping between singlet/doublet status. The cells undergoing status changes have very few informative UMIs, making them underpowered for this classification, so it is less surprising that they might change assignment. The vast majority of cells classified as singlets by both the previously released algorithm and the new algorithm show no disagreements in donor assignment. In general, this update does not change outcomes significantly but does affect the likelihoods of assignment due to fewer observations.
Quality of life improvements
Moreover, we have implemented several quality of life improvements to enhance the transparency of the read filtering methods used. This should help in better understanding situations of low-quality data (sequencing or VCF) or mismatches between the sequence data and VCF data, which result in low amounts of data used in the likelihood calculations. It will also assist us in addressing any issues you may submit. Additionally, we have incorporated a "faster-fail" threshold to detect when no UMIs detect a transcribed SNP. This threshold, TRANSCRIBED_SNP_FAIL_FAST_THRESHOLD, represents the number of UMIs that can be observed without detecting a transcribed SNP. If this threshold is exceeded, the program will stop running early and report the number of UMIs observed.
For API users, FiltererIterator has a new method, logFilterResults, which can be overridden by your subclass to automatically generate logging when the iterator is empty.
MarkChimericReads release
When investigating differences in expression matrixes generated by 10x and our software, we discovered that 10x has implemented a new cleanup step for UMI sequences that are observed by multiple genes with in a cell. This is implemented in 10x's CellRanger software, please see this method:
Cell Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.
We found this technique to be useful, but wanted to extend this correction to apply not only to a counts matrix, but to the reads in a BAM file. MarkChimericReads accomplishes this by identifying UMIs that are shared across gene within a cell using the same heuristics, then marking problematic reads as map quality 0 so they are ignored by other downstream processes. There are two options for this filtering. We refer to the 10x strategy as RETAIN_MOST_SUPPORTED (our default), and there is an additional strategy REMOVE_ALL that removes all reads where UMIs are observed on multiple genes. We found this strategy to be too conservative in practice, but have left it available for experimentation.
This program is best run after cell barcode selection, but before running any programs that rely on UMIs, such as DigitalExpression or AssignCellsToSamples. The software can be run on a BAM file specifying a cell barcode list, but the memory required to track problematic UMIs for all cell barcodes in a BAM may be prohibitive. We have found this UMI cleanup to yield small incremental improvements in donor assignment and doublet detection.
Donor assignment / doublet detection tools in support of Cell Stem Cell publication
This release is in support of our manuscript "Natural variation in gene expression and viral susceptibility revealed by neural progenitor cell villages" by Wells, et al., which has been accepted for publication in Cell Stem Cell. The following software is relevant for release.
- AssignCellsToSamples - This program takes as input a BAM file containing sequencing reads from a pool of donors, and a VCF file containing genotypes from those donors. The program emits the most likely donor for each cell.
- DetectDoublets - This program takes as input a BAM file containing sequencing reads from a pool of donors, a VCF file containing genotypes from those donors, and the output from AssignCellsToSamples. The program emits the probability that each cell barcode is a doublet where a cell from two different donors have been co-encapsulated in the same droplet.
- TagReadWithGeneFunction - This program adds additional gene function tags to a BAM file that are required to run donor assignment code. This programs are run as part of the standard Drop-seq pipeline, but can also be run on standard 10x BAM files to make them compatible with our toolkit. Please see the Drop-seq alignment cookbook and command line help for details.
Please also see our biorxiv manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells which is already available. We will publish a computational protocols guide in support of this work in the following weeks.
See the Drop-seq alignment cookbook and Census-seq computational protocols for detailed usage instructions, diagrams, and explanations of how these new systems work. For more information about Census-seq, see our manuscript on bioRxiv
Upgrade to Picard 2.26.10, which includes log4j 2.17.1
Picard and log4j upgrades
- Picard 2.26.10, which includes log4j 2.17.1
Minor changes
- Reduce memory requirement of MergeDgeSparse by making it multi-pass.
See the Drop-seq alignment cookbook and Census-seq computational protocols for detailed usage instructions, diagrams, and explanations of how these new systems work. For more information about Census-seq, see our manuscript on bioRxiv.
Dropulation + upgrade to picard 2.26.8, which includes log4j.16.0
Picard and log4j upgrades
- picard.jar v2.26.8, which includes log4j 2.16.0
- remove stand-alone log4j.jar
- upgrade to biojava 6.0.3
- remove slf4j-log4j12
Dropulation programs - these programs are in support of our manuscript Natural variation in gene expression and Zika virus susceptibility revealed by villages of neural progenitor cells.
- AssignCellsToSamples - This program takes as input a BAM file containing sequencing reads from a pool of donors, and a VCF file containing genotypes from those donors. The program emits the most likely donor for each cell.
- DetectDoublets - This program takes as input a BAM file containing sequencing reads from a pool of donors, a VCF file containing genotypes from those donors, and the output from AssignCellsToSamples. The program emits the probability that each cell barcode is a doublet where a cell from two different donors have been co-encapsulated in the same droplet.
GTF parsing
- pass command-line VALIDATION_STRINGENCY to GTFParser
- Set default command-lin VALIDATION_STRINGENCY to LENIENT
Minor changes
- improvements to sparse matrix I/O
- allow merging of DGEs with same prefix if there are no cell barcode collisions.
See the Drop-seq alignment cookbook and Census-seq computational protocols for detailed usage instructions, diagrams, and explanations of how these new systems work. For more information about Census-seq, see our manuscript on bioRxiv
Support for SBARRO experiments
Support for SBARRO experiments
- TagReadWithRabiesBarcodes - Tags an unaligned BAM with rabies virus barcode sequences and associated metrics. Barcode sequences are extracted using invariant sequences which flank the barcode regions.
- FilterValidRabiesBarcodes - Filters rabies virus tags into pass/fail BAMs based on metrics associated with the extracted viral barcodes.
- BipartiteRabiesVirusCollapse - Collapse rabies virus sequences in which half the barcode matches within some edit distance, within a cell.
Other updates
- TrimStartingSequence update: New algorithm with mismatch rate and sequence not require to be anchored to start of read. Note that different sequence is recommended for 10X and Drop-seq. Old, buggy behavior can be enabled with LEGACY=true.
- PolyATrimmer now optionally stores length of polyA and trimmed adapter
- ConvertTagToReadGroup improved error messages when read group missing from input data
- FilterBamByGeneFunction - new program to filter reads to specified gene functions, in the same way DigtialGeneExpression and related programs internally filter data.
- CensusSeq and related software can now take as input a list of BAMs.
- Csi analysis now generates a contamination confidence interval.
- Improvements to make GTF parsing more robust
Release 2.4.0
- Census-seq tools and documentation (CensusSeq, RollCall, CsiAnalysis)
- New program SplitBamByCell - partitions a large BAM into many smaller BAMs by cell barcodes, so that each cell barcode is contained in only one output BAM, and BAMs have roughly the same size.
- More flexible DGE merging
- New program CountUnmatchedSampleIndices
- Upgrade to Picard 2.20.5
- Add option to FilterBam and FilterBamByTag to fail if not enough reads pass filter
- Many bug fixes --
git log v2.3.0..v2.4.0
to view
See the Drop-seq alignment cookbook and Census-seq computational protocols for detailed usage instructions, diagrams, and explanations of how these new systems work. For more information about Census-seq, see our manuscript on bioRxiv