MarkChimericReads release
When investigating differences in expression matrixes generated by 10x and our software, we discovered that 10x has implemented a new cleanup step for UMI sequences that are observed by multiple genes with in a cell. This is implemented in 10x's CellRanger software, please see this method:
Cell Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.
We found this technique to be useful, but wanted to extend this correction to apply not only to a counts matrix, but to the reads in a BAM file. MarkChimericReads accomplishes this by identifying UMIs that are shared across gene within a cell using the same heuristics, then marking problematic reads as map quality 0 so they are ignored by other downstream processes. There are two options for this filtering. We refer to the 10x strategy as RETAIN_MOST_SUPPORTED (our default), and there is an additional strategy REMOVE_ALL that removes all reads where UMIs are observed on multiple genes. We found this strategy to be too conservative in practice, but have left it available for experimentation.
This program is best run after cell barcode selection, but before running any programs that rely on UMIs, such as DigitalExpression or AssignCellsToSamples. The software can be run on a BAM file specifying a cell barcode list, but the memory required to track problematic UMIs for all cell barcodes in a BAM may be prohibitive. We have found this UMI cleanup to yield small incremental improvements in donor assignment and doublet detection.