Donor Assignment Minor Release
Update to donor assignment and Census-Seq code
We have recently noticed that as the sequence length increases and SNP backbone density improves, the fraction of reads or UMIs observing more than one transcribed SNP also increases. This trend was particularly evident when comparing the number of "informative" UMIs to the total number of UMIs expressed by each cell. In some instances, the count of informative UMIs surpassed the total UMIs produced by DGE for a cell. In such cases, multiple SNPs were observed on the same read or UMI, which might not represent independent observations. This essentially leads to "double counting" of these observations. To address this issue, AssignCellsToSamples and DetectDoublets now pre-sort the reads by cell and molecular barcode, selecting only the "best" SNP for each pileup. SNP quality is measured using the mean GQ score across all donors, and in case of a tie, a random SNP is chosen. For algorithms not employing UMIs (CensusSeq tools), this filtering occurs at the read level.
Changes in results in a modest test data set of 1614 cells
Donor assignment changes were minor - all discordant donor assignment cells were also flagged as doublets, so would not affect the final donor labels. Probability of assignment was lower on average due to removal of non-independent allele observations.
This change alters the doublet classification of a small percentage (~3%) of cells, with roughly equal numbers of cells flipping between singlet/doublet status. The cells undergoing status changes have very few informative UMIs, making them underpowered for this classification, so it is less surprising that they might change assignment. The vast majority of cells classified as singlets by both the previously released algorithm and the new algorithm show no disagreements in donor assignment. In general, this update does not change outcomes significantly but does affect the likelihoods of assignment due to fewer observations.
Quality of life improvements
Moreover, we have implemented several quality of life improvements to enhance the transparency of the read filtering methods used. This should help in better understanding situations of low-quality data (sequencing or VCF) or mismatches between the sequence data and VCF data, which result in low amounts of data used in the likelihood calculations. It will also assist us in addressing any issues you may submit. Additionally, we have incorporated a "faster-fail" threshold to detect when no UMIs detect a transcribed SNP. This threshold, TRANSCRIBED_SNP_FAIL_FAST_THRESHOLD, represents the number of UMIs that can be observed without detecting a transcribed SNP. If this threshold is exceeded, the program will stop running early and report the number of UMIs observed.
For API users, FiltererIterator has a new method, logFilterResults, which can be overridden by your subclass to automatically generate logging when the iterator is empty.