Verifying status of "uniqueness" of alignments for single cell analysis #54

sknaack · 2023-07-17T16:19:28Z

I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:

Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How does Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below

cat run_info.json
{
"n_targets": 219393688,
"n_bootstraps": 0,
"n_processed": 340480788,
"n_pseudoaligned": 90125537,
"n_unique": 46079947,
"p_pseudoaligned": 26.5,
"p_unique": 13.5,
"kallisto_version": "0.48.0",
"index_version": -1293124848,
"start_time": "Sun Jul 2 21:08:02 2023",
"call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz"
}
cat inspect.json
{
"numRecords": 37168977,
"numReads": 92983952,
"numBarcodes": 1937400,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 47.994194,
"numUMIs": 12884443,
"numBarcodeUMIs": 33830912,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 17.462017,
"gtRecords": 11411219,
"numBarcodesOnWhitelist": 469183,
"percentageBarcodesOnWhitelist": 24.217147,
"numReadsOnWhitelist": 85704414,
"percentageReadsOnWhitelist": 92.171189

Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?
A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.

Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.

Sara Knaack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifying status of "uniqueness" of alignments for single cell analysis #54

Verifying status of "uniqueness" of alignments for single cell analysis #54

sknaack commented Jul 17, 2023

Verifying status of "uniqueness" of alignments for single cell analysis #54

Verifying status of "uniqueness" of alignments for single cell analysis #54

Comments

sknaack commented Jul 17, 2023