Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verifying status of "uniqueness" of alignments for single cell analysis #54

Open
sknaack opened this issue Jul 17, 2023 · 0 comments
Open

Comments

@sknaack
Copy link

sknaack commented Jul 17, 2023

I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:

  1. Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How does Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below

cat run_info.json
{
"n_targets": 219393688,
"n_bootstraps": 0,
"n_processed": 340480788,
"n_pseudoaligned": 90125537,
"n_unique": 46079947,
"p_pseudoaligned": 26.5,
"p_unique": 13.5,
"kallisto_version": "0.48.0",
"index_version": -1293124848,
"start_time": "Sun Jul 2 21:08:02 2023",
"call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz"
}
cat inspect.json
{
"numRecords": 37168977,
"numReads": 92983952,
"numBarcodes": 1937400,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 47.994194,
"numUMIs": 12884443,
"numBarcodeUMIs": 33830912,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 17.462017,
"gtRecords": 11411219,
"numBarcodesOnWhitelist": 469183,
"percentageBarcodesOnWhitelist": 24.217147,
"numReadsOnWhitelist": 85704414,
"percentageReadsOnWhitelist": 92.171189

  1. Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?

  2. A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.

Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.

Sara Knaack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant