You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:
Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How does Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below
Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?
A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.
Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.
Sara Knaack
The text was updated successfully, but these errors were encountered:
I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME_ANNOTATION>
". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:
cat run_info.json
{
"n_targets": 219393688,
"n_bootstraps": 0,
"n_processed": 340480788,
"n_pseudoaligned": 90125537,
"n_unique": 46079947,
"p_pseudoaligned": 26.5,
"p_unique": 13.5,
"kallisto_version": "0.48.0",
"index_version": -1293124848,
"start_time": "Sun Jul 2 21:08:02 2023",
"call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz"
}
cat inspect.json
{
"numRecords": 37168977,
"numReads": 92983952,
"numBarcodes": 1937400,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 47.994194,
"numUMIs": 12884443,
"numBarcodeUMIs": 33830912,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 17.462017,
"gtRecords": 11411219,
"numBarcodesOnWhitelist": 469183,
"percentageBarcodesOnWhitelist": 24.217147,
"numReadsOnWhitelist": 85704414,
"percentageReadsOnWhitelist": 92.171189
Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?
A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.
Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.
Sara Knaack
The text was updated successfully, but these errors were encountered: