Improving speed when running `kb count` #55

reetm09 · 2023-11-06T23:52:40Z

When you input multiple FASTQ files into the kb count function, does it process them sequentially or is there a way to parallelize it? Especially because for me, the first step "kallisto bus" takes the longest (when loading the index and mapping). Is there a way to parallelize this process or any other tips to improve speed?

Thank you!

The text was updated successfully, but these errors were encountered:

Yenaled · 2023-11-07T00:15:35Z

It should automatically parallelize (rather than sequential reading) if you enable many threads -- that's one reason that splitting FASTQ files into multiple chunks enables faster processing.

kallisto should be pretty fast unless you're doing single nucleus rnaseq or rna velocity -- with enough threads, it will only take 1-3 seconds to process a million reads.

Also, make sure you're using the current version of kb-python (version 0.27.3) since speed improvements have been made.

Finally, post issues on the kallisto or the kb-python github page -- I'm usually more responsive on those pages.

reetm09 · 2023-11-07T00:30:09Z

Hi,

Thank you so much for your quick response! This is the command I'm running for RNA Velocity analysis. Currently it's taking 30-40 mins and each of the fastq's are 1000 reads, with the index file being ~40GB. Additionally, each of the files here are 119MB. Is this expected?

kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample1 --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz

Additionally, just to clarify once again, if I specify the following command, it should already be parallelizing?
kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz subSample2_R1.fastq.gz subSample2_R2.fastq.gz subSample3_R1.fastq.gz subSample3_R2.fastq.gz

Or do I need to do anything additional to split the FASTQ files into multiple chunks? And would the output folder (subSample) here contain the combined .h5ad file?

Thanks so much for your help!

Yenaled · 2023-11-07T02:18:16Z

OK, yes, rna velocity is just slow with kallisto. This will change in our forthcoming release of kb-python (version 0.28; currently on devel branch), which will be released in the next week or so.

I don't think there's much you can do in terms of speed with the current version of kb-python.

And yes, it will be parallelizing automatically with the command you supplied (and the output will be no different than combining the subsamples into a single fastq file).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving speed when running `kb count` #55

Improving speed when running `kb count` #55

reetm09 commented Nov 6, 2023

Yenaled commented Nov 7, 2023

reetm09 commented Nov 7, 2023

Yenaled commented Nov 7, 2023

Improving speed when running kb count #55

Improving speed when running kb count #55

Comments

reetm09 commented Nov 6, 2023

Yenaled commented Nov 7, 2023

reetm09 commented Nov 7, 2023

Yenaled commented Nov 7, 2023

Improving speed when running `kb count` #55

Improving speed when running `kb count` #55