Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving speed when running kb count #55

Open
reetm09 opened this issue Nov 6, 2023 · 3 comments
Open

Improving speed when running kb count #55

reetm09 opened this issue Nov 6, 2023 · 3 comments

Comments

@reetm09
Copy link

reetm09 commented Nov 6, 2023

When you input multiple FASTQ files into the kb count function, does it process them sequentially or is there a way to parallelize it? Especially because for me, the first step "kallisto bus" takes the longest (when loading the index and mapping). Is there a way to parallelize this process or any other tips to improve speed?

Thank you!

@Yenaled
Copy link

Yenaled commented Nov 7, 2023

It should automatically parallelize (rather than sequential reading) if you enable many threads -- that's one reason that splitting FASTQ files into multiple chunks enables faster processing.

kallisto should be pretty fast unless you're doing single nucleus rnaseq or rna velocity -- with enough threads, it will only take 1-3 seconds to process a million reads.

Also, make sure you're using the current version of kb-python (version 0.27.3) since speed improvements have been made.

Finally, post issues on the kallisto or the kb-python github page -- I'm usually more responsive on those pages.

@reetm09
Copy link
Author

reetm09 commented Nov 7, 2023

Hi,

Thank you so much for your quick response! This is the command I'm running for RNA Velocity analysis. Currently it's taking 30-40 mins and each of the fastq's are 1000 reads, with the index file being ~40GB. Additionally, each of the files here are 119MB. Is this expected?

kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample1 --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz

Additionally, just to clarify once again, if I specify the following command, it should already be parallelizing?
kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz subSample2_R1.fastq.gz subSample2_R2.fastq.gz subSample3_R1.fastq.gz subSample3_R2.fastq.gz

Or do I need to do anything additional to split the FASTQ files into multiple chunks? And would the output folder (subSample) here contain the combined .h5ad file?

Thanks so much for your help!

@Yenaled
Copy link

Yenaled commented Nov 7, 2023

OK, yes, rna velocity is just slow with kallisto. This will change in our forthcoming release of kb-python (version 0.28; currently on devel branch), which will be released in the next week or so.

I don't think there's much you can do in terms of speed with the current version of kb-python.

And yes, it will be parallelizing automatically with the command you supplied (and the output will be no different than combining the subsamples into a single fastq file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants