Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feasibility of including deduplicated alignments #109

Open
ijhoskins opened this issue Jul 9, 2020 · 1 comment
Open

Feasibility of including deduplicated alignments #109

ijhoskins opened this issue Jul 9, 2020 · 1 comment

Comments

@ijhoskins
Copy link

I see that the dropEst program reports a matrix of counts for genes in the input GTF. Would it be feasible to support output of a deduplicated BAM as well? I am not interested in scRNA-seq counts but rather the ability of your pipeline to identify and deduplicate erroneous UMIs for other applications. I realize this may be out-of-scope but your pipeline appears to be the superior solution for determining UMI duplicate networks!

@VPetukhov
Copy link
Contributor

VPetukhov commented Jul 14, 2020

Would it be feasible to support output of a deduplicated BAM as well?

Unfortunately, it doesn't fit the workflow. Merging duplicated UMIs requires a lot of R calls, but all BAM-related functionality is in C++. So, basically, the simplest solution would be to run correction of UMIs in R, save the list of CB+Gene+UMI+CorrectedUMI to some file, and then have a C++ script that parses this file and outputs the corrected one.
To my experience, writing such a C++ script is generally faster than waiting for Python to do the same :) You basically need to take the BamTools library, iterate over the bam, update the tags and save it to another bam. Something like ~50 lines of code. Here is an example of iteration over bam, and here is another one for editing tags.

I am not interested in scRNA-seq counts but rather the ability of your pipeline to identify and deduplicate erroneous UMIs for other applications.

Do you mean "deduplicate erroneous scRNA-seq UMIs", or is it about some completely different kind of data? The approach should work whenever you have cells, genes and UMIs. But maybe it can also be adopted to other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants