Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting lower complexity barcodes yields many false positives #12

Open
mschubert opened this issue Feb 10, 2023 · 6 comments · May be fixed by #14
Open

Counting lower complexity barcodes yields many false positives #12

mschubert opened this issue Feb 10, 2023 · 6 comments · May be fixed by #14

Comments

@mschubert
Copy link

I found this tool after trying to use the function vmatchPDict from the Bioconductor package Biostrings for barcode matching (which was horribly slow, and took 100 GB of memory if matching with mismatches). guide-counter is amazingly fast and easy to use! 👍

I have one issue, however: My barcodes that I'm matching are lower complexity than the CRISPR sg RNAs would be, i.e. only 12 nucleotides instead of the 20 mentioned in #8.

As such, the automated offset detection for the library sequences yields many false positives, stemming from the read randomly containing this sequence at another position. As a result, I get many more barcode matches than I have reads (usually on average 1.5-2 per read).

Would it be possible to restrict the offset matching to a common value for all reads, or provide a custom offset via a command-line option?

@tfenne
Copy link
Member

tfenne commented Feb 10, 2023

Hi @mschubert - yes, I think that should be relatively easy to do. In your case, do you use :

i) a single fixed offset into all the reads
ii) a fixed range (e.g. 10-12bp in)
iii) different or the same values for all samples?

@mschubert
Copy link
Author

mschubert commented Feb 10, 2023

In my case that would be i) same offset for all samples, but if the offset was determined per sample that would also work

@mschubert
Copy link
Author

Upon reading the guide-counter count --help page again, I realized that I misunderstood what the --offset-min-fraction does. In fact, this option already makes it possible to set a minimum amount of matches for an offset to be considered. So this already works and provides an option to limit off-target matches.

However, the value is given as a fraction of the total number of matches, not a fraction of the 100,000 reads sampled. This means that the desired value of this number changes whenever the off-target number changes. In my case, with 100,000 reads I have 75,000 matches at my desired offset. But when I run with --offset-min-fraction 0.5, I get no counts. And I don't know what to set this value to before running guide-counter, because I don't know how many off-targets I will get.

Wouldn't it make more sense to specify this parameter per reads sampled?

@tfenne
Copy link
Member

tfenne commented Feb 18, 2023

@mschubert My thought in having the denominator be the number of matches (instead of number of reads) is that that fraction should be consistent/predictable even with sequencing error or other problems causing the actual fraction of matching reads overall to drop.

Looking at the code, one issue I see is that it currently counts all matches from a read, so to your point, with short kmers you can end up with multiple matches per read. This is definitely not something I had anticipated. Something that is perhaps exacerbating this is that when looking for the prefixes, it is also tolerating mismatches, which obviously expands the possibility of finding multiple.

I think there are probably a few ways to fix this and I'd appreciate your input:

  1. Add the ability to just specify the prefixes you want to use and skip auto-detection
  2. Add the ability to restrict auto-detection to just a smaller range of the read, but still do auto-detection
  3. Make auto-detection more intelligent/complicated. I think what this would look like would be: i) tallying all the possible matched offsets for a single read, ii) if a single offset is found use it, iii) if multiple offsets are found, and there are a mixture of no-mismatch and one-mismatch offsets, reduce to just the no-mismatch set, iv) instead of counting each offset as 1 match, count it instead as 1 / num_selected_matches_for_read.

(3) would make the prefix auto-detection a bit slower, but also more robust. Whereas (1) is obviously simplest. Thoughts?

@tfenne
Copy link
Member

tfenne commented Feb 18, 2023

@mschubert I'm not sure if you're up for checking out a branch and building, but I took a shot at implementing (3) in #14. Alternatively if you're able to share the first 25-100k of your fastq I can given it a shot too.

@mschubert
Copy link
Author

Thanks for your quick answer @tfenne!

I should be able to build this locally. Be aware that I'm already using --exact-match, so the no-mismatch before 1-mismatch will not affect my counts. So before I look into building this I've got a conceptual question. The way I see it, we've got two expectations of the --offset-min-fraction parameter:

  • (Yours) Obvious sequencing errors should not be counted in this fraction (let's say, bacterial contamination or primer dimers)
  • (Mine) Offsets should not be competitive (i.e., if half of the reads have the barcode on position 35 that should correspond to a --offset-min-fraction of 0.5, irrespective of whether the same amount of matches occurs elsewhere)

I think there are probably a few ways to fix this and I'd appreciate your input

Your proposed changes in #14 would still not align with my expectation because the counting is still competitive. On the contrary, it makes the fraction that I expect harder to estimate because I don't know where a match will be counted as 1 and where as 1/n. What about:

  1. Still count each match as 1, but have a separate counter with how many reads contain a barcode at any position. Then, for computing the --offset-min-fraction divide by this read-based counter instead of the matches-based counter.

(This is your tool of course; so feel free to ignore my opinion here. Maybe there is a reason to use competitive counts that I don't yet see.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants