-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-context seeds plus fixes and optimized parameters #426
base: main
Are you sure you want to change the base?
Conversation
f4ec683
to
7ae48f2
Compare
Increases the number of seeds per read by 2*w_min seeds
…he -b parameter value
… try modifying this at a later stage
…l hits may benefit from larger syncmers reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"
several full seeds can have the same partial base seed (identical query and ref coordinates. Such partial seeds got added to the same NAM and thus increased the score (through incrementing n_hits several times)
…we did not sort on seed length, they could still be added if there was a full hit with the same base hash value. This commit sorts also on seed length and check if we have already added a full hit with the same base hash value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of comments, mostly directed at myself.
7ae48f2
to
1702c10
Compare
Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the -rescuable.43 83 NC_001422.1 3137 60 120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X = 2955 -362 (sequence) (qual) NM:i:25 AS:i:120 RG:Z:1
+rescuable.43 69 NC_001422.1 2955 0 * = 2955 0 (sequence) (qual) RG:Z:1 |
By adding get_aux_len() to StrobemerIndex and using that.
Reduces code duplication a little bit.
65240de
to
2c90913
Compare
I’m not sure where, but @ksahlin has some measurements that show that the multi-context seeds branch is slower than I used the Linux
That is, the multi-context seeds version from this PR spends 21% more time in the The above was done with read length 150. I am going to check 50 bp as well. (Since one idea was to disable mcs at read lengths ~100bp and above.) I then changed strobealign to keep track of a couple more statistics, in particular the number of hits and NAMs generated per read. This is the output. Before (3a97f6b):
After (this PR):
So is the explanation that mcs is slower because it generates more hits and NAMs? |
Pasted below are the plots you're referring too. I was interested mainly in reducing, or at least finding the cause for, the overhead for longer reads 200 and above (see figs). This may not be so relevant if we disable the mcs for reads over 125bp or so. |
I tested all commits in this branch to see which ones contribute to the slowdown (using the sim3 datasetand drosophila-200 reads, single-end, mapping only). This table shows the commits at which runtimes change:
|
Here are also runtimes for disabling multi-context seeds (by just commenting out the appropriate
Since runtime in |
Great detective work! I am not sure I have any good input on the above yet. As you mention, speed increase from d5125b6 is very strange indeed.
I wrote that as a TODO in the code, but my first guess was that small vectors (up to 100 elements or so) are faster to search and maintain than small sets. but maybe worth a try. |
We do
In the cases we set Also, I don't think we need Could one of these be responsible for the slowdown in d5125b6 ? EDIT: Oh I see that "Assertion statements compile only if DEBUG is defined", nvm the second comment then. |
I think my first statement is also not correct. It only results in that strobe2 is main, which is not harmful since they are set to be the same at start.. I will blame my lame attempt on that I am both on vacation and sick -- trying to do something meaningful from a hotel room.. I'll let you do the detective work instead :) |
Oh no, sorry to hear! Yeah, the assertions are ignored when compiling in Release mode. And I think the swap is very cheap; that should be something like two machine instructions. I’ll investigate further. |
One of the remaining to-do items here is to enable multi-context seeds only when the read length is below a certain threshold. I just pushed commit 47f9257, which makes it possible to enable or disable mcs dynamically by setting a boolean parameter, but for the moment, I hard-coded this parameter to I ran a full evaluation comparing
We can use this to decide which criteria to use for enabling mcs. But I’m having a hard time right now to make such a decision because the slowdown feels relatively large for some datasets. I’ll try to improve the runtime a bit, maybe that makes it easier. |
Nice! Would the difference between green and pink mostly be from new seeding parameters? From your plots it seems favourable to use mcs for read lengths 100 and below. Actually, I think I mentioned around 110-125 somewhere earlier. For 50-100bp, the mapping accuracy is substantially better at reasonable (sometimes no) cost in runtime. Drosophila seems most affected by mapping slowdown for these short read lengths (relative terms). Assuming there would be a significant improvement in runtime for mcs (e.g., lowering some of the increases we observed in your table), I'd say we would want to use it up to lengths of 200. Particularly, it seems mcs can help the mapping only and, thus, the |
I’m finding it difficult to work on some aspects of strobealign at the moment because many changes would actually be in conflict with the multi-context seeds PR. Also, as we saw in the table above, this PR contains multipe things that change strobealign’s behavior. I think it would be good to tease them apart into separate PRs where each one is much smaller than this PR. I’m mainly thinking:
Edit: All PRs opened as discussed. Added links to the list above. |
I was wondering that as well, so I ran an evaluation for this PR but without parameter optimization (mcs lookups disabled, parameters from main). It’s "mcs-nooptim" (the olive lines) in these plots (some read lengths missing so I wouldn’t have to wait that long):
Accuracy is identical to main. Runtime is is consistently slightly higher than main. Due to the observation in #447 that only changing the hash function actually reduces runtime slightly, I think this is due to the extra syncmers at the end of reads. |
Very good suggestion! I agree.
Ok, makes sense! |
Everything in this PR has now been split out into other, smaller PRs. Since we’ve been discussing speed of the mcs implementation here and that is still ongoing, I’ll leave this PR open for the moment. See what I tried below.
I tried a couple of things to make multi-context seeds faster:
Also, when looking up a full hash, the interval that we find is a subinterval of the interval for the partial hash. We currently look up the subinterval first and then the larger interval encompassing it, which requires that we do the full lookup again. I tried to swap this around, that is, to always look up the partial hash first. The advantage should be:
|
I have also tried to vary the |
I like this idea(!) and also find it conceptually cleaner. However, I don't think drosophila is a good test for this as most partial seeds may already be unique(?). I would benchmark this on CHM13. I would prefer this lookup order if it wasn't slower than current lookup order.
Makes sense to try! But I would also benchmark this on CHM13. |
Good point trying on CHM13, will do so. |
Regarding varying |
Documenting another failed attempt at improving speed: In commit 6a5eca7 I tried to make it so that the |
I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a "Merge" button, so this is a separate PR that supersedes #388.
I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as
mcs-optimized-parameters-backup
.Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash has changed due to history rewriting.
To Do
rescuable.43
is no longer rescuableconst unsigned int aux_len = 24;
infind_nams()