Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-context seeds plus fixes and optimized parameters #426

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented May 22, 2024

I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a "Merge" button, so this is a separate PR that supersedes #388.

I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as mcs-optimized-parameters-backup.

Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash has changed due to history rewriting.

To Do

  • Changelog entry
  • Document the concept (take from Add multi-context seeds #388)
  • Document the data structures
  • Document terminology: main, partial, auxiliary, ...
  • Profile: Mapping-only mode is somewhat slower for read lengths up to 100 or so
  • Fix tests
    • rescuable.43 is no longer rescuable
  • Get rid of hardcoded const unsigned int aux_len = 24; in find_nams()

@marcelm marcelm force-pushed the mcs-optimized-parameters branch 2 times, most recently from f4ec683 to 7ae48f2 Compare May 22, 2024 12:07
Itolstoganov and others added 10 commits May 22, 2024 14:58
Increases the number of seeds per read by 2*w_min seeds
…l hits may benefit from larger syncmers

reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"
several full seeds can have the same partial base seed (identical query
and ref coordinates. Such partial seeds got added to the same NAM and
thus increased the score (through incrementing n_hits several times)
…we did not sort on seed length, they could still be added if there was a full hit with the same base hash value. This commit sorts also on seed length and check if we have already added a full hit with the same base hash value
Copy link
Collaborator Author

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of comments, mostly directed at myself.

src/arguments.hpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/randstrobes.cpp Outdated Show resolved Hide resolved
src/index.hpp Show resolved Hide resolved
src/index.hpp Outdated Show resolved Hide resolved
src/index.hpp Outdated Show resolved Hide resolved
src/nam.cpp Outdated Show resolved Hide resolved
src/randstrobes.cpp Outdated Show resolved Hide resolved
@marcelm
Copy link
Collaborator Author

marcelm commented May 23, 2024

Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. Shortened diff of read 1:

-rescuable.43   83  NC_001422.1  3137  60  120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X      =       2955    -362    (sequence)       (qual)       NM:i:25 AS:i:120        RG:Z:1
+rescuable.43   69  NC_001422.1  2955  0  *  =  2955  0       (sequence)  (qual)  RG:Z:1

src/randstrobes.cpp Outdated Show resolved Hide resolved
@marcelm
Copy link
Collaborator Author

marcelm commented Sep 13, 2024

I’m not sure where, but @ksahlin has some measurements that show that the multi-context seeds branch is slower than main. I did some profiling to find out why that is the case. In short, I think the simple explanation seems to be that this branch simply generates more hits.

I used the Linux perf tool for profiling, which has a perf diff subcommand that can compare two profiles. Here is how the first couple of lines in its output look:

              +20.54%  strobealign-mcs       [.] find_nams
              +13.15%  strobealign-mcs       [.] SyncmerIterator::next
              +12.84%  strobealign-mcs       [.] (anonymous namespace)::merge_hits_into_nams
               +6.38%  strobealign-mcs       [.] sw_sse2_byte

That is, the multi-context seeds version from this PR spends 21% more time in the find_nams function, 13% more in the SyncmerIterator::next function etc.

The above was done with read length 150. I am going to check 50 bp as well. (Since one idea was to disable mcs at read lengths ~100bp and above.)

I then changed strobealign to keep track of a couple more statistics, in particular the number of hits and NAMs generated per read. This is the output.

Before (3a97f6b):

Number of reads: 1000000
Number of non-rescue hits: 25160733 total. Per read: 25.16
Number of rescue hits: 583028 total. Per rescue attempt: 23.07
Number of non-rescue NAMs: 27321159 total. Per read: 27.32
Number of rescue NAMs: 11457402 total. Per rescue attempt: 453.27
Total mapping sites tried: 1883792
Total calls to ssw: 184179
Inconsistent NAM ends: 1154
Tried NAM rescue: 25277

After (this PR):

Number of reads: 1000000
Number of non-rescue hits: 38371991 total. Per read: 38.37
Number of rescue hits: 843076 total. Per rescue attempt: 30.16
Number of non-rescue NAMs: 39108199 total. Per read: 39.11
Number of rescue NAMs: 14522183 total. Per rescue attempt: 519.54
Total mapping sites tried: 1898603
Total calls to ssw: 193726
Inconsistent NAM ends: 1205
Tried NAM rescue: 27952

So is the explanation that mcs is slower because it generates more hits and NAMs?

@ksahlin
Copy link
Owner

ksahlin commented Sep 13, 2024

Pasted below are the plots you're referring too. I was interested mainly in reducing, or at least finding the cause for, the overhead for longer reads 200 and above (see figs). This may not be so relevant if we disable the mcs for reads over 125bp or so.

time_plot_SE.pdf

time_plot_PE.pdf

@marcelm
Copy link
Collaborator Author

marcelm commented Sep 27, 2024

I tested all commits in this branch to see which ones contribute to the slowdown (using the sim3 datasetand drosophila-200 reads, single-end, mapping only). This table shows the commits at which runtimes change:

Commit Time (s) Overall slowdown Description
f59bcb4 12.9 0% Merge pull request #425 from ksahlin/optional
379a1fc 13.9 8% (Added mcs, lumped together with earlier commits)
3902f41 14.7 14% Prevent looking up and, thus, adding the same partial hits twice
caea5c2 15.4 19% Updated parameters according to optimization in issue #423
d5125b6 16.8 30% Make RandstrobeIterator/-Generator yield the same results
9c37683 16.5 28% Ensure unmappable.41 really is unmappable
  • Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).
  • I don’t understand, yet, why d5125b6 slows things down so much. It should only add a couple of strobemers at the end of each contig (syncmer paired with itself). That is, it only affects the reference. There was a problem here that the second hash value was set to zero (see discussion above), but that was fixed in commit 9c37683. However, runtime only goes down a bit with that commit.

@marcelm
Copy link
Collaborator Author

marcelm commented Sep 27, 2024

Here are also runtimes for disabling multi-context seeds (by just commenting out the appropriate else in find_nams or find_nams_rescue:

  • 13.1s when disabling mcs only in find_nams
  • 13.0s when disabling mcs also in find_nams_rescue

Since runtime in main was 12.9s, this means that all the other changes done in this branch affect runtime very little. (For sim3-drosophila-200, that is.)

@ksahlin
Copy link
Owner

ksahlin commented Sep 27, 2024

Great detective work!

I am not sure I have any good input on the above yet. As you mention, speed increase from d5125b6 is very strange indeed.

Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).

I wrote that as a TODO in the code, but my first guess was that small vectors (up to 100 elements or so) are faster to search and maintain than small sets. but maybe worth a try.

@ksahlin
Copy link
Owner

ksahlin commented Sep 30, 2024

We do bool first_strobe_is_main = strobe1.hash < strobe2.hash; in make_randstrobe. Then

    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }

In the cases we set Syncmer strobe2 = strobe1; (several times in both fw and rc ends in each query), it becomes an unnecessary swap.

Also, I don't think we need assert(i <= w_max); and assert(i < syncmers.size()); in the RandstrobeIterator and RandstrobeGenerator::next.

Could one of these be responsible for the slowdown in d5125b6 ?

EDIT: Oh I see that "Assertion statements compile only if DEBUG is defined", nvm the second comment then.

@ksahlin
Copy link
Owner

ksahlin commented Sep 30, 2024

I think my first statement is also not correct. It only results in that strobe2 is main, which is not harmful since they are set to be the same at start.. I will blame my lame attempt on that I am both on vacation and sick -- trying to do something meaningful from a hotel room.. I'll let you do the detective work instead :)

@marcelm
Copy link
Collaborator Author

marcelm commented Sep 30, 2024

Oh no, sorry to hear! Yeah, the assertions are ignored when compiling in Release mode. And I think the swap is very cheap; that should be something like two machine instructions. I’ll investigate further.

@marcelm
Copy link
Collaborator Author

marcelm commented Sep 30, 2024

One of the remaining to-do items here is to enable multi-context seeds only when the read length is below a certain threshold. I just pushed commit 47f9257, which makes it possible to enable or disable mcs dynamically by setting a boolean parameter, but for the moment, I hard-coded this parameter to false. That is, with this commit, all the other changes in this PR are still enabled, but actually using multi-context seeds at mapping time is disabled.

I ran a full evaluation comparing main, multi-context seeds and this most recent commit, please see the most relevant plots here:

We can use this to decide which criteria to use for enabling mcs. But I’m having a hard time right now to make such a decision because the slowdown feels relatively large for some datasets. I’ll try to improve the runtime a bit, maybe that makes it easier.

@ksahlin
Copy link
Owner

ksahlin commented Oct 1, 2024

Nice!

Would the difference between green and pink mostly be from new seeding parameters?

From your plots it seems favourable to use mcs for read lengths 100 and below. Actually, I think I mentioned around 110-125 somewhere earlier. For 50-100bp, the mapping accuracy is substantially better at reasonable (sometimes no) cost in runtime. Drosophila seems most affected by mapping slowdown for these short read lengths (relative terms).

Assuming there would be a significant improvement in runtime for mcs (e.g., lowering some of the increases we observed in your table), I'd say we would want to use it up to lengths of 200. Particularly, it seems mcs can help the mapping only and, thus, the --aemb setting quite a bit.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 1, 2024

I’m finding it difficult to work on some aspects of strobealign at the moment because many changes would actually be in conflict with the multi-context seeds PR.

Also, as we saw in the table above, this PR contains multipe things that change strobealign’s behavior. I think it would be good to tease them apart into separate PRs where each one is much smaller than this PR.

I’m mainly thinking:

  1. Change the way randstrobes are hashed (that is, introduce aux_len) (Edit: Done in Switch to multi-context seeds hash function #447)
  2. Use multi-context seeds when looking up randstrobes (Implement multi-context seed lookups (but leave them disabled) #448)
  3. Add syncmers at the ends of reads (Add syncmers at ends of reads #452)
  4. Switch to optimized parameters (Use optimized parameters according to optimization in issue #423 #453)

Edit: All PRs opened as discussed. Added links to the list above.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 1, 2024

Would the difference between green and pink mostly be from new seeding parameters?

I was wondering that as well, so I ran an evaluation for this PR but without parameter optimization (mcs lookups disabled, parameters from main). It’s "mcs-nooptim" (the olive lines) in these plots (some read lengths missing so I wouldn’t have to wait that long):

Accuracy is identical to main. Runtime is is consistently slightly higher than main. Due to the observation in #447 that only changing the hash function actually reduces runtime slightly, I think this is due to the extra syncmers at the end of reads.

@ksahlin
Copy link
Owner

ksahlin commented Oct 1, 2024

Also, as we saw in the table above, this PR contains multipe things that change strobealign’s behavior. I think it would be good to tease them apart into separate PRs where each one is much smaller than this PR.

Very good suggestion! I agree.

Due to the observation in #447 that only changing the hash function actually reduces runtime slightly, I think this is due to the extra syncmers at the end of reads.

Ok, makes sense!

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 7, 2024

Everything in this PR has now been split out into other, smaller PRs.

Since we’ve been discussing speed of the mcs implementation here and that is still ongoing, I’ll leave this PR open for the moment. See what I tried below.

Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).

I wrote that as a TODO in the code, but my first guess was that small vectors (up to 100 elements or so) are faster to search and maintain than small sets. but maybe worth a try.

I tried a couple of things to make multi-context seeds faster:

  • Use std::unordered_set and robin_hood::unordered_set for partial_queried: This is actually slower.
  • Do not call partial_queried.reserve(10) (or use different values other than 10): This makes a small difference, which is why I disabled the call to reserve() entirely if multi-context seeds are disabled. If they are enabled, reserving a couple of elements seems to be a bit beneficial.
  • In the query_randstrobes vector, we first get all forward randstrobes and then all reverse randstrobes. I tried to truncate the partial_queried vector when we see the first reverse randstrobe because we do not need to compare them to the forward ones. This does not make a difference.

Also, when looking up a full hash, the interval that we find is a subinterval of the interval for the partial hash. We currently look up the subinterval first and then the larger interval encompassing it, which requires that we do the full lookup again. I tried to swap this around, that is, to always look up the partial hash first. The advantage should be:

  • If we don’t find the partial hash, we can stop because we won’t find the full hash either.
  • In the majority of the cases, the first randstrobe that matches the partial hash also matches the full hash. Then we don’t need to do a second lookup.
  • If the position we found does not match the full hash, we could just search forward from that position and would not have to re-lookup.
    Unfortunately, this wasn’t actually faster. I presume this is because everything is in the cache already when we do the second lookup of the partial hash. I’m not quite ready to give up on this because I find it a bit cleanerconceptually, so I’ll invest a little bit of time to ensure I haven’t overlooked anything.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 7, 2024

I have also tried to vary the MAX_LINEAR_SEARCH parameter (currently set to 4). Interestingly, it makes no measurable difference in total runtime whether I set it to 0 or 10000 (at least on Drosophila).

@ksahlin
Copy link
Owner

ksahlin commented Oct 7, 2024

Also, when looking up a full hash, the interval that we find is a subinterval of the interval for the partial hash. We currently look up the subinterval first and then the larger interval encompassing it, which requires that we do the full lookup again. I tried to swap this around, that is, to always look up the partial hash first.

I like this idea(!) and also find it conceptually cleaner. However, I don't think drosophila is a good test for this as most partial seeds may already be unique(?). I would benchmark this on CHM13.

I would prefer this lookup order if it wasn't slower than current lookup order.

I have also tried to vary the MAX_LINEAR_SEARCH parameter (currently set to 4). Interestingly, it makes no measurable difference in total runtime whether I set it to 0 or 10000 (at least on Drosophila).

Makes sense to try! But I would also benchmark this on CHM13.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 7, 2024

Good point trying on CHM13, will do so.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 7, 2024

Regarding varying MAX_LINEAR_SEARCH: I tried it on CHM13 and set it to various powers of two. There’s no discernible difference in runtime when set to 0, 4 or 256. It gets a little bit (1%) slower at 512 and then it gets worse with higher values (16% at 16384). So I think we’re fine with leaving it at 4.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 8, 2024

Documenting another failed attempt at improving speed: In commit 6a5eca7 I tried to make it so that the partial_queried vector is not re-created every time the find_nams function is run. Instead, there’s just one vector that gets cleared. This avoids the memory allocation (which I thought was somewhat expensive because the call to reserve() takes some time). But alas, there’s no difference in speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants