Multi-context seeds plus fixes and optimized parameters #426

marcelm · 2024-05-22T11:08:25Z

I realized only now that #388 is incomplete and that @ksahlin’s updates were made in a separate branch. We need a place to review this mcs-optimized-parameters branch and someplace where we have a "Merge" button, so this is a separate PR that supersedes #388.

I’m making quite some changes to this branch (squashing commits, changing commit messages etc.). If anyone wants the original branch without my modifications, it is available as mcs-optimized-parameters-backup.

Original parameter optimization was done with the commit that has the description "Fix so that partial rescue hits are added properly". The commit hash has changed due to history rewriting.

To Do

Changelog entry
Document the concept (take from Add multi-context seeds #388)
Document the data structures
Document terminology: main, partial, auxiliary, ...
Profile: Mapping-only mode is somewhat slower for read lengths up to 100 or so
Fix tests
- rescuable.43 is no longer rescuable
Get rid of hardcoded const unsigned int aux_len = 24; in find_nams()

src/randstrobes.cpp

See #407

…eds.

Increases the number of seeds per read by 2*w_min seeds

…he -b parameter value

… try modifying this at a later stage

…l hits may benefit from larger syncmers reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"

several full seeds can have the same partial base seed (identical query and ref coordinates. Such partial seeds got added to the same NAM and thus increased the score (through incrementing n_hits several times)

…we did not sort on seed length, they could still be added if there was a full hit with the same base hash value. This commit sorts also on seed length and check if we have already added a full hit with the same base hash value

marcelm

Here are a couple of comments, mostly directed at myself.

src/arguments.hpp

src/nam.cpp

src/randstrobes.cpp

src/index.hpp

src/nam.cpp

src/randstrobes.cpp

marcelm · 2024-05-23T14:32:57Z

Commit 21599d9 "Possibly fix redundant alignment sites for symmetric multi context seeds." also breaks a test. After it, the rescuable.43 read in the tiny test dataset is no longer rescuable. Shortened diff of read 1:

-rescuable.43   83  NC_001422.1  3137  60  120S14=1X4=1X9=1X4=1X4=1X9=1X4=1X4=1X4=1X9=1X19=1X4=1X4=1X4=1X4=1X4=1X4=1X9=1X4=1X9=1X4=1X9=1X4=1X4=1X4=1X      =       2955    -362    (sequence)       (qual)       NM:i:25 AS:i:120        RG:Z:1
+rescuable.43   69  NC_001422.1  2955  0  *  =  2955  0       (sequence)  (qual)  RG:Z:1

By adding get_aux_len() to StrobemerIndex and using that.

Reduces code duplication a little bit.

src/randstrobes.cpp

marcelm · 2024-09-13T15:17:42Z

I’m not sure where, but @ksahlin has some measurements that show that the multi-context seeds branch is slower than main. I did some profiling to find out why that is the case. In short, I think the simple explanation seems to be that this branch simply generates more hits.

I used the Linux perf tool for profiling, which has a perf diff subcommand that can compare two profiles. Here is how the first couple of lines in its output look:

              +20.54%  strobealign-mcs       [.] find_nams
              +13.15%  strobealign-mcs       [.] SyncmerIterator::next
              +12.84%  strobealign-mcs       [.] (anonymous namespace)::merge_hits_into_nams
               +6.38%  strobealign-mcs       [.] sw_sse2_byte

That is, the multi-context seeds version from this PR spends 21% more time in the find_nams function, 13% more in the SyncmerIterator::next function etc.

The above was done with read length 150. I am going to check 50 bp as well. (Since one idea was to disable mcs at read lengths ~100bp and above.)

I then changed strobealign to keep track of a couple more statistics, in particular the number of hits and NAMs generated per read. This is the output.

Before (3a97f6b):

Number of reads: 1000000
Number of non-rescue hits: 25160733 total. Per read: 25.16
Number of rescue hits: 583028 total. Per rescue attempt: 23.07
Number of non-rescue NAMs: 27321159 total. Per read: 27.32
Number of rescue NAMs: 11457402 total. Per rescue attempt: 453.27
Total mapping sites tried: 1883792
Total calls to ssw: 184179
Inconsistent NAM ends: 1154
Tried NAM rescue: 25277

After (this PR):

Number of reads: 1000000
Number of non-rescue hits: 38371991 total. Per read: 38.37
Number of rescue hits: 843076 total. Per rescue attempt: 30.16
Number of non-rescue NAMs: 39108199 total. Per read: 39.11
Number of rescue NAMs: 14522183 total. Per rescue attempt: 519.54
Total mapping sites tried: 1898603
Total calls to ssw: 193726
Inconsistent NAM ends: 1205
Tried NAM rescue: 27952

So is the explanation that mcs is slower because it generates more hits and NAMs?

ksahlin · 2024-09-13T17:27:34Z

Pasted below are the plots you're referring too. I was interested mainly in reducing, or at least finding the cause for, the overhead for longer reads 200 and above (see figs). This may not be so relevant if we disable the mcs for reads over 125bp or so.

time_plot_SE.pdf

time_plot_PE.pdf

marcelm · 2024-09-27T11:42:50Z

I tested all commits in this branch to see which ones contribute to the slowdown (using the sim3 datasetand drosophila-200 reads, single-end, mapping only). This table shows the commits at which runtimes change:

Commit	Time (s)	Overall slowdown	Description
`f59bcb4`	12.9	0%	Merge pull request #425 from ksahlin/optional
`379a1fc`	13.9	8%	(Added mcs, lumped together with earlier commits)
`3902f41`	14.7	14%	Prevent looking up and, thus, adding the same partial hits twice
`caea5c2`	15.4	19%	Updated parameters according to optimization in issue #423
`d5125b6`	16.8	30%	Make RandstrobeIterator/-Generator yield the same results
`9c37683`	16.5	28%	Ensure unmappable.41 really is unmappable

Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).
I don’t understand, yet, why d5125b6 slows things down so much. It should only add a couple of strobemers at the end of each contig (syncmer paired with itself). That is, it only affects the reference. There was a problem here that the second hash value was set to zero (see discussion above), but that was fixed in commit 9c37683. However, runtime only goes down a bit with that commit.

marcelm · 2024-09-27T13:28:01Z

Here are also runtimes for disabling multi-context seeds (by just commenting out the appropriate else in find_nams or find_nams_rescue:

13.1s when disabling mcs only in find_nams
13.0s when disabling mcs also in find_nams_rescue

Since runtime in main was 12.9s, this means that all the other changes done in this branch affect runtime very little. (For sim3-drosophila-200, that is.)

ksahlin · 2024-09-27T14:58:51Z

Great detective work!

I am not sure I have any good input on the above yet. As you mention, speed increase from d5125b6 is very strange indeed.

Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).

I wrote that as a TODO in the code, but my first guess was that small vectors (up to 100 elements or so) are faster to search and maintain than small sets. but maybe worth a try.

ksahlin · 2024-09-30T09:34:55Z

We do bool first_strobe_is_main = strobe1.hash < strobe2.hash; in make_randstrobe. Then

    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }

In the cases we set Syncmer strobe2 = strobe1; (several times in both fw and rc ends in each query), it becomes an unnecessary swap.

Also, I don't think we need assert(i <= w_max); and assert(i < syncmers.size()); in the RandstrobeIterator and RandstrobeGenerator::next.

Could one of these be responsible for the slowdown in d5125b6 ?

EDIT: Oh I see that "Assertion statements compile only if DEBUG is defined", nvm the second comment then.

ksahlin · 2024-09-30T10:01:28Z

I think my first statement is also not correct. It only results in that strobe2 is main, which is not harmful since they are set to be the same at start.. I will blame my lame attempt on that I am both on vacation and sick -- trying to do something meaningful from a hotel room.. I'll let you do the detective work instead :)

marcelm · 2024-09-30T10:15:13Z

Oh no, sorry to hear! Yeah, the assertions are ignored when compiling in Release mode. And I think the swap is very cheap; that should be something like two machine instructions. I’ll investigate further.

marcelm · 2024-09-30T15:35:13Z

One of the remaining to-do items here is to enable multi-context seeds only when the read length is below a certain threshold. I just pushed commit 47f9257, which makes it possible to enable or disable mcs dynamically by setting a boolean parameter, but for the moment, I hard-coded this parameter to false. That is, with this commit, all the other changes in this PR are still enabled, but actually using multi-context seeds at mapping time is disabled.

I ran a full evaluation comparing main, multi-context seeds and this most recent commit, please see the most relevant plots here:

We can use this to decide which criteria to use for enabling mcs. But I’m having a hard time right now to make such a decision because the slowdown feels relatively large for some datasets. I’ll try to improve the runtime a bit, maybe that makes it easier.

ksahlin · 2024-10-01T09:35:26Z

Nice!

Would the difference between green and pink mostly be from new seeding parameters?

From your plots it seems favourable to use mcs for read lengths 100 and below. Actually, I think I mentioned around 110-125 somewhere earlier. For 50-100bp, the mapping accuracy is substantially better at reasonable (sometimes no) cost in runtime. Drosophila seems most affected by mapping slowdown for these short read lengths (relative terms).

Assuming there would be a significant improvement in runtime for mcs (e.g., lowering some of the increases we observed in your table), I'd say we would want to use it up to lengths of 200. Particularly, it seems mcs can help the mapping only and, thus, the --aemb setting quite a bit.

marcelm · 2024-10-01T15:08:50Z

I’m finding it difficult to work on some aspects of strobealign at the moment because many changes would actually be in conflict with the multi-context seeds PR.

Also, as we saw in the table above, this PR contains multipe things that change strobealign’s behavior. I think it would be good to tease them apart into separate PRs where each one is much smaller than this PR.

I’m mainly thinking:

Change the way randstrobes are hashed (that is, introduce aux_len) (Edit: Done in Switch to multi-context seeds hash function #447)
Use multi-context seeds when looking up randstrobes (Implement multi-context seed lookups (but leave them disabled) #448)
Add syncmers at the ends of reads (Add syncmers at ends of reads #452)
Switch to optimized parameters (Use optimized parameters according to optimization in issue #423 #453)

Edit: All PRs opened as discussed. Added links to the list above.

marcelm · 2024-10-01T16:33:14Z

Would the difference between green and pink mostly be from new seeding parameters?

I was wondering that as well, so I ran an evaluation for this PR but without parameter optimization (mcs lookups disabled, parameters from main). It’s "mcs-nooptim" (the olive lines) in these plots (some read lengths missing so I wouldn’t have to wait that long):

SE time.pdf
SE accuracy.pdf

Accuracy is identical to main. Runtime is is consistently slightly higher than main. Due to the observation in #447 that only changing the hash function actually reduces runtime slightly, I think this is due to the extra syncmers at the end of reads.

ksahlin · 2024-10-01T21:12:44Z

Also, as we saw in the table above, this PR contains multipe things that change strobealign’s behavior. I think it would be good to tease them apart into separate PRs where each one is much smaller than this PR.

Very good suggestion! I agree.

Due to the observation in #447 that only changing the hash function actually reduces runtime slightly, I think this is due to the extra syncmers at the end of reads.

Ok, makes sense!

marcelm · 2024-10-07T14:29:16Z

Everything in this PR has now been split out into other, smaller PRs.

Since we’ve been discussing speed of the mcs implementation here and that is still ongoing, I’ll leave this PR open for the moment. See what I tried below.

Commit 3902f41 adds a linear search in a vector; this can probably be optimized a bit (using a set).

I wrote that as a TODO in the code, but my first guess was that small vectors (up to 100 elements or so) are faster to search and maintain than small sets. but maybe worth a try.

I tried a couple of things to make multi-context seeds faster:

Use std::unordered_set and robin_hood::unordered_set for partial_queried: This is actually slower.
Do not call partial_queried.reserve(10) (or use different values other than 10): This makes a small difference, which is why I disabled the call to reserve() entirely if multi-context seeds are disabled. If they are enabled, reserving a couple of elements seems to be a bit beneficial.
In the query_randstrobes vector, we first get all forward randstrobes and then all reverse randstrobes. I tried to truncate the partial_queried vector when we see the first reverse randstrobe because we do not need to compare them to the forward ones. This does not make a difference.

Also, when looking up a full hash, the interval that we find is a subinterval of the interval for the partial hash. We currently look up the subinterval first and then the larger interval encompassing it, which requires that we do the full lookup again. I tried to swap this around, that is, to always look up the partial hash first. The advantage should be:

If we don’t find the partial hash, we can stop because we won’t find the full hash either.
In the majority of the cases, the first randstrobe that matches the partial hash also matches the full hash. Then we don’t need to do a second lookup.
If the position we found does not match the full hash, we could just search forward from that position and would not have to re-lookup.
Unfortunately, this wasn’t actually faster. I presume this is because everything is in the cache already when we do the second lookup of the partial hash. I’m not quite ready to give up on this because I find it a bit cleanerconceptually, so I’ll invest a little bit of time to ensure I haven’t overlooked anything.

marcelm · 2024-10-07T14:59:54Z

I have also tried to vary the MAX_LINEAR_SEARCH parameter (currently set to 4). Interestingly, it makes no measurable difference in total runtime whether I set it to 0 or 10000 (at least on Drosophila).

ksahlin · 2024-10-07T15:12:18Z

Also, when looking up a full hash, the interval that we find is a subinterval of the interval for the partial hash. We currently look up the subinterval first and then the larger interval encompassing it, which requires that we do the full lookup again. I tried to swap this around, that is, to always look up the partial hash first.

I like this idea(!) and also find it conceptually cleaner. However, I don't think drosophila is a good test for this as most partial seeds may already be unique(?). I would benchmark this on CHM13.

I would prefer this lookup order if it wasn't slower than current lookup order.

I have also tried to vary the MAX_LINEAR_SEARCH parameter (currently set to 4). Interestingly, it makes no measurable difference in total runtime whether I set it to 0 or 10000 (at least on Drosophila).

Makes sense to try! But I would also benchmark this on CHM13.

marcelm · 2024-10-07T15:15:40Z

Good point trying on CHM13, will do so.

marcelm · 2024-10-07T18:23:56Z

Regarding varying MAX_LINEAR_SEARCH: I tried it on CHM13 and set it to various powers of two. There’s no discernible difference in runtime when set to 0, 4 or 256. It gets a little bit (1%) slower at 512 and then it gets worse with higher values (16% at 16384). So I think we’re fine with leaving it at 4.

marcelm · 2024-10-08T09:30:10Z

Documenting another failed attempt at improving speed: In commit 6a5eca7 I tried to make it so that the partial_queried vector is not re-created every time the find_nams function is run. Instead, there’s just one vector that gets cleared. This avoids the memory allocation (which I thought was somewhat expensive because the call to reserve() takes some time). But alas, there’s no difference in speed.

marcelm force-pushed the mcs-optimized-parameters branch 2 times, most recently from f4ec683 to 7ae48f2 Compare May 22, 2024 12:07

marcelm commented May 22, 2024

View reviewed changes

src/randstrobes.cpp Show resolved Hide resolved

Itolstoganov and others added 10 commits May 22, 2024 14:58

Add multi-context seeds

2adbeb4

Use optimized parameters

3f15f8a

See #407

Possibly fix redundant alignment sites for symmetric multi context se…

21599d9

…eds.

Add syncmers at the ends of reads as seeds without auxiliary hash

dc873e9

Increases the number of seeds per read by 2*w_min seeds

Fix that the mcs implementation gave different results depending on t…

c9f856f

…he -b parameter value

using the same filter cutoff as for full matches as baseline - we can…

379a1fc

… try modifying this at a later stage

implemented also using partial hits for rescue

151fa17

bumping all k values with one from previous optimization since partia…

3714c5a

…l hits may benefit from larger syncmers reverting to only change shortest read lengths before starting benchmarks - will wait for proper parameter optimization instead"

Fix a bug introduced by partial hits

dc33c39

several full seeds can have the same partial base seed (identical query and ref coordinates. Such partial seeds got added to the same NAM and thus increased the score (through incrementing n_hits several times)

marcelm commented May 22, 2024

View reviewed changes

marcelm commented May 23, 2024

View reviewed changes

src/randstrobes.cpp Outdated Show resolved Hide resolved

marcelm force-pushed the mcs-optimized-parameters branch from 7ae48f2 to 1702c10 Compare May 23, 2024 11:00

ksahlin and others added 11 commits May 23, 2024 16:51

Prevent looking up and, thus, adding the same partial hits twice

3902f41

Fix so that partial rescue hits are added properly

6512545

Updated parameters according to optimization in issue #423

caea5c2

Log auxiliary hash length only with -v

d84103a

Comment

7b2b711

Rename get_partial_hash to get_main_hash

d06762b

Changelog multi-context seeds

1ebd7f6

Avoid hardcoded aux_len = 24

8810cd1

By adding get_aux_len() to StrobemerIndex and using that.

Add RandstrobeIterator/-Generator tests

da2aa05

Make RandstrobeIterator/-Generator yield the same results

d5125b6

Add make_randstrobe function

2c90913

Reduces code duplication a little bit.

marcelm commented May 23, 2024

View reviewed changes

src/randstrobes.cpp Outdated Show resolved Hide resolved

marcelm force-pushed the mcs-optimized-parameters branch from 65240de to 2c90913 Compare May 23, 2024 15:52

Update expected test results

4ed851a

marcelm mentioned this pull request Sep 3, 2024

Reconsider the has_shared_substring heuristic #437

Open

Factor out strobe_extent_partial

46d2245

Make use of mcs optional and disable it by default

47f9257

This was referenced Oct 1, 2024

Add operator<< for Cigar and Alignment #444

Merged

Add make_randstrobe function #445

Merged

Rename Hit struct to Match #446

Merged

Switch to multi-context seeds hash function #447

Merged

This was referenced Oct 7, 2024

Add syncmers at ends of reads #452

Draft

Use optimized parameters according to optimization in issue #423 #453

Draft

Add multi-context seeds #388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-context seeds plus fixes and optimized parameters #426

Multi-context seeds plus fixes and optimized parameters #426

marcelm commented May 22, 2024 •

edited

Loading

marcelm left a comment

marcelm commented May 23, 2024

marcelm commented Sep 13, 2024

ksahlin commented Sep 13, 2024

marcelm commented Sep 27, 2024

marcelm commented Sep 27, 2024

ksahlin commented Sep 27, 2024 •

edited

Loading

ksahlin commented Sep 30, 2024 •

edited

Loading

ksahlin commented Sep 30, 2024 •

edited

Loading

marcelm commented Sep 30, 2024

marcelm commented Sep 30, 2024

ksahlin commented Oct 1, 2024

marcelm commented Oct 1, 2024 •

edited

Loading

marcelm commented Oct 1, 2024

ksahlin commented Oct 1, 2024

marcelm commented Oct 7, 2024 •

edited

Loading

marcelm commented Oct 7, 2024

ksahlin commented Oct 7, 2024 •

edited

Loading

marcelm commented Oct 7, 2024

marcelm commented Oct 7, 2024 •

edited

Loading

marcelm commented Oct 8, 2024

Multi-context seeds plus fixes and optimized parameters #426

Are you sure you want to change the base?

Multi-context seeds plus fixes and optimized parameters #426

Conversation

marcelm commented May 22, 2024 • edited Loading

To Do

marcelm left a comment

Choose a reason for hiding this comment

marcelm commented May 23, 2024

marcelm commented Sep 13, 2024

ksahlin commented Sep 13, 2024

marcelm commented Sep 27, 2024

marcelm commented Sep 27, 2024

ksahlin commented Sep 27, 2024 • edited Loading

ksahlin commented Sep 30, 2024 • edited Loading

ksahlin commented Sep 30, 2024 • edited Loading

marcelm commented Sep 30, 2024

marcelm commented Sep 30, 2024

ksahlin commented Oct 1, 2024

marcelm commented Oct 1, 2024 • edited Loading

marcelm commented Oct 1, 2024

ksahlin commented Oct 1, 2024

marcelm commented Oct 7, 2024 • edited Loading

marcelm commented Oct 7, 2024

ksahlin commented Oct 7, 2024 • edited Loading

marcelm commented Oct 7, 2024

marcelm commented Oct 7, 2024 • edited Loading

marcelm commented Oct 8, 2024

marcelm commented May 22, 2024 •

edited

Loading

ksahlin commented Sep 27, 2024 •

edited

Loading

ksahlin commented Sep 30, 2024 •

edited

Loading

ksahlin commented Sep 30, 2024 •

edited

Loading

marcelm commented Oct 1, 2024 •

edited

Loading

marcelm commented Oct 7, 2024 •

edited

Loading

ksahlin commented Oct 7, 2024 •

edited

Loading

marcelm commented Oct 7, 2024 •

edited

Loading