-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-context seeds #388
Conversation
src/randstrobes.cpp
Outdated
static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) { | ||
syncmer_hash_t main_hash = std::min(hash1, hash2); | ||
syncmer_hash_t aux_hash = std::max(hash1, hash2); | ||
return ((main_hash >> digest_size) << digest_size) ^ (aux_hash >> (64 - digest_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to pass information of which strobe (1 or 2) that was saved as the base value. probably here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this information to the last bit of the hash
src/nam.cpp
Outdated
@@ -14,12 +15,14 @@ inline void add_to_hits_per_ref( | |||
int query_start, | |||
int query_end, | |||
const StrobemerIndex& index, | |||
size_t position | |||
size_t position, | |||
bool is_partial | |||
) { | |||
int min_diff = std::numeric_limits<int>::max(); | |||
for (const auto hash = index.get_hash(position); index.get_hash(position) == hash; ++position) { | |||
int ref_start = index.get_strobe1_position(position); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if using multi-context, we are not sure if it is index.get_strobe1_position(position) or position of the second strobe here. Will yield inconsistent NAMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
bf8abf0
to
8c3dfe4
Compare
0aea3fc
to
c4a7f61
Compare
# Conflicts: # src/indexparameters.cpp
b89d123
to
77b5933
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This replaces randstrobes with multi-context seeds.
For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.
Where
hash1
andhash2
are the syncmer hashes, withhash1
being the smaller hash.
Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?
The
64 - digest_size
prefix of the multi-context hash corresponding tohash1
is denoted as the main hash, and thedigest_size
suffix corresponding tohash2
is denoted as auxiliary hash.
This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.
- main hash
- auxiliary hash
- digest, digest size (inconsistency: command-line option is
--digest
, but what is meant is actually "digest length"). What is a digest anyway? - subhash (used in a comment somewhere)
- multi-context hash
- digest hash
- partial hit
I guess that some of this will resolve itself if we switch to asymmetric randstrobe hashes, so it is not super important to fix this at the moment.
src/index.cpp
Outdated
@@ -150,6 +150,8 @@ void StrobemerIndex::populate(float f, unsigned n_threads) { | |||
} | |||
stats.tot_strobemer_count = total_randstrobes; | |||
|
|||
logger.info() << "Digest parameter is : " << parameters.randstrobe.digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info() << "Digest parameter is : " << parameters.randstrobe.digest; | |
logger.info() << "Digest size: " << parameters.randstrobe.digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced "digest" with "aux_len" here and in other places.
src/dumpstrobes.cpp
Outdated
@@ -125,7 +125,8 @@ int run_dumpstrobes(int argc, char **argv) { | |||
l_set ? l : IndexParameters::DEFAULT, | |||
u_set ? u : IndexParameters::DEFAULT, | |||
c_set ? c : IndexParameters::DEFAULT, | |||
max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT | |||
max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT, | |||
digest ? digest: IndexParameters::DEFAULT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
digest ? digest: IndexParameters::DEFAULT | |
digest ? digest : IndexParameters::DEFAULT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/index.hpp
Outdated
@@ -80,18 +83,66 @@ struct StrobemerIndex { | |||
return end(); | |||
} | |||
|
|||
//Returns the first entry that matches the first strobe subhash (if using multi-context seeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using multi-context seeds optional with this PR? If not, then please remove the text in parentheses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not optional, but I wanted to leave the possibility of changing between hashes open at some point. But it does not seem to have any benefits. Fixed the comment.
src/indexparameters.cpp
Outdated
Profile{ 50, 70, 18, -4, -2, 1, 24}, | ||
Profile{ 75, 90, 20, -4, -3, 2, 24}, | ||
Profile{100, 110, 20, -4, -2, 2, 24}, | ||
Profile{125, 135, 20, -4, -1, 4, 24}, | ||
Profile{150, 175, 20, -4, 1, 7, 24}, | ||
Profile{250, 375, 22, -4, 2, 12, 24}, | ||
Profile{400, max, 23, -6, 2, 12, 24}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it’s always 24, it doesn’t need to be part of the profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/nam.cpp
Outdated
if (diff <= min_diff) { | ||
hits_per_ref[index.reference_index(position)].push_back(Hit{query_start, query_end, ref_start, ref_end}); | ||
min_diff = diff; | ||
if (not is_partial) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you notice you were using a digraph?
if (not is_partial) { | |
if (!is_partial) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this up into two functions: One for the partial and one for the non-partial case. (A function that has completely different behavior based on a boolean parameter should be split up into two functions.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Splitted the function.
src/randstrobes.hpp
Outdated
@@ -16,6 +16,7 @@ | |||
|
|||
using syncmer_hash_t = uint64_t; | |||
using randstrobe_hash_t = uint64_t; | |||
using digest_hash_t = uint64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this typedef is necessary. digest_hash_t
is only used once as the return type of the digest_hash
function, but everywhere else, it is still called randstrobe_hash_t
. If there was a reason to use a name other than randstrobe_hash_t
, we would rename randstrobe_hash_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/index.hpp
Outdated
|
||
bool is_filtered(bucket_index_t position) const { | ||
return get_hash(position) == get_hash(position + filter_cutoff); | ||
} | ||
|
||
bool is_partial_filtered(bucket_index_t position) const { | ||
uint shift = parameters.randstrobe.digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer explicitly sized types, that is, don’t use uint
, use whichever type the digest
parameter has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced all uint occurences
src/randstrobes.cpp
Outdated
@@ -49,6 +49,13 @@ static inline randstrobe_hash_t randstrobe_hash(syncmer_hash_t hash1, syncmer_ha | |||
return hash1 + hash2; | |||
} | |||
|
|||
static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a digest_hash
function, the randstrobe_hash
function should be modified. (Currently, the randstrobe_hash
function is unused.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/randstrobes.cpp
Outdated
@@ -220,9 +238,11 @@ QueryRandstrobeVector randstrobes_query(const std::string_view seq, const IndexP | |||
RandstrobeIterator randstrobe_fwd_iter{syncmers, parameters.randstrobe}; | |||
while (randstrobe_fwd_iter.has_next()) { | |||
auto randstrobe = randstrobe_fwd_iter.next(); | |||
uint partial_start = randstrobe.is_first_main ? randstrobe.strobe1_pos : randstrobe.strobe2_pos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint
src/randstrobes.hpp
Outdated
|
||
private: | ||
SyncmerIterator syncmer_iterator; | ||
const unsigned w_min; | ||
const unsigned w_max; | ||
const uint64_t q; | ||
const unsigned int max_dist; | ||
uint digest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint
Sure, edited the PR description.
Yes, the only reason for selecting minimal hash is to keep the hash symmetric.
I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash". |
Is this PR ready for a larger benchmarking? |
Co-authored-by: Marcel Martin <[email protected]>
The randstrobe iterator for queries stops when there is still
This is expected behaviour for our current seeds. For example, However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends). I haven't thought carefully about the best way to change it in the code, but perhaps changing the |
I implemented and tested this briefly:
The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.
Needs to be tested on larger genome(s) obv. |
… the -b parameter value
I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation? |
Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome. Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set
|
Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.
Good, I’ll use that and would like to call it |
Yes, sounds good! |
This PR has been reworked and became #426, which now supersedes it. I’ve read through the comments here and think we have addressed all relevant ones, so I’m closing this PR. |
This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows
((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))
Where
hash1
andhash2
are the syncmer hashes, withhash1
being the smaller hash. The64 - digest_size
prefix of the multi-context hash corresponding tohash1
is denoted as the main hash, and thedigest_size
suffix corresponding tohash2
is denoted as auxiliary hash.The prefix of the size (
64 - digest_size
) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the
m_packed
field ofRefRandstrobe
, replacing one of the bits previously reserved for the reference index.List of changes:
find_nams
function if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.add_to_hits_per_ref
.partial_find
function that looks for the seeds with the same main hash as the query was added to theStrobemerIndex
.StrobemerIndex
.m_packed
field of theRefRandstrobe
is now set iff the first strobe was used as the main part of the multi-context seed.QueryRandstrobe
class.--digest
parameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.