-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hashset feature] Iterator bugfixes - handle empty hashset and change iteration order so rehashed buckets can be skipped #1147
[hashset feature] Iterator bugfixes - handle empty hashset and change iteration order so rehashed buckets can be skipped #1147
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## hashset #1147 +/- ##
===========================================
- Coverage 70.40% 70.33% -0.07%
===========================================
Files 115 115
Lines 63780 63782 +2
===========================================
- Hits 44907 44864 -43
- Misses 18873 18918 +45
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't review the tests yet.
src/hashset.c
Outdated
if ((growing && !behind_rehash) || (!growing && !ahead_rehash)) { | ||
/* Emit elements in smaller table */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about some more explicit comment to explain this condition?
if ((growing && !behind_rehash) || (!growing && !ahead_rehash)) { | |
/* Emit elements in smaller table */ | |
/* Emit elements in smaller table, except if this table is the | |
* rehashing source table and the current index has already been | |
* rehashed, or the rehashing destination table and the current | |
* index has not yet been populated by rehashing. */ | |
if ((growing && !behind_rehash) || (!growing && !ahead_rehash)) { |
Can't it happen that we're shrinking and we're behind rehashing, but rehash put some elements before or at the current index?
If the big (rehashing source) table had a lot of empty buckets with 'everfull' set, I guess it can happen, because rehash doesn't follow probing chains like scan does.
If we make rehashing follow probing chains, are we safe here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to always skip one of the tables is a very good thing, I believe, especially in things like the expiry scanning. Eviction does sampling which is using scan too. Therefore, to make sure it's safe to skip in the destination table, I think we need to make rehash follow probing chains. Then we can ensure that a key is not rehashed from the not-yet-scanned part of the source table to the already-scanned part of the destination table. Do you want to do this change to rehashStep in this PR?
The names like table_small
are better than table0
. I'm not sure that all the extra variables actually make it simpler though. Each variable adds a mental indirection, so if the names are not extremely clear and if they are just a trivial abstraction, then it might not be worth it.
I'm thinking that maybe instead of growing
we can just use table_small == 0
.
The names behind_rehash
and ahead_rehash
confuse me a little, because it can mean two things: Is the rehash ahead of cursor or is cursor ahead of rehash? I think already_rehashed
and not_yet_rehashed
are better names and I think it's enough to have one of them: already_rehashed = cursorIsLessThan(t->rehashIdx, cursor)
.
We don't need to mask the cursor and rehashIdx
here, because cursor should already be masked with the small mask and, if we make rehashing follows probe chains, the rehashIdx will already be masked with the small mask in too rehashStep
. Maybe we don't strictly have to rehash from both locations in a larger table to the same location in the smaller destination table at the same time, but if we do this, we can guarantee that a cursor is either rehashed or not rehashed, never half-rehashed. Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do you mean that rehashStep would continue until there can be no probe chains ahead of its cursor in either table? It would need to track the furthest bucket touched (due to probe chains in destination table) and continue until the rehash cursor has caught up and covered that bucket, otherwise we can't guarantee that the destination table is completely empty ahead of the rehash cursor.
I feel like that would make rehash steps too large and random, and we'd lose a lot of the benefit of incremental rehashing. I think it'd be better to not skip the destination bucket when scanning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because scan already follows probe chains, I don't think we ever have to worry about rehash moving an unscanned item behind the scan cursor because of its probe chain becoming shorter like you described. If there was an element at the end of a probe chain like that, scan would have either followed the whole potential probe chain and covered the element, or it would be before the earliest bucket it could land in without a probe chain. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I am almost convinced. I will think about it during the night. Do you want to draw an ASCII picture or something to make it look more obvious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that skipping in the new table is not worth the effort.
I still want to argue that it's safe to skip already rehashed indices in the old table though, because those have been rehashed = moved to the new table. (Sam can't be behind the rehash index in the old table, but he can be behind the (index corresponding to the) rehash index in the new table.)
Regarding skipping not-yet-rehashed indices in the new table, I agree it's possible but not really worth the complexity.
My take is that it's not as complex as you say though. Assume the new table is twice as large as the old table. Then, each index x in the old table corresponds to two indices y1 and y2 in the new table. Note that these two indices are neighours though. We have y1 = x & mask_new
and y2 = nextCursor(y1, mask_new)
. Therefore, any probe chain staring before these indices in the new table need to involve y1
. Thus, we only need to check y1 > (rehashIdx & mask_new) && new_table[y1]->everfull == 0
to know if we can skip the remaining buckets in the new table in this scan call. (This allows us to skip remaining buckets in the new table when we're following in a probe chain in the old table.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to argue that it's safe to skip already rehashed indices in the old table though, because those have been rehashed = moved to the new table. (Sam can't be behind the rehash index in the old table, but he can be behind the (index corresponding to the) rehash index in the new table.)
Maybe I'm misunderstanding you, but I think I disagree. Sam can hash to a bucket behind the current rehash index while simultaneously being stored in a bucket ahead of the rehash index due to probe chaining. Suppose when Sam gets rehashed Sam gets placed directly in the bucket he hashes to in the new table, which is behind the rehash index? Remember that rehashing does not follow probe chains and can stop between where Sam is stored and the bucket Sam hashes to!
I've changed it to match the new table being twice as large as the old table:
Sam (☺) hashes to this bucket in the old table
|
old: E E ☺ .
new: .......
|
RehashIdx
|
cursor
Sam (☺) hashes here in new table (one of the two new buckets that correspond to the old bucket he hashed to)
|
old: E E . .
new: .☺.....
| |
| RehashIdx
|
cursor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your last picture, Sam's face is moved to the new table. In the new table we're not going to skip him. (Especially if we're not skipping anything in the new table.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not? Sam is behind cursor
now. Scan assumes it has already covered all buckets behind cursor
including the bucket Sam moved to, and won't revisit them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't reconnect to your pictures properly.
But why not? Sam is behind
cursor
now.
OK, if Sam was rehashed to behind the cursor, then he is not covered in this scan call. Agreed.
But then he must have been covered in the previous scan call. In the old table, he was ahead of the cursor, but scan follows the probe chain and covered him. If he hashes to a bucket behind the cursor but he is located in a bucket ahead of the cursor, there must be a probe chain between those two buckets, because probing is what placed him there.
So, in this scan call, we cover him in the old table:
Sam (☺) hashes to this bucket in the old table
|
old: E E ☺ .
new: .......
|
RehashIdx
|
cursor
The only way we could miss him is that rehashing was called from the scan callback, but we have prevented that by pausing rehashing during the scan call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool test case, rigging a race between scan and rehash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to merge the fix in the iterator and the tests.
We don't seem to agree we about what we can skip in scan. Can you revert those changes for now? (We can keep the renamed variables table0 to table_small, etc.)
Can you also update the PR title and description to match what's actually implemented?
c3a9d71
to
8f1e2eb
Compare
I've done the rebasing onto the new squashed hashset branch, and I've removed the hashsetScan() stuff we're still discussing. (I did keep a small simple optimization though.) |
…rent PR Signed-off-by: Rain Valentine <[email protected]>
8f1e2eb
to
60d557c
Compare
Fixes 2 bugs in hashsetNext():
Minor optimization in hashsetScan and some renamed variables.