Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hashset feature] Convert SET datatype to use hashset instead of dict #1176

Open
wants to merge 2 commits into
base: hashset
Choose a base branch
from

Conversation

SoftlyRaining
Copy link

A fairly straightforward conversion, though I had to do a lot of debugging along the way. This requires a few fixes in hashset.c to pass all tests - this PR contains minimal versions of those fixes but my earlier PR (#1147) has better fixes for those issues.

Copy link

codecov bot commented Oct 16, 2024

Codecov Report

Attention: Patch coverage is 86.27451% with 28 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (8fe59b3) to head (add04d0).

Files with missing lines Patch % Lines
src/debug.c 0.00% 14 Missing ⚠️
src/defrag.c 72.72% 6 Missing ⚠️
src/rdb.c 83.33% 3 Missing ⚠️
src/db.c 92.00% 2 Missing ⚠️
src/t_set.c 97.05% 2 Missing ⚠️
src/hashset.c 93.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           hashset    #1176      +/-   ##
===========================================
+ Coverage    70.40%   70.58%   +0.17%     
===========================================
  Files          115      115              
  Lines        62480    63812    +1332     
===========================================
+ Hits         43989    45039    +1050     
- Misses       18491    18773     +282     
Files with missing lines Coverage Δ
src/object.c 80.75% <100.00%> (+1.56%) ⬆️
src/server.c 88.86% <100.00%> (+0.14%) ⬆️
src/server.h 100.00% <ø> (ø)
src/t_zset.c 95.64% <100.00%> (+<0.01%) ⬆️
src/hashset.c 67.56% <93.33%> (+27.35%) ⬆️
src/db.c 88.79% <92.00%> (+0.28%) ⬆️
src/t_set.c 97.49% <97.05%> (-0.34%) ⬇️
src/rdb.c 75.95% <83.33%> (-0.41%) ⬇️
src/defrag.c 86.02% <72.72%> (-0.90%) ⬇️
src/debug.c 51.80% <0.00%> (-1.92%) ⬇️

... and 84 files with indirect coverage changes

return {[string match {*table size: $table_size*number of elements: $keys*} $htstats]}
}

test "SRANDMEMBER with a dict containing long chain" {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted this test because hashset does not have linked list chains the way that dict does, so the aspect this is attempting to test no longer exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the keys and expires using hashet, I updated DEBUG HTSTATS to count probing chain lengths instead of linked list lengths. Maybe it makes sense here too...?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this should be a hashset UT instead. Assuming that our random sampling doesn't follow probe chains we should be unaffected, but we want to guard against regressions in the future. My UT would make a hashset with one long chain of similar elements, then ensure those elements aren't under or over represented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Can you write that UT in another PR towards the hashset branch?

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. I'll look more later.

return {[string match {*table size: $table_size*number of elements: $keys*} $htstats]}
}

test "SRANDMEMBER with a dict containing long chain" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the keys and expires using hashet, I updated DEBUG HTSTATS to count probing chain lengths instead of linked list lengths. Maybe it makes sense here too...?

@@ -71,6 +71,7 @@
* addressing scheme, including the use of linear probing by scan cursor
* increment, by Viktor Söderqvist. */
#include "hashset.h"
#include "server.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried hard to avoid having hashset depend on the whole valkey server.h. It's better to have dependencies in only one direction and allow hashset to be a relatively independent component only depending on low level stuff like zmalloc.h.

You added this for dismissMemory, right?

I think we can move the logic of dismissMemory from object.c to zmalloc.c. dismissMemory basically just calls zmadvise_dontneed which already knows the page size without using server.page_size. It doesn't accept a size parameter though, but we can change zmadvise_dontneed since it's actually only called from dismissMemory. We can add a size parameter and make it do all what dismissMemory does. In server.h we can add a dismissMemory as an alias (define) of zmadvise_dontneed.

/* server.h */
#define dismissMemory zmadvise_dontneed

/* zmalloc.c */
void zmadvise_dontneed(void *ptr, size_t size_hint) {
    /* Code moved from dismissMemory */
    ...
    /* Code that was already in zmadvise_dontneed since before */
    ...
}

src/hashset.c Show resolved Hide resolved
This changes the type of command tables from dict to hashset. Command
table lookup takes ~3% of overall CPU time in benchmarks, so it is a
good candidate for optimization.

My initial SET benchmark comparison suggests that hashset is about 4.5
times faster than dict and this replacement reduced overall CPU time by
2.79% 🥳

---------

Signed-off-by: Rain Valentine <[email protected]>
Co-authored-by: Rain Valentine <[email protected]>
@zuiderkwast
Copy link
Contributor

Sorry for force-pushing the hashset branch again, to fix a DCO issue.

Can you rebase and force-push again? (I guess it's better than merge when we have a DCO issue following us.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants