Add hashset (new hash table) and use in command lookup #1186

zuiderkwast · 2024-10-17T13:18:17Z

This is the first in a series of PRs about using a new hash table.

The "hashset" is a cache line optimized open addressing hash table implemented as outlined in #169. It supports incremental rehashing, scan, random key, etc. just like the dict but faster and using less memory. For details, see the comments in src/hashset.{c,h}.

The plan (WIP) is to use it for keys, expires and many other things, but this first PR just contains these two:

Adds the implementation of the hash table itself (only hashset.c and hashset.h).
Use the new hash table for the command lookup table (implemented by @SoftlyRaining).

Fixes #991

Signed-off-by: Viktor Söderqvist <[email protected]>

codecov · 2024-10-17T13:33:13Z

Codecov Report

Attention: Patch coverage is 47.20639% with 463 lines in your changes missing coverage. Please review.

Project coverage is 70.77%. Comparing base (a62d1f1) to head (5129254).
Report is 2 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/hashset.c	39.97%	428 Missing ⚠️
src/module.c	0.00%	19 Missing ⚠️
src/acl.c	74.19%	8 Missing ⚠️
src/server.c	91.83%	8 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1186      +/-   ##
============================================
+ Coverage     70.65%   70.77%   +0.12%     
============================================
  Files           114      115       +1     
  Lines         61799    64714    +2915     
============================================
+ Hits          43664    45804    +2140     
- Misses        18135    18910     +775

Files with missing lines	Coverage Δ
src/config.c	`78.87% <100.00%> (+0.17%)`	⬆️
src/latency.c	`80.87% <100.00%> (+0.25%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/acl.c	`88.83% <74.19%> (-0.19%)`	⬇️
src/server.c	`90.01% <91.83%> (+1.36%)`	⬆️
src/module.c	`9.66% <0.00%> (+0.02%)`	⬆️
src/hashset.c	`39.97% <39.97%> (ø)`

... and 86 files with indirect coverage changes

xingbowang · 2024-10-17T21:33:16Z

For command lookup, since the command table is rarely updated, why not build an efficient trie? Or it is even more slower?
Or try perfect hash function. https://en.wikipedia.org/wiki/Perfect_hash_function

madolson

Just a partial review (everything but hashset.c, which I guess is the interesting stuff) and tests. Blocked time off tomorrow to read through :) Exciting stuff!

src/unit/test_hashset.c

src/hashset.h

src/hashset.c

src/hashset.h

madolson · 2024-10-17T21:40:17Z

src/hashset.c

+/* --- Global variables --- */
+
+static uint8_t hash_function_seed[16];
+static hashsetResizePolicy resize_policy = HASHSET_RESIZE_ALLOW;


Can we use type callbacks or values instead of the globals? We discussed changing this in Redis to make it more modular, but never got around to it. This allows much finer control of the policies as well.

Yes, we could. It needs some design work. I wanted to avoid side-stepping too much. Just focusing on replacing dict at the first step.

Follow-up.

Yeah, followup is always fine.

src/hashset.h

madolson · 2024-10-17T21:52:39Z

src/server.c

+                              .keyCompare = hashsetStringKeyCaseCompare,
+                              .instant_rehashing = 1};
+
+/* Command set, hashed by char* string, stores serverCommand structs. */


Suggested change

/* Command set, hashed by char* string, stores serverCommand structs. */

/* Sub-command set, hashed by char* string, stores serverCommand structs. */

Is there a reason these need to be different? It seems like we could use declared name for both.

@SoftlyRaining knows this better.

I suppose it's about renamed commands or something, where the declared name and the real name are not the same.

This changes the type of command tables from dict to hashset. Command table lookup takes ~3% of overall CPU time in benchmarks, so it is a good candidate for optimization. My initial SET benchmark comparison suggests that hashset is about 4.5 times faster than dict and this replacement reduced overall CPU time by 2.79% 🥳 --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Rain Valentine <[email protected]> Co-authored-by: Rain Valentine <[email protected]>

zuiderkwast · 2024-10-18T08:09:24Z

For command lookup, since the command table is rarely updated, why not build an efficient trie? Or it is even more slower? Or try perfect hash function. https://en.wikipedia.org/wiki/Perfect_hash_function

@xingbowang For command table, yes a perfect hash function would be even better. The main point of this implementation is to use it for the database keys though, which will come in another PR. (A trie is possibly slower because there might be more memory accesses.)

soloestoy

excellent work!

to be honest, I didn't like open-addressing hash tables before, because the tombstones generated by deletions have a very negative impact on both performance and space. I felt they were not suitable for online services with a large number of deletion operations.

However, this design is quite clever, deleting an element does not create a tombstone, only when a bucket has been full at some point is it considered to have a tombstone, which greatly reduces the impact of deletion operations.

A bucket's length is 7, so in terms of chained hashing, a tombstone effect would only occur when there are 7 collisions, which intuitively seems to be a relatively rare situation.

In theory, this should save memory. Compared to the previous chained hashing, it saves two pointers (for the previous and next dictEntry) and adds one hash value, saving about 15 bytes per element. However, the frequency of resizing might be higher, as resizing is triggered not only by the fill factor but also by the proportion of tombstones. Do we have reliable tests for space and performance?

src/hashset.h

… rehashed buckets can be skipped (#1147) Fixes 2 bugs in hashsetNext(): - null dereference when iterating over new unused hashset - change order of iteration so skipping past already-rehashed buckets works correctly and won't miss elements Minor optimization in hashsetScan and some renamed variables. Signed-off-by: Rain Valentine <[email protected]>

zuiderkwast · 2024-10-18T08:47:58Z

In theory, this should save memory. Compared to the previous chained hashing, it saves two pointers (for the previous and next dictEntry) and adds one hash value, saving about 15 bytes per element. However, the frequency of resizing might be higher, as resizing is triggered not only by the fill factor but also by the proportion of tombstones. Do we have reliable tests for space and performance?

@soloestoy I'm glad you like the design.

In my WIP branch for using this for keys, i get 10-15% better performace for GET compared to unstable.

I set the fill factor to 77% currently, and with 7/8 of the space used for pointers (1/8 in each bucket is metadata bitys), it means that 67% of the allocation is pointers.

Rehashing triggered by the proportion of tombstones is a corner case. It didn't implement it first, but I realized in the defrag tests, the test code tried to create a lot of fragmentation and it added many keys and eviction deleted many keys. In this case, the table became full of tombstones. It does not happen very often and I think it will not affect the performance.

There is a potential problem with CoW during fork though. The dict can avoid rehashing up to it becomes 500% fill factor. This is not possible with open addressing. We have to rehash at some point. I set the hard limit to 90% currently. I hope it will not be a major problem. In the "resize avoid" mode, resizing is allowed but incremental rehashing is paused, so only new keys are added to the new table. This also avoids destroying the CoW.

Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]>

Use variable name `s` for hashset instead of `t`. Use variable name `element` instead of `elem`. Remove the hashsetType.userdata field. Some more fixes. Signed-off-by: Viktor Söderqvist <[email protected]>

Signed-off-by: Viktor Söderqvist <[email protected]>

madolson

Like 75% review of hashset.c, and my brain is done with reviewing. Overall looks really solid though.

The one thing that occurred to me while I was reviewing is that instead of using the everfull bit to start probing, we could use it to indicate the last position is a pointer to another bucket and do linked list resolution. This does a couple of things:

Prevents significant chain lengths, since collisions won't cause issues.
Keeps the overall scanning implementation much more similar to the current implementation.
We get most of the performance benefit from the cache lines and the nextCursorCall() is likely almost as random as a random DRAM lookup. (Which we'll probably prefetch anyways?)

madolson · 2024-10-18T17:04:39Z

src/hashset.c

+#if ELEMENTS_PER_BUCKET < 8
+#define BUCKET_BITS_TYPE uint8_t
+#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 3
+#elif ELEMENTS_PER_BUCKET < 16
+#define BUCKET_BITS_TYPE uint16_t
+#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 4
+#else
+#error "Unexpected value of ELEMENTS_PER_BUCKET"
+#endif


Suggested change

#if ELEMENTS_PER_BUCKET < 8

#define BUCKET_BITS_TYPE uint8_t

#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 3

#elif ELEMENTS_PER_BUCKET < 16

#define BUCKET_BITS_TYPE uint16_t

#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 4

#else

#error "Unexpected value of ELEMENTS_PER_BUCKET"

#endif

#if ELEMENTS_PER_BUCKET == 7

#define BUCKET_BITS_TYPE uint8_t

#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 3

#elif ELEMENTS_PER_BUCKET == 12

#define BUCKET_BITS_TYPE uint16_t

#define BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET 4

#else

#error "Unexpected value of ELEMENTS_PER_BUCKET"

#endif

It's currently 12 per bucket in 32-bit mode though. (It'd be possible to use 13 and to store fewer hash bits.) I used < 8 and < 16 here to select an 8-bit or a 16-bit integer type, the number of bits we need here. It seemed more logical and than to make it depend on the actual bucket sizes we have.

I just found it confusing since it wasn't the exact number.

It's the exact numbers 8 and 16, as in uint8_t and uint16_t. 😁

I thought it's better to decouple it as much as possible, but we could put these definitions within the main big #if 64-bit block if you prefer.

madolson · 2024-10-18T17:22:16Z

src/hashset.c

+            if (table_index) *table_index = table;
+            return b;
+        }
+        bucket_idx = nextCursor(bucket_idx, mask);


Note for Madelyn she should update before posting:

Are there any ways for this to become completely full and this can enter an infinite loop?

Yes! At least there was until I added rehashing when it gets too full. There's a unit test case for that.

Well, scan got into an infinite loop, but it's a very similar condition.

Yeah, I forgot to delete this comment (like I said I should) but I commented on the insertion path which doesn't seem to have a guard.

Should we have some assertion here or breakout condition here just in case?

src/hashset.c

madolson · 2024-10-18T18:23:48Z

src/hashset.c

+    /* Bucket done. Advance to the next bucket in probing order, to cover
+     * complete probing chains. Other alternatives are (1) just rehashIdx++ or
+     * (2) in reverse scan order and clear the tombstones while doing so.
+     * (Alternative is to do rehashIdx++.) */


I guess I'll comment here. Why have probing order be the same as cursor order?

Because in scan we need to follow probing chains and not return until we reach the end of a probing chain. This prevents the possibility that we miss an element if rehashing happens between two calls to scan.

Maybe that should be documented here then? The comment now doesn't make a lot of sense since the other two options don't work.

For your statement to be true, don't we also need to follow the probing chains until the end during rehashing as well then? Items can be moved to early positions in the chain during rehashing if we are growing the table and items in the current table were placed in buckets later in the chain.

Scan needs to guarantee that rehash can't move something between scan calls into scan's already-covered area such that scan will miss it. I did rant at you today in person about this, so I won't rehash it here 😜

Yeah we can formulate this comment better. I agree the "other alternatives" are not real alternatives if they don't work. Rehash step also doesn't cover complete probe chains in one call. Scan does that though.

This comment needs to explain why we need to do this. I don't remember the reasoning right now.

TBH, I'm not completely sure we do though. When we scan at cursor X, we follow probing chains in both tables to cover all elements that hash to the index that corresponds to cursor X. This includes elements located at X+1 or later (due to probing), and regardless of in which of the two tables it currently is. This seems to imply that the rehashing order doesn't matter.

What am I missing?

@SoftlyRaining you have a test case where you rigged a race between rehashing and scan. Let's add it and see if it breaks?

I remember now: we rehash in cursor order to be able to skip already rehashed buckets in scan.

src/hashset.c

src/hashset.h

madolson · 2024-10-18T18:52:05Z

src/hashset.c

+/* --- Global variables --- */
+
+static uint8_t hash_function_seed[16];
+static hashsetResizePolicy resize_policy = HASHSET_RESIZE_ALLOW;


Yeah, followup is always fine.

src/hashset.h

zuiderkwast · 2024-10-18T20:01:27Z

src/hashset.c

+/* Encode bucket_index, pos_in_bucket, table_index into an opaque pointer. */
+static void *encodePositionInTable(size_t bucket_index, int pos_in_bucket, int table_index) {
+    uintptr_t encoded = bucket_index;
+    encoded <<= BITS_NEEDED_TO_STORE_POS_WITHIN_BUCKET;
+    encoded |= pos_in_bucket;
+    encoded <<= 1;
+    encoded |= table_index;
+    encoded++; /* Add one to make sure we don't return NULL. */
+    return (void *)encoded;
+}


This isn't good for very large tables in 32-bit mode. We don't have enough bits. Could return an 64-bit integer though.

I've considered just returning a pointer to the element's location in the bucket. It's possible to decode that too, back to bucket index and the other indexes.

SoftlyRaining · 2024-10-19T00:18:01Z

Like 75% review of hashset.c, and my brain is done with reviewing. Overall looks really solid though.

The one thing that occurred to me while I was reviewing is that instead of using the everfull bit to start probing, we could use it to indicate the last position is a pointer to another bucket and do linked list resolution. This does a couple of things:
1. Prevents significant chain lengths, since collisions won't cause issues.

2. Keeps the overall scanning implementation much more similar to the current implementation.

3. We get _most_ of the performance benefit from the cache lines and the nextCursorCall() is likely almost as random as a random DRAM lookup.

For what it's worth, I can make nextCursor() use sequential, in-memory order for rehashing, scanning, and iterator if we use the most significant bits instead of the least significant bits of the hash for bucket index. The math for iterating the two different-sized tables in parallel for rehash and scan would be a little different but not bad - basically bit shifting instead of masking.

At any rate, I plan to add prefetch hints - not sure how much that would eliminate the benefit of sequential memory access.

zuiderkwast · 2024-10-19T10:33:28Z

The one thing that occurred to me while I was reviewing is that instead of using the everfull bit to start probing, we could use it to indicate the last position is a pointer to another bucket and do linked list resolution.

@madolson This is a great idea. It reminds me of "bulk chaining" in the SegCache paper. I think it can be done without changing the API, so it can be done at any point.

I believe it has many benefits, like we can postpone rehashing almost indefinitely like the dict does when there's a fork. We can also prevent scan from returning duplicates (when it wraps around zero following a probe chain).

For what it's worth, I can make nextCursor() use sequential, in-memory order for rehashing, scanning, and iterator if we use the most significant bits instead of the least significant bits of the hash for bucket index. The math for iterating the two different-sized tables in parallel for rehash and scan would be a little different but not bad - basically bit shifting instead of masking.

@SoftlyRaining This sounds interesting. Memory is fetched per cache-line, so I'm wondering if there are benefits to accessing it sequentially to the current reverse-bit-increment. Systems with larger cache lines, sure, but that's only macbook so far? Predictive memory fetching predicts that cache lines are accessed sequentially?

src/hashset.c

Signed-off-by: Viktor Söderqvist <[email protected]>

zuiderkwast changed the title ~~Add hashset (a new hash table implementation) and use in command lookup~~ Add hashset (new hash table) and use in command lookup Oct 17, 2024

Hashset implementation (squashed)

02221d1

Signed-off-by: Viktor Söderqvist <[email protected]>

zuiderkwast force-pushed the hashset branch from 19576b5 to 8fe59b3 Compare October 17, 2024 13:35

zuiderkwast marked this pull request as ready for review October 17, 2024 14:10

zuiderkwast requested a review from madolson October 17, 2024 14:11

madolson reviewed Oct 17, 2024

View reviewed changes

zuiderkwast force-pushed the hashset branch from 8fe59b3 to 3038293 Compare October 18, 2024 07:22

soloestoy reviewed Oct 18, 2024

View reviewed changes

src/hashset.h Outdated Show resolved Hide resolved

zuiderkwast and others added 5 commits October 18, 2024 11:04

Apply suggestions from code review

a7f34b8

Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]>

More code review suggestions

69ec76a

Use variable name `s` for hashset instead of `t`. Use variable name `element` instead of `elem`. Remove the hashsetType.userdata field. Some more fixes. Signed-off-by: Viktor Söderqvist <[email protected]>

Fix comment and delete unused code

bd1ecfb

Signed-off-by: Viktor Söderqvist <[email protected]>

Mistake in previous incomplete variable renaming

10d3abf

Signed-off-by: Viktor Söderqvist <[email protected]>

Fix trailing whitespace in comment

2204dd6

Signed-off-by: Viktor Söderqvist <[email protected]>

madolson reviewed Oct 18, 2024

View reviewed changes

zuiderkwast commented Oct 18, 2024

View reviewed changes

SoftlyRaining reviewed Oct 19, 2024

View reviewed changes

src/hashset.c Outdated Show resolved Hide resolved

zuiderkwast added 3 commits October 20, 2024 16:22

Fix more review comments

5eb3e17

Signed-off-by: Viktor Söderqvist <[email protected]>

Use snake_case for struct fields + make stats struct opaque

b5f61df

Signed-off-by: Viktor Söderqvist <[email protected]>

Fixup: test case reference to renamed field

5129254

Signed-off-by: Viktor Söderqvist <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hashset (new hash table) and use in command lookup #1186

Add hashset (new hash table) and use in command lookup #1186

zuiderkwast commented Oct 17, 2024 •

edited

Loading

codecov bot commented Oct 17, 2024 •

edited

Loading

xingbowang commented Oct 17, 2024 •

edited

Loading

madolson left a comment

madolson Oct 17, 2024

zuiderkwast Oct 18, 2024

madolson Oct 18, 2024

madolson Oct 17, 2024

zuiderkwast Oct 18, 2024

zuiderkwast commented Oct 18, 2024

soloestoy left a comment •

edited

Loading

zuiderkwast commented Oct 18, 2024

madolson left a comment •

edited

Loading

madolson Oct 18, 2024 •

edited

Loading

zuiderkwast Oct 18, 2024

madolson Oct 18, 2024

zuiderkwast Oct 20, 2024

madolson Oct 18, 2024

zuiderkwast Oct 18, 2024

zuiderkwast Oct 18, 2024

madolson Oct 18, 2024

madolson Oct 18, 2024

zuiderkwast Oct 18, 2024

madolson Oct 18, 2024

SoftlyRaining Oct 19, 2024 •

edited

Loading

zuiderkwast Oct 19, 2024

zuiderkwast Oct 20, 2024

madolson Oct 18, 2024

zuiderkwast Oct 18, 2024 •

edited

Loading

SoftlyRaining commented Oct 19, 2024

zuiderkwast commented Oct 19, 2024

	/* Command set, hashed by char* string, stores serverCommand structs. */
	/* Sub-command set, hashed by char* string, stores serverCommand structs. */

Add hashset (new hash table) and use in command lookup #1186

Are you sure you want to change the base?

Add hashset (new hash table) and use in command lookup #1186

Conversation

zuiderkwast commented Oct 17, 2024 • edited Loading

codecov bot commented Oct 17, 2024 • edited Loading

Codecov Report

xingbowang commented Oct 17, 2024 • edited Loading

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast commented Oct 18, 2024

soloestoy left a comment • edited Loading

Choose a reason for hiding this comment

zuiderkwast commented Oct 18, 2024

madolson left a comment • edited Loading

Choose a reason for hiding this comment

madolson Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SoftlyRaining Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

SoftlyRaining commented Oct 19, 2024

zuiderkwast commented Oct 19, 2024

zuiderkwast commented Oct 17, 2024 •

edited

Loading

codecov bot commented Oct 17, 2024 •

edited

Loading

xingbowang commented Oct 17, 2024 •

edited

Loading

soloestoy left a comment •

edited

Loading

madolson left a comment •

edited

Loading

madolson Oct 18, 2024 •

edited

Loading

SoftlyRaining Oct 19, 2024 •

edited

Loading

zuiderkwast Oct 18, 2024 •

edited

Loading