fix: don't lock in-memory shares storage for database ops #1778

anatolie-ssv · 2024-10-07T14:34:30Z

The approach taken here is to minimize the time spent with a locked shares map.
By separately locking the badger instance we allow access to the in-memory shares while the "heavy lifting" disk access is being executed.

anatolie-ssv · 2024-10-08T13:57:25Z

@moshe-blox I'm looking at ways to move the interaction with the database in a separate goroutine.
Right now the problem that I'm facing is the inability to propagate the potential database error from the goroutine to the method updating the in-memory shares map. So if there is an error from badger db we would find out about it only after we've made changes to the shares map. I'm not really sure how to get around this issue.

registry/storage/shares.go

iurii-ssv · 2024-10-09T15:25:11Z

@moshe-blox I'm looking at ways to move the interaction with the database in a separate goroutine. Right now the problem that I'm facing is the inability to propagate the potential database error from the goroutine to the method updating the in-memory shares map. So if there is an error from badger db we would find out about it only after we've made changes to the shares map. I'm not really sure how to get around this issue.

One pattern I've seen for this is:

instead of returning an error (from func(s) like Save for example) you return chan error
Save implementation will pass this chan error to some background worker go-routine to either close it (important) or send an error through this channel
the caller of Save now has to receive from channel if he intends to handle the potential error (that might eventually come on this chan error), he can also handle it by spinning a go-routine of his own that will listen on this chan error and maybe log the result or something

It does complicate sharesStorage API and usage quite a bit (although, synchronous API can be created in the form of a wrapper on top of asynchronous one - for those callers who don't want to deal with chan error and just outright prefer wait for error instead), the question is whether there are real callers that can benefit from this ?

So if there is an error from badger db we would find out about it only after we've made changes to the shares map.

Save implementation itself can spin up a go-routine that will listen on chan error (similar construction to what I've mentioned above), and upon receiving error from background DB go-routine it can handle it however is considered meaningful (eg. it can remove it from in-memory map) before relaying this error to the original caller of Save (who will maybe re-try Save call after a while).

It doesn't really change the underlying assumptions this PR introduces - which is the consistency between in-memory data and DB data becomes eventual (meaning, you can call Save, and then another caller might call Range in parallel that will apply some fn to the newly saved object, and then Save will actually fail to store data in DB, so that in the end it turns out Range caller observed object that was never meant to be). Unless the implementation code is arranged in ad hoc manner (which is not always possible, tricky to implement correctly, and probably warrants some comments in the code that call it out) where for every update-operation we try to update DB first and only then update in-memory state - so that every read-operation can only read data after it has already been stored in DB).

moshe-blox · 2024-10-10T13:26:01Z

@anatolie-ssv oh that's right during event sync we can't ignore DB errors, i forgot!

@iurii-ssv returning <-chan error is an interesting idea to simulate async/await behavior, with that said i like Go's approach of blocking by default and delegating async to the caller, who can simply spawn a goroutine if they wish, because it keeps the API trivial as you noted, and allows the caller to avoid unnecessary channels when they do need to block.

i think going for a pure solution might be too complex, because what does it mean if a database save fail — should we undo the changes we've done in the map?

maybe we should try to target the specific issue we have and witnessed causing a slowdown in the real world — that updating of thousands of share metadatas completely locks the storage for many seconds at a time

i'm not aware of other points of pressure around shares than that (so far!)

WDYT?

iurii-ssv · 2024-10-10T13:52:58Z

i'm not aware of other points of pressure around shares than that (so far!)

Yeah, that's what I was getting at. If the performance of fully blocking implementation is sufficient - there is no need to go for the harder solution.

with that said i like Go's approach of blocking by default and delegating async to the caller, who can simply spawn a goroutine if they wish, because it keeps the API trivial as you noted, and allows the caller to avoid unnecessary channels when they do need to block.

Yup, the simpler the better. I was just describing a potential solution (with it's tradeoffs) if we need to go down this path sometime in the future, because @anatolie-ssv mentioned he might want to consider it.

moshe-blox · 2024-10-10T15:03:09Z

@iurii-ssv i've been discussing with @anatolie-ssv and indeed we're throwing away my original proposal, and instead expanding on his initial implementation with a few improvements (such as promoting the db mutex to being a parent lock of the in-memory mutex), which he's still playing around with to confirm it makes sense

hopefully @anatolie-ssv can fill you in on the details or present it once it's ready :)

iurii-ssv

LGTM, but it seems we lock dbmu for longer times than necessary (I assume it's only goal is to manage access to sharesStorage.db).

iurii-ssv · 2024-10-14T16:55:54Z

registry/storage/shares.go

+	dbmu           sync.Mutex // parent lock for in-memory mutex
 	mu             sync.RWMutex


Perhaps more detailed descriptions wouldn't hurt (because it's not quite clear what parent really means here):

// dbMtx is used to synchronize access to db dbMtx sync.Mutex // sMtx is used to synchronize access to shares sMtx sync.RWMutex

i'd suggest renaming to something like storageMtx and memoryMtx because it explains the fundamental difference between the two

registry/storage/shares.go

iurii-ssv · 2024-10-18T08:38:37Z

registry/storage/shares.go

+		if err := s.validatorStore.handleSharesUpdated(updateShares...); err != nil {
+			return err
+		}
+
+		if err := s.validatorStore.handleSharesAdded(addShares...); err != nil {
+			return err
+		}


Would be nice to wrap errors here (because they come from different code branches might be hard to debug without it):

if err := s.validatorStore.handleSharesUpdated(updateShares...); err != nil { return fmt.Erorrf("handleSharesUpdated: %w", err) } if err := s.validatorStore.handleSharesAdded(addShares...); err != nil { return fmt.Erorrf("handleSharesAdded: %w", err) }

iurii-ssv · 2024-10-18T08:38:43Z

registry/storage/shares.go

@@ -210,48 +210,57 @@ func (s *sharesStorage) Save(rw basedb.ReadWriter, shares ...*types.SSVShare) er
 		}
 	}

-	s.mu.Lock()
-	defer s.mu.Unlock()
+	s.logger.Debug("save validators to shares storage", zap.Int("count", len(shares)))


nit: save validators to shares storage -> save validator shares to shares storage

anatolie-ssv self-assigned this Oct 7, 2024

anatolie-ssv marked this pull request as draft October 7, 2024 14:34

anatolie-ssv changed the title ~~Fix: don't lock in-memory shares storage for database ops~~ fix: don't lock in-memory shares storage for database ops Oct 7, 2024

anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch from 47b9cfb to dd28280 Compare October 7, 2024 18:01

iurii-ssv reviewed Oct 9, 2024

View reviewed changes

registry/storage/shares.go Outdated Show resolved Hide resolved

registry/storage/shares.go Outdated Show resolved Hide resolved

anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 2 times, most recently from 5ad0a20 to 688454c Compare October 11, 2024 14:41

anatolie-ssv marked this pull request as ready for review October 13, 2024 04:46

anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 3 times, most recently from 161c1f3 to ef25bf0 Compare October 14, 2024 10:16

iurii-ssv approved these changes Oct 14, 2024

View reviewed changes

anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch 6 times, most recently from d53f013 to 3257ec3 Compare October 17, 2024 09:57

iurii-ssv reviewed Oct 18, 2024

View reviewed changes

anatolie-ssv added 4 commits October 19, 2024 09:21

fix: move persistence outside of lock

133d6f7

use db lock as parent lock for shares map mutex

90e00bd

add missing save to memory

cf5bbf5

rename mutexes

2be0c5a

anatolie-ssv force-pushed the fix/lockless-share-storage-2 branch from 3257ec3 to 2be0c5a Compare October 19, 2024 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't lock in-memory shares storage for database ops #1778

fix: don't lock in-memory shares storage for database ops #1778

anatolie-ssv commented Oct 7, 2024 •

edited

Loading

anatolie-ssv commented Oct 8, 2024

iurii-ssv commented Oct 9, 2024 •

edited

Loading

moshe-blox commented Oct 10, 2024 •

edited

Loading

iurii-ssv commented Oct 10, 2024 •

edited

Loading

moshe-blox commented Oct 10, 2024 •

edited

Loading

iurii-ssv left a comment •

edited

Loading

iurii-ssv Oct 14, 2024

moshe-blox Oct 15, 2024 •

edited

Loading

iurii-ssv Oct 18, 2024

iurii-ssv Oct 18, 2024

		dbmu sync.Mutex // parent lock for in-memory mutex
		mu sync.RWMutex

fix: don't lock in-memory shares storage for database ops #1778

Are you sure you want to change the base?

fix: don't lock in-memory shares storage for database ops #1778

Conversation

anatolie-ssv commented Oct 7, 2024 • edited Loading

anatolie-ssv commented Oct 8, 2024

iurii-ssv commented Oct 9, 2024 • edited Loading

moshe-blox commented Oct 10, 2024 • edited Loading

iurii-ssv commented Oct 10, 2024 • edited Loading

moshe-blox commented Oct 10, 2024 • edited Loading

iurii-ssv left a comment • edited Loading

Choose a reason for hiding this comment

iurii-ssv Oct 14, 2024

Choose a reason for hiding this comment

moshe-blox Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

iurii-ssv Oct 18, 2024

Choose a reason for hiding this comment

iurii-ssv Oct 18, 2024

Choose a reason for hiding this comment

anatolie-ssv commented Oct 7, 2024 •

edited

Loading

iurii-ssv commented Oct 9, 2024 •

edited

Loading

moshe-blox commented Oct 10, 2024 •

edited

Loading

iurii-ssv commented Oct 10, 2024 •

edited

Loading

moshe-blox commented Oct 10, 2024 •

edited

Loading

iurii-ssv left a comment •

edited

Loading

moshe-blox Oct 15, 2024 •

edited

Loading