Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libsql-wal: streaming compaction #1762

Merged
merged 3 commits into from
Sep 30, 2024
Merged

Libsql-wal: streaming compaction #1762

merged 3 commits into from
Sep 30, 2024

Conversation

MarinPostma
Copy link
Contributor

This PR implements streaming compaction for libsql-wal.

Motivation

Before, we used buffer files for segment compaction. When long sequences of segments needed a lot of disk storage to store intermediate segments and the resulting segments. With some changes to the compacted segment format this PR enables the streaming compaction of segments.

How?

Initially, the CompactedSegments contained a header with the number of frames in the segment. When compacting we can't cheaply know how many frames will be in the resulting segment, so we needed a different way know how many frames there are in the segment. The segment also contained a footer with the checksum, but without the frame count, it's impossible to know where to fetch the footer. Instead, we change the frame headers, introducing a CompactedFrameHeader. The compacted frame headers drop the size_after (all frames in a compacted segment should be logically committed together) field, and introduce a checksum field, and a flag field, with the LAST flag. The LAST flag is set for the last flag in the segment.

The checksum is computed as the crc32 frame header + data (expect the checksum), seeded by the checksum of the previous frame. The first frame is seeded with the checksum of the segment header.

The dedup_stream method in the compactor is the meat of this PR. It takes a SegmentSet, and returns a deduplicated set of all the frame for that set. Here's how it works:
Iterating on the segments in the set backwards (most recent segment first), we start downloading indexed (this step is done conccurently). Then, we sequentially iterate over the received segments, and check if that segment contains any data that we need. To do this, the maintain a seen_pages bitset with all the pages we have already collected. If any page in the segment index is not in the set, we download the segment data. For every frame in the segment data whose page we haven't seen, we stream that page out. We repeat this process, until we either have enough pages (as indicated by size after), or run out ot segments to search in.

@MarinPostma MarinPostma added this pull request to the merge queue Sep 30, 2024
Merged via the queue into main with commit 8abff7b Sep 30, 2024
18 checks passed
@MarinPostma MarinPostma deleted the streaming-compaction branch September 30, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant