Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet output using parquet2 via Rust #1240

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

the80srobot
Copy link
Contributor

@the80srobot the80srobot commented Nov 24, 2023

This patch series adds support for building the parquet2 Rust crate and using it, from C++, to write a parquet file.

@the80srobot the80srobot force-pushed the parquet2 branch 2 times, most recently from 376b54c to ec5cf3d Compare December 1, 2023 10:10
@pmarkowsky pmarkowsky mentioned this pull request Dec 11, 2023
@the80srobot the80srobot force-pushed the parquet2 branch 3 times, most recently from a3657c1 to 2f82a0b Compare December 14, 2023 13:03
@the80srobot the80srobot changed the title WIP: Add parquet output using parquet2 via Rust Add parquet output using parquet2 via Rust Dec 14, 2023
@the80srobot the80srobot marked this pull request as ready for review December 14, 2023 23:30
@the80srobot the80srobot requested a review from a team as a code owner December 14, 2023 23:30
@the80srobot
Copy link
Contributor Author

the80srobot commented Dec 14, 2023

This is ready for review. What's been done so far:

  1. Rust build support
  2. FFI interface and linkage with C++ code using cxx crate
  3. Parquet Table implementation on top of the parquet2 crate
  4. C++ bindings to generate a Parquet table and write it to a file
  5. Unit tests in C++ and Rust and an e2e test using Pandas (via Docker)

The code size when compiled is about 3.5 MiB, which can probably be reduced further with future work. (Most of the .text size is compression libraries, and brotli is one of the larger ones.) This compares favorably to 10-30 MiB for C++ Arrow (depending on how you build it and how unmaintainable you're willing to make things for the sake of shrinking the build).

The build time for the whole thing is about 5 seconds. Again this compares favorably to Arrow (in Pedro, that build takes about 30-60 seconds.)

As I mentioned to Pete, this PR is quite large, but I think anything smaller wouldn't be reviewable, because it couldn't work end-to-end. I'm happy to jump on a video call and walk people through it. There is also quite detailed commit history, if you'd like to step through it. (Though a word of warning, the first draft of the Rust code got rewritten.)

Future work:

  • Benchmarking
  • Get rid of the buffer reinitialization after flush (will probably need to rewrite FileWriter to an imperative style API, which can wait)
  • Wrap in a C++ class for use Santa's Logger

@the80srobot
Copy link
Contributor Author

Sorry for the timeline saying I added 19 commits, I just rebased the branch onto main and GH got confused. (I also added a few more comments.)

@pmarkowsky
Copy link
Contributor

Still looking into internal build stuff with this.

@pmarkowsky pmarkowsky added this to the 2024.3 milestone Mar 1, 2024
@pmarkowsky
Copy link
Contributor

Apologies for the long delay here. There were some internal things that needed to happen before we could resume looking at this.

load("@build_bazel_rules_apple//apple:macos.bzl", "macos_unit_test")
load("@build_bazel_rules_apple//apple:resources.bzl", "apple_resource_group")
load("@rules_cc//cc:defs.bzl", "cc_library")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit but it complicates the build internally: this can be removed and native.cc_library used below.

@pmarkowsky
Copy link
Contributor

@the80srobot Can you make this a static library as it simplifies the internal builds dramatically?

@mlw mlw modified the milestones: 2024.3, 2024.5 Apr 5, 2024
@mlw mlw removed this from the 2024.5 milestone May 9, 2024
@mlw
Copy link
Contributor

mlw commented Jul 2, 2024

Moving back to draft for now. This PR has become a bit stale and will require some effort to get back to being merge-ready.

@mlw mlw marked this pull request as draft July 2, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants