-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Introduce Catalog for Lance #3257
Comments
Hi @eddyxu @westonpace The proposal document is here. We can discuss here and make a plan for the next step work. Would you like any more input? |
@yanghua Have you implemented a catalog for another format before? Basic questionsI'm not familiar with catalogs myself, so I have a few basic questions:
Rust language complexity
I do wonder if your experience here is influenced because you are working in Java, which has the worst devexp in this repo at the moment. For example, the fact that you can't get a traceback when there is a panic is terrible. It would probably be worthwhile to invest effort into improving that developer experience. Choosing an architectureFor the architecture, my preference is we don't make a hard decision now. I don't think we know enough yet to know what will be best. Instead, let's make an MVP catalog implementation that shows what it would look like. Whoever implements this can decide on which initial architecture will be faster to create. What would an MVP look like?What I would love to see from this proposal is an MVP user story we are targeting. Something like: A user in Spark can:
This chooses the Spark and the Unity catalog as the first target. I think Spark should likely be high priority, though unsure about unity vs another catalog. |
Hi @wjones127 thanks for your feedback. Let me answer your question one by one.
No, but when I joined the Lance community, I dived into some other catalog systems or modules of table format, e.g. Iceberg.
Generally speaking, overall, for data hierarchical management. A classic example is RDBMS, which uses the concept of a database to organize tables/views/functions. More introduction about Catalog, I could share a good blog, that did deep research on the catalog. It's not just for listing tables. It can organize and manage tables(do CRUD for tables, it's not correct but easily understood). Catalog system can be a running service, yes, it can expose a URI to connect and integrate with it. The thirty-party engine or framework can integrate with it and then expose an easier UI, such as SQL DDL.
Yes, it needs a place to store the stateful information, that's what professional catalog systems need to do. For Lance (this proposal), we would like to define an interface and let:
Note: we do not plan to design and describe the details about how to store all the information and provide a service owned by us. Is the title of this proposal not clear? Maybe IMO, the table schema may be in the catalog, but column statistics may not. About syncing, the interface should be implemented in all the SDKs and all the actions to operate the dataset and its schema must be via the Catalog interface.
The interface of Catalog runs in users' code where they use Lance SDK just like now. We provide a catalog interface in Lance SDKs and make it to be an optional choice. We can understand the implementation of the Catalog interface as a plugin(maybe Unity catalog or gravitino's plugin), or some default implementation hosted in this repo(for example,
Not sure if some personal views could be misunderstood. Actually, the comparison in the documentation stands in my eye(a Java developer?). Complexity is more about the complexity of using RUST, the audience for this language, and there's always a certain subjectivity to that, you know, for a beginner at RUST.
Agree. We have done a POC version of this proposal (contains Spark Integration). We chose option 2(just like the Iceberg's design). Here, I put there is for receiving more thoughts and further discussion. About the MVP implementation, I will submit a draft PR. |
LanceDB has an implicit catalog. I think this proposal makes the catalog explicit and configurable and that would be useful. From the standpoint of the The current implicit catalog has a single layer of nesting (databases but not namespaces (e.g. schemas in classic terminology)). The implicit catalog is stateless and relies on a list dir operation on storage combined with a few different commit handler implementations. There's a few advantages I can see that could be obtained from a catalog.
From the From the lance/rust/lance-table/src/io/commit.rs Line 402 in 2b29487
|
Hi, @westonpace thanks for sharing more information and context to let me know more. IMO, we both agree with the benefit that the Catalog feature would bring.
I have looked at the What's more, I have some questions:
|
I think it would actually be helpful to see a the catalog implemented in Rust as For In-memory testing catalog, we actually support an in-memory object store if you provide the lance/rust/lance/src/dataset/rowids.rs Lines 131 to 133 in ae70478
If you want to have the interface defined in Java, you could do that. And the Rust-backed FSCatalog could implement that interface. |
OK, I will start drafting a PR for further discussion. Since we always need to implement it in Rust. Let me try to define the interface in the Rust and Java module. When we need to integrate with Hive Catalog, the most suitable option is to implement it via Java module. |
@westonpace , I will maintain the task split in this issue(please see the last section in this proposal). WDYT? If you have any suggestions please let me know. |
Motivation
As an effective concept for hierarchically managing data assets, Catalog has been widely accepted in the industry. Moreover, many open-source engines have implemented this concept. The most typical one is the Apache Hive Catalog. At the same time, with the increasing popularity of LLM, the industry is also actively practicing managing AI data assets and even data assets of any type through Catalog (for example, Unity Catalog, Gravitino, etc.). As a modern data format, Lance's success must be connected to the support of many ecological components in AI and big data. The core element of integration is allowing these ecological components to obtain the information and metadata of "Lance" correctly and seamlessly connect with their Catalog system. Therefore, it is necessary to design a Catalog for Lance to make it play a greater value in the AI and big data ecosystem.
Investigation of Mainstream Catalog Systems
In the fields of Big Data and AI, there are some mainstream Catalog Systems in the industry (more typical ones are: Unity Catalog and Apache Gravitino). The best choice for Lance in expanding its ecosystem is to integrate with them. Given that, currently, there is no Catalog System has become a "de facto" standard in the industry. Here, we compare them from some dimensions to support more in-depth discussions and decisions.
Conclusion: Regarding the current capabilities and maturity of the two projects, integrating Lance with Apache Gravitino may be a better choice. However, in the following text, we will define Lance's Catalog interface, enabling it to be integrated with either of these systems.
Lance Catalog Conceptual Design
After researching mainstream Catalog System and Table Format catalog designs, we believe that the Lance Catalog is mainly a two-level conceptual design:
The overall structure is illustrated as follows:
Based on the above concept introduction, we need to define the following entities and interfaces.
Entity
Interface
Lance Catalog architecture design
At the architectural level, there are two approaches to implement the Lance Catalog. The diagram is as follows:
Option 1: Integration mode centered on RUST.
Option 2: Multi-language vertical integration mode
The choice between these two designs is a trade-off on multiple levels. Here we compare them in some dimensions:
Design comparison
Conclusion: The choice of specific schemes still requires further discussion.
Lance Catalog interface design
The following is temporarily implemented in the form of Option 2 for POC.
Catalog Interface
DatasetOperation Interface
TODO(discussion)
Implementation
The text was updated successfully, but these errors were encountered: