Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Introduce Catalog for Lance #3257

Open
1 task
yanghua opened this issue Dec 17, 2024 · 8 comments
Open
1 task

Proposal: Introduce Catalog for Lance #3257

yanghua opened this issue Dec 17, 2024 · 8 comments

Comments

@yanghua
Copy link
Contributor

yanghua commented Dec 17, 2024

Motivation

As an effective concept for hierarchically managing data assets, Catalog has been widely accepted in the industry. Moreover, many open-source engines have implemented this concept. The most typical one is the Apache Hive Catalog. At the same time, with the increasing popularity of LLM, the industry is also actively practicing managing AI data assets and even data assets of any type through Catalog (for example, Unity Catalog, Gravitino, etc.). As a modern data format, Lance's success must be connected to the support of many ecological components in AI and big data. The core element of integration is allowing these ecological components to obtain the information and metadata of "Lance" correctly and seamlessly connect with their Catalog system. Therefore, it is necessary to design a Catalog for Lance to make it play a greater value in the AI and big data ecosystem.

Investigation of Mainstream Catalog Systems

In the fields of Big Data and AI, there are some mainstream Catalog Systems in the industry (more typical ones are: Unity Catalog and Apache Gravitino). The best choice for Lance in expanding its ecosystem is to integrate with them. Given that, currently, there is no Catalog System has become a "de facto" standard in the industry. Here, we compare them from some dimensions to support more in-depth discussions and decisions.

Dimension Unity Catalog (OSS) Gravitino Description
LICENSE Apache v2.0 Apache v2.0
Multilingual Ecosystem Unity Catalog: no Python Client currently; Gravitino: Python SDK
Unstructured Data Support
Iceberg REST API Unity Catalog: read-only, via UniForm
Security Control
Engine Support Unity Catalog: Spark, DuckDB, Trino, Daft, PuggyGraph, SpiceAI, XTable; Gravitino: Trino, Spark, Flink, PyTorch, Ray
WEB GUI

Conclusion: Regarding the current capabilities and maturity of the two projects, integrating Lance with Apache Gravitino may be a better choice. However, in the following text, we will define Lance's Catalog interface, enabling it to be integrated with either of these systems.

Lance Catalog Conceptual Design

After researching mainstream Catalog System and Table Format catalog designs, we believe that the Lance Catalog is mainly a two-level conceptual design:

  • Namespace: Used to organize several Lance datasets. It is equivalent to the Schema (or Database) concept in RDBMS and also comparable to the Namespace in Iceberg. Dataset assets can be organized under the Namespace.
  • Dataset: Corresponding to the current Lance Dataset concept, it is equivalent to a Table in RDBMS.

The overall structure is illustrated as follows:

catalog-concept-design

Based on the above concept introduction, we need to define the following entities and interfaces.

Entity

  • Namespace: Used for organizing datasets, similar to a database in an RDBMS.
  • DataSetIdentifier: Used for uniquely marking a dataset in a catalog.
  • DatasetMetadata(optional): Used to encapsulate metadata information of a dataset (such as schema, location, and some extended attributes).

Interface

  • Catalog: An abstract interface used to stipulate the basic semantic interfaces that Lance Catalog needs to support.
  • DatasetOperation: An abstract interface used to stipulate submitting or refreshing dataset metadata.

Lance Catalog architecture design

At the architectural level, there are two approaches to implement the Lance Catalog. The diagram is as follows:

Option 1: Integration mode centered on RUST.

design-option-1

Option 2: Multi-language vertical integration mode

design-option-2

The choice between these two designs is a trade-off on multiple levels. Here we compare them in some dimensions:

Design comparison

Dimension Solution 1 Solution 2 Explanation (mainly explain the reasons for the inferior scheme)
Complexity The language complexity of RUST and the costs of binding call writing, DEBUG, etc. are higher.
Controllability The vertical mode only stipulates interfaces. The logic is implemented by each language itself so that a bug will not affect all docking scenarios of multiple languages.
Workload Taking Hive Catalog as an example, for Lance SDKs of different languages, adaptation is required separately once.
Maintainability Vertical integration needs to rely on "conventions" between each programming language to "guarantee" the consistency of interface semantics, with relatively low constraints.
Maturity (Are there reference cases of this solution in the same field in the industry?) Iceberg adopted Solution 2. Although basic programming languages of different formats have different implementation paths and some historical implementation backgrounds, only the results are used for evaluation here.
Friendliness of docking with mainstream Catalog System. Both can interact with Unity Catalog/Apache Gravitino through RESTful API.

Conclusion: The choice of specific schemes still requires further discussion.

Lance Catalog interface design

The following is temporarily implemented in the form of Option 2 for POC.

Catalog Interface

/** A Catalog API for dataset create, drop, and load operations. */
public interface Catalog {

  /**
   * Return the name for this catalog.
   *
   * @return this catalog's name
   */
  String name()

  /**
   * Create dataset with a given identifier and schema.
   */
  Dataset createDataset(DatasetIdentifier identifier, Schema schema)

  /**
   * Create dataset with a given identifier, schema, location and properties.
   */
  Dataset createDataset(
      DatasetIdentifier identifier,
      Schema schema,
      String location,
      Map<String, String> properties)

  /**
   * Return all the identifiers under this namespace.
   *
   * @param namespace a namespace
   * @return a list of identifiers for datasets
   * @throws NoSuchNamespaceException if the namespace is not found
   */
  List<DatasetIdentifier> listDatasets(Namespace namespace);

  /**
   * Drop a dataset; optionally delete data and metadata files.
   *
   * <p>If purge is set to true the implementation should delete all data and metadata files.
   *
   * @param identifier a dataset identifier
   * @param purge if true, delete all data and metadata files in the dataset
   * @param storageOptions a map of storage options to use when deleting data and metadata files
   * @return true if the dataset was dropped, false if the dataset did not exist
   */
  boolean dropDataset(
      DatasetIdentifier identifier, boolean purge, Map<String, String> storageOptions);

  boolean dropDataset(DatasetIdentifier identifier);

  /**
   * Rename a dataset.
   *
   * @param from identifier of the dataset to rename
   * @param to new dataset name
   * @throws NoSuchDatasetException if the from dataset does not exist
   * @throws AlreadyExistsException if the to dataset already exists
   */
  void renameDataset(DatasetIdentifier from, DatasetIdentifier to);

  /**
   * Load a dataset.
   *
   * @param identifier a dataset identifier
   * @return instance of {@link Dataset} implementation referred by {@code identifier}
   * @throws NoSuchDatasetException if the dataset does not exist
   */
  Optional<Dataset> loadDataset(DatasetIdentifier identifier);

  /**
   * Invalidate cached dataset metadata from current catalog.
   *
   * <p>If the dataset is already loaded or cached, drop cached data. If the dataset does not exist
   * or is not cached, do nothing.
   *
   * @param identifier a dataset identifier
   */
  default void invalidateDataset(DatasetIdentifier identifier) {}

  /**
   * Register a dataset with the catalog if it does not exist.
   *
   * @param identifier a dataset identifier
   * @param metadataFileLocation the location of a metadata file
   * @return a dataset instance
   * @throws AlreadyExistsException if the dataset already exists in the catalog.
   */
  Dataset registerDataset(DatasetIdentifier identifier, String metadataFileLocation)

  DatasetBuilder buildDataset(DatasetIdentifier identifier, Schema schema)
  
  /**
   * Create a namespace in the catalog.
   *
   * @param namespace a namespace. {@link Namespace}.
   * @throws AlreadyExistsException If the namespace already exists
   * @throws UnsupportedOperationException If create is not a supported operation
   */
  void createNamespace(Namespace namespace)

  /**
   * Create a namespace in the catalog.
   *
   * @param namespace a multi-part namespace
   * @param metadata a string Map of properties for the given namespace
   * @throws AlreadyExistsException If the namespace already exists
   * @throws UnsupportedOperationException If create is not a supported operation
   */
  void createNamespace(Namespace namespace, Map<String, String> metadata);

  /**
   * List top-level namespaces from the catalog.
   *
   * <p>If an object such as a table, view, or function exists, its parent namespaces must also
   * exist and must be returned by this discovery method. For example, if table a.b.t exists, this
   * method must return ["a"] in the result array.
   *
   * @return a List of namespace {@link Namespace} names
   */
  List<Namespace> listNamespaces()

  /**
   * List child namespaces from the namespace.
   *
   * <p>For two existing tables named 'a.b.c.table' and 'a.b.d.table', this method returns:
   *
   * <ul>
   *   <li>Given: {@code Namespace.empty()}
   *   <li>Returns: {@code Namespace.of("a")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a")}
   *   <li>Returns: {@code Namespace.of("a", "b")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a", "b")}
   *   <li>Returns: {@code Namespace.of("a", "b", "c")} and {@code Namespace.of("a", "b", "d")}
   * </ul>
   *
   * <ul>
   *   <li>Given: {@code Namespace.of("a", "b", "c")}
   *   <li>Returns: empty list, because there are no child namespaces
   * </ul>
   *
   * @return a List of child {@link Namespace} names from the given namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   */
  List<Namespace> listNamespaces(Namespace namespace) throws NoSuchNamespaceException;

  /**
   * Load metadata properties for a namespace.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return a string map of properties for the given namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   */
  Map<String, String> loadNamespaceMetadata(Namespace namespace) throws NoSuchNamespaceException;

  /**
   * Drop a namespace. If the namespace exists and was dropped, this will return true.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return true if the namespace was dropped, false otherwise.
   * @throws NamespaceNotEmptyException If the namespace is not empty
   */
  boolean dropNamespace(Namespace namespace) throws NamespaceNotEmptyException;

  /**
   * Set a collection of properties on a namespace in the catalog.
   *
   * <p>Properties that are not in the given map are not modified or removed by this method.
   *
   * @param namespace a namespace. {@link Namespace}
   * @param properties a collection of metadata to apply to the namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   * @throws UnsupportedOperationException If namespace properties are not supported
   */
  boolean setProperties(Namespace namespace, Map<String, String> properties)
      throws NoSuchNamespaceException;

  /**
   * Remove a set of property keys from a namespace in the catalog.
   *
   * <p>Properties that are not in the given set are not modified or removed by this method.
   *
   * @param namespace a namespace. {@link Namespace}
   * @param properties a collection of metadata to apply to the namespace
   * @throws NoSuchNamespaceException If the namespace does not exist (optional)
   * @throws UnsupportedOperationException If namespace properties are not supported
   */
  boolean removeProperties(Namespace namespace, Set<String> properties)
      throws NoSuchNamespaceException;

  /**
   * Checks whether the Namespace exists.
   *
   * @param namespace a namespace. {@link Namespace}
   * @return true if the Namespace exists, false otherwise
   */
  boolean namespaceExists(Namespace namespace)

  /**
   * Initialize a catalog given a custom name and a map of catalog properties.
   *
   * <p>A custom Catalog implementation must have a no-arg constructor. A compute engine like Spark
   * or Flink will first initialize the catalog without any arguments, and then call this method to
   * complete catalog initialization with properties passed into the engine.
   *
   * @param name a custom name for the catalog
   * @param properties catalog properties
   */
  default void initialize(String name, Map<String, String> properties) {}

  interface DatasetBuilder {

    DatasetBuilder withLocation(String location);

    DatasetBuilder withProperties(Map<String, String> properties);

    DatasetBuilder withProperty(String key, String value);
    
    //....

    Dataset create();
  }

}

DatasetOperation Interface

  • current: Return the dataset metadata that is currently loaded.
  • refresh: Refresh the dataset metadata.
  • commit: Replace the table's metadata with a new version. (The commit operation needs to consider locking.)

TODO(discussion)

  • Provide index-related APIs and store it in metadata;

Implementation

@yanghua
Copy link
Contributor Author

yanghua commented Dec 19, 2024

Hi @eddyxu @westonpace The proposal document is here. We can discuss here and make a plan for the next step work. Would you like any more input?

@wjones127
Copy link
Contributor

@yanghua Have you implemented a catalog for another format before?

Basic questions

I'm not familiar with catalogs myself, so I have a few basic questions:

  • What do users use the catalogs for? Is it just to list the tables? And then the catalog provides the URI so they can read the full
  • Is there any stateful information about a table stored? For example, table schema or column statistics? If so, how are they kept in sync with the table?
  • Where does the Lance catalog code run? Is it a plugin on some catalog server? Is it a plugin on the engine? Or somewhere else?

Rust language complexity

The language complexity of RUST and the costs of binding call writing, DEBUG, etc. are higher.

I do wonder if your experience here is influenced because you are working in Java, which has the worst devexp in this repo at the moment. For example, the fact that you can't get a traceback when there is a panic is terrible. It would probably be worthwhile to invest effort into improving that developer experience.

Choosing an architecture

For the architecture, my preference is we don't make a hard decision now. I don't think we know enough yet to know what will be best. Instead, let's make an MVP catalog implementation that shows what it would look like. Whoever implements this can decide on which initial architecture will be faster to create.

What would an MVP look like?

What I would love to see from this proposal is an MVP user story we are targeting. Something like:

A user in Spark can:

  1. Install the Unity catalog and Lance spark plugin on their cluster
  2. When a user creates a new table in Spark, it will show up in the catalog's list of tables
  3. The user can open the table by name from the catalog
  4. When a user drops the table in Spark, it will no longer show up in the catalog's list of tables

This chooses the Spark and the Unity catalog as the first target. I think Spark should likely be high priority, though unsure about unity vs another catalog.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 20, 2024

Hi @wjones127 thanks for your feedback. Let me answer your question one by one.

Have you implemented a catalog for another format before?

No, but when I joined the Lance community, I dived into some other catalog systems or modules of table format, e.g. Iceberg.

What do users use the catalogs for? Is it just to list the tables? And then the catalog provides the URI so they can read the full

Generally speaking, overall, for data hierarchical management. A classic example is RDBMS, which uses the concept of a database to organize tables/views/functions. More introduction about Catalog, I could share a good blog, that did deep research on the catalog.

It's not just for listing tables. It can organize and manage tables(do CRUD for tables, it's not correct but easily understood).

Catalog system can be a running service, yes, it can expose a URI to connect and integrate with it. The thirty-party engine or framework can integrate with it and then expose an easier UI, such as SQL DDL.

Is there any stateful information about a table stored? For example, table schema or column statistics? If so, how are they kept in sync with the table?

Yes, it needs a place to store the stateful information, that's what professional catalog systems need to do. For Lance (this proposal), we would like to define an interface and let:

  • users to know: how to use the catalog interface to manage their lance datasets(tables);
  • developers to know: how to integrate and adapt with other Catalog systems(e.g. Unity Catalog, Grivatino, Iceberg REST-based catalogs)

Note: we do not plan to design and describe the details about how to store all the information and provide a service owned by us.

Is the title of this proposal not clear? Maybe Introduce Catalog interface for Lance sounds better?

IMO, the table schema may be in the catalog, but column statistics may not. About syncing, the interface should be implemented in all the SDKs and all the actions to operate the dataset and its schema must be via the Catalog interface.

Where does the Lance catalog code run? Is it a plugin on some catalog server? Is it a plugin on the engine? Or somewhere else?

The interface of Catalog runs in users' code where they use Lance SDK just like now. We provide a catalog interface in Lance SDKs and make it to be an optional choice.

We can understand the implementation of the Catalog interface as a plugin(maybe Unity catalog or gravitino's plugin), or some default implementation hosted in this repo(for example, integration/HiveCatalog, InMemeoryCatalog for testing).

Rust language complexity

Not sure if some personal views could be misunderstood. Actually, the comparison in the documentation stands in my eye(a Java developer?). Complexity is more about the complexity of using RUST, the audience for this language, and there's always a certain subjectivity to that, you know, for a beginner at RUST.

Choosing an architecture

Agree. We have done a POC version of this proposal (contains Spark Integration). We chose option 2(just like the Iceberg's design). Here, I put there is for receiving more thoughts and further discussion. About the MVP implementation, I will submit a draft PR.

@westonpace
Copy link
Contributor

westonpace commented Dec 20, 2024

LanceDB has an implicit catalog. I think this proposal makes the catalog explicit and configurable and that would be useful. From the standpoint of the lancedb packages which are built on the Rust lancedb crate we would need to make a trait to make the catalog explicit.

The current implicit catalog has a single layer of nesting (databases but not namespaces (e.g. schemas in classic terminology)). The implicit catalog is stateless and relies on a list dir operation on storage combined with a few different commit handler implementations.

There's a few advantages I can see that could be obtained from a catalog.

  • Performance - maybe there's some minor performance advantages gained by replacing list_dir with something cached in mutable memory. Performance doesn't seem the most promising advantage.
  • Integration - sometimes it is beneficial to have a single catalog server that is capable of managing several different database products. For example, maybe a user has postgres tables, lance tables, and iceberg tables.
  • Extra capability - Putting the catalog in a standalone service allows for extra capabilities to be added. For example, one potential issue with the implicit catalog we have is that there is no real way to define access controls to limit which users can see which databases and it isn't clear how this would be solved.

From the lancedb perspective it seems pretty clear that making the catalog explicit is probably not too hard and has some good advantages.

From the pylance perspective I think we already have the trait that we need and it is the CommitHandler located here:

pub trait CommitHandler: Debug + Send + Sync {
I'd encourage you to take a look at the trait, it is something that a catalog would need to be able to implement.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 23, 2024

Hi, @westonpace thanks for sharing more information and context to let me know more.

IMO, we both agree with the benefit that the Catalog feature would bring.

I'd encourage you to take a look at the trait, it is something that a catalog would need to be able to implement.

I have looked at the CommitHandler. We need to change the manifest in some dataset operations, e.g., schema evolution(alter columns), these operators need a transactional commit. Lance has introduced an external manifest store to avoid multiple-writer conflicts because it needs *-if-not-exists feature. The two features may be combined in one catalog's implementation. Is that correct?

What's more, I have some questions:

  • what's the subsequence action in your mind?
  • will we refer to the implicit catalog in LanceDB, or just start our new one via community-driven?
  • based on the discussion above, it seems we will first touch the rust core module firstly, it's very similar to option 1 in my proposal documentation, is this correct?

@wjones127
Copy link
Contributor

what's the subsequence action in your mind?

I think it would actually be helpful to see a the catalog implemented in Rust as FSCatalog containing the current "implicit" catalog. I think there's a lot of feedback we could provide on the proposed interface once we see that. For example, the catalog you propose has properties while Lance already has a notion of config that is similar. Having a PR with the interface and implementing that for the current implicit catalog would help drive that discussion. Does that seem reasonable?

For In-memory testing catalog, we actually support an in-memory object store if you provide the "memory://" URI. So I would use an FSCatalog with that base URL as the in-memory catalog for testing. You can see examples of us using that here:

let dataset = Dataset::write(reader, "memory://", Some(write_params))
.await
.unwrap();

If you want to have the interface defined in Java, you could do that. And the Rust-backed FSCatalog could implement that interface.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 24, 2024

OK, I will start drafting a PR for further discussion. Since we always need to implement it in Rust. Let me try to define the interface in the Rust and Java module. When we need to integrate with Hive Catalog, the most suitable option is to implement it via Java module.

@yanghua
Copy link
Contributor Author

yanghua commented Dec 24, 2024

@westonpace , I will maintain the task split in this issue(please see the last section in this proposal).

WDYT? If you have any suggestions please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants