Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Comprehensive Export/Import Functionality for Table Management #623

Open
mrchypark opened this issue Aug 20, 2024 · 5 comments

Comments

@mrchypark
Copy link
Contributor

Feature Request: Comprehensive Export/Import Functionality for Datalake Table Management

Overview

First and foremost, thank you for your incredible work on the seafowl project. Your efforts in managing table information using SQLite are truly appreciated. To further enhance the project's capabilities and support continuous table creation and read-only service operations, I'd like to propose a new feature: comprehensive export and import functionality for table information.

Proposed Feature

The feature would consist of three main components:

  1. Full Non-System Table Export/Import:

    • Export all table information from SQLite, excluding system tables.
    • Import this information, including updates to existing tables.
  2. Individual Table Export/Import:

    • Allow export and import of information for specific tables.
  3. Schema-based Table Export/Import:

    • Enable export and import of all table information within a given schema.

Use Case

This feature would greatly benefit users who need to:

  • Continuously create and manage tables in one environment.
  • Import these tables into a separate read-only service environment for operational use.

Potential Implementation

While I understand that the specifics of implementation would be up to the project maintainers, some initial thoughts include:

  • Utilizing SQLite's built-in backup functionality for full database exports.
  • Developing custom SQL queries to extract and insert table information for more granular operations.
  • Implementing a flexible file format (e.g., JSON or CSV) for exported data to ensure compatibility and ease of use.

Benefits

  1. Enhanced data portability between different seafowl instances.
  2. Simplified management of development/production environments.
  3. Improved support for backup and restore operations.
  4. Greater flexibility in table management across different use cases.

Conclusion

I believe this feature would significantly enhance seafowl's utility and appeal to a broader range of users and use cases. I'm excited to hear your thoughts on this proposal and would be happy to provide any additional information or clarification if needed.

Thank you for considering this feature request, and for your continued dedication to the seafowl project.

@gruuya
Copy link
Contributor

gruuya commented Aug 20, 2024

Thanks for the kind words!

What you propose makes sense, and I think for the most part the functionality is already there, mainly via the clade crate (stands for Catalog of Lakehouse Definitions). It enables Seafowl to be run in "headless" mode, where the catalog is abstracted away to some remote system, that can then itself perform all table management functionality according to its needs (which would probably substantially vary from case to case).

The idea is that the clade::schema module provides common definitions needed for Seafowl to read (from) tables, facilitated by implementing and running clade::schema::schema_store_service_server::SchemaStoreServiceServer; you can see an example in the tests here and here.

In fact this is precisely what how we use Seafowl in production ourselves, the metastore is a separate component and Seafowl(s) talk to to it to learn about schemas, tables and object store locations when executing a query. One major drawback however is that it does not support writes, meaning the initial DML would need to be performed out-of-band for now.

@mrchypark
Copy link
Contributor Author

mrchypark commented Aug 21, 2024

Thank you for letting me know. I was thinking of a scenario where writing tasks, which use more resources, are performed on multiple Seafowl instances, and then the table information is combined in a read-only Seafowl. I will try using clade, referring to the example you provided.

What does out-of-band mean? Can you tell me what methods are available? I'm currently using a method where I create an external table and then perform aggregation queries to create the necessary tables.

@gruuya
Copy link
Contributor

gruuya commented Aug 21, 2024

By out-of-band mean that you use something else to write the tables, or even Seafowl, but then you separately persist the metadata to you metastore and expose it to Seafowl instances via clade.

@mrchypark
Copy link
Contributor Author

Thank you for explaining clade to me. It seems very useful for working with multiple read-only Seafowl instances.
However, it appears that to use this, each Seafowl instance would need to perform its own 'create table' operations, and then a server running clade would need to extract and maintain information about these tables. Is this correct? If so, could you suggest a method for extracting the table information from Seafowl instances so that the clade server can provide this information?
I appreciate your help with this. Thank you in advance for any insights you can offer.

@gruuya
Copy link
Contributor

gruuya commented Aug 23, 2024

However, it appears that to use this, each Seafowl instance would need to perform its own 'create table' operations, and then a server running clade would need to extract and maintain information about these tables. Is this correct?

Yes that is correct. Until the clade interface is extended to perform writes as well you're going to have to do it out-of-band.

If so, could you suggest a method for extracting the table information from Seafowl instances so that the clade server can provide this information?

Perhaps some custom service can get access to the PG/SQLite connection string and perform replication from them. Or something more out-of-the-box, e.g. https://github.com/superfly/litefs or https://www.splitgraph.com/blog/deploying-serverless-seafowl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants