Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column/field description #447

Open
simonaubertbd opened this issue Jul 29, 2024 · 6 comments
Open

Column/field description #447

simonaubertbd opened this issue Jul 29, 2024 · 6 comments

Comments

@simonaubertbd
Copy link

simonaubertbd commented Jul 29, 2024

Describe the enhancement requested

Hello,

I may have misunderstood or misread the documentation but I didn't find it. I would love to have a field/column description just like we have in databases :

image

Best regards,

Simon

Component(s)

No response

@gaborcsardi
Copy link

For every column, you can store arbitrary key-value metadata (string -> string), see at

8: optional list<KeyValue> key_value_metadata

@simonaubertbd
Copy link
Author

Hello @gaborcsardi Yes, thanks for the tip but it can't be exploited by a software such as dataviz, etl or dataviz.

@liquidcarbon
Copy link

Hi! I'm confused by different types of metadata.

  1. there's key-value metadata of type BLOB that you can write via schema metadata (in python pyarrow.table(data, metadata=...)) and read similarly, or through https://duckdb.org/docs/data/parquet/metadata.html#parquet-key-value-metadata
  2. there's separate place, with these fields, that you can read with https://duckdb.org/docs/data/parquet/metadata.html#parquet-metadata
['file_name', 'row_group_id', 'row_group_num_rows',
       'row_group_num_columns', 'row_group_bytes', 'column_id', 'file_offset',
       'num_values', 'path_in_schema', 'type', 'stats_min', 'stats_max',
       'stats_null_count', 'stats_distinct_count', 'stats_min_value',
       'stats_max_value', 'compression', 'encodings', 'index_page_offset',
       'dictionary_page_offset', 'data_page_offset', 'total_compressed_size',
       'total_uncompressed_size', 'key_value_metadata']

column-level information and data types are recorded here. But how to you write something into THAT key_level_metadata ?

@simonaubertbd
Copy link
Author

Since Parquet is now meant to be a standard file format, use for intermediate storage between tools, it really seems necessary to have this field/column description added. It exists in a lot of format such as Tableau Hyper, Alteryx yxdb, etc

As for now, it means loosing informations when storing in Parquet.

Best regards,

Simon

@gaborcsardi
Copy link

gaborcsardi commented Sep 28, 2024

Hello @gaborcsardi Yes, thanks for the tip but it can't be exploited by a software such as dataviz, etl or dataviz.

Why can't dataviz or etl software add the description as metadata? E.g. as a key called description?

@simonaubertbd
Copy link
Author

@gaborcsardi Because it means asking the user of ETL a key name and the user of the dataviz the key name. This means another place (a file, a mail, an issue...) where the users have to share the information "hey guys, I put the field description in the key label". But since in most ETL or dataviz tool, you don't have the ability to retrieve metadata by key, users of each tools have to ask the tool editor a way to retrieve a value by key description. The whole process isn't user friendly.

Since it's a very common metadata and a lot of file format already have it, I think this is way better to have it in the file format specification. As of today, this is probably my main pain point with Parquet.

Best regards,

Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants