catalog: support composite primary key #625

skyzh · 2022-04-14T05:17:51Z

Currently, our catalog cannot handle the case for composite primary key. For example,

create table t1 (x int not null, y int not null, primary key(x, y));
create table t2 (x int not null, y int not null, primary key(y, x));

These two tables will have exactly the same catalog entry, despite that the first one should have (x, y) as composite sort key, and the second one have (y, x). The order is different!

Therefore, we will need to refactor the catalog to correctly record such information. For TableCatalog, we should have a field primary_key_indices, and remove the is_primary field from ColumnDesc.

This task might involve some refactors and some regressions in functionalities. Need plan first.

The text was updated successfully, but these errors were encountered:

skyzh · 2022-04-14T05:18:07Z

... may also be done together with #590

skyzh · 2022-04-14T05:42:35Z

assigned to @shmiwy

shmiwy · 2022-04-18T09:40:53Z

@skyzh
It seems that we don't support primary key? when I type this sql

create table t (a int not null, b int not null, c int, d int, primary key(a, b));

the physicalCreateTable says that neither a nor b is primary key🤣

skyzh · 2022-04-18T09:47:03Z

Then our binder doesn't support such syntax now 🤣 Please also help refactor the binder to support primary key (a, b) syntax.

risinglight/src/binder/statement/create_table.rs

Lines 20 to 22 in a5644b1

    
           Statement::CreateTable { name, columns, .. } => { 
        
               let name = &lower_case_name(name); 
        
               let (database_name, schema_name, table_name) = split_name(name)?;

For example, the first statement produces AST like:

        CreateTable {
            or_replace: false,
            temporary: false,
            external: false,
            if_not_exists: false,
            name: ObjectName([Ident {
                value: "t1",
                quote_style: None,
            }]),
            columns: [
                ColumnDef {
                    name: Ident {
                        value: "x",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
                ColumnDef {
                    name: Ident {
                        value: "y",
                        quote_style: None,
                    },
                    data_type: Int(None),
                    collation: None,
                    options: [ColumnOptionDef {
                        name: None,
                        option: NotNull,
                    }],
                },
            ],
            constraints: [Unique {
                name: None,
                columns: [
                    Ident {
                        value: "x",
                        quote_style: None,
                    },
                    Ident {
                        value: "y",
                        quote_style: None,
                    },
                ],
                is_primary: true,
            }],
            hive_distribution: NONE,
            hive_formats: Some(HiveFormat {
                row_format: None,
                storage: None,
                location: None,
            }),
            table_properties: [],
            with_options: [],
            file_format: None,
            location: None,
            query: None,
            without_rowid: false,
            like: None,
            engine: None,
            default_charset: None,
            collation: None,
        };

We can read the constraints variable and assign the primary keys.

skyzh · 2022-04-18T09:48:28Z

create table t (a int not null primary key, b int not null, c int, d int);

By the way, currently the binder only supports primary key at this location 🤣

shmiwy · 2022-04-18T15:50:10Z

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc?
I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,

risinglight/src/catalog/table.rs

Lines 43 to 46 in a5644b1

    
           for col_catalog in columns { 
        
               table_catalog.add_column(col_catalog).unwrap(); 
        
           } 
        
           table_catalog

add a judgment about col_catalog.is_primary() before line 44. If it is true, we can record the corresponding col_id in somewhere (maybe a vector). When we end up the for loop, we can change the primary_key_indices.

After we have created the a table, it seems that subsequent queries will not modify the catalog data like primary keys' id of the table. So it seems acceptable that we store redundant data about primary keys.🤣

shmiwy · 2022-04-18T15:53:54Z

By the way, changing the binder to supports this kind of sql query seems easy.

create table t(a int not null, b int not null, c int, d int, primary key(a, b));

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

skyzh · 2022-04-18T15:58:30Z

(emmm, also I still have one weird mistake I haven't figured out, one sql logic test falled????)🤪

You may send a draft PR and we can investigate it...

Can we add a field called primary_key_indices in TableCatalog without removng is_primary field from ColumnDesc?
I think we can just change the new function for TableCatalog(src/ catalog/ table.rs) around these line,

Of course we can :) But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

shmiwy · 2022-04-18T16:07:36Z

But from my perspective, is_primary won't be used after we have refactored all occurrences. We can have such intermediate state, and refactor it later.

I think it just make some thing like is_primary function for ColumnCatalog easier hhhh(I I find this function used in many places). if we remove it, maybe we need a more complex logic to Implement this function（maybe a extra filed in ColumnDesc to indicate which table the column belongs to, or a complex logic to scan travel from database to schema to table to column just to find whether the column is primary

shmiwy · 2022-04-18T16:17:37Z

Or we need to refactor everything that uses the function （more bugs can emerge hhhh🤩）

skyzh · 2022-04-18T16:44:21Z

I think it just make some thing like is_primary function for ColumnCatalog easier

Or we need to refactor everything that uses the function

Exactly. If you want to refactor, we can do this. But storing is_primary inside ColumnDesc is more efficient -- we can do an O(1) hash lookup to determine whether a key is a primary key, and we don't need to refactor everywhere. For the first step, I'd like to retain it.

I find this function used in many places

For most of the places using is_primary (especially in planner and optimizer), it's used to determine the sort key of storage. However, most of the usages only consider single primary key. If we want to extend the case to multiple sort keys, a refactor is unavoidable. For example, BoundColumnRef's is_primary IMO should be removed in the future. We can do this later when we have a more detailed plan. Just keep a single PR concise and only solving one problem.

shmiwy · 2022-04-19T03:31:46Z

still not tell difference between primary key(a, b) and primary key(b, a), i will do this int the not-too-distant future.

shmiwy · 2022-04-19T08:14:27Z

on the first step, I'm going to add two filed in TableCatalog.Inner, one is pk_ids which is a HashSet to indicate which columns are primary, the other one is extra_ipk_ids. extra_pk_ids is a vector and will be used to record which columns are declared by "primary key(c1, c2, ..)" syntax.

for example, if we have a sql like

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

the pk_ids will record the column id for a b c, and the extra_pk_ids will record the column id for b c in order b followed by c.

I'm trying to maintain the data of the two filed without using any of the ColumnDesc information except column id, because we will remove is_primary filed in ColumnDesc in the future and refactor some code. So I will not use any of them.

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

skyzh · 2022-04-19T08:26:45Z

create table t (a int primary key, b int not null, c int not null, d int , primary key(b, c));

I think we should forbid this in binder. We should only allow single primary key if a int primary key, or multiple primary keys using primary key(b, c) syntax.

skyzh · 2022-04-19T08:28:47Z

So, if

create table t (a int, b int not null, c int not null, d int , primary key(b, c));

We should have hashset to set b -> true, c -> true, and ordered_pk_ids to id of b, id of c.

If

create table t (a int primary key, b int not null, c int not null, d int);

We should set a -> true, while ordered_pk_ids only contain id of a.

I believe in the future, the HashSet won't be used, as all inquiries about pks will consider order. But we can keep it for now, to refactor everything little by little.

skyzh · 2022-04-19T08:29:28Z

To achieve this goal, I will refactor some codes when creating a table, add two filed in BoundCreateTable like 'TableCatalog,Inner', get the two filed data from stmt after parsing, pass them through binder, logical_plan .. until I call create_table on storage.

This plan looks great! We can proceed on this.

skyzh · 2022-04-19T08:30:10Z

ok, if so, I will only keep one vector in tableCatalog, and forbiden that syntax from binder

Sounds good to me.

shmiwy · 2022-04-19T08:31:38Z

ok🥳

skyzh mentioned this issue Apr 14, 2022

Tracking: sort key support across the system #626

Open

skyzh assigned shmiwy Apr 18, 2022

shmiwy mentioned this issue Apr 19, 2022

feat(binder): support “primary key (a, b)” syntax #631

Merged

shmiwy mentioned this issue Apr 19, 2022

feat: support multi ordered primary key #633

Merged

skyzh added the enhancement New feature or request label Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catalog: support composite primary key #625

catalog: support composite primary key #625

skyzh commented Apr 14, 2022

skyzh commented Apr 14, 2022

skyzh commented Apr 14, 2022

shmiwy commented Apr 18, 2022 •

edited

Loading

skyzh commented Apr 18, 2022

skyzh commented Apr 18, 2022

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 18, 2022

skyzh commented Apr 18, 2022

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 18, 2022 •

edited

Loading

skyzh commented Apr 18, 2022

shmiwy commented Apr 19, 2022

shmiwy commented Apr 19, 2022 •

edited

Loading

skyzh commented Apr 19, 2022 •

edited

Loading

skyzh commented Apr 19, 2022 •

edited

Loading

skyzh commented Apr 19, 2022

skyzh commented Apr 19, 2022

shmiwy commented Apr 19, 2022

catalog: support composite primary key #625

catalog: support composite primary key #625

Comments

skyzh commented Apr 14, 2022

skyzh commented Apr 14, 2022

skyzh commented Apr 14, 2022

shmiwy commented Apr 18, 2022 • edited Loading

skyzh commented Apr 18, 2022

skyzh commented Apr 18, 2022

shmiwy commented Apr 18, 2022 • edited Loading

shmiwy commented Apr 18, 2022

skyzh commented Apr 18, 2022

shmiwy commented Apr 18, 2022 • edited Loading

shmiwy commented Apr 18, 2022 • edited Loading

skyzh commented Apr 18, 2022

shmiwy commented Apr 19, 2022

shmiwy commented Apr 19, 2022 • edited Loading

skyzh commented Apr 19, 2022 • edited Loading

skyzh commented Apr 19, 2022 • edited Loading

skyzh commented Apr 19, 2022

skyzh commented Apr 19, 2022

shmiwy commented Apr 19, 2022

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 18, 2022 •

edited

Loading

shmiwy commented Apr 19, 2022 •

edited

Loading

skyzh commented Apr 19, 2022 •

edited

Loading

skyzh commented Apr 19, 2022 •

edited

Loading