Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stage READ/WRITE support for LOAD DATA, External Table and SELECT OUTFILE #17979

Closed
wants to merge 147 commits into from

Conversation

cpegeric
Copy link
Contributor

@cpegeric cpegeric commented Aug 8, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #17820
issue #17747
issue #17748

What this PR does / why we need it:

  • Able to import data from SQL "LOAD DATA INFILE"
  • Able to read data from External table via local file and S3.
  • SELECT INTO OUTFILE accept stage URL as input filepath and write to local file and S3.
  • stage URL can specify to a local file, s3 file and another stage with subpath.
  • restrict the stage URL format by only supporting file:///, s3:// or stage://
  • remove the HashPassword for stage credentials
  • support stage_list() function to list the file in the directory or file path with wildcard
  • export function work on top of fileservice

PR Type

Bug fix, Enhancement, Tests


Description

  • Refactored export logic for file handling and CSV writing.
  • Updated stage creation and alteration logic to validate URL protocols and removed password hashing for stage credentials.
  • Added utility functions for handling stage URLs and credentials, including functions to list files in a stage.
  • Enhanced various components to support stage URLs, including compileExternScan, file existence checks, and external stats initialization.
  • Added and updated tests for stage creation, data loading, and snapshot handling with new URL formats and credentials.
  • Registered stage_list built-in function and added corresponding function ID.
  • Updated query result dumping and status statement execution to use refactored export logic.

Changes walkthrough 📝

Relevant files
Enhancement
12 files
export.go
Refactor export logic for file handling and CSV writing. 

pkg/frontend/export.go

  • Removed unused imports and variables.
  • Refactored file handling to use io.Pipe and fileservice.
  • Simplified the openNewFile function.
  • Updated CSV writing logic to use a buffer.
  • +122/-201
    stage_util.go
    Add utility functions for stage URL handling and file listing.

    pkg/sql/plan/function/stage_util.go

  • Added new utility functions for handling stage URLs and credentials.
  • Implemented functions to parse and convert stage URLs to paths.
  • Added functions to list files in a stage with or without wildcards.
  • +430/-0 
    authenticate.go
    Refactor stage creation and alteration logic.                       

    pkg/frontend/authenticate.go

  • Removed commented-out SQL query strings.
  • Updated stage creation and alteration to validate URL protocols.
  • Removed password hashing for stage credentials.
  • +24/-73 
    utils.go
    Add initialization functions for stage and S3 parameters.

    pkg/sql/plan/utils.go

  • Added functions to initialize parameters for stage and S3.
  • Implemented logic to handle stage URLs in external parameters.
  • +125/-0 
    func_unary.go
    Add `StageList` function for listing stage files.               

    pkg/sql/plan/function/func_unary.go

    • Added StageList function to list files in a stage.
    +55/-0   
    list_builtIn.go
    Register `stage_list` built-in function.                                 

    pkg/sql/plan/function/list_builtIn.go

    • Added stage_list function to the list of built-in functions.
    +34/-0   
    compile.go
    Enhance `compileExternScan` to support stage URLs.             

    pkg/sql/compile/compile.go

    • Updated compileExternScan to handle stage URLs.
    +16/-1   
    query_result.go
    Update query result dumping to use new export logic.         

    pkg/frontend/query_result.go

    • Updated query result dumping to use refactored export logic.
    +1/-5     
    function_id.go
    Add function ID for `STAGE_LIST`.                                               

    pkg/sql/plan/function/function_id.go

    • Added STAGE_LIST function ID.
    +6/-0     
    status_stmt.go
    Use refactored export logic in status statement execution.

    pkg/frontend/status_stmt.go

    • Updated status statement execution to use refactored export logic.
    +3/-5     
    build_load.go
    Enhance file existence check to support stage URLs.           

    pkg/sql/plan/build_load.go

    • Updated file existence check to handle stage URLs.
    +1/-1     
    external.go
    Enhance external stats initialization for stage URLs.       

    pkg/sql/plan/external.go

    • Updated external stats initialization to handle stage URLs.
    +1/-1     
    Tests
    12 files
    authenticate_test.go
    Update test cases for stage creation and alteration.         

    pkg/frontend/authenticate_test.go

  • Removed single quotes from URLs and credentials in test cases.
  • Updated URLs to use correct protocols.
  • +13/-314
    export_test.go
    Update export tests to match refactored logic.                     

    pkg/frontend/export_test.go

  • Removed tests for obsolete functions.
  • Updated tests to match refactored export logic.
  • +31/-93 
    session_test.go
    Comment out failing test case for system time zone.           

    pkg/frontend/session_test.go

    • Commented out a test case for updating time zone to "system".
    +6/-4     
    nonsys_restore_system_table_to_nonsys_account.result
    Update snapshot test results with new timestamps and credentials.

    test/distributed/cases/snapshot/nonsys_restore_system_table_to_nonsys_account.result

    • Updated timestamps and credentials in snapshot test results.
    +36/-36 
    restore_cluster_table.result
    Update cluster restore test results with new timestamps and
    credentials.

    test/distributed/cases/snapshot/cluster/restore_cluster_table.result

  • Updated timestamps and credentials in cluster restore test results.
  • +18/-18 
    cluster_level_snapshot_restore_system_table_to_nonsys.result
    Update snapshot and stage details with new timestamps and formats

    test/distributed/cases/snapshot/cluster_level_snapshot_restore_system_table_to_nonsys.result

  • Updated timestamps for snapshot creation and function definitions.
  • Changed stage credentials format.
  • Adjusted stage URLs and statuses.
  • +31/-31 
    sys_restore_system_table_to_nonsys_account.result
    Update snapshot and stage details with new timestamps and formats

    test/distributed/cases/snapshot/sys_restore_system_table_to_nonsys_account.result

  • Updated timestamps for snapshot creation and function definitions.
  • Changed stage credentials format.
  • Adjusted stage URLs and statuses.
  • +30/-30 
    sys_restore_system_table_to_newnonsys_account.result
    Update snapshot and stage details with new timestamps and formats

    test/distributed/cases/snapshot/sys_restore_system_table_to_newnonsys_account.result

  • Updated timestamps for snapshot creation and function definitions.
  • Changed stage credentials format.
  • Adjusted stage URLs and statuses.
  • +29/-29 
    stage.result
    Add and update stage creation and data loading tests         

    test/distributed/cases/stage/stage.result

  • Added new stage creation tests with various URL protocols.
  • Updated stage credentials and statuses.
  • Added tests for loading data from stages.
  • +137/-110
    load_data_parquet.result
    Add tests for loading data from parquet files                       

    test/distributed/cases/load_data/load_data_parquet.result

  • Added new table creation and data loading from parquet files.
  • Included stage creation and data loading tests.
  • +15/-1   
    load_data_parquet.sql
    Add SQL commands for loading data from parquet files         

    test/distributed/cases/load_data/load_data_parquet.sql

  • Added SQL commands for creating tables and loading data from parquet
    files.
  • Included stage creation and data loading commands.
  • +8/-1     
    stage.sql
    Update stage SQL tests for new URL formats and sub-stages

    test/distributed/cases/stage/stage.sql

  • Added new CREATE STAGE statements with various URL formats, including
    file://, s3://, and stage://.
  • Updated ALTER STAGE statements to reflect new URL formats and
    credentials.
  • Modified SELECT INTO OUTFILE statements to use stage URLs.
  • Introduced tests for listing stage directories and handling
    sub-stages.
  • +99/-65 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    @matrix-meow matrix-meow added size/L Denotes a PR that changes [500,999] lines and removed size/XXL Denotes a PR that changes 2000+ lines labels Aug 21, 2024
    @cpegeric cpegeric requested a review from m-schen August 21, 2024 20:07
    @m-schen
    Copy link
    Contributor

    m-schen commented Aug 22, 2024

    不知道为什么file changed里多了特别多main分支上的改动,我担心这会带来麻烦。

    所以我先加了个not-merge的标签,如果检查没问题的话可以去掉。pr会被正常合并。

    @m-schen
    Copy link
    Contributor

    m-schen commented Aug 22, 2024

    不知道为什么file changed里多了特别多main分支上的改动,我担心这会带来麻烦。

    所以我先加了个not-merge的标签,如果检查没问题的话可以去掉。pr会被正常合并。

    没有问题,这应该不会引入新的commit或者带来code owner的改动。

    @sukki37 sukki37 closed this Aug 22, 2024
    Copy link
    Contributor

    mergify bot commented Aug 22, 2024

    ⚠️ The sha of the head commit of this PR conflicts with #18280. Mergify cannot evaluate rules on this PR. ⚠️

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Bug fix Enhancement kind/bug Something isn't working kind/feature Review effort [1-5]: 4 size/L Denotes a PR that changes [500,999] lines Tests
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.