Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

Closed
tmontes opened this issue Oct 9, 2024 · 4 comments

Comments

@tmontes
Copy link

tmontes commented Oct 9, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Hi Arrow team, thanks for sharing such a powerful and fundamental data handling lib! :)

I'm failing to read a hive partitioned parquet dataset when the partition columns have a leading underscore in their names, using the latest Pandas 2.2.3 + PyArrow 17.0.0 combination.

I admit I might be doing something wrong, but found nothing to guide me after browsing the docs, searching the web, and even asking a few LLMs around (!!!)... The fact is that other tools, like duckdb which I also use often, have no issue reading the same dataset.

REPRODUCTION:

import pathlib
import tempfile

import pandas as pd
import pyarrow.dataset as ds


YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'


with tempfile.TemporaryDirectory() as td:

    dataset_path = pathlib.Path(td) / 'dataset'

    # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
    pd.DataFrame([
        {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
        {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
        {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
    ]).to_parquet(
        dataset_path,
        partition_cols=[YEAR_COLUMN, FILE_COLUMN],
        index=False,
    )

    # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
    dataset = ds.dataset(
        dataset_path,
        partitioning=ds.partitioning(flavor='hive')
    )
    row_count_for_file_a = sum(
        batch.num_rows
        for batch in dataset.to_batches(
            columns=[YEAR_COLUMN],
            filter=(ds.field(FILE_COLUMN) == 'a')
        )
    )
    assert row_count_for_file_a == 2

FAILURE:

$ python x.py
Traceback (most recent call last):
  File ".../x.py", line 39, in <module>
    for batch in dataset.to_batches(
                 ^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 475, in pyarrow._dataset.Dataset.to_batches
  File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 3557, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 3475, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow/_dataset.pyx", line 3409, in pyarrow._dataset._populate_builder
  File "pyarrow/_compute.pyx", line 2724, in pyarrow._compute._bind
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(_file) in

MORE:

  • Removing the leading underscore from either of the partitioning columns still fails.
  • Only with no leading underscores in any of the partitioning columns does the code work.
YEAR_COLUMN='_year' YEAR_COLUMN='year'
FILE_COLUMN='_file' No match for FieldRef.Name(_file) No match for FieldRef.Name(_file)
FILE_COLUMN='file' No match for FieldRef.Name(file) works

LASTLY:

  • A consistent observation is that a Pandas pd.read_parquet of said dataset returns an empty dataframe, I suspect precisely due to the same underlying motives.

QUESTION:

  • I found no docs stating that leading underscores in hive-partition column names are invalid: maybe I missed them.
  • Could this be a bug? Or am I coding it wrong?

Thanks for the Arrow project and any insight/assistance on this.

Component(s)

Python

@tmontes
Copy link
Author

tmontes commented Oct 9, 2024

Variation of #42160?

@tmontes
Copy link
Author

tmontes commented Oct 9, 2024

@tmontes
Copy link
Author

tmontes commented Oct 9, 2024

UPDATE:

  • Investigation into related issues (why didn't I do that before?!) lead me into the solution.
  • I was, indeed, missing something.

FIX:

  • Add ignore_prefixes=['.'] to the dataset creation (explanation: default value is ['.', '_'], ignoring files/dirs (?) with these prefixes).

@tmontes
Copy link
Author

tmontes commented Oct 9, 2024

WORKING VARIATION:

import pathlib
import tempfile

import pandas as pd
import pyarrow.dataset as ds


YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'


with tempfile.TemporaryDirectory() as td:

    dataset_path = pathlib.Path(td) / 'dataset'

    # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
    pd.DataFrame([
        {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
        {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
        {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
    ]).to_parquet(
        dataset_path,
        partition_cols=[YEAR_COLUMN, FILE_COLUMN],
        index=False,
    )

    # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
    dataset = ds.dataset(
        dataset_path,
        partitioning=ds.partitioning(flavor='hive'),
        # required not to ignore the leading underscores in YEAR/FILE_COLUMN
        ignore_prefixes=['.'],
    )
    row_count_for_file_a = sum(
        batch.num_rows
        for batch in dataset.to_batches(
            columns=[YEAR_COLUMN],
            filter=(ds.field(FILE_COLUMN) == 'a')
        )
    )
    assert row_count_for_file_a == 2

CLOSING

PS: Thanks and sorry for the noise! :-)

@tmontes tmontes closed this as completed Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant