Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

tmontes · 2024-10-09T11:24:40Z

Describe the bug, including details regarding any error messages, version, and platform.

Hi Arrow team, thanks for sharing such a powerful and fundamental data handling lib! :)

I'm failing to read a hive partitioned parquet dataset when the partition columns have a leading underscore in their names, using the latest Pandas 2.2.3 + PyArrow 17.0.0 combination.

I admit I might be doing something wrong, but found nothing to guide me after browsing the docs, searching the web, and even asking a few LLMs around (!!!)... The fact is that other tools, like duckdb which I also use often, have no issue reading the same dataset.

REPRODUCTION:

import pathlib
import tempfile

import pandas as pd
import pyarrow.dataset as ds


YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'


with tempfile.TemporaryDirectory() as td:

    dataset_path = pathlib.Path(td) / 'dataset'

    # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
    pd.DataFrame([
        {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
        {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
        {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
    ]).to_parquet(
        dataset_path,
        partition_cols=[YEAR_COLUMN, FILE_COLUMN],
        index=False,
    )

    # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
    dataset = ds.dataset(
        dataset_path,
        partitioning=ds.partitioning(flavor='hive')
    )
    row_count_for_file_a = sum(
        batch.num_rows
        for batch in dataset.to_batches(
            columns=[YEAR_COLUMN],
            filter=(ds.field(FILE_COLUMN) == 'a')
        )
    )
    assert row_count_for_file_a == 2

FAILURE:

$ python x.py
Traceback (most recent call last):
  File ".../x.py", line 39, in <module>
    for batch in dataset.to_batches(
                 ^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 475, in pyarrow._dataset.Dataset.to_batches
  File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 3557, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 3475, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow/_dataset.pyx", line 3409, in pyarrow._dataset._populate_builder
  File "pyarrow/_compute.pyx", line 2724, in pyarrow._compute._bind
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(_file) in

MORE:

Removing the leading underscore from either of the partitioning columns still fails.
Only with no leading underscores in any of the partitioning columns does the code work.

	`YEAR_COLUMN='_year'`	`YEAR_COLUMN='year'`
`FILE_COLUMN='_file'`	No match for FieldRef.Name(_file)	No match for FieldRef.Name(_file)
`FILE_COLUMN='file'`	No match for FieldRef.Name(file)	works

LASTLY:

A consistent observation is that a Pandas pd.read_parquet of said dataset returns an empty dataframe, I suspect precisely due to the same underlying motives.

QUESTION:

I found no docs stating that leading underscores in hive-partition column names are invalid: maybe I missed them.
Could this be a bug? Or am I coding it wrong?

Thanks for the Arrow project and any insight/assistance on this.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

tmontes · 2024-10-09T11:56:39Z

Variation of #42160?

tmontes · 2024-10-09T12:05:18Z

More related issues:

Underscores at beginning of directory names create problems for open_dataset function #7857

tmontes · 2024-10-09T12:08:28Z

UPDATE:

Investigation into related issues (why didn't I do that before?!) lead me into the solution.
I was, indeed, missing something.

FIX:

Add ignore_prefixes=['.'] to the dataset creation (explanation: default value is ['.', '_'], ignoring files/dirs (?) with these prefixes).

tmontes · 2024-10-09T12:10:29Z

WORKING VARIATION:

import pathlib
import tempfile

import pandas as pd
import pyarrow.dataset as ds


YEAR_COLUMN = '_year'
FILE_COLUMN = '_file'


with tempfile.TemporaryDirectory() as td:

    dataset_path = pathlib.Path(td) / 'dataset'

    # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
    pd.DataFrame([
        {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
        {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
        {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
        {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
        {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
    ]).to_parquet(
        dataset_path,
        partition_cols=[YEAR_COLUMN, FILE_COLUMN],
        index=False,
    )

    # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
    dataset = ds.dataset(
        dataset_path,
        partitioning=ds.partitioning(flavor='hive'),
        # required not to ignore the leading underscores in YEAR/FILE_COLUMN
        ignore_prefixes=['.'],
    )
    row_count_for_file_a = sum(
        batch.num_rows
        for batch in dataset.to_batches(
            columns=[YEAR_COLUMN],
            filter=(ds.field(FILE_COLUMN) == 'a')
        )
    )
    assert row_count_for_file_a == 2

CLOSING

PS: Thanks and sorry for the noise! :-)

tmontes added the Type: bug label Oct 9, 2024

github-actions bot added the Component: Python label Oct 9, 2024

tmontes closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

Comments

tmontes commented Oct 9, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024

tmontes commented Oct 9, 2024