Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hive partitioning at the Npix level #435

Open
3 tasks done
troyraen opened this issue Oct 3, 2024 · 2 comments
Open
3 tasks done

Support Hive partitioning at the Npix level #435

troyraen opened this issue Oct 3, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@troyraen
Copy link

troyraen commented Oct 3, 2024

Feature request

Request

Allow Npix to be a directory and to be named without ".parquet" at the end.

This could be a relaxation of the current standard (where Npix is expected to be a single file) rather than a complete change. There would be some work to update your existing code, but since most parquet readers take files and directories interchangeably you could probably also continue supporting the current standard without much trouble.

Benefits

  1. Pyarrow (and thus, other readers that use it under the hood) could recognize Npix as an actual partition in the same way it recognizes Norder and Dir.

    • This recognition can be quite important for efficient reads. I haven't tested the relative efficiency of this specific change, but the screenshot and surrounding text in Partitioning column dtypes conflict with Pyarrow's handling of Hive partitioning hats#367 shows a similar case.
    • This recognition would also mean that methods like pyarrow.dataset.get_partition_keys (demonstrated below) would return keys for all three partition levels instead of just the first two.
  2. More than one file could be allowed in each leaf partition.

    • This could make it much easier to update HATS catalogs. I have specific use cases in mind for both IRSA and Pitt-Google Broker that would rely on being able to do this.

Example demonstrating that Npix is not currently recognized as a partition

One way to see this is to ask pyarrow what the partitioning keys/values are for a specific leaf partition. Even if we explicitly tell it that Npix is a partition, it won't recognize it.

import pyarrow.dataset

# Assuming we're in the hipscat-import root directory.
small_sky_object_catalog = "tests/hipscat_import/data/small_sky_object_catalog"
ignore_prefixes = [".", "_", "catalog_info.json", "partition_info.csv", "point_map.fits", "provenance_info.json"]

# Explicitly define the partitioning, including Npix.
partitioning_fields = [
    pyarrow.field(name="Norder", type=pyarrow.uint8()),
    pyarrow.field(name="Dir", type=pyarrow.uint64()),
    pyarrow.field(name="Npix", type=pyarrow.uint64()),
]
partitioning = pyarrow.dataset.partitioning(schema=pyarrow.schema(partitioning_fields), flavor="hive")

# Load the dataset and get a single fragment.
dataset = pyarrow.dataset.dataset(
    small_sky_object_catalog, ignore_prefixes=ignore_prefixes, partitioning=partitioning
)
frag = next(dataset.get_fragments())

# Look at the file path so we know which partition this is.
frag.path
# Output: 'tests/hipscat_import/data/small_sky_object_catalog/Norder=0/Dir=0/Npix=11.parquet'

# Ask for the expression that IDs this specific partition.
frag.partition_expression
# Output: <pyarrow.compute.Expression ((Norder == 0) and (Dir == 0))>

# The above didn't show the Npix partition.
# Just to make sure it's not hidden somewhere in the Expression object, ask for a plain dict.
pyarrow.dataset.get_partition_keys(frag.partition_expression)
# Output: {'Norder': 0, 'Dir': 0}

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
@troyraen troyraen added the enhancement New feature or request label Oct 3, 2024
@nevencaplar
Copy link
Member

To be implemented in HATS 0.4

@delucchi-cmu delucchi-cmu transferred this issue from astronomy-commons/hats Oct 11, 2024
@delucchi-cmu
Copy link
Contributor

I believe there are no changes required in HATS, but we may want to allow for more custom path-finding when loading via LSDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants