Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From_flat catalog creation #434

Open
nevencaplar opened this issue Oct 10, 2024 · 4 comments
Open

From_flat catalog creation #434

nevencaplar opened this issue Oct 10, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@nevencaplar
Copy link
Member

Provide a function, provisionally called from_flat that creates a catalog, with nested structure from a single source table, i.e., list of observations of astronomical objects, where observations can repeat.

Connected with #421.

@dougbrn Can you elaborate further or explain what I might have gotten wrong?

@nevencaplar nevencaplar added the enhancement New feature or request label Oct 10, 2024
@dougbrn
Copy link
Contributor

dougbrn commented Oct 10, 2024

Looks good, would just add the (maybe obvious) rider that this functionality is already available within Nested-Dask: https://nested-dask.readthedocs.io/en/latest/autoapi/nested_dask/core/index.html#nested_dask.core.NestedFrame.from_flat

So this ticket would just be creating a catalog function that directly wraps/uses this.

@dougbrn
Copy link
Contributor

dougbrn commented Oct 15, 2024

@hombit do you think it would be good to follow #421 here and provide a nest_flat function within the catalog class? Or do you think this should diverge from #421 and be a catalog constructor class? Or just do both?

@hombit
Copy link
Contributor

hombit commented Oct 15, 2024

I believe it should be consistent with nest_lists, because these two are very close to each other.

@nevencaplar nevencaplar moved this to Suggested Todo in HATS / LSDB Oct 29, 2024
@hombit
Copy link
Contributor

hombit commented Nov 7, 2024

Implementing from_flat would have a challenge in generating a new _healpix_29 index. Generally, the original _healpix_29 would be different for different observations of a single object (e.g. for Zubercal, LSST DRs, etc.). This is a pipeline I'd propose to have (for each pixel):

  1. Concatenate catalog and margin partitions and NestedFrame.from_flat(df.reset_index(), on='column_name', name='lc')
  2. Now we have a nested column lc with the _healpix_29 subcolumn. First, we should split the df into "catalog" and "margin" dfs. "margin" would have objects having all ls._healpix_29 list-values to lie out of the partition pixel. "catalog" will include all other objects.
  3. Then, we aggregate ls._healpix_29 to a new index value. There could be different strategies, but basically, we want to select one of the _healpix_29 values to be a new index, and for the "catalog" df, it should be the _healpix_29 value within the partition.
    a. A possible strategy is just selecting the smallest _healpix_29 value,
    b. or a tile order-29 closest to the average coordinates (which we can get converting healpix-29 to RA&Dec)
  4. Reindex both "catalog" and "margin" dfs with this new index and construct a new Catalog object

@nevencaplar nevencaplar moved this from Suggested Todo to Todo in HATS / LSDB Nov 14, 2024
@nevencaplar nevencaplar moved this from Todo to Suggested Todo in HATS / LSDB Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Suggested Todo
Development

No branches or pull requests

3 participants