Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Define partial chunk shape for GenericDataChunkIterator #995

Open
3 tasks done
bendichter opened this issue Nov 8, 2023 · 4 comments · May be fixed by #996 or #997
Open
3 tasks done

[Feature]: Define partial chunk shape for GenericDataChunkIterator #995

bendichter opened this issue Nov 8, 2023 · 4 comments · May be fixed by #996 or #997
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users

Comments

@bendichter
Copy link
Contributor

What would you like to see added to HDMF?

Right now, for the GenericDataChunkIterator, it's possible to define chunk_mb or chunk_shape. I would like to enable a hybrid approach, where a user could input chunk_mb=10.0, chunk_shape=(None, 64), and the GenericDataChunkIterator would identify the remaining dimension that gets you close to the target chunk size.

Is your feature request related to a problem?

It is pretty common for users to have some insight into the likely read patterns of a dataset.

What solution would you like?

I would like GenericDataChunkIterator to find the maximum size (prod of dims) that is <= the target size. I also would like the chunk to be as cube-like as possible, so I would like to minimize the sum of the dimensions of the array. Previously, we tried building chunks that were scaled down versions of the data shape, similar to h5py, but experience with Jeremy has shown that this approach is poorly suited for common data reading routines, and I think a better naive assumption would be that (hyper-) cube chunks are a good default.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@oruebel oruebel added this to the 4.0 milestone Nov 8, 2023
@oruebel oruebel added the priority: medium non-critical problem and/or affecting only a small set of users label Nov 8, 2023
@oruebel
Copy link
Contributor

oruebel commented Nov 8, 2023

@CodyCBakerPhD is this an issue you could help with, since you are most familiar with GenericDataChunkIterator?

@oruebel oruebel added the category: enhancement improvements of code or code behavior label Nov 8, 2023
@bendichter bendichter linked a pull request Nov 8, 2023 that will close this issue
6 tasks
@bendichter
Copy link
Contributor Author

Above is a proposed solution. Obv this needs tests, but I wanted to run it by the group before moving forward

@oruebel
Copy link
Contributor

oruebel commented Nov 8, 2023

To be honest, the functionality you describe to me sounds more like a utility function that would be more broadly useful for DataChunk iterators. I.e., this could be a method (e.g., max_chunk_shape that a user would call to get suggested chunk sized that they would then hand to the iterator, e.g.:

GenericDataChunkIterator(chunk_shape=max_chunk_shape(chunk_mb=10.0, chunk_shape=(None, 64)), ...)

This function could either live on it's own in the same module as GenericDataChunkIterator or maybe be a static method on AbstractDataChunkIterator or DataChunk (but I think separately may make sense). The main reason I think this would be useful to do as a utility function is that it:

  • Makes the logic reusable
  • Makes the logic explicit, since the user calls the function ,rather than being additional "magic" inside the constructor of GenericDataChunkIterator

@oruebel
Copy link
Contributor

oruebel commented Nov 8, 2023

Sorry, I didn't see that you made a draft PR, I was referring to the solution you suggested in the issue. Let me take a look at the PR.

@CodyCBakerPhD CodyCBakerPhD linked a pull request Nov 8, 2023 that will close this issue
6 tasks
@mavaylon1 mavaylon1 removed this from the 4.0 milestone Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
3 participants