Constrained chunk shape estimation #997

CodyCBakerPhD · 2023-11-08T23:53:56Z

Motivation

Potential replacement of #996

This is actually really good timing; the latest stage of Zarr integration in NeuroConv (catalystneuro/neuroconv#569) benefits greatly from having these types of methods become static methods of the iterator (and its children) for broader usage, much as @oruebel suggests in #995 (comment)

@bendichter Anyway, I did some mathematical reduction and refactoring of the suggested approach from #996; it's actually quite similar to the original axis ratio method in that one could imagine an altogether more general algorithm that places 'weight factors' on each axis to determine how strongly to bias the distribution of bytes throughout the process...

In the recently suggested update these weight factors would all be 1; in the original, they are something proportional to the relative length of each axis

This PR also addresses @oruebel's comments by no longer requiring secondary methods

Two other changes:

I strongly suggest keeping the previous method accessible 'just in case' something unexpected arises in the new implementation; a lesson from learned from previous bugs in this iterator. The other fallback would be to force installation of a previous HDMF version, but that could be less desirable and cause other problems (such as incompatibilities) on top of that

Another lesson learned from the past, always use math.prod for these operations. Eventually these methods will be used on shapes with very large values, and np.prod overflows easily and silently (well, occasionally there might be a warning thrown but not always)

How to test the behavior?

Existing tests may require updating to the newly predicted default values

A few new tests should be added for the static usage and showcase of the expected results from each strategy

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

bendichter · 2023-11-09T02:12:51Z

src/hdmf/data_utils.py

+            estimated_chunk_shape = np.array(chunk_shape_constraints)
+            capped_axes = none_axes
+            while any(capped_axes):
+                number_of_free_axes = len(none_axes)
+
+                # Estimate the amount to fill uniformly across the unconstrained axes
+                # Note that math.prod is 1 if all axes are None
+                constrained_bytes = math.prod(estimated_chunk_shape[constrained_axes]) * itemsize
+                estimated_fill_factor = math.floor((chunk_bytes / constrained_bytes) ** (1 / number_of_free_axes))
+
+                # Update axes and constraints in the event that the fill factor pushed some axes beyond their maximum
+                capped_axes = none_axes[np.where(maxshape[none_axes] <= estimated_fill_factor)[0]]
+                estimated_chunk_shape[capped_axes] = maxshape[capped_axes]  # Cap the estimate at the max
+                chunk_shape_constraints[capped_axes] = maxshape[capped_axes]  # Consider capped axis a constraint
+                none_axes = np.where([x is None for x in chunk_shape_constraints])[0]
+                constrained_axes = np.where([x is not None for x in chunk_shape_constraints])[0]
+            estimated_chunk_shape[none_axes] = estimated_fill_factor


OK, I think this seems like a good alternative solution, which should achieve the same answer in far fewer loops, but I agree with @oruebel that it could go in a separate function. I liked the way I broke it apart where the math was in a separate function that was just trying to achieve a vector with a certain product (not knowing that it was related to the size and shape of an array).

Also, a minor point: I would remove the np.wheres. If you are going this way you can just index with the boolean np arrays, which will make the syntax a bit cleaner.

bendichter · 2023-11-09T02:15:11Z

src/hdmf/data_utils.py

+        maxshape, itemsize, chunk_shape, buffer_gb = getargs(
+            "maxshape", "chunk_shape", "itemsize", "buffer_gb", kwargs
        )
+        assert buffer_gb > 0, f"buffer_gb ({buffer_gb}) must be greater than zero!"
+        assert all(
+            chunk_axis > 0 for chunk_axis in chunk_shape
+        ), f"Some dimensions of chunk_shape ({chunk_shape}) are less than zero!"

        k = math.floor(
-            (
-                buffer_gb * 1e9 / (math.prod(self.chunk_shape) * self.dtype.itemsize)
-            ) ** (1 / len(self.chunk_shape))
-        )
-        return tuple(
-            [
-                min(max(k * x, self.chunk_shape[j]), self.maxshape[j])
-                for j, x in enumerate(self.chunk_shape)
-            ]
+            (buffer_gb * 1e9 / (math.prod(chunk_shape) * itemsize)) ** (1 / len(chunk_shape))
        )
+        return tuple([min(max(k * x, chunk_shape[j]), maxshape[j]) for j, x in enumerate(chunk_shape)])


Nice, pulling these attributes out made the code much more readable.

bendichter · 2023-11-09T02:21:44Z

I think you probably want the ratio to be informed by the overall size of the dataset. I.e., for (time, electrodes) where time is much longer than electrodes, then you'd also want the chunks to be larger in the time dimension. (@oruebel #996 (comment))

@oruebel, this was our original guess, and it was conveniently also the approach h5py uses, but our experience with @magland has shown that this can cause big problems for certain common applications and at this point I am leaning towards the hypercube approach, as I don't expect it to totally blow up in most cases. The (time, electrode) is the perfect example- it's actually much more common for users to want many channels over a short time than one channel over a long time, particularly in streaming applications.

This PR also addresses @oruebel's comments by no longer requiring secondary methods

See the way I separated the function out in my PR. I think this the function as-is is a bit too long and should be broken up, but that's more of a style thing.

I'm fine with keeping the old method for now. Let's see how often we or anyone else tends to use it.

And yes, using math.prod is a good idea. I do remember that being a particularly sneaky bug we definitely do not want to reintroduce.

refactor of pr 996

8a6d4e8

CodyCBakerPhD self-assigned this Nov 8, 2023

bendichter reviewed Nov 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constrained chunk shape estimation #997

Constrained chunk shape estimation #997

CodyCBakerPhD commented Nov 8, 2023 •

edited

Loading

bendichter Nov 9, 2023 •

edited

Loading

bendichter Nov 9, 2023

bendichter commented Nov 9, 2023 •

edited

Loading

Constrained chunk shape estimation #997

Are you sure you want to change the base?

Constrained chunk shape estimation #997

Conversation

CodyCBakerPhD commented Nov 8, 2023 • edited Loading

Motivation

How to test the behavior?

Checklist

bendichter Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

bendichter Nov 9, 2023

Choose a reason for hiding this comment

bendichter commented Nov 9, 2023 • edited Loading

CodyCBakerPhD commented Nov 8, 2023 •

edited

Loading

bendichter Nov 9, 2023 •

edited

Loading

bendichter commented Nov 9, 2023 •

edited

Loading