VectorData Expand by Default via write_dataset #1093

mavaylon1 · 2024-04-07T00:35:15Z

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

This change allows the new default behavior for writing VectorData data as expandandable datasets. We do this by providing maxshape to dataset settings that do not already have a defined maxshape set by the user.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Testts

Checklist

Did you update CHANGELOG.md with your changes?
Does the PR clearly describe the problem and the solution?
Have you reviewed our Contributing Guide?
Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

mavaylon1 · 2024-04-07T00:38:56Z

Notes and Questions:

Does scalar_fill imply that the dataset has only 1 value and should only have 1? There is no doc string. If so I can move the logic for max_shape into list_fill.
There are three cases with references where the shape is defined within required_dataset. I assume this is because get_data_shape returns a shape (#,1) where "#" is the number of references. This is something I found as an edge case. The quickest solution is to set maxshape at each of the three locations. The precedent being each 3 locations repeats shape being set.
@oruebel

oruebel · 2024-04-07T05:26:27Z

Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

hdmf/src/hdmf/backends/hdf5/h5tools.py

Line 1319 in d85d0cb

def __scalar_fill__(cls, parent, name, data, options=None):

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

mavaylon1 · 2024-04-07T05:38:12Z

Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

hdmf/src/hdmf/backends/hdf5/h5tools.py

Line 1319 in d85d0cb

def __scalar_fill__(cls, parent, name, data, options=None):

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

oruebel · 2024-04-07T05:43:19Z

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

In my experience it is best to make choices explicit and provide useful defaults rather than hiding configurations. A user may not want to use chunking if they want to use numpy memory mapping to read contiguous datasets.

mavaylon1 · 2024-05-02T00:00:54Z

@rly I will shoot to have this done by next week (formerly Friday May 3).

mavaylon1 · 2024-05-02T22:51:26Z

Dev Notes:
When writing datasets, we have a few options:

setup_chunked_dset: This is is for datachunkiterator so we do not need to mess with this
scalar_fill: scalar datasets do not need to be extended by default by nature of being scalar
setup_empty_dset: This is is for datachunkiterator so we do not need to mess with this
list_fill: Write a regular in memory array (e.g., numpy array, list etc.)

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

oruebel · 2024-05-02T22:59:50Z

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

Yes, I believe that is correct. I think only logic in list_fill should need to be modified and then the enable_chunking setting will need to be passed through. Note, list_fill already is being passed the argument options which contains io_settings so I think you may just need to set chunks=True in the io_settings (if chunks is set to None) to enable chunking. I'm not sure if it will be easiest to do this change if io_settings in list_fill or to update the io_settings outside of list_fill so that list_fill would not need to change at all.

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/hdf5/h5tools.py#L1432C5-L1438C53

mavaylon1 · 2024-05-05T16:06:37Z

Tests:

compound cases: added a check on existing test
regular array data: added a check on existing test
existing tests: updated three tests (refer to changed files)
Check test for when maxshape is already set, that my update does not interfere

codecov · 2024-05-05T16:34:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.70%. Comparing base (126bdb1) to head (58f3bf0).
Report is 29 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##              dev    #1093   +/-   ##
=======================================
  Coverage   88.70%   88.70%           
=======================================
  Files          45       45           
  Lines        9745     9748    +3     
  Branches     2767     2769    +2     
=======================================
+ Hits         8644     8647    +3     
  Misses        779      779           
  Partials      322      322

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mavaylon1 · 2024-05-05T16:39:55Z

@oruebel This is mostly done. I need to check/update or write a test that my changes does not do any interference of existing maxshape settings. (And do another pass through to make sure the logic is efficient) However, the main point I want to bring up is your idea of having a parameter for turning on and off the expandability. This would mean HDMFIO has a parameter that is not used in ZarrIO. In fact there are tests failing due to the parameter not being recognized. I see we have two options:

Update hdmf-zarr to accept and do nothing with the arg
Don't have the arg
I do like the explicit nature of having it, but I also think having this trickle into hdmf-zarr is not clean.

oruebel · 2024-05-05T18:21:37Z

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

mavaylon1 · 2024-05-05T18:36:48Z

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

oruebel · 2024-05-05T18:57:19Z

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

hdmf/src/hdmf/backends/io.py

Lines 80 to 81 in 126bdb1

    
                        'default': None}, allow_extra=True) 
        
               def write(self, **kwargs):

and those are being passed through to write_builder

hdmf/src/hdmf/backends/io.py

Line 99 in 126bdb1

self.write_builder(f_builder, **kwargs)

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

mavaylon1 · 2024-05-05T19:13:37Z

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

hdmf/src/hdmf/backends/io.py

Lines 80 to 81 in 126bdb1

'default': None}, allow_extra=True)

def write(self, **kwargs):

and those are being passed through to write_builder

hdmf/src/hdmf/backends/io.py

Line 99 in 126bdb1

self.write_builder(f_builder, **kwargs)

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

Well isn't that just right in front of my face.

mavaylon1 · 2024-05-05T19:30:41Z

Notes:
My approach to the tests:

Make sure the default is now expandable. This is accomplished by adding an additional assert on existing test. This is done for both regular and compound data.
Make sure the default does not override user chunking. This is accomplished by existing tests passing.
Make sure when told "false" that it turns off said chunking.

CHANGELOG.md

Co-authored-by: Oliver Ruebel <[email protected]>

src/hdmf/backends/hdf5/h5tools.py

tests/unit/test_io_hdf5_h5tools.py

oruebel

I added some minor suggestions, but otherwise this looks good to me.

mavaylon1 · 2024-05-06T17:57:28Z

I added some minor suggestions, but otherwise this looks good to me.

Thanks for the quick review. I will make the doc string more detailed, but take a look at my comments for the other changes. The pass was a deliberate thing (vs a left over from a draft) and I like the warning.

oruebel

Looks good to me. Thanks!

rly · 2024-05-07T15:43:43Z

CHANGELOG.md

@@ -7,6 +7,7 @@
 - Added `TypeConfigurator` to automatically wrap fields with `TermSetWrapper` according to a configuration file. @mavaylon1 [#1016](https://github.com/hdmf-dev/hdmf/pull/1016)
 - Updated `TermSetWrapper` to support validating a single field within a compound array. @mavaylon1 [#1061](https://github.com/hdmf-dev/hdmf/pull/1061)
 - Updated testing to not install in editable mode and not run `coverage` by default. @rly [#1107](https://github.com/hdmf-dev/hdmf/pull/1107)
+- Updated the default behavior for writing HDF5 datasets to be expandandable datasets with chunking enabled by default. This does not override user set chunking parameters. @mavaylon1 [#1093](https://github.com/hdmf-dev/hdmf/pull/1093)


expandandable -> expandable

src/hdmf/backends/hdf5/h5tools.py

rly · 2024-05-07T16:00:54Z

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

tests/unit/test_io_hdf5_h5tools.py

mavaylon1 · 2024-05-07T16:38:39Z

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

Yeah the lack of dataset of references support was just a smaller scope for this idea. I agree this makes a lot of sense to have. I will make this an issue ticket.

As for the expansion of VectorData documentation, I thought we had that. Maybe I am thinking of the HDF5 documentation, but I will look. If it does not exist, I will loop that into the ticket for dataset of references.

This reverts commit 201b8c4.

Initial concept

56de73a

stephprince mentioned this pull request Apr 16, 2024

Make all DynamicTables chunked and resizable by default NeurodataWithoutBorders/pynwb#1067

Open

Merge branch 'dev' into expand_dataset

283636c

mavaylon1 self-assigned this May 2, 2024

mavaylon1 added 2 commits May 5, 2024 08:09

work in progress

ea408da

reset

6bd72bc

mavaylon1 added 2 commits May 5, 2024 09:10

checkpoint

e9c76db

pasing

ac2a246

clean up

75bc6ea

clean

4973fdd

mavaylon1 added 2 commits May 5, 2024 12:24

clean

fa92b4c

clean

50d13bf

mavaylon1 added 2 commits May 5, 2024 12:35

test

81e564f

Update CHANGELOG.md

a478e1b

mavaylon1 marked this pull request as ready for review May 6, 2024 15:56

mavaylon1 requested a review from oruebel May 6, 2024 15:56