Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorData Expand by Default via write_dataset #1093

Merged
merged 15 commits into from
May 7, 2024
Merged

Conversation

mavaylon1
Copy link
Contributor

@mavaylon1 mavaylon1 commented Apr 7, 2024

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

This change allows the new default behavior for writing VectorData data as expandandable datasets. We do this by providing maxshape to dataset settings that do not already have a defined maxshape set by the user.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Testts

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

@mavaylon1
Copy link
Contributor Author

Notes and Questions:

  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1? There is no doc string. If so I can move the logic for max_shape into list_fill.
  2. There are three cases with references where the shape is defined within required_dataset. I assume this is because get_data_shape returns a shape (#,1) where "#" is the number of references. This is something I found as an edge case. The quickest solution is to set maxshape at each of the three locations. The precedent being each 3 locations repeats shape being set.
    @oruebel

@oruebel
Copy link
Contributor

oruebel commented Apr 7, 2024

  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

def __scalar_fill__(cls, parent, name, data, options=None):

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

@mavaylon1
Copy link
Contributor Author

  1. Does scalar_fill imply that the dataset has only 1 value and should only have 1?

Are you referring to

def __scalar_fill__(cls, parent, name, data, options=None):

If so, this function is used to write scalar datasets, i.e., dataset with a single value.

2. There are three cases with references where the shape is defined within required_dataset.

Could you point to the case you are referring to? require_dataset is usually used to create a dataset it if it doesn’t exist and open the dataset if it does.

2. The quickest solution is to set maxshape at each of the three locations.

This would mean to make all datasets expandable by enabling chunking for all datasets. That is a bit broader approach then to make this the default just for VectorData, but it would make it the default behavior for all (non-scalar) datasets. If that is the approach we'd want to take, then I would suggest adding a parameter enable_chunking=True on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. @rly thoughts?

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

@oruebel
Copy link
Contributor

oruebel commented Apr 7, 2024

The enable chunking parameter to give the user the option to turn off the expandable default? If so, would there be a reason they would want to?

In my experience it is best to make choices explicit and provide useful defaults rather than hiding configurations. A user may not want to use chunking if they want to use numpy memory mapping to read contiguous datasets.

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented May 2, 2024

@rly I will shoot to have this done by next week (formerly Friday May 3).

@mavaylon1 mavaylon1 self-assigned this May 2, 2024
@mavaylon1
Copy link
Contributor Author

Dev Notes:
When writing datasets, we have a few options:

  1. setup_chunked_dset: This is is for datachunkiterator so we do not need to mess with this
  2. scalar_fill: scalar datasets do not need to be extended by default by nature of being scalar
  3. setup_empty_dset: This is is for datachunkiterator so we do not need to mess with this
  4. list_fill: Write a regular in memory array (e.g., numpy array, list etc.)

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

@oruebel
Copy link
Contributor

oruebel commented May 2, 2024

From my understanding we only need modify the input parameter options for only list_fill.

Now Oliver mentioned being more explicit with a switch enable_chunking=True (default will be true) on HDFIO.write and HDFIO.export so that we can configure the default behavior for write. this will need to be passed through the chain of methods from write and export to write_dataset.

Yes, I believe that is correct. I think only logic in list_fill should need to be modified and then the enable_chunking setting will need to be passed through. Note, list_fill already is being passed the argument options which contains io_settings so I think you may just need to set chunks=True in the io_settings (if chunks is set to None) to enable chunking. I'm not sure if it will be easiest to do this change if io_settings in list_fill or to update the io_settings outside of list_fill so that list_fill would not need to change at all.

https://github.com/hdmf-dev/hdmf/blob/126bdb100c6d5ce3e2dadd375de9d32524219404/src/hdmf/backends/hdf5/h5tools.py#L1432C5-L1438C53

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented May 5, 2024

Tests:

  1. compound cases: added a check on existing test
  2. regular array data: added a check on existing test
  3. existing tests: updated three tests (refer to changed files)
  4. Check test for when maxshape is already set, that my update does not interfere

Copy link

codecov bot commented May 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.70%. Comparing base (126bdb1) to head (58f3bf0).
Report is 29 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev    #1093   +/-   ##
=======================================
  Coverage   88.70%   88.70%           
=======================================
  Files          45       45           
  Lines        9745     9748    +3     
  Branches     2767     2769    +2     
=======================================
+ Hits         8644     8647    +3     
  Misses        779      779           
  Partials      322      322           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented May 5, 2024

@oruebel This is mostly done. I need to check/update or write a test that my changes does not do any interference of existing maxshape settings. (And do another pass through to make sure the logic is efficient) However, the main point I want to bring up is your idea of having a parameter for turning on and off the expandability. This would mean HDMFIO has a parameter that is not used in ZarrIO. In fact there are tests failing due to the parameter not being recognized. I see we have two options:

  1. Update hdmf-zarr to accept and do nothing with the arg
  2. Don't have the arg
    I do like the explicit nature of having it, but I also think having this trickle into hdmf-zarr is not clean.

@oruebel
Copy link
Contributor

oruebel commented May 5, 2024

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

@mavaylon1
Copy link
Contributor Author

This would mean HDMFIO has a parameter that is not used in ZarrIO.

I don't think the parameter needs to be in HDMFIO. I think it's ok to just add as a parameter in HDF5IO

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

@oruebel
Copy link
Contributor

oruebel commented May 5, 2024

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

'default': None}, allow_extra=True)
def write(self, **kwargs):

and those are being passed through to write_builder

self.write_builder(f_builder, **kwargs)

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

@mavaylon1
Copy link
Contributor Author

HDF5IO write needs to call write_builder. It does that by calling super().write(**kwargs). This then gets us to HDMFIO write, which calls write_builder.

Yes, but HDMFIO.write allows extra keyword arguments:

'default': None}, allow_extra=True)
def write(self, **kwargs):

and those are being passed through to write_builder

self.write_builder(f_builder, **kwargs)

So you can add custom keyword arguments without having to add them in HDMFIO. HDF5IO already has several additional arguments on write and write_builder that are not in HDMFIO, such as the exhaust_dci parameter.

Well isn't that just right in front of my face.

@mavaylon1
Copy link
Contributor Author

mavaylon1 commented May 5, 2024

Notes:
My approach to the tests:

  1. Make sure the default is now expandable. This is accomplished by adding an additional assert on existing test. This is done for both regular and compound data.
  2. Make sure the default does not override user chunking. This is accomplished by existing tests passing.
  3. Make sure when told "false" that it turns off said chunking.

@mavaylon1 mavaylon1 marked this pull request as ready for review May 6, 2024 15:56
@mavaylon1 mavaylon1 requested a review from oruebel May 6, 2024 15:56
CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Oliver Ruebel <[email protected]>
Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some minor suggestions, but otherwise this looks good to me.

@mavaylon1
Copy link
Contributor Author

I added some minor suggestions, but otherwise this looks good to me.

Thanks for the quick review. I will make the doc string more detailed, but take a look at my comments for the other changes. The pass was a deliberate thing (vs a left over from a draft) and I like the warning.

@mavaylon1 mavaylon1 requested a review from oruebel May 6, 2024 23:32
Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

@mavaylon1 mavaylon1 merged commit 201b8c4 into dev May 7, 2024
27 of 28 checks passed
@mavaylon1 mavaylon1 deleted the expand_dataset branch May 7, 2024 11:44
@@ -7,6 +7,7 @@
- Added `TypeConfigurator` to automatically wrap fields with `TermSetWrapper` according to a configuration file. @mavaylon1 [#1016](https://github.com/hdmf-dev/hdmf/pull/1016)
- Updated `TermSetWrapper` to support validating a single field within a compound array. @mavaylon1 [#1061](https://github.com/hdmf-dev/hdmf/pull/1061)
- Updated testing to not install in editable mode and not run `coverage` by default. @rly [#1107](https://github.com/hdmf-dev/hdmf/pull/1107)
- Updated the default behavior for writing HDF5 datasets to be expandandable datasets with chunking enabled by default. This does not override user set chunking parameters. @mavaylon1 [#1093](https://github.com/hdmf-dev/hdmf/pull/1093)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandandable -> expandable

@rly
Copy link
Contributor

rly commented May 7, 2024

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

@mavaylon1
Copy link
Contributor Author

Could you add documentation on how to expand a VectorData?

It looks like creation of a dataset of references is not modified here. Some tables in NWB contain columns that are all references, e.g., the electrode table has a column with references to the ElectrodeGroup. I think such datasets should be expandable as well.

Yeah the lack of dataset of references support was just a smaller scope for this idea. I agree this makes a lot of sense to have. I will make this an issue ticket.

As for the expansion of VectorData documentation, I thought we had that. Maybe I am thinking of the HDF5 documentation, but I will look. If it does not exist, I will loop that into the ticket for dataset of references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants