Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Converting HDF5 to Zarr with unlimted dimensions fails #154

Closed
3 tasks done
oruebel opened this issue Jan 16, 2024 · 1 comment · Fixed by #155
Closed
3 tasks done

[Bug]: Converting HDF5 to Zarr with unlimted dimensions fails #154

oruebel opened this issue Jan 16, 2024 · 1 comment · Fixed by #155
Assignees
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users
Milestone

Comments

@oruebel
Copy link
Contributor

oruebel commented Jan 16, 2024

What happened?

When converting an HDF5 dataset to Zarr where the HDF5 dataset has an unlimited dimension, the data_shape will include None values for the unlimited dimensions when get_data_shape is being called"

elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.number) or dtype == np.bool_:
data_shape = get_data_shape(data)

This is because hdmf.utils.get_data_shape looks at the maxshape of the data first rather than returning the actual shape https://github.com/hdmf-dev/hdmf/blob/8892ba6472d9751ff44b977a1d5a50e6512d3b05/src/hdmf/utils.py#L902-L903

However, the data shape when creating a new dataset in Zarr cannot contain None values, i.e., it must be a tuple of only integer values.

Steps to Reproduce

# Download sub-1214579789/sub-1214579789_ses-1214621812_icephys.nwb (asset URL 
# https://api.dandiarchive.org/api/assets/1142cd85-96f9-4ae2-9d71-f722887fb3c0/download/) 
# from  DANDI and convert with:

from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO
import sys

def print_help():
    print("The script requires exactly one arguments. Call with")
    print("python nwb_hdf_to_zarr.py <hdf5_filename>")
    print("")
    print("The output zarr file will be named the same as the HDF5 file with the added .zarr extension")

def convert(hdf_filename: str, zarr_filename: str):
    with NWBHDF5IO(hdf_filename, 'r', load_namespaces=True) as read_io:  # Create HDF5 IO object for read
        with NWBZarrIO(zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print_help()
        exit(0)
    else:
        hdf_filename = sys.argv[1]
        zarr_filename = hdf_filename +".zarr"
        convert(hdf_filename=hdf_filename, zarr_filename=zarr_filename)

Traceback

See above

Operating System

Linux

Python Executable

Conda

Python Version

3.11

Package Versions

No response

Code of Conduct

@oruebel
Copy link
Contributor Author

oruebel commented Jan 16, 2024

@rly I'll submit a PR with a fix for hdmf_zarr, however, I'm wondering whether a more approbriat fix would be to update the behavior of get_data_shape in hdmf to make sure we don't return shapes that include None values. However, I do not recall whether that is intended behavior of get_data_shape and why this may be needed. Alternatively, we could also add a flag to get_data_shape to allow use to configure whether None values are allowed in the returned shape or not. Thoughts?

@oruebel oruebel added category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users labels Jan 16, 2024
@oruebel oruebel added this to the Next Release milestone Jan 16, 2024
@oruebel oruebel self-assigned this Jan 16, 2024
mavaylon1 pushed a commit that referenced this issue Jan 16, 2024
* Fix #154 determine datashape for unlimtited HDF5 datasets on write

* Update changelog

* Add unit test using maxshape with None values in HDF5

* Exclude backup code from codecov

* Simplify io options logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: high impacts proper operation or use of feature important to most users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant