[Backend Configuration IIa] Add dataset identification tools #569

CodyCBakerPhD · 2023-09-18T02:46:01Z

replaces #555

~80 changes to source code (debugs/refactoring of iterator)
~200 additions to source code
~500 lines of tests

This is the main bulk of complexity related to this feature, which required debugging over a large number of test cases compared to the original conception of the method in #475

In a nutshell, the non-private method that is exposed to tools.nwb_helpers scans an in-memory NWBFile and identifies all the primary fields that could become datasets in the file when it is written; these are instantiated as the models in the previous PRs using default compression/filter/chunking/buffer options, and which would then be passed back to the user for confirmation before any final configuration of the datasets is performed (in follow-up PRs)

for more information, see https://pre-commit.ci

… new_backend_default_dataset_configuration

for more information, see https://pre-commit.ci

… new_backend_default_dataset_configuration

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

...st_backend_and_dataset_configuration/test_helpers/test_get_default_dataset_configurations.py

...dataset_configuration/test_helpers/test_get_default_dataset_configurations_appended_files.py

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

CodyCBakerPhD · 2023-10-03T17:33:22Z

add test cases for dataset config parsing on common extensions such as ndx-events

as well as compass directions

… new_backend_default_dataset_configuration

for more information, see https://pre-commit.ci

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

tests/imports.py

for more information, see https://pre-commit.ci

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

h-mayorquin

OK, I think this is great. I have only one request about adding some further tests.

I am also wondering. The method _get_dataset_metadata is working as a constructor for the DatasetInfo and the DatasetConfiguration. If the this is the main way that we are going to be using this class on the library I think that we should have constructor methods with the logic in the current _get_dataset_metadata. No point in separating code that semantically belongs together in another file.

Concretly:

class DatasetInfo(BaseModel):
    """A data model to represent immutable aspects of an object that will become a HDF5 or Zarr dataset on write."""

    # TODO: When using Pydantic v2, replace with
    # model_config = ConfigDict(allow_mutation=False)
    class Config:  # noqa: D106
        allow_mutation = False
        arbitrary_types_allowed = True

    object_id: str = Field(description="The UUID of the neurodata object containing the dataset.")
    location: str = Field(  # TODO: in v2, use init_var=False or assign as a property
        description="The relative location of the this dataset within the in-memory NWBFile."
    )
    dataset_name: Literal["data", "timestamps"] = Field(description="The reference name of the dataset.")
    dtype: np.dtype = Field(  # TODO: When using Pydantic v2, replace np.dtype with InstanceOf[np.dtype]
        description="The data type of elements of this dataset."
    )
    full_shape: Tuple[int, ...] = Field(description="The maximum shape of the entire dataset.")

     #Other methods in the class.##
    @classmethod
    def from_neurodata_object(cls, field_name: str, neurodata_object: Container) -> "DatasetConfiguration":
        location = find_location_within_nwb_file(current_location=field_name, neurodata_object=neurodata_object)
        dataset_name = location.strip("/")[-1]
        dtype = _determine_dtype_like_data_chunk_iterator(neurodata_object)

        candidate_dataset = getattr(neurodata_object, field_name)
        full_shape = get_data_shape(data=candidate_dataset)

        dataset_info = cls(
            object_id=neurodata_object.object_id,
            location=location,
            dataset_name=dataset_name,
            full_shape=full_shape,
            dtype=dtype,
        )
        return dataset_info

Even if we were are using this class in some other way this seems like a very useful method to have. You pass the object and the field and you get the information. And analogous function can be made for DatasetIOConfiguration with the logic inside _get_dataset_metadata.

I also think that this is the functionality that we are adding in this PR and should be easier to test. Otherwise, we are only testing all of this indirectly through get_default_dataset_io_configurations which collects this constructors through the whole nwbfile.

What do you think? Are there some drawbacks of this approach that I am missing?

...backend_and_dataset_configuration/test_helpers/test_get_default_dataset_io_configurations.py

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

CodyCBakerPhD · 2023-11-21T17:51:11Z

What do you think? Are there some drawbacks of this approach that I am missing?

It's a good idea, but this PR is already massive enough and we need to start drawing some lines. I'd say raise an issue with this idea to migrate logic from a private method to a class method on the models and leave as a TODO follow-up that is 'nice' but not required to proceed along this chain

Also keep in mind some of those model designs may change once we get the Pydantic V2 ban lifted upstream, so I wasn't going to perfect them too much until that time

Otherwise the one publicly exposed function that forms the basis of this PR and its tests (get_default_dataset_io_configurations) is the only thing I'd focus on ATM

h-mayorquin · 2023-11-21T18:37:14Z

What do you think? Are there some drawbacks of this approach that I am missing?

It's a good idea, but this PR is already massive enough and we need to start drawing some lines. I'd say raise an issue with this idea to migrate logic from a private method to a class method on the models and leave as a TODO follow-up that is 'nice' but not required to proceed along this chain

Also keep in mind some of those model designs may change once we get the Pydantic V2 ban lifted upstream, so I wasn't going to perfect them too much until that time

Otherwise the one publicly exposed function that forms the basis of this PR and its tests (get_default_dataset_io_configurations) is the only thing I'd focus on ATM

This makes sense. I agree.

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2023-11-21T19:07:19Z

@h-mayorquin I think all issues and next steps should be addressed now

h-mayorquin

LGTM. I would like to add a test for how the methods that extract the shape and dtype behave with ragged arrays in the dynamic table as that issue will come back or us when dealig with the units table or writig sorting objects from spikeinterface.

We can do that now or later, up to you.

h-mayorquin · 2023-11-22T07:12:23Z

...backend_and_dataset_configuration/test_helpers/test_get_default_dataset_io_configurations.py

+    data = iterator(array)
+
+    nwbfile = mock_NWBFile()
+    column = VectorData(name="TestColumn", description="", data=data)


Can we add a ragged array column in the test? I am wondering how it wil work with the shape and dtype

Very good thinking on this request. Caught a bug where _index columns weren't found

And in general, a good reminder for myself not to access column fields of a table by accessing their name as a key in the dictionary - since they override __get_item__ for user convenience it actually has different behavior than I was expecting from a simple mapping

h-mayorquin · 2023-11-22T08:37:44Z

Ah, yes also, you changed the name of the function that finds the location but the attribute in the DatasetInfo is still caled like that. I don't know if that something that we should change as well.

CodyCBakerPhD · 2023-11-22T15:29:06Z

I don't know if that something that we should change as well.

Given that class was introduced two prior PRs back, I'd rather scope such a refactor to it's own PR

At this far downstream it takes a lot to go back and adjust things, and again this PR is quite a lot

I only updated the part of this PR that interacts with it and would be happy to do a top-down rename to location_in_file in a follow-up

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

h-mayorquin · 2023-11-22T16:17:29Z

I don't know if that something that we should change as well.

Given that class was introduced two prior PRs back, I'd rather scope such a refactor to it's own PR

At this far downstream it takes a lot to go back and adjust things, and again this PR is quite a lot

I only updated the part of this PR that interacts with it and would be happy to do a top-down rename to location_in_file in a follow-up

Follow-up is great.

h-mayorquin · 2023-11-22T16:20:30Z

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py

@@ -177,16 +177,15 @@ def get_default_dataset_io_configurations(
        if isinstance(neurodata_object, DynamicTable):
            dynamic_table = neurodata_object  # for readability

-            for column_name in dynamic_table.colnames:
-                candidate_dataset = dynamic_table[column_name].data  # VectorData object


Wait, so how are these two guys differently?

In particular, dynamic_table[column_name] calls __get_item__ on dynamic_table with key column_name, which default dict behavior would return the value, but they adjust it to return downstream links so that you can do things like units["spike_times"][:] and it returns the actual list of spikes shaped by units x spikes_per_unit

codecov · 2023-11-22T16:48:55Z

Codecov Report

Merging #569 (3032755) into main (b732807) will decrease coverage by 0.06%.
The diff coverage is 90.13%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #569      +/-   ##
==========================================
- Coverage   91.66%   91.61%   -0.06%     
==========================================
  Files         106      107       +1     
  Lines        5517     5627     +110     
==========================================
+ Hits         5057     5155      +98     
- Misses        460      472      +12

Flag	Coverage Δ
unittests	`91.61% <90.13%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/neuroconv/tools/nwb_helpers/__init__.py	`100.00% <100.00%> (ø)`
...euroconv/tools/nwb_helpers/_models/_base_models.py	`100.00% <100.00%> (ø)`
...euroconv/tools/nwb_helpers/_models/_hdf5_models.py	`62.50% <100.00%> (ø)`
...euroconv/tools/nwb_helpers/_models/_zarr_models.py	`86.15% <100.00%> (ø)`
src/neuroconv/tools/testing/__init__.py	`100.00% <ø> (ø)`
...roconv/tools/testing/_mock/_mock_dataset_models.py	`100.00% <100.00%> (ø)`
src/neuroconv/tools/hdmf.py	`87.05% <86.95%> (-0.82%)`	⬇️
...roconv/tools/nwb_helpers/_dataset_configuration.py	`90.00% <90.00%> (ø)`

CodyCBakerPhD added 3 commits September 17, 2023 21:51

port over tool function for defaults

c4bca8a

modify iterator as well

38a1fa3

factor out backend config stuff to other PR

a981068

CodyCBakerPhD requested a review from h-mayorquin September 18, 2023 02:46

CodyCBakerPhD self-assigned this Sep 18, 2023

CodyCBakerPhD mentioned this pull request Sep 18, 2023

[Backend Configuration IIb] Add backend collection tools #570

Merged

[pre-commit.ci] auto fixes from pre-commit.com hooks

966592c

for more information, see https://pre-commit.ci

CodyCBakerPhD mentioned this pull request Sep 18, 2023

[Backend Config II]: Add dataset configuration tools #555

Closed

CodyCBakerPhD and others added 5 commits September 17, 2023 22:56

Merge branch 'new_backend_pydantic_backend_configuration_models' into…

2df8c06

… new_backend_default_dataset_configuration

Update CHANGELOG.md

2e7af84

Update __init__.py

85bc927

[pre-commit.ci] auto fixes from pre-commit.com hooks

8307156

for more information, see https://pre-commit.ci

Merge branch 'new_backend_pydantic_backend_configuration_models' into…

87bee35

… new_backend_default_dataset_configuration

bendichter reviewed Oct 2, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py Show resolved Hide resolved

h-mayorquin reviewed Oct 3, 2023

View reviewed changes

...st_backend_and_dataset_configuration/test_helpers/test_get_default_dataset_configurations.py Outdated Show resolved Hide resolved

h-mayorquin reviewed Oct 3, 2023

View reviewed changes

...dataset_configuration/test_helpers/test_get_default_dataset_configurations_appended_files.py Outdated Show resolved Hide resolved

h-mayorquin reviewed Oct 3, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 5 commits October 4, 2023 09:42

Merge branch 'new_backend_pydantic_backend_configuration_models' into…

87d1116

… new_backend_default_dataset_configuration

use dataset_name in DatasetInfo; other debugs

13c9b37

[pre-commit.ci] auto fixes from pre-commit.com hooks

b63161a

for more information, see https://pre-commit.ci

remove comments

d55e2a2

Merge branch 'new_backend_default_dataset_configuration' of https://g…

4b848b0

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

Base automatically changed from new_backend_pydantic_backend_configuration_models to main November 7, 2023 16:04

fix conflict

446d81d

CodyCBakerPhD mentioned this pull request Nov 7, 2023

[Feature]: Expose mode to NWBZarrIO hdmf-dev/hdmf-zarr#137

Open

3 tasks

CodyCBakerPhD added 2 commits November 7, 2023 20:37

remove unused typing

3c7cde8

improve error message and fix import test

b845ac6

CodyCBakerPhD commented Nov 8, 2023

View reviewed changes

tests/imports.py Show resolved Hide resolved

add global static maps; further condense tests with parametrize

5c7fb6b

CodyCBakerPhD and others added 5 commits November 21, 2023 10:29

Merge branch 'main' into new_backend_default_dataset_configuration

46b8cdb

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f0806a

for more information, see https://pre-commit.ci

add IO to dataset config names

b07a541

fix conflict

2bb867a

[pre-commit.ci] auto fixes from pre-commit.com hooks

89915ab

for more information, see https://pre-commit.ci

h-mayorquin reviewed Nov 21, 2023

View reviewed changes

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py Show resolved Hide resolved

CodyCBakerPhD added 2 commits November 21, 2023 11:48

fix minimal test

bfe1049

Merge branch 'new_backend_default_dataset_configuration' of https://g…

268e7e9

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

h-mayorquin reviewed Nov 21, 2023

View reviewed changes

...backend_and_dataset_configuration/test_helpers/test_get_default_dataset_io_configurations.py Show resolved Hide resolved

src/neuroconv/tools/nwb_helpers/_dataset_configuration.py Show resolved Hide resolved

alter private method name

f1683fa

CodyCBakerPhD and others added 2 commits November 21, 2023 14:02

add extra tests

185a69d

[pre-commit.ci] auto fixes from pre-commit.com hooks

6fad003

for more information, see https://pre-commit.ci

CodyCBakerPhD requested a review from h-mayorquin November 21, 2023 19:07

h-mayorquin approved these changes Nov 22, 2023

View reviewed changes

Merge branch 'main' into new_backend_default_dataset_configuration

15d8aed

CodyCBakerPhD and others added 3 commits November 22, 2023 10:56

Merge branch 'main' into new_backend_default_dataset_configuration

4f668ec

add test for ragged tables; debug

5ce5914

Merge branch 'new_backend_default_dataset_configuration' of https://g…

8350d2e

…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration

h-mayorquin reviewed Nov 22, 2023

View reviewed changes

adjust for cross-platform

3032755

CodyCBakerPhD merged commit 1d3d58d into main Nov 22, 2023
34 of 36 checks passed

CodyCBakerPhD deleted the new_backend_default_dataset_configuration branch November 22, 2023 20:05

h-mayorquin mentioned this pull request Nov 26, 2023

Refactor dataset configuration code as constructors #659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend Configuration IIa] Add dataset identification tools #569

[Backend Configuration IIa] Add dataset identification tools #569

CodyCBakerPhD commented Sep 18, 2023

CodyCBakerPhD commented Oct 3, 2023

h-mayorquin left a comment •

edited

Loading

CodyCBakerPhD commented Nov 21, 2023

h-mayorquin commented Nov 21, 2023

CodyCBakerPhD commented Nov 21, 2023

h-mayorquin left a comment

h-mayorquin Nov 22, 2023

CodyCBakerPhD Nov 22, 2023

h-mayorquin commented Nov 22, 2023

CodyCBakerPhD commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

h-mayorquin Nov 22, 2023

CodyCBakerPhD Nov 22, 2023

CodyCBakerPhD Nov 22, 2023

codecov bot commented Nov 22, 2023

[Backend Configuration IIa] Add dataset identification tools #569

[Backend Configuration IIa] Add dataset identification tools #569

Conversation

CodyCBakerPhD commented Sep 18, 2023

CodyCBakerPhD commented Oct 3, 2023

h-mayorquin left a comment • edited Loading

Choose a reason for hiding this comment

CodyCBakerPhD commented Nov 21, 2023

h-mayorquin commented Nov 21, 2023

CodyCBakerPhD commented Nov 21, 2023

h-mayorquin left a comment

Choose a reason for hiding this comment

h-mayorquin Nov 22, 2023

Choose a reason for hiding this comment

CodyCBakerPhD Nov 22, 2023

Choose a reason for hiding this comment

h-mayorquin commented Nov 22, 2023

CodyCBakerPhD commented Nov 22, 2023

h-mayorquin commented Nov 22, 2023

h-mayorquin Nov 22, 2023

Choose a reason for hiding this comment

CodyCBakerPhD Nov 22, 2023

Choose a reason for hiding this comment

CodyCBakerPhD Nov 22, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 22, 2023

Codecov Report

h-mayorquin left a comment •

edited

Loading