Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend Configuration IIa] Add dataset identification tools #569

Merged
merged 40 commits into from
Nov 22, 2023

Conversation

CodyCBakerPhD
Copy link
Member

replaces #555

~80 changes to source code (debugs/refactoring of iterator)
~200 additions to source code
~500 lines of tests

This is the main bulk of complexity related to this feature, which required debugging over a large number of test cases compared to the original conception of the method in #475

In a nutshell, the non-private method that is exposed to tools.nwb_helpers scans an in-memory NWBFile and identifies all the primary fields that could become datasets in the file when it is written; these are instantiated as the models in the previous PRs using default compression/filter/chunking/buffer options, and which would then be passed back to the user for confirmation before any final configuration of the datasets is performed (in follow-up PRs)

@CodyCBakerPhD
Copy link
Member Author

add test cases for dataset config parsing on common extensions such as ndx-events

as well as compass directions

Base automatically changed from new_backend_pydantic_backend_configuration_models to main November 7, 2023 16:04
Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is great. I have only one request about adding some further tests.

I am also wondering. The method _get_dataset_metadata is working as a constructor for the DatasetInfo and the DatasetConfiguration. If the this is the main way that we are going to be using this class on the library I think that we should have constructor methods with the logic in the current _get_dataset_metadata. No point in separating code that semantically belongs together in another file.

Concretly:

class DatasetInfo(BaseModel):
    """A data model to represent immutable aspects of an object that will become a HDF5 or Zarr dataset on write."""

    # TODO: When using Pydantic v2, replace with
    # model_config = ConfigDict(allow_mutation=False)
    class Config:  # noqa: D106
        allow_mutation = False
        arbitrary_types_allowed = True

    object_id: str = Field(description="The UUID of the neurodata object containing the dataset.")
    location: str = Field(  # TODO: in v2, use init_var=False or assign as a property
        description="The relative location of the this dataset within the in-memory NWBFile."
    )
    dataset_name: Literal["data", "timestamps"] = Field(description="The reference name of the dataset.")
    dtype: np.dtype = Field(  # TODO: When using Pydantic v2, replace np.dtype with InstanceOf[np.dtype]
        description="The data type of elements of this dataset."
    )
    full_shape: Tuple[int, ...] = Field(description="The maximum shape of the entire dataset.")

     #Other methods in the class.##
    @classmethod
    def from_neurodata_object(cls, field_name: str, neurodata_object: Container) -> "DatasetConfiguration":
        location = find_location_within_nwb_file(current_location=field_name, neurodata_object=neurodata_object)
        dataset_name = location.strip("/")[-1]
        dtype = _determine_dtype_like_data_chunk_iterator(neurodata_object)

        candidate_dataset = getattr(neurodata_object, field_name)
        full_shape = get_data_shape(data=candidate_dataset)

        dataset_info = cls(
            object_id=neurodata_object.object_id,
            location=location,
            dataset_name=dataset_name,
            full_shape=full_shape,
            dtype=dtype,
        )
        return dataset_info

Even if we were are using this class in some other way this seems like a very useful method to have. You pass the object and the field and you get the information. And analogous function can be made for DatasetIOConfiguration with the logic inside _get_dataset_metadata.

I also think that this is the functionality that we are adding in this PR and should be easier to test. Otherwise, we are only testing all of this indirectly through get_default_dataset_io_configurations which collects this constructors through the whole nwbfile.

What do you think? Are there some drawbacks of this approach that I am missing?

@CodyCBakerPhD
Copy link
Member Author

What do you think? Are there some drawbacks of this approach that I am missing?

It's a good idea, but this PR is already massive enough and we need to start drawing some lines. I'd say raise an issue with this idea to migrate logic from a private method to a class method on the models and leave as a TODO follow-up that is 'nice' but not required to proceed along this chain

Also keep in mind some of those model designs may change once we get the Pydantic V2 ban lifted upstream, so I wasn't going to perfect them too much until that time

Otherwise the one publicly exposed function that forms the basis of this PR and its tests (get_default_dataset_io_configurations) is the only thing I'd focus on ATM

@h-mayorquin
Copy link
Collaborator

What do you think? Are there some drawbacks of this approach that I am missing?

It's a good idea, but this PR is already massive enough and we need to start drawing some lines. I'd say raise an issue with this idea to migrate logic from a private method to a class method on the models and leave as a TODO follow-up that is 'nice' but not required to proceed along this chain

Also keep in mind some of those model designs may change once we get the Pydantic V2 ban lifted upstream, so I wasn't going to perfect them too much until that time

Otherwise the one publicly exposed function that forms the basis of this PR and its tests (get_default_dataset_io_configurations) is the only thing I'd focus on ATM

This makes sense. I agree.

@CodyCBakerPhD
Copy link
Member Author

@h-mayorquin I think all issues and next steps should be addressed now

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would like to add a test for how the methods that extract the shape and dtype behave with ragged arrays in the dynamic table as that issue will come back or us when dealig with the units table or writig sorting objects from spikeinterface.

We can do that now or later, up to you.

data = iterator(array)

nwbfile = mock_NWBFile()
column = VectorData(name="TestColumn", description="", data=data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a ragged array column in the test? I am wondering how it wil work with the shape and dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good thinking on this request. Caught a bug where _index columns weren't found

And in general, a good reminder for myself not to access column fields of a table by accessing their name as a key in the dictionary - since they override __get_item__ for user convenience it actually has different behavior than I was expecting from a simple mapping

@h-mayorquin
Copy link
Collaborator

Ah, yes also, you changed the name of the function that finds the location but the attribute in the DatasetInfo is still caled like that. I don't know if that something that we should change as well.

@CodyCBakerPhD
Copy link
Member Author

I don't know if that something that we should change as well.

Given that class was introduced two prior PRs back, I'd rather scope such a refactor to it's own PR

At this far downstream it takes a lot to go back and adjust things, and again this PR is quite a lot

I only updated the part of this PR that interacts with it and would be happy to do a top-down rename to location_in_file in a follow-up

@h-mayorquin
Copy link
Collaborator

I don't know if that something that we should change as well.

Given that class was introduced two prior PRs back, I'd rather scope such a refactor to it's own PR

At this far downstream it takes a lot to go back and adjust things, and again this PR is quite a lot

I only updated the part of this PR that interacts with it and would be happy to do a top-down rename to location_in_file in a follow-up

Follow-up is great.

@@ -177,16 +177,15 @@ def get_default_dataset_io_configurations(
if isinstance(neurodata_object, DynamicTable):
dynamic_table = neurodata_object # for readability

for column_name in dynamic_table.colnames:
candidate_dataset = dynamic_table[column_name].data # VectorData object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so how are these two guys differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, dynamic_table[column_name] calls __get_item__ on dynamic_table with key column_name, which default dict behavior would return the value, but they adjust it to return downstream links so that you can do things like units["spike_times"][:] and it returns the actual list of spikes shaped by units x spikes_per_unit

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Merging #569 (3032755) into main (b732807) will decrease coverage by 0.06%.
The diff coverage is 90.13%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #569      +/-   ##
==========================================
- Coverage   91.66%   91.61%   -0.06%     
==========================================
  Files         106      107       +1     
  Lines        5517     5627     +110     
==========================================
+ Hits         5057     5155      +98     
- Misses        460      472      +12     
Flag Coverage Δ
unittests 91.61% <90.13%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/neuroconv/tools/nwb_helpers/__init__.py 100.00% <100.00%> (ø)
...euroconv/tools/nwb_helpers/_models/_base_models.py 100.00% <100.00%> (ø)
...euroconv/tools/nwb_helpers/_models/_hdf5_models.py 62.50% <100.00%> (ø)
...euroconv/tools/nwb_helpers/_models/_zarr_models.py 86.15% <100.00%> (ø)
src/neuroconv/tools/testing/__init__.py 100.00% <ø> (ø)
...roconv/tools/testing/_mock/_mock_dataset_models.py 100.00% <100.00%> (ø)
src/neuroconv/tools/hdmf.py 87.05% <86.95%> (-0.82%) ⬇️
...roconv/tools/nwb_helpers/_dataset_configuration.py 90.00% <90.00%> (ø)

@CodyCBakerPhD CodyCBakerPhD merged commit 1d3d58d into main Nov 22, 2023
34 of 36 checks passed
@CodyCBakerPhD CodyCBakerPhD deleted the new_backend_default_dataset_configuration branch November 22, 2023 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants