-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend Configuration IIa] Add dataset identification tools #569
[Backend Configuration IIa] Add dataset identification tools #569
Conversation
for more information, see https://pre-commit.ci
… new_backend_default_dataset_configuration
for more information, see https://pre-commit.ci
… new_backend_default_dataset_configuration
...st_backend_and_dataset_configuration/test_helpers/test_get_default_dataset_configurations.py
Outdated
Show resolved
Hide resolved
...dataset_configuration/test_helpers/test_get_default_dataset_configurations_appended_files.py
Outdated
Show resolved
Hide resolved
add test cases for dataset config parsing on common extensions such as ndx-events as well as compass directions |
… new_backend_default_dataset_configuration
for more information, see https://pre-commit.ci
…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ithub.com/catalystneuro/neuroconv into new_backend_default_dataset_configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this is great. I have only one request about adding some further tests.
I am also wondering. The method _get_dataset_metadata
is working as a constructor for the DatasetInfo
and the DatasetConfiguration
. If the this is the main way that we are going to be using this class on the library I think that we should have constructor methods with the logic in the current _get_dataset_metadata
. No point in separating code that semantically belongs together in another file.
Concretly:
class DatasetInfo(BaseModel):
"""A data model to represent immutable aspects of an object that will become a HDF5 or Zarr dataset on write."""
# TODO: When using Pydantic v2, replace with
# model_config = ConfigDict(allow_mutation=False)
class Config: # noqa: D106
allow_mutation = False
arbitrary_types_allowed = True
object_id: str = Field(description="The UUID of the neurodata object containing the dataset.")
location: str = Field( # TODO: in v2, use init_var=False or assign as a property
description="The relative location of the this dataset within the in-memory NWBFile."
)
dataset_name: Literal["data", "timestamps"] = Field(description="The reference name of the dataset.")
dtype: np.dtype = Field( # TODO: When using Pydantic v2, replace np.dtype with InstanceOf[np.dtype]
description="The data type of elements of this dataset."
)
full_shape: Tuple[int, ...] = Field(description="The maximum shape of the entire dataset.")
#Other methods in the class.##
@classmethod
def from_neurodata_object(cls, field_name: str, neurodata_object: Container) -> "DatasetConfiguration":
location = find_location_within_nwb_file(current_location=field_name, neurodata_object=neurodata_object)
dataset_name = location.strip("/")[-1]
dtype = _determine_dtype_like_data_chunk_iterator(neurodata_object)
candidate_dataset = getattr(neurodata_object, field_name)
full_shape = get_data_shape(data=candidate_dataset)
dataset_info = cls(
object_id=neurodata_object.object_id,
location=location,
dataset_name=dataset_name,
full_shape=full_shape,
dtype=dtype,
)
return dataset_info
Even if we were are using this class in some other way this seems like a very useful method to have. You pass the object and the field and you get the information. And analogous function can be made for DatasetIOConfiguration
with the logic inside _get_dataset_metadata
.
I also think that this is the functionality that we are adding in this PR and should be easier to test. Otherwise, we are only testing all of this indirectly through get_default_dataset_io_configurations
which collects this constructors through the whole nwbfile.
What do you think? Are there some drawbacks of this approach that I am missing?
...backend_and_dataset_configuration/test_helpers/test_get_default_dataset_io_configurations.py
Show resolved
Hide resolved
It's a good idea, but this PR is already massive enough and we need to start drawing some lines. I'd say raise an issue with this idea to migrate logic from a private method to a class method on the models and leave as a TODO follow-up that is 'nice' but not required to proceed along this chain Also keep in mind some of those model designs may change once we get the Pydantic V2 ban lifted upstream, so I wasn't going to perfect them too much until that time Otherwise the one publicly exposed function that forms the basis of this PR and its tests ( |
This makes sense. I agree. |
for more information, see https://pre-commit.ci
@h-mayorquin I think all issues and next steps should be addressed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would like to add a test for how the methods that extract the shape and dtype behave with ragged arrays in the dynamic table as that issue will come back or us when dealig with the units table or writig sorting objects from spikeinterface.
We can do that now or later, up to you.
data = iterator(array) | ||
|
||
nwbfile = mock_NWBFile() | ||
column = VectorData(name="TestColumn", description="", data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a ragged array column in the test? I am wondering how it wil work with the shape and dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good thinking on this request. Caught a bug where _index
columns weren't found
And in general, a good reminder for myself not to access column fields of a table by accessing their name as a key in the dictionary - since they override __get_item__
for user convenience it actually has different behavior than I was expecting from a simple mapping
Ah, yes also, you changed the name of the function that finds the location but the attribute in the DatasetInfo is still caled like that. I don't know if that something that we should change as well. |
Given that class was introduced two prior PRs back, I'd rather scope such a refactor to it's own PR At this far downstream it takes a lot to go back and adjust things, and again this PR is quite a lot I only updated the part of this PR that interacts with it and would be happy to do a top-down rename to |
Follow-up is great. |
@@ -177,16 +177,15 @@ def get_default_dataset_io_configurations( | |||
if isinstance(neurodata_object, DynamicTable): | |||
dynamic_table = neurodata_object # for readability | |||
|
|||
for column_name in dynamic_table.colnames: | |||
candidate_dataset = dynamic_table[column_name].data # VectorData object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so how are these two guys differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, dynamic_table[column_name]
calls __get_item__
on dynamic_table
with key column_name
, which default dict
behavior would return the value, but they adjust it to return downstream links so that you can do things like units["spike_times"][:]
and it returns the actual list of spikes shaped by units x spikes_per_unit
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #569 +/- ##
==========================================
- Coverage 91.66% 91.61% -0.06%
==========================================
Files 106 107 +1
Lines 5517 5627 +110
==========================================
+ Hits 5057 5155 +98
- Misses 460 472 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
|
replaces #555
~80 changes to source code (debugs/refactoring of iterator)
~200 additions to source code
~500 lines of tests
This is the main bulk of complexity related to this feature, which required debugging over a large number of test cases compared to the original conception of the method in #475
In a nutshell, the non-private method that is exposed to
tools.nwb_helpers
scans an in-memory NWBFile and identifies all the primary fields that could become datasets in the file when it is written; these are instantiated as the models in the previous PRs using default compression/filter/chunking/buffer options, and which would then be passed back to the user for confirmation before any final configuration of the datasets is performed (in follow-up PRs)