-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 288 marc field method refactor #200
Conversation
afa2320
to
cdad188
Compare
One question for both @jonavellecuerdo and @ehanson8: I can't comment directly in the PR, but wondering about moving these static file mappings to methods/properties on the Seems like we'd want to avoid them reading the file each time they are used, but that could be solved with cached instances, or even just read and save on class init. |
2c8c701
to
e0d7b44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a question about the temporary values you noted in the PR, asking if they could just be called directly even if not part of a field method proper.
Also, maybe dovetails with a general comment about the static variables at the top.
These all seem to have one thing in common: parse some data that multiple fields will need. If so, then feels like we could do that on init for the MarcTransform
class somehow, and reuse that data throughout via self
. As the first part of the MARC refactoring, might be a good time to consider that pattern.
transmogrifier/sources/xml/marc.py
Outdated
leader_field = Marc._get_leader_field(source_record) | ||
control_field_general_info = Marc._get_control_field_general_info(source_record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your thinking a bit in having these temporarily here for use in field that don't have their own methods yet... but what about just calling these new and/or renamed methods from those places directly?
e.g. I see leader_field
used, but could call directly:
# content_type
if content_type := Marc.json_crosswalk_code_to_name(
self._get_leader_field(source_record)[6:7],
marc_content_type_crosswalk,
source_record_id,
"Leader/06",
):
fields["content_type"] = [content_type]
What I think this might help guide is how often these are used. If often, it might be worth caching the results on those sub-methods. If not often, it's probably okay to invoke them directly 1-2 times. I'm a big fan of caching -- where you call the method, and it reuses data if safe to cache and already called -- but there is a bit of overhead there.
Either way, might avoid needing these temporary values at all, and would give some insight into usage during future parts of this refactor. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated Marc
to call the private methods as needed but as discussed, caching is not needed at this time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, a few minor suggestions!
tests/sources/xml/test_marc.py
Outdated
leader_field_insert (str): A string representing a MARC fixed length 'leader' | ||
XML element. Defaults to a dummy value. | ||
control_field_general_info_insert (str): A string representing a MARC fixed length | ||
'general info control field' (i.e., code 008) XML element. | ||
Defaults to a dummy value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Heads up: Renamed control_field_general_info_insert
-> control_field_insert
.
tests/sources/xml/test_marc.py
Outdated
def test_get_control_field_general_info_if_field_blank(): | ||
source_record = create_marc_source_record_stub( | ||
control_field_general_info_insert='<controlfield tag="008"></controlfield>' | ||
) | ||
with pytest.raises( | ||
SkippedRecordEvent, | ||
match=( | ||
'Record skipped because key information is missing: <controlfield tag="008">.' | ||
), | ||
): | ||
Marc._get_control_field_general_info(source_record) | ||
|
||
|
||
def test_get_control_field_general_info_if_field_missing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add raises_exception
to test names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looks like we've used _raises_skipped_record_event
in other transforms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I renamed the test as suggested.
@@ -894,11 +968,7 @@ def test_marc_record_missing_leader_logs_error(caplog): | |||
output_records = Marc("alma", marc_xml_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd update this test name since it no longer logs an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test to say skips_record
for now, but I think these two tests may be part of the old tests we'll want to deprecate eventually!
@@ -908,11 +978,7 @@ def test_marc_record_missing_008_logs_error(caplog): | |||
output_records = Marc("alma", marc_xml_records) | |||
assert len(list(output_records)) == 0 | |||
assert output_records.processed_record_count == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the test name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment above.
d1c30e6
to
692c090
Compare
As @ghukill and I discussed, we opted to only make the following change: assign 'crosswalk' variables as attributes of
|
692c090
to
c2a8ff9
Compare
Good summary and I agree that seems like the best path! |
country_code_crosswalk = load_external_config("config/loc-countries.xml", "xml") | ||
holdings_collection_crosswalk = load_external_config( | ||
"config/holdings_collection_crosswalk.json", "json" | ||
) | ||
holdings_format_crosswalk = load_external_config( | ||
"config/holdings_format_crosswalk.json", "json" | ||
) | ||
holdings_location_crosswalk = load_external_config( | ||
"config/holdings_location_crosswalk.json", "json" | ||
) | ||
language_code_crosswalk = load_external_config("config/loc-languages.xml", "xml") | ||
marc_content_type_crosswalk = load_external_config( | ||
"config/marc_content_type_crosswalk.json", "json" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfection!
Thanks for making this change. Kind of minor, but a bit of nice code organization I think. Also, while I was definitely using "loose" in our conversations, it occurred to me those kind of variables are probably most accurately described as "module level" variables. Maybe that's a good term we can use going forward to have a collective understanding of what we're talking about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a part 1, looks great to me!
I like "module level" and am always in favor of standardizing the vocabulary for our coding! |
c2a8ff9
to
fedc35a
Compare
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Add field methods and corresponding unit tests: alternate_titles, call_numbers * Add private methods for key MARC elements: leader and control field '008' * Rename 'xml' -> 'source_record' * Update dependencies Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-288
fedc35a
to
61db0f7
Compare
Purpose and background context
Field method refactor for transform class
MARC
(Part 1).<leader>
,<controlfield tag="008">
(alternate_titles
andcall_numbers
.Note(s):
get_optional_fields()
, which is required by field derivations (that will eventually be moved into their own field methods).get_control_field_general_info
was chosen overget_fixed_length_data
as there are technically multiple fixed length fields in MARC. The fixed length field that MARC depends on is for code 008, which describes "general information" about the record.How can a reviewer manually see the effects of these changes?
make test
and verify all unit tests are passing.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-288
Developer
Code Reviewer(s)