Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 288 marc field method refactor #200

Merged
merged 1 commit into from
Jul 23, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jul 22, 2024

Purpose and background context

Field method refactor for transform class MARC (Part 1).

  • Added private methods for key elements in MARC-formatted XML records (<leader>, <controlfield tag="008">(
  • Added field methods and corresponding unit tests for alternate_titles and call_numbers.

Note(s):

  • For the time-being, I've added calls to the private methods at the beginning of get_optional_fields(), which is required by field derivations (that will eventually be moved into their own field methods).
  • The name get_control_field_general_info was chosen over get_fixed_length_data as there are technically multiple fixed length fields in MARC. The fixed length field that MARC depends on is for code 008, which describes "general information" about the record.

How can a reviewer manually see the effects of these changes?

  1. Run make test and verify all unit tests are passing.
  2. Run CLI command
    pipenv run transform -i tests/fixtures/marc/marc_record_all_fields.xml -o output/marc-transformed-records.json -s alma
    
    Output:
    2024-07-22 13:23:04,806 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
    2024-07-22 13:23:04,806 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
    2024-07-22 13:23:04,806 INFO transmogrifier.cli.main(): Running transform for source alma
    2024-07-22 13:23:05,330 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
    2024-07-22 13:23:05,330 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.524199
    

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-288

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jul 22, 2024
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor branch from afa2320 to cdad188 Compare July 22, 2024 17:25
@ghukill
Copy link
Contributor

ghukill commented Jul 22, 2024

One question for both @jonavellecuerdo and @ehanson8: I can't comment directly in the PR, but wondering about moving these static file mappings to methods/properties on the MarcTransformer class?

https://github.com/MITLibraries/transmogrifier/pull/200/files#diff-58573f6e57860b560075cbe825c45aace3f2830bb51d47f8d8dbf4b1b0b5d5ebR14-R30

Seems like we'd want to avoid them reading the file each time they are used, but that could be solved with cached instances, or even just read and save on class init.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor branch 2 times, most recently from 2c8c701 to e0d7b44 Compare July 22, 2024 18:07
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a question about the temporary values you noted in the PR, asking if they could just be called directly even if not part of a field method proper.

Also, maybe dovetails with a general comment about the static variables at the top.

These all seem to have one thing in common: parse some data that multiple fields will need. If so, then feels like we could do that on init for the MarcTransform class somehow, and reuse that data throughout via self. As the first part of the MARC refactoring, might be a good time to consider that pattern.

Comment on lines 49 to 50
leader_field = Marc._get_leader_field(source_record)
control_field_general_info = Marc._get_control_field_general_info(source_record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your thinking a bit in having these temporarily here for use in field that don't have their own methods yet... but what about just calling these new and/or renamed methods from those places directly?

e.g. I see leader_field used, but could call directly:

# content_type
if content_type := Marc.json_crosswalk_code_to_name(
    self._get_leader_field(source_record)[6:7],
    marc_content_type_crosswalk,
    source_record_id,
    "Leader/06",
):
    fields["content_type"] = [content_type]

What I think this might help guide is how often these are used. If often, it might be worth caching the results on those sub-methods. If not often, it's probably okay to invoke them directly 1-2 times. I'm a big fan of caching -- where you call the method, and it reuses data if safe to cache and already called -- but there is a bit of overhead there.

Either way, might avoid needing these temporary values at all, and would give some insight into usage during future parts of this refactor. Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Marc to call the private methods as needed but as discussed, caching is not needed at this time!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, a few minor suggestions!

Comment on lines 23 to 27
leader_field_insert (str): A string representing a MARC fixed length 'leader'
XML element. Defaults to a dummy value.
control_field_general_info_insert (str): A string representing a MARC fixed length
'general info control field' (i.e., code 008) XML element.
Defaults to a dummy value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart change!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Heads up: Renamed control_field_general_info_insert -> control_field_insert.

Comment on lines 828 to 841
def test_get_control_field_general_info_if_field_blank():
source_record = create_marc_source_record_stub(
control_field_general_info_insert='<controlfield tag="008"></controlfield>'
)
with pytest.raises(
SkippedRecordEvent,
match=(
'Record skipped because key information is missing: <controlfield tag="008">.'
),
):
Marc._get_control_field_general_info(source_record)


def test_get_control_field_general_info_if_field_missing():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add raises_exception to test names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks like we've used _raises_skipped_record_event in other transforms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I renamed the test as suggested.

@@ -894,11 +968,7 @@ def test_marc_record_missing_leader_logs_error(caplog):
output_records = Marc("alma", marc_xml_records)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd update this test name since it no longer logs an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test to say skips_record for now, but I think these two tests may be part of the old tests we'll want to deprecate eventually!

@@ -908,11 +978,7 @@ def test_marc_record_missing_008_logs_error(caplog):
output_records = Marc("alma", marc_xml_records)
assert len(list(output_records)) == 0
assert output_records.processed_record_count == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the test name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment above.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor branch from d1c30e6 to 692c090 Compare July 22, 2024 20:17
@jonavellecuerdo
Copy link
Contributor Author

One question for both @jonavellecuerdo and @ehanson8: I can't comment directly in the PR, but wondering about moving these static file mappings to methods/properties on the MarcTransformer class?

https://github.com/MITLibraries/transmogrifier/pull/200/files#diff-58573f6e57860b560075cbe825c45aace3f2830bb51d47f8d8dbf4b1b0b5d5ebR14-R30

Seems like we'd want to avoid them reading the file each time they are used, but that could be solved with cached instances, or even just read and save on class init.

As @ghukill and I discussed, we opted to only make the following change: assign 'crosswalk' variables as attributes of Marc transform class for the following reasons:

  • Setting them as class attributes has the same effect as having them as "loose" global variables in the module in that they are instantiated only once when the Marc module is imported; whereas calling a class / property method for each one would be called for every transformed record.
  • The idea of having these variables set on the instance via a __post_attrs_init__ was also floated, but transmogrifier does not currently use attrs, this would be a heavier lift. We also spoke about having a custom __init__ method but realized that the base Transformer.__init__ method is "final" and can't be overridden.

@ehanson8
Copy link
Contributor

  • Setting them as class attributes has the same effect as having them as "loose" global variables in the module in that they are instantiated only once when the Marc module is imported; whereas calling a class / property method for each one would be called for every transformed record.

Good summary and I agree that seems like the best path!

Comment on lines +17 to +31
country_code_crosswalk = load_external_config("config/loc-countries.xml", "xml")
holdings_collection_crosswalk = load_external_config(
"config/holdings_collection_crosswalk.json", "json"
)
holdings_format_crosswalk = load_external_config(
"config/holdings_format_crosswalk.json", "json"
)
holdings_location_crosswalk = load_external_config(
"config/holdings_location_crosswalk.json", "json"
)
language_code_crosswalk = load_external_config("config/loc-languages.xml", "xml")
marc_content_type_crosswalk = load_external_config(
"config/marc_content_type_crosswalk.json", "json"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfection!

@ghukill
Copy link
Contributor

ghukill commented Jul 23, 2024

One question for both @jonavellecuerdo and @ehanson8: I can't comment directly in the PR, but wondering about moving these static file mappings to methods/properties on the MarcTransformer class?
https://github.com/MITLibraries/transmogrifier/pull/200/files#diff-58573f6e57860b560075cbe825c45aace3f2830bb51d47f8d8dbf4b1b0b5d5ebR14-R30
Seems like we'd want to avoid them reading the file each time they are used, but that could be solved with cached instances, or even just read and save on class init.

As @ghukill and I discussed, we opted to only make the following change: assign 'crosswalk' variables as attributes of Marc transform class for the following reasons:

  • Setting them as class attributes has the same effect as having them as "loose" global variables in the module in that they are instantiated only once when the Marc module is imported; whereas calling a class / property method for each one would be called for every transformed record.
  • The idea of having these variables set on the instance via a __post_attrs_init__ was also floated, but transmogrifier does not currently use attrs, this would be a heavier lift. We also spoke about having a custom __init__ method but realized that the base Transformer.__init__ method is "final" and can't be overridden.

Thanks for making this change. Kind of minor, but a bit of nice code organization I think. Also, while I was definitely using "loose" in our conversations, it occurred to me those kind of variables are probably most accurately described as "module level" variables. Maybe that's a good term we can use going forward to have a collective understanding of what we're talking about.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a part 1, looks great to me!

@ehanson8
Copy link
Contributor

Thanks for making this change. Kind of minor, but a bit of nice code organization I think. Also, while I was definitely using "loose" in our conversations, it occurred to me those kind of variables are probably most accurately described as "module level" variables. Maybe that's a good term we can use going forward to have a collective understanding of what we're talking about.

I like "module level" and am always in favor of standardizing the vocabulary for our coding!

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor branch from c2a8ff9 to fedc35a Compare July 23, 2024 18:03
Why these changes are being introduced:
* These updates are required to implement the architecture described
in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md

How this addresses that need:
* Add field methods and corresponding unit tests:
  alternate_titles, call_numbers
* Add private methods for key MARC elements:
  leader and control field '008'
* Rename 'xml' -> 'source_record'
* Update dependencies

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-288
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-288-marc-field-method-refactor branch from fedc35a to 61db0f7 Compare July 23, 2024 18:04
@jonavellecuerdo jonavellecuerdo merged commit d83c86d into main Jul 23, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-288-marc-field-method-refactor branch July 23, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants