Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 287 ead field method refactor 1 #185

Merged
merged 1 commit into from
May 31, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented May 30, 2024

Purpose and background context

Field method refactor for transform class Ead (Part 1).

  • Added private"methods for key elements in EAD-formatted XML records (<archdesc>, <archdesc><did>, <archdesc><controlaccess>
  • Added field methods and corresponding unit tests for alternate_titles, citation, content_type

Note: For the time-being, I've added calls to the private methods at the beginning of get_optional_fields(), which is required by field derivations that will eventually be moved into their own field methods.

How can a reviewer manually see the effects of these changes?

  1. Run make test and verify all unit tests are passing.
  2. Run CLI command
    pipenv run transform -i tests/fixtures/ead/ead_record_all_fields.xml -o output/ead-transformed-records.json -s aspace
    
    Output:
    2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): Logger 'root' configured with level=INFO
    2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
    2024-05-30 12:33:59,914 INFO transmogrifier.cli.main(): Running transform for source aspace
    2024-05-30 12:33:59,930 WARNING transmogrifier.sources.transformer.get_valid_title(): Record repositories/2/resources/1 has multiple titles. Using the first title from the following titles found: ['Charles J. Connick Stained Glass Foundation Collection', 'Title 2', 'Title 3']
    2024-05-30 12:33:59,930 INFO transmogrifier.cli.main(): Completed transform, total records processed: 1, transformed records: 1, skipped records: 0, deleted records: 0
    2024-05-30 12:33:59,930 INFO transmogrifier.cli.main(): Total time to complete transform: 0:00:00.016147
    
    

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this May 30, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review May 30, 2024 16:37
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it looks great! Left a request for docstrings in the helper methods, and a question, but overall looking good to me.

transmogrifier/sources/xml/ead.py Show resolved Hide resolved
@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None:
return None

@classmethod
def get_main_titles(cls, xml: Tag) -> list[str]:
def _get_collection_description(cls, source_record: Tag) -> Tag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose a small docstring here just to explain that it's widely used for TIMDEX field values, and therefore makes sense to define this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings have been added!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

raise SkippedRecordEvent(message)

@classmethod
def _get_collection_description_did(cls, source_record: Tag) -> Tag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: small docstring could be helpful to explain why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings have been added!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, a few suggestions!

@@ -1,10 +1,61 @@
# ruff: noqa: E501
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just a personal opinion but I think it's preferable to not ignore E501 for a module (or even a line really). For the xml_string var in create_ead_source_record_stub, you can use spaces instead of tabs and you can break up the long citation strings like this:

assert Ead.get_citation(source_record) == (
        "Charles J. Connick Stained Glass Foundation Collection, "
        "VC-0002, box X. Massachusetts Institute of Technology, "
        "Department of Distinctive Collections, Cambridge, Massachusetts."
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! On a somewhat related note (and is also why I hesitated to reformat the strings):

When I save on an open XML file, I have a tool in my IDE that reformats the XML. It results in newlines being added throughout the XML file, which then causes the tests to fail.

@ehanson8 @ghukill Do you think it's worth creating an issue to whitespace/newline-proof the XML transforms? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd probably want to preserve whitespace inside an element (its text), as this is often intentional by the record creator. I think it would fundamentally change the nature of the content if we introduce newline characters inside a long string of text for an element.

But reformatting the XML in a way that just moves elements around, but not element text, that seems okay / good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll forego creating an issue for now as the newlines are resulting from me not turning off the XML formatter in my IDE 😅 .

Circling back to original comment, I removed the syntax to ignore E501 errors and made updates accordingly!

assert ("has a date that couldn't be parsed: 'undated'") in caplog.text
assert ("has a later start date than end date: '2001', '1999'") in caplog.text


def test_ead_record_correct_identifiers_from_multiple_unitid(caplog):
def test_ead_transform_with_multiple_unitid_gets_valid_ids():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ead has more of these edge case tests than the other sources we've done so far, we might consider redoing these with stub records but not sure if that's worth the effort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 For this PR, I'll stick to just renaming the test and will revisit when I work on the get_identifiers field method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fine, just food for thought!

Comment on lines 613 to 626
<archdesc level="collection">
<did>
<unittitle>
Charles J. Connick Stained Glass
<emph>
Foundation
<emph>Collection</emph>
</emph>
<num>VC.0002</num>
</unittitle>
<unittitle></unittitle>
<unittitle></unittitle>
</did>
</archdesc>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated a lot, you might refactor so top level elements like <archdesc> aren't repeated in every test

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 @ghukill Let me know what you think of the updates I made to the create_ead_source_record_stub method. It was the only way I could think of simplifying some of the metadata_insert arguments throughout the testing module (removes the <archdesc> and <archdesc><did> elements from the string.

In general, we want to avoid complex functions in the testing module, but maybe this is a simple enough method that it isn't a problem? 🤔

tests/sources/xml/test_ead.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/ead.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/ead.py Show resolved Hide resolved
Comment on lines 371 to 372
citation := cls.create_string_from_mixed_value(
citation_element, " ", ["head"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to start using named args for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering this but felt conflicted with @ghukill 's comment on a previous PR, but perhaps the guideline @ghukill is only applicable for "methods with single positional arguments" or "when testing"?

In any case, I will update to use named args! I think create_string_from_mixed_value is nifty but is difficult to understand just by the name alone.

Copy link
Contributor

@ghukill ghukill May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my comment a bit...

If I'm reading correctly, this is the signature:

def create_string_from_mixed_value(
        cls,
        xml_element: Tag,
        separator: str = "",
        skipped_elements: list[str] | None = None,
    )...

So anytime called, I'd expect to see:

  • required, positional argument xml_element is not named, but just passed positionally
  • the rest of them, passed as name arguments, because named in the signature
    • unless the defaults are okay, and then okay to omit

Apologies if I over-complicated anywhere, but I think it really boils down to:

  • if positional argument, pass as positional
  • if named argument, passed as named

Does that mesh with the discussion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, I'd expect:

cls.create_string_from_mixed_value(
    citation_element,
    separator = " ",
    skipped_elements = ["head"]
)

Open to pushback, but this convention feels like a pretty widely followed convention?

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think my confusion was I took "named arguments" (or keyword arguments) literally. 🤔 Technically, all the arguments are positional since the signature isn't as follows:

def create_string_from_mixed_value(
        cls,
        xml_element: Tag,
        *,
        separator: str = "",
        skipped_elements: list[str] | None = None,
)

I'm definitely on board with the proposed convention; just curious if it's more accurate to say:

use keyword arguments in the method call when (1) overriding positional arguments with default values and (2) there are multiple arguments in the function signature

?

Copy link
Contributor

@ghukill ghukill May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading it right, this is the actual function definition:

@classmethod
def create_string_from_mixed_value(
    cls,
    xml_element: Tag,
    separator: str = "",
    skipped_elements: list[str] | None = None,
) -> str:
   ...

Apologies if I'm being imprecise with my words here or earlier, but I would characterize the arguments like:

  • cls: required, placeholder, can kind of ignore
  • xml_element: required, positional
  • separator: because it has equals sign =, it's named, optional, with default
  • skipped_elements: same as separator

I take your point that technically all the arguments are positional, good point. But I suppose my position is that conventionally -- and this is open to debate for sure -- if an argument has an equals sign = than it's treated as a named argument, and therefore optional with a default value.

This SO post breaks it down nicely, and exposes that I have been lazy with my terminology.

If we ever do deal with functions that use *args or **kwargs, then I think it gets more subtle (to your example above), but for most signatures, I feel like it's often positional (no defaults) and then named/keyword (with defaults). My vote is to:

  1. not use the arg name for positionals
  2. do use the arg name for named/keyword

Thoughts? or am I just making a mountain out of a mole hill 😅?

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying! I accept / understand your definitions and we are on the same page! I do wonder whether we should start using * or / in our function signatures to more explicitly distinguish positional vs named/keyword arguments (but that's beyond the scope of this PR 🤓). If I remember correctly, Ruff throws errors for boolean positional arguments (FBT001).

Circling back to original comment: changes have been made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels right to me but I honestly am learning some of these conventions in these PR conversations 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation is the first time I've seen the / operator for function/method signatures, and helped clarify for me what the * operator does without an immediate following variable name like *args.

This helped me understand it more tacitly:

In [1]: def foo(a, b, *, c="horse"):
   ...:     print(a,b,c)
   ...: 

In [2]: foo(42, 101, 999, 888, c='zebra')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 foo(42, 101, 999, 888, c='zebra')

TypeError: foo() takes 2 positional arguments but 4 positional arguments (and 1 keyword-only argument) were given

In [3]: def bar(a, b, *args, c="horse"):
   ...:     print(a,b,args,c)
   ...: 

In [4]: bar(42, 101, 999, 888, c='zebra')
42 101 (999, 888) zebra

I'm definitely not opposed to using * and / in signatures if we find them useful. Here is an edge case where I could see it being useful, when you want to force keyword argument or no positional:

def baz(*, a=1):
    pass

In [6]: baz()
1

In [7]: baz(42)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 baz(42)

TypeError: baz() takes 0 positional arguments but 1 was given

In [8]: baz(a=101)
101    

Ineresting stuff! In cases without * or /, glad we're honing in on some conventions, and curious if we ever run into situations where it would be nice to use them. I read an article this morning that I found pretty convincing that the main appeal is often APIs, where you want to finely control how functions are called by users potentially not familiar with the library.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-1 branch 3 times, most recently from b8adf19 to 2c8339b Compare May 31, 2024 13:26
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and I'm glad this PR led to some learning for all of us! 🙂

header_insert: str = "", metadata_insert: str = ""
header_insert: str = "",
metadata_insert: str = "",
parent_element: Literal["archdesc", "did"] = None, # noqa: RUF013
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't too complex in my mind and saves space in the unit tests but I defer to @ghukill for final approval of this

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Thanks again for good discussions. I'm glad in these early updates, we're hashing out some of these conventions and details.

@@ -362,19 +327,86 @@ def generate_name_identifier_url(cls, name_element: Tag) -> list | None:
return None

@classmethod
def get_main_titles(cls, xml: Tag) -> list[str]:
def _get_collection_description(cls, source_record: Tag) -> Tag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +335 to +339
For EAD-formatted XML, all of the information about a collection
that is relevant to the TIMDEX data model is
encapsulated within the <archdesc level="collection"> element.
If this element is missing, it suggests that there is a structural
error in the record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks.

Why these changes are being introduced:
* These updates are required to implement the architecture described
in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md

How this addresses that need:
* Rename param 'xml' to 'source_record'
* Add private methods for retrieving key elements:
  <archdesc>, <archdesc><did>, <controlaccess>
* Raise SkippedRecordEvent when <archdesc> and <archdesc><did>
  are missing
* Add field methods for alternate_titles, citation, content_type
* Use named args for 'create_list/string_from_mixed_value'
* Update tests
  * Rename tests
  * Add method to create source records from EAD-formatted XML for unit tests
  * Add tests for field methods alternate_titles, main_titles, citation, content_type
  * Avoid linting error E501

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-287
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-287-ead-field-method-refactor-1 branch from 2c8339b to 714e2cc Compare May 31, 2024 15:21
@jonavellecuerdo jonavellecuerdo merged commit 2b499cb into main May 31, 2024
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-287-ead-field-method-refactor-1 branch May 31, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants