-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate date strings when transforming MITAardvark records #124
Conversation
Why these changes are being introduced: The source gisogm should also use the MITAardvark transformer, as the GeoHarvester, even for OGM records, will produce what we are referring to as MITAardvark files. They are Aardvark in format, but meet MIT requirements for TIMDEX. How this addresses that need: * Changes defined transformer class in config Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced: While technically valid for an Aardvark file to have multiple download links as part of dct_references_s, when creating a TIMDEX record, we cannot know which link to include as the sole and primary download link. Single, scalar values are by far the most common and will be used, but if an array of values is encountered, adding a Download link is skipped. How this addresses that need: * Ensure that value coming from dct_references_s for URI 'http://schema.org/downloadUrl' is a scalar value Side effects of this change: * Slight decrease in download links for transforming OGM records Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced: It was determined after some end-to-end testing that a large number of Aardvark records, when transformed, were setting dates in the TIMDEX record that Opensearch could not parse. This was unlike other transformations where the date strings are validated. NOTE: it is anticipated that we will return to date validation, possibly to extend it beyond just validation and into extraction as well (e.g. pulling valid year '1933' from date string '1933?'). How this addresses that need: * A try/except block added to field method _range_dates to warn and continue * All non DateRange values are validated in field method get_dates; if fail validation, they are dropped Side effects of this change: * Massive increase in success rate for Opensearch indexing for gisogm source * Decrease in date strings present in final records that are not valid without manipulation Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but 1 bigger picture question before approving
@@ -112,7 +112,7 @@ | |||
"gisogm": { | |||
"name": "OpenGeoMetadata GIS Resources", | |||
"base-url": "https://search.libraries.mit.edu/record/", | |||
"transform-class": "transmogrifier.sources.json.aardvark.OGMAardvark", | |||
"transform-class": "transmogrifier.sources.json.aardvark.MITAardvark", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, OGMAardvark
was referenced here before being defined which we were expecting to need because of the get_source_link
method including gismit
, are we thinking differently about that approach now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really glad you raised this up, for a couple of reasons.
First of all, looking more closely at the get_source_link
which I had not thought to check, I believe that we can update that for MITAardvark transformations to use data from dct_references_s
. In this way, it also helps explain why MIT and OGM records can share the same transformer.
Every record that comes out of GeoHarvester is an "MIT" Aardvark record in the sense that, regardless of origin institution or metadata format, we have crafted the Aardvark file in a way that meets our TIMDEX needs. During that work in GeoHarvester, quite a bit of care is taken to craft the dct_references_s
field which contains URLs.
The value for dct_refereces_s['http://schema.org/url']
is what the "source link" for the record should be. For MIT records this will be https://geodata.libraries.mit.edu/record/<IDENTIFIER>
and for OGM records it will be an external URL that we extracted from the source metadata; gauranteed to be present or it does not get included in the harvester output.
Taking all this together, will work on another commit that:
- updates
get_source_link
to actually read data from the record - will remove
base_url
from thegismit
andgisogm
configurations, as it's not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 - just pushed this commit and have re-requested review.
Why these changes are being introduced: Unlike other Transmog sources, which dynamically create a source link for the record based on a base URL and some identifier, MITAardvark records are expected to have a fully formed URL already present in the metadata. This is true for both MIT and OGM records, where MIT records point to geo TIMDEX UI, and OGM records point to the holding institution. How this addresses that need: * Updates get_source_link() method to use data from metadata * Sets config.base_url to string 'None' for both gismit and gisogm Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
if len(url_links) == 1: | ||
return url_links[0].url | ||
message = "Could not locate a kind=Website link to pull the source link from." | ||
raise ValueError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, much better than what we were doing before
Purpose and background context
The primary change this PR introduces is validating date strings for MITAardvark transformations, particularly for records from OpenGeoMetadata (OGM) where dates are less controlled than when we create them.
Using
validate_date(...)
is inline with most other transformations in Transmogrifier. By adding this, all date related errors while indexing to Opensearch locally were avoided, allowing an additional 26k records to index. However, this means we also lost adding dates to TIMDEX records that were not immediately valid, e.g.:"1993?"
"approximately 1569"
"[1910 to 1919]"
While not ideal to lose information during transformation, this seems acceptable at this time for a couple of reasons:
?
orxxx to yyy
-- we can apply that logic once and know that all transformations will benefitAlso in this PR:
gisogm
should also use theMITAardvark
transformation as those records; MIT and OGM share transformationdct_references_s
from Aardvark recordHow can a reviewer manually see the effects of these changes?
The new unit test in this commit is indicative of how it drops values that are not valid.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES: increase in documents indexed, decrease in multiple dates per record
What are the relevant tickets?
Developer
Code Reviewer(s)