-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate date strings when transforming MITAardvark records #124
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
d0648b7
gisogm transformations use MITAardvark transformer
ghukill cb9c1f6
Allow Aardvark download links if scalar value
ghukill 83f26c7
Validate Date.value strings for Aardvark transformations
ghukill 5d166e8
Additional unit test for MITAardvark date validation
ghukill aa84d19
MITAardvark get source link from source metadata
ghukill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "", "dct_title_s": "Test title 1", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": false, "id": "mit:123", "locn_geometry": ""} | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "", "dct_title_s": "Test title 2", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": false, "id": "ogm:456", "locn_geometry": ""} | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "", "dct_title_s": "Test title 3", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": true, "id": "ogm:789", "locn_geometry": ""} | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "{\"http://schema.org/url\": \"https://geodata.libraries.mit.edu/record/abc:123\"}", "dct_title_s": "Test title 1", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": false, "id": "mit:123", "locn_geometry": ""} | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "{\"http://schema.org/url\": \"https://geodata.libraries.mit.edu/record/abc:123\"}", "dct_title_s": "Test title 2", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": false, "id": "ogm:456", "locn_geometry": ""} | ||
{"dct_accessRights_s": "Access rights", "dct_references_s": "{\"http://schema.org/url\": \"https://geodata.libraries.mit.edu/record/abc:123\"}", "dct_title_s": "Test title 3", "gbl_mdModified_dt": "", "gbl_mdVersion_s": "", "gbl_resourceClass_sm": "", "gbl_suppressed_b": true, "id": "ogm:789", "locn_geometry": ""} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ | |
import re | ||
|
||
import transmogrifier.models as timdex | ||
from transmogrifier.helpers import validate_date | ||
from transmogrifier.sources.transformer import JSON, JSONTransformer | ||
|
||
logger = logging.getLogger(__name__) | ||
|
@@ -31,25 +32,33 @@ def get_main_titles(cls, source_record: dict) -> list[str]: | |
|
||
@classmethod | ||
def get_source_link( | ||
cls, source_base_url: str, source_record_id: str, source_record: dict[str, JSON] | ||
cls, | ||
_source_base_url: str, | ||
source_record_id: str, | ||
source_record: dict[str, JSON], | ||
) -> str: | ||
""" | ||
Class method to set the source link for the item. | ||
|
||
May be overridden by source subclasses if needed. | ||
|
||
Default behavior is to concatenate the source base URL + source record id. | ||
Unlike other Transmogrifier sources that dynamically build a source link, | ||
MITAardvark files are expected to have a fully formed and appropriate source link | ||
in the metadata already. This method relies on that data. | ||
|
||
Args: | ||
source_base_url: Source base URL. | ||
_source_base_url: Source base URL. Not used for MITAardvark transforms. | ||
source_record_id: Record identifier for the source record. | ||
source_record: A BeautifulSoup Tag representing a single XML record. | ||
- not used by default implementation, but could be useful for subclass | ||
overrides | ||
""" | ||
return source_base_url + cls.get_timdex_record_id( | ||
"gismit", source_record_id, source_record | ||
) | ||
links = cls.get_links(source_record, source_record_id) | ||
url_links = [link for link in links if link.kind == "Website"] | ||
if len(url_links) == 1: | ||
return url_links[0].url | ||
message = "Could not locate a kind=Website link to pull the source link from." | ||
raise ValueError(message) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perfect, much better than what we were doing before |
||
|
||
@classmethod | ||
def get_timdex_record_id( | ||
|
@@ -183,12 +192,25 @@ def get_contributors(source_record: dict) -> list[timdex.Contributor]: | |
|
||
@classmethod | ||
def get_dates(cls, source_record: dict, source_record_id: str) -> list[timdex.Date]: | ||
"""Get values from source record for TIMDEX dates field.""" | ||
return ( | ||
"""Get values from source record for TIMDEX dates field. | ||
|
||
This method aggregates dates from a variety of Aardvark fields. Once aggregated, | ||
the results are filtered to allow only well formed DateRanges or validated date | ||
strings. | ||
""" | ||
dates = ( | ||
cls._issued_dates(source_record) | ||
+ cls._coverage_dates(source_record) | ||
+ cls._range_dates(source_record, source_record_id) | ||
) | ||
return [ | ||
date | ||
for date in dates | ||
# skip value validation for DateRange type dates | ||
if isinstance(date.range, timdex.DateRange) | ||
# validate date string if not None | ||
or (date.value is not None and validate_date(date.value, source_record_id)) | ||
] | ||
|
||
@classmethod | ||
def _issued_dates(cls, source_record: dict) -> list[timdex.Date]: | ||
|
@@ -228,9 +250,13 @@ def _range_dates( | |
"""Get values for issued dates.""" | ||
range_dates = [] | ||
for date_range_string in source_record.get("gbl_dateRange_drsim", []): | ||
date_range_values = cls.parse_solr_date_range_string( | ||
date_range_string, source_record_id | ||
) | ||
try: | ||
date_range_values = cls.parse_solr_date_range_string( | ||
date_range_string, source_record_id | ||
) | ||
except ValueError as exc: | ||
logger.warning(exc) | ||
continue | ||
range_dates.append( | ||
timdex.Date( | ||
kind="Coverage", | ||
|
@@ -292,6 +318,7 @@ def get_links(source_record: dict, source_record_id: str) -> list[timdex.Link]: | |
url=link.get("url"), kind="Download", text=link.get("label") | ||
) | ||
for link in links_object.get("http://schema.org/downloadUrl", []) | ||
if isinstance(link.get("url", {}), str) | ||
] | ||
) | ||
if schema_url := links_object.get("http://schema.org/url"): | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So,
OGMAardvark
was referenced here before being defined which we were expecting to need because of theget_source_link
method includinggismit
, are we thinking differently about that approach now?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really glad you raised this up, for a couple of reasons.
First of all, looking more closely at the
get_source_link
which I had not thought to check, I believe that we can update that for MITAardvark transformations to use data fromdct_references_s
. In this way, it also helps explain why MIT and OGM records can share the same transformer.Every record that comes out of GeoHarvester is an "MIT" Aardvark record in the sense that, regardless of origin institution or metadata format, we have crafted the Aardvark file in a way that meets our TIMDEX needs. During that work in GeoHarvester, quite a bit of care is taken to craft the
dct_references_s
field which contains URLs.The value for
dct_refereces_s['http://schema.org/url']
is what the "source link" for the record should be. For MIT records this will behttps://geodata.libraries.mit.edu/record/<IDENTIFIER>
and for OGM records it will be an external URL that we extracted from the source metadata; gauranteed to be present or it does not get included in the harvester output.Taking all this together, will work on another commit that:
get_source_link
to actually read data from the recordbase_url
from thegismit
andgisogm
configurations, as it's not neededThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 - just pushed this commit and have re-requested review.