Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Characters Allowed in Metadata Saved by Metacat UI Editor cause catastrophic dataset error #2481

Open
vchendrix opened this issue Jul 8, 2024 · 5 comments

Comments

@vchendrix
Copy link
Collaborator

Description
The Metacat UI Editor allowed invalid characters to be saved in metadata. When the Metacat indexer tried to process the metadata file, the following error was encountered:

metacat-index 20240630-23:50:14: [ERROR]: SolrIndex.update - could not update the solr index for the object ess-dive-3619bd077a60b7c-20240624T120319367 since Invalid byte 2 of 4-byte UTF-8 sequence. [edu.ucsb.nceas.metacat.index.SolrIndex:update:656]
org.apache.solr.client.solrj.SolrServerException: Invalid byte 2 of 4-byte UTF-8 sequence.
        at edu.ucsb.nceas.metacat.index.SolrIndex.process(SolrIndex.java:237) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.insert(SolrIndex.java:396) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:697) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:620) [classes/:?]
        at edu.ucsb.nceas.metacat.index.SystemMetadataEventListener$1.run(SystemMetadataEventListener.java:187) [classes/:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_402]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_402]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_402]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_402]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_402].

The result was that the dataset metadata was not indexed in Solr. However, the resource map was created successfully, rendering the dataset uneditable. The metadata in Solr looked as follows:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"id:ess-dive-3619bd077a60b7c-20240624T120319367",
      "wt":"javabin",
      "version":"2"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "read_count_i":44,
        "id":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "identifier":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "sku":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "_version_":1803150904808964096,
        "serviceCoupling":"false",
        "isService":false,
        "isDocumentedBy":["ess-dive-3619bd077a60b7c-20240624T120319367"],
        "documents":["ess-dive-9725a595229ffc6-20240520T181650760",
          "ess-dive-a947e57390f1fad-20240613T203820095",
          "ess-dive-babae844b274bf2-20240613T212651812",
          "ess-dive-f718cd02247b6b7-20240520T181650806",
          "ess-dive-03a811f10de6c4a-20240613T204125926",
          "ess-dive-3619bd077a60b7c-20240624T120319367",
          "ess-dive-6c73eb2d4ac33cb-20240624T115801116",
          "ess-dive-e87e6b2bb4d0b0d-20240624T115801104",
          "ess-dive-8775aeed8499ba7-20240613T203820082",
          "ess-dive-a2b05a328913511-20240613T203820068",
          "ess-dive-645a4c9d54aacec-20240624T115754244",
          "ess-dive-8641172de4e1937-20240613T210540301",
          "ess-dive-3ac7448d1be1e0f-20240613T210540311",
          "ess-dive-a29fa7c825dea22-20240613T203820108",
          "ess-dive-aac74b2ca73dbee-20240613T203820102",
          "ess-dive-323f59eaa468ca0-20240520T181650795",
          "ess-dive-047dc22f57f82d8-20240624T115801110",
          "ess-dive-f9fd47d9e4c8c34-20240613T203820077",
          "ess-dive-cf5ba5193c8d2ef-20240621T121606390",
          "ess-dive-8742ead85f7c535-20240613T203820088",
          "ess-dive-0d69c0b5a6f7e45-20240613T203820055",
          "ess-dive-35eccae477fcaaa-20240613T203820115",
          "ess-dive-d3ccee76444e6d9-20240624T115801123"],
        "resourceMap":["ess-dive-2c4cdf7a877c0f4-20240624T120319346"],
        "language":""}]
  }
}

Steps to Reproduce

  1. Use Metacat UI Editor to save metadata with invalid characters.
  2. Attempt to index the metadata with Metacat indexer.
  3. Observe the error in the logs as shown above.

Expected behavior
The metadata should be properly encoded as UTF-8 before being saved, ensuring that it can be indexed without errors.

Screenshots
Screenshot 2024-07-08 at 3 08 58 PM

Additional context
We recovered from this by using the API directly to upload a new metadata file that is parseable by the Metacat indexer and then manually create the resource map. This fixed the issue enough to allow the dataset to be edited and published. However, the previous version is in a state where it will never be properly indexed. The Metacat UI metadata editor should ensure that the metadata is encoded properly as UTF-8.

@mbjones
Copy link
Member

mbjones commented Jul 9, 2024

Thanks for the report, @vchendrix . This could be related to #2167 and certainly seems to be in the same category of character encoding problems. Like that bug, our error handling pipeline in MetacatUI seems to miss that metacat produces an error and silently moves on. This has been a common thread and involves data loss, so I am going to label this as critical. I will discuss this with @robyngit and @rushirajnenuji to try to figure out a path forward. Thanks.

@vchendrix
Copy link
Collaborator Author

Thanks for the report, @vchendrix . This could be related to #2167 and certainly seems to be in the same category of character encoding problems. Like that bug, our error handling pipeline in MetacatUI seems to miss that metacat produces an error and silently moves on. This has been a common thread and involves data loss, so I am going to label this as critical. I will discuss this with @robyngit and @rushirajnenuji to try to figure out a path forward. Thanks.

No problem. The solution will probably be the same in MetacatUi. The only noticeable difference is that in this case Metacat accepts the update but fails to parse the EML for the solr index which was very difficult to remedy. In #2167 Metacat rejects the update thus making it easier to recover.

@mbjones
Copy link
Member

mbjones commented Jul 9, 2024

@vchendrix could you attachto this ticket the original EML document that triggers this SOLR indexing error? It would be very helpful to be able to reproduce what you mean by "invalid characters" with a concrete reproducible example.

@vchendrix
Copy link
Collaborator Author

what you mean by "invalid characters" with a concrete reproducible example.

Here is the URL: https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-3619bd077a60b7c-20240624T120319367

The_importance_of_accounting_for_landscape.xml

@vchendrix
Copy link
Collaborator Author

The_importance_of_accounting_for_landscape.xml

@mbjones NOTE that once opened up in an editor the characters are automatically encoded and I was able to upload and have it parse successfully. The characters were garbage but it sidestepped the error. The invalid characters, I suspect, are in Step 7 of the Methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants