Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] xinclude error message is shadowed by missing DTD warning message. #2562

Closed
jarl-dk opened this issue Jun 1, 2022 · 4 comments · Fixed by #3257
Closed

[bug] xinclude error message is shadowed by missing DTD warning message. #2562

jarl-dk opened this issue Jun 1, 2022 · 4 comments · Fixed by #3257

Comments

@jarl-dk
Copy link

jarl-dk commented Jun 1, 2022

High level description
When using xinclude on a non-existing file followed by an xinclude with a DOCTYPE declaration (from a non-existing DTD file) The warning text of the missing DTD file ends up in the error text for the xinclude of the non-existing file.

Reproduce the problem

Create a file named with_doctype_missing_files.xml with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root_element PUBLIC "-//ORG//DTD Doc XML V1.0//EN" "doc.dtd">
<root_element>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="xinclude_doctype_sub_missing.xml" parse="xml"/>
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="xinclude_doctype_sub_1.xml" parse="xml"/>
</root_element>

Create a file named xinclude_doctype_sub_1.xml with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE preface PUBLIC "-//ORG//DTD Doc XML V1.0//EN" "doc.dtd">
<preface>
  <title>Title</title>
  <para>Paragraph</para>
</preface>

Then run

ruby -rnokogiri -e 'p Nokogiri::XML(Pathname("with_doctype_missing_files.xml")) { |c| c.xinclude.nonet.noent.nodtdload }.to_xml'; echo $?

It will produce something like this:

/home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/lib/nokogiri/xml/node.rb:366:in `process_xincludes': 2:67: WARNING: failed to load external entity "/home/jarl/tmp/test_nokogiri_xinclude/doc.dtd" (Nokogiri::XML::SyntaxError)
        from /home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/lib/nokogiri/xml/node.rb:366:in `do_xinclude'
        from /home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/lib/nokogiri/xml/document.rb:76:in `parse'
        from /home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/lib/nokogiri/xml.rb:8:in `XML'
        from -e:1:in `<main>'
1

The exit code is as expected. However the error message is wrong; it is the warning message from the missing DTD-file that turns up in the error message for the missing xinclude flie. Interesting enough it is the warning caused by the DOCTYPE declaration in xinclude_doctype_sub_1.xml appearing AFTER the xinclude of the missing file missing file (that is the actual error), not from warning caused by the DOCTYPE declaration in with_doctype_missing_files.xml.

Expected behavior

Expected error message:

4:0: ERROR: could not load /home/jarl/tmp/test_nokogiri_xinclude/xinclude_doctype_sub_missing.xml, and no fallback was found

Environment

$ nokogiri -v
# Nokogiri (1.13.6)
    ---
    warnings: []
    nokogiri:
      version: 1.13.6
      cppflags:
      - "-I/home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/ext/nokogiri"
      - "-I/home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/ext/nokogiri/include"
      - "-I/home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.1
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - '0008-htmlParseComment-handle-abruptly-closed-comments.patch'
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/home/jarl/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/nokogiri-1.13.6-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.14
      loaded: 2.9.14
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.35
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.12
      libgumbo: 1.0.0-nokogiri

Additional context
I believe it is related to this one: #1610

@jarl-dk jarl-dk added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jun 1, 2022
@flavorjones
Copy link
Member

@jarl-dk Thanks for opening this issue. I'll take a deeper look shortly.

@flavorjones
Copy link
Member

OK, digging in, it turns out that the libxml2 function xmlXIncludeProcessTreeFlags records 3 errors (via our callback registered via xmlSetStructuredErrorFunc) during its operation:

[
 #<Nokogiri::XML::SyntaxError: WARNING: failed to load external entity "xinclude_doctype_sub_missing.xml">,
 #<Nokogiri::XML::SyntaxError: 4:0: ERROR: could not load xinclude_doctype_sub_missing.xml, and no fallback was found>,
 #<Nokogiri::XML::SyntaxError: 2:67: WARNING: failed to load external entity "doc.dtd">
]

Note that there's little to distinguish the "warnings" from the "error" that was the cause of the failure, so the code chooses the last error under the assumption that a fatal error would be the final recorded error. In this case, that assumption is busted.

So the question really becomes: if we raise one and only one exception, which error do we choose? We could have some more complicated logic to look for an error with the word "ERROR" in it, but I don't like depending on unstructured data for this. I'm going to think about it a bit.

@flavorjones
Copy link
Member

flavorjones commented Jun 1, 2022

I think perhaps this is a sign that we should be using the xmlErrorLevel to help distinguish warnings and errors, which will definitely add some complexity to the error handling but might be the right thing to do. I need to think about this a bit more.

As a user, what would you expect Nokogiri to do in this case? If we raised an exception that included the text of all errors and warnings, would that be helpful and/or sufficient? That would be straightforward to implement and gives the full context to the user ...

@flavorjones flavorjones added topic/error-handling and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jun 1, 2022
flavorjones added a commit that referenced this issue Jun 24, 2024
In places (like document parsing) where it's possible for libxml2 to
emit multiple warnings and errors, aggregate these libxml2 errors into
a single exception so users can see all the problems.

Previously, we were grabbing the most recent "error" which might just
be a warning and not the fatal error preventing parsing from
completing. This was misleading and hid the source of the real
problem.

Now, if there are multiple errors, they're aggregated into a single
Nokogiri::XML::SyntaxError with a message like:

    Multiple errors encountered:
    - WARNING: text of warning message
    - ERROR: text of error message
    - WARNING: text of second warning message

Note that I've also renamed some internal C functions to try to
incrementally get consistent with naming.

Closes #2562
flavorjones added a commit that referenced this issue Jun 24, 2024
In places (like document parsing) where it's possible for libxml2 to
emit multiple warnings and errors, aggregate these libxml2 errors into
a single exception so users can see all the problems.

Previously, we were grabbing the most recent "error" which might just
be a warning and not the fatal error preventing parsing from
completing. This was misleading and hid the source of the real
problem.

Now, if there are multiple errors, they're aggregated into a single
Nokogiri::XML::SyntaxError with a message like:

    Multiple errors encountered:
    WARNING: text of warning message
    ERROR: text of error message
    WARNING: text of second warning message

Note that I've also renamed some internal C functions to try to
incrementally get consistent with naming.

Closes #2562
flavorjones added a commit that referenced this issue Jun 24, 2024
In places (like document parsing) where it's possible for libxml2 to
emit multiple warnings and errors, aggregate these libxml2 errors into
a single exception so users can see all the problems.

Previously, we were grabbing the most recent "error" which might just
be a warning and not the fatal error preventing parsing from
completing. This was misleading and hid the source of the real
problem.

Now, if there are multiple errors, they're aggregated into a single
Nokogiri::XML::SyntaxError with a message like:

    Multiple errors encountered:
    WARNING: text of warning message
    ERROR: text of error message
    WARNING: text of second warning message

Note that I've also renamed some internal C functions to try to
incrementally get consistent with naming.

Closes #2562
@flavorjones
Copy link
Member

See #3257 for what I think is an adequate cosmetic improvement in how errors are handled in these cases.

@flavorjones flavorjones added this to the v1.17.0 milestone Jun 24, 2024
flavorjones added a commit that referenced this issue Jun 24, 2024
**What problem is this PR intended to solve?**

In places (like document parsing) where it's possible for libxml2 to
emit multiple warnings and errors, aggregate these libxml2 errors into a
single exception so users can see all the problems.

Previously, we were grabbing the most recent "error" which might just be
a warning and not the fatal error preventing parsing from completing.
This was misleading and hid the source of the real problem.

Now, if there are multiple errors, they're aggregated into a single
Nokogiri::XML::SyntaxError with a message like:

    Multiple errors encountered:
    WARNING: text of warning message
    ERROR: text of error message
    WARNING: text of second warning message

Note that I've also renamed some internal C functions to try to
incrementally get consistent with naming.

Closes #2562


**Have you included adequate test coverage?**

Yes.


**Does this change affect the behavior of either the C or the Java
implementations?**

It's a cosmetic improvement on the behavior of the CRuby XML and HTML4
parser.

I have not made this improvement to the JRuby implementation, but it
should be easy enough to add if someone wishes to do the work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants