Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri_error_raise as a libxml2 handler considered harmful #1610

Open
npac opened this issue Mar 1, 2017 · 7 comments
Open

Nokogiri_error_raise as a libxml2 handler considered harmful #1610

npac opened this issue Mar 1, 2017 · 7 comments
Assignees
Milestone

Comments

@npac
Copy link

npac commented Mar 1, 2017

Memory leak is observed while searching in a document with undefined namespace prefix.

# Nokogiri (1.7.0.1)
    ---
    warnings: []
    nokogiri: 1.7.0.1
    ruby:
      version: 2.3.1
      platform: x86_64-linux
      description: ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxml2/2.9.4"
      libxslt_path: "/opt/rbenv/versions/2.3.1/lib/ruby/gems/2.3.0/gems/nokogiri-1.7.0.1/ports/x86_64-pc-linux-gnu/libxslt/1.1.29"
      libxml2_patches: []
      libxslt_patches: []
      compiled: 2.9.4
      loaded: 2.9.4

Reproduced with:

require 'nokogiri'
while true do 
  begin
    #parse simple xml. no namespace definition
    doc = Nokogiri::XML('<ns1:Root></ns1:Root>')
    # below code raises exception and leaks memory  
    body = doc.search('//ns1:Root').first 
    puts body.to_xml 
  rescue => e
   puts e
  end 
end 
@flavorjones
Copy link
Member

flavorjones commented Mar 2, 2017 via email

@flavorjones
Copy link
Member

I've reproduced this behavior. Digging in.

@flavorjones
Copy link
Member

OK, this leak appears to come from using Nokogiri_error_raise to raise an exception during evaluation of the xpath query, which under the hood does a C longjmp and prevents libxml2 from cleaning up memory that had been allocated before the error was encountered.

Nokogiri_error_raise is used in a handful of places, and so now I'm worried that we may be leaking memory under other circumstances as well. I think I'd like to take a look at overhauling our current error-handling pattern to avoid raising exceptions while libxml2 is in the middle of something.

@flavorjones flavorjones changed the title Undefined namespace prefix error results in memory leak Nokogiri_error_raise considered harmful Mar 6, 2017
@flavorjones
Copy link
Member

I've audited the C code for non-trivial raises, use of handlers, and general patterns of error handling:

places where Nokogiri_error_raise is used

ext/nokogiri/xml_sax_push_parser.c

38-  if (xmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
39-    if (!(ctx->options & XML_PARSE_RECOVER)) {
40-      xmlErrorPtr e = xmlCtxtGetLastError(ctx);
41:      Nokogiri_error_raise(NULL, e);
42-    }
43-  }
44-

characterization:

  • call: xmlParseChunk
  • handlers: none
  • check: function return value
  • raise: xmlErrorPtr from xmlCtxtGetLastError
  • errors: none

ext/nokogiri/xml_relax_ng.c

81-  if(NULL == schema) {
82-    xmlErrorPtr error = xmlGetLastError();
83-    if(error)
84:      Nokogiri_error_raise(NULL, error);
85-    else
86-      rb_raise(rb_eRuntimeError, "Could not parse document");
87-

characterization:

  • call: xmlRelaxNGParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlRelaxNGSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
133-  if(NULL == schema) {
134-    xmlErrorPtr error = xmlGetLastError();
135-    if(error)
136:      Nokogiri_error_raise(NULL, error);
137-    else
138-      rb_raise(rb_eRuntimeError, "Could not parse document");
139-

characterization:

  • call: xmlRelaxNGParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlRelaxNGSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

ext/nokogiri/html_sax_push_parser.c

23-  if(htmlParseChunk(ctx, chunk, size, Qtrue == _last_chunk ? 1 : 0)) {
24-    if (!(ctx->options & XML_PARSE_RECOVER)) {
25-      xmlErrorPtr e = xmlCtxtGetLastError(ctx);
26:      Nokogiri_error_raise(NULL, e);
27-    }
28-  }
29-

characterization:

  • call: htmlParseChunk
  • handlers: none
  • check: return value
  • raise: xmlCtxtGetLastError
  • errors: none

ext/nokogiri/xml_xpath_context.c

209-  }
210-
211-  xmlResetLastError();
212:  xmlSetStructuredErrorFunc(NULL, Nokogiri_error_raise);
213-
214-  /* For some reason, xmlXPathEvalExpression will blow up with a generic error */
215-  /* when there is a non existent function. */

characterization:

  • call: xmlXPathEvalExpression
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_raise # TODO bug!
    • xmlSetGenericErrorFunc
  • check: handler # TODO bug!
  • raise: handler

ext/nokogiri/xml_schema.c

120-  if(NULL == schema) {
121-    xmlErrorPtr error = xmlGetLastError();
122-    if(error)
123:      Nokogiri_error_raise(NULL, error);
124-    else
125-      rb_raise(rb_eRuntimeError, "Could not parse document");
126-

characterization:

  • call: xmlSchemaParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlSchemaSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
173-  if(NULL == schema) {
174-    xmlErrorPtr error = xmlGetLastError();
175-    if(error)
176:      Nokogiri_error_raise(NULL, error);
177-    else
178-      rb_raise(rb_eRuntimeError, "Could not parse document");
179-

characterization:

  • call: xmlSchemaParse
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
    • xmlSchemaSetParserStructuredErrors: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

places where rb_exc_raise is used

ext/nokogiri/xml_reader.c

458-
459-  error = xmlGetLastError();
460-  if(error)
461:    rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
462-  else
463-    rb_raise(rb_eRuntimeError, "Error pulling: %d", ret);
464-

characterization:

  • call: xmlTextReaderRead
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none # TODO this may be a bug?

ext/nokogiri/xml_node.c

1401-
1402-    error = xmlGetLastError();
1403-    if(error)
1404:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
1405-    else
1406-      rb_raise(rb_eRuntimeError, "Could not perform xinclude substitution");
1407-  }

characterization:

  • call: xmlXIncludeProcessTreeFlags
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none # TODO this may be a bug?

ext/nokogiri/xslt_stylesheet.c

80-    if (!ss) {
81-	xmlFreeDoc(xml_cpy);
82-	exception = rb_exc_new3(rb_eRuntimeError, errstr);
83:	rb_exc_raise(exception);
84-    }
85-
86-    return Nokogiri_wrap_xslt_stylesheet(ss);

characterization:

  • call: xsltParseStylesheetDoc
  • handlers:
    • xsltSetGenericErrorFunc: xslt_generic_error_handler
  • check: return value
  • raise: handler error string
  • errors: none
177-
178-    if (parse_error_occurred) {
179-      exception = rb_exc_new3(rb_eRuntimeError, errstr);
180:      rb_exc_raise(exception);
181-    }
182-
183-    return Nokogiri_wrap_xml_document((VALUE)0, result) ;

characterization:

  • call: xsltApplyStylesheet
  • handlers:
    • xsltSetGenericErrorFunc: xslt_generic_error_handler
    • xmlSetGenericErrorFunc: swallow_superfluous_xml_errors
  • check: handler error string # TODO bug? might be leaking return value
  • raise: handler error string
  • errors: none

ext/nokogiri/xml_xpath_context.c

221-
222-  if(xpath == NULL) {
223-    xmlErrorPtr error = xmlGetLastError();
224:    rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
225-  }
226-
227-  assert(ctx->doc);

characterization:

  • call: xmlXPathEvalExpression
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_raise
    • xmlSetGenericErrorFunc: xpath_generic_exception_handler
  • check: return value
  • raise: xmlGetLastError
  • errors: none

ext/nokogiri/xml_document.c

254-
255-    error = xmlGetLastError();
256-    if(error)
257:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
258-    else
259-      rb_raise(rb_eRuntimeError, "Could not parse document");
260-

characterization:

  • call: xmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
298-
299-    error = xmlGetLastError();
300-    if(error)
301:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
302-    else
303-      rb_raise(rb_eRuntimeError, "Could not parse document");
304-

characterization:

  • call: xmlReadMemory
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
446-  if(NULL == ptr) {
447-    xmlErrorPtr error = xmlGetLastError();
448-    if(error)
449:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
450-    else
451-      rb_raise(rb_eRuntimeError, "Could not create entity");
452-

characterization:

  • call: xmlAddDocEntity
  • handlers: none # TODO bug?
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none

ext/nokogiri/html_document.c

66-    VALUE encoding_found = rb_funcall(io, id_encoding_found, 0);
67-    if (!NIL_P(encoding_found)) {
68-      xmlFreeDoc(doc);
69:      rb_exc_raise(encoding_found);
70-    }
71-  }
72-

characterization:

  • call: htmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value rb_respond_to(io, id_encoding_found) # TODO i have no idea what this is doing
  • raise: encoding_found
  • errors: none
77-
78-    error = xmlGetLastError();
79-    if(error)
80:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
81-    else
82-      rb_raise(rb_eRuntimeError, "Could not parse document");
83-

characterization:

  • call: htmlReadIO
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document
123-
124-    error = xmlGetLastError();
125-    if(error)
126:      rb_exc_raise(Nokogiri_wrap_xml_syntax_error(error));
127-    else
128-      rb_raise(rb_eRuntimeError, "Could not parse document");
129-

characterization:

  • call: htmlReadMemory
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: on document

places where rb_raise is used

ext/nokogiri/xml_reader.c

547-
548-  if(reader == NULL) {
549-    xmlFreeTextReader(reader);
550:    rb_raise(rb_eRuntimeError, "couldn't create a parser");
551-  }
552-
553-  rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);

characterization:

  • call: xmlReaderForMemory
  • handlers: none
  • check: return value
  • raise: string
  • errors: none
592-
593-  if(reader == NULL) {
594-    xmlFreeTextReader(reader);
595:    rb_raise(rb_eRuntimeError, "couldn't create a parser");
596-  }
597-
598-  rb_reader = Data_Wrap_Struct(klass, NULL, dealloc, reader);

characterization:

  • call: xmlReaderForIO
  • handlers: none
  • check: return value
  • raise: string
  • errors: none

ext/nokogiri/xml_node.c

1491-    switch (error) {
1492-      case XML_ERR_INTERNAL_ERROR:
1493-      case XML_ERR_NO_MEMORY:
1494:	rb_raise(rb_eRuntimeError, "error parsing fragment (%d)", error);
1495-	break;
1496-      default:
1497-	break;

characterization:

  • call: xmlXIncludeProcessTreeFlags
  • handlers:
    • xmlSetStructuredErrorFunc: Nokogiri_error_array_pusher
  • check: return value
  • raise:
    • xmlGetLastError
    • string
  • errors: none

ext/nokogiri/xml_sax_push_parser.c

69-          filename
70-        );
71-  if (ctx == NULL) {
72:    rb_raise(rb_eRuntimeError, "Could not create a parser context");
73-  }
74-
75-  ctx->userData = NOKOGIRI_SAX_TUPLE_NEW(ctx, self);

characterization:

  • call: xmlCreatePushParserCtxt
  • handlers: none
  • check: return value
  • raise: string
  • errors: none
93-  Data_Get_Struct(self, xmlParserCtxt, ctx);
94-
95-  if (xmlCtxtUseOptions(ctx, (int)NUM2INT(options)) != 0) {
96:    rb_raise(rb_eRuntimeError, "Cannot set XML parser context options");
97-  }
98-
99-  return Qnil;

characterization:

  • call: xmlCtxtUseOptions
  • handlers: none
  • check: return value
  • raise: string
  • errors: none

@flavorjones flavorjones changed the title Nokogiri_error_raise considered harmful Nokogiri_error_raise as a libxml2 handler considered harmful Mar 6, 2017
@flavorjones flavorjones self-assigned this May 11, 2017
@flavorjones
Copy link
Member

I'll further note that this is the script I used to generate the samplings of code I filtered through above (in case I need to re-do the analysis after so much time):

#!/usr/bin/env ruby

target = "ext/nokogiri"
command = "ack -C3 --group --nocolor --ignore-file=match:xml_syntax_error*"

expressions = %w[
  Nokogiri_error_raise
  rb_exc_raise
  rb_raise
]

expressions.each do |expression|
  puts "## places where `#{expression}` is used"
  puts

  cmd = "#{command} #{expression} #{target}"
  output = `#{cmd}`

  output.split("\n").each do |line|
    if line =~ /\.[ch]$/
      puts "### #{line}"
      puts
      puts "```C"
    elsif line.empty?
      puts "```"
      puts
    elsif line =~ /^--$/
      puts "```"
      puts 
      puts "```C"
    else
      puts line
    end
  end
  puts "```"
  puts
end

@flavorjones flavorjones added this to the v1.11.0 milestone Oct 12, 2020
@flavorjones
Copy link
Member

Note that TruffleRuby would also prefer if we didn't raise from a native code callback, see #1882

flavorjones added a commit that referenced this issue Oct 13, 2020
also ensure CRuby raises an XML::XPath::SyntaxError on xpath function
errors (like JRuby). previously CRuby raised a RuntimeError.

Related to #1610 and #1882.
@flavorjones flavorjones modified the milestones: v1.12.0, v1.13.0 Aug 2, 2021
flavorjones added a commit that referenced this issue Dec 8, 2021
The remaining leaks are small. Some are from exception handling
detailed at #1610 and being worked on in #2096.

This allows us to detect and new memory leaks introduced.
flavorjones added a commit that referenced this issue Dec 8, 2021
The remaining leaks are small. Some are from exception handling
detailed at #1610 and being worked on in #2096.

This allows us to detect and new memory leaks introduced.
flavorjones added a commit that referenced this issue Dec 10, 2021
The remaining leaks are small. Some are from exception handling
detailed at #1610 and being worked on in #2096.

This allows us to detect and new memory leaks introduced.
@flavorjones flavorjones modified the milestones: v1.13.0, v1.14.0 Jan 6, 2022
@flavorjones
Copy link
Member

Recent relevant post from @peterzhu2118 with explanatory notes: https://blog.peterzhu.ca/ruby-c-ext-part-8/

Note also some exploratory work I've started at #2096

@flavorjones flavorjones modified the milestones: v1.14.0, v1.15.0 Aug 28, 2022
flavorjones added a commit that referenced this issue Oct 14, 2022
in xml_xpath_context.c:evaluate generic error handling

Related to #1610
@flavorjones flavorjones modified the milestones: v1.15.0, v1.16.0 Apr 28, 2023
flavorjones added a commit that referenced this issue Jul 15, 2024
**What problem is this PR intended to solve?**

I think the stack trace changed with ruby/ruby@51bd8165, see for example
https://github.com/sparklemotion/nokogiri/actions/runs/9935752042/job/27442553528

I'm not sure why this is the only leak showing up now, but am deferring
further study until I devote some time to cleaning up known leaks where
we raise exceptions in Ruby callbacks (from libxml2), see #2096 and
#1610
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants