Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDEA: improving Turtle performance with non-strict mode #81

Open
pchampin opened this issue Mar 4, 2022 · 14 comments
Open

IDEA: improving Turtle performance with non-strict mode #81

pchampin opened this issue Mar 4, 2022 · 14 comments

Comments

@pchampin
Copy link
Collaborator

pchampin commented Mar 4, 2022

If I remember correctly, we found out some time ago that a lot of time was spent in checking PN_CHARS.

We could add a strict flag in the parser configuration; when set to true, the parser would use the current code, rejecting any invalid character. When set to false (the default, following Postel's law), the parser would accept any non-ascii character (since only ascii characters are "significant" for the syntax anyway).

Not sure how much this would improve performances, but it is worth a try. @Tpt WDYT?

@Tpt
Copy link
Collaborator

Tpt commented Mar 4, 2022

It's great idea! I would add two elements to this "lenient" mode:

  1. As cheap as possible IRI validation: it seems to me it is currently the major bottelneck of the N-Triples parser. We might only validate against the Turtle grammar and not against RFC 3987.
  2. Cheap language tag validation: validate only against the Turtle grammar and not RFC 5646.

@pchampin
Copy link
Collaborator Author

pchampin commented Mar 4, 2022

I'm not sure about that... Being a valid IRI (resp. language tag) is part of the contract of the datatypes you expose. So letting the parser produce such "impure" data will contaminate the user's code, and may eventually break it.

@Tpt
Copy link
Collaborator

Tpt commented Mar 4, 2022

Being a valid IRI (resp. language tag) is part of the contract of the datatypes you expose.

That's a good point. However, if we want IRI to be always valid we would still have to validate (maybe less strictly?) PN_CHARS because they are used in PN_LOCAL which value is directly used to build IRIs.

@pchampin
Copy link
Collaborator Author

pchampin commented Mar 4, 2022

The limitations in PN_LOCALS are not all related to IRI syntax. @drobilla pointed out to me some time ago that, for example, slashes are forbidden in Turtle pnames only because they are forbidden in SPARQL. So a:b/c is invalid in Turtle, for the sole reason that it would be ambiguous in SPARQL (looks like a property path). That's what got me thinking about a lenient mode for the parser.

Another option would be to have

  • strict: only accept valid turtle
  • lenient: accepts some invalid turtle, but produces valid IRIs and language tags
  • bisounours: accept some invalid turtle and does not check IRIs nor language tags

with a big red flag on the latest...

@Tpt
Copy link
Collaborator

Tpt commented Mar 4, 2022

The limitations in PN_LOCALS are not all related to IRI syntax. @drobilla pointed out to me some time ago that, for example, slashes are forbidden in Turtle pnames only because they are forbidden in SPARQL. So a:b/c is invalid in Turtle, for the sole reason that it would be ambiguous in SPARQL (looks like a property path). That's what got me thinking about a lenient mode for the parser.

That's a great point! Thank you! I would tend to believe is that the "lenient" won't bring much performance improvements compared to "strict" because we would still have to do a quite carefull validation on PN_LOCAL but if someone is willing to take the time to implement and benchmark that I would love to get the results.

bisounours: accept some invalid turtle and does not check IRIs nor language tags

I love this name. This mode might be convenient for simple usecases where people don't care much about validity.

@pchampin
Copy link
Collaborator Author

pchampin commented Mar 4, 2022

I love this name.

It was intended as a working title, but why not ;)
The English version is "care bear", but I'm not sure it has the same connotation in English as "bisounours" has in French...

@drobilla
Copy link

drobilla commented Mar 5, 2022

The Turtle grammar for pnames is indeed... unfortunate.

If you want to journey down that rabbit hole, I'd be happy to implement a compatible "lenient" (for lack of a better term) mode in serd, although we'd need to agree on what the grammar is. I confess to having complained for years about how silly all of those SPARQL warts are, but never actually writing down or implementing what I think the grammar should be :)

@pchampin
Copy link
Collaborator Author

pchampin commented Mar 7, 2022

although we'd need to agree on what the grammar is

I don't think that we need that. The rationale of the non-strict mode is not to define an alternate version of Turtle, that we would need to agree on. It is to separate two distinct concerns: validation, and production of triples.

Currently, the rio_turtle parser (and, I guess, the serd parser) plays both roles. It produces triples, but also rejects any invalid input, therefore also acting as a validator. But it does not have to be that way.

Consider HTML: the two concerns are clearly separated. The HTML validator only validates, it does not produce (render) anything. Browsers, on the other hand, produce visual output from (i.e. render) HTML input, and they strive to be liberal, and produce output even for invalid HTML (following Postel's law). While we can complain if two browsers render the same valid HTML differently (one of them, at least, must be wrong w.r.t. to the HTML standard), we can not complain if they differ in rendering invalid HTML. In that case, we should indeed blame the producer of the invalid HTML.

Coming back to Turtle and the Turtle test suite. Validating parsers need to be strict in what they accept (i.e. pass the positive and negative tests). But producing-only parsers may, on the contrary, be liberal. Then it does not really matter if they don't pass all the negative tests (they still need, of course, to pass all positive tests). And it does not really matter either if the superset of Turtle that one accepts is not exactly the same as the superset accepted by another liberal parser.


All hat being said, I agree that it could also be interesting to define a superset of Turtle that would have a simpler grammar, not encumbered with constraints inherited from SPARQL. Then even validating parsers (and strict serialisers) would be easier to develop. But that's one step further that what I was proposing above.

@drobilla
Copy link

drobilla commented May 9, 2022

As it happens, I think the lax mentality that browser tech in general uses is one of the most catastrophic mistakes in technology there has ever been. The web is an irreparable mess, virtually impossible for new implementations to join, and the infinite minefield of invalid markup out there is a big part of why. Strict tools are, in the grand scheme of things, better, in my opinion. Lax seems like a nice idea in a myopic sort of way, but can be disastrous in the grand scheme of things. That said, it's a useful feature, but I think lax/tolerant/whatever parsing should be optional (and disabled by default). Otherwise, all we get is garbage data everywhere, and interoperability suffers. In short, the HTML-and-friends universe is the last place I think we should be taking guidance from on strictness.

Anyway, that's something of a tangent, but since this was originally about performance, I doubt that this would have much impact. The non-validating case would get a bit faster, perhaps, but at the cost of making the validating case (which should be the default) slower since all the data needs to be scanned twice (which, on modern architectures, is a much more impactful concern than doing a check or two here and there. Cache rules everything around me...).

My interest in this is more about complexity, maintenance burden, and so on. Turtle is far too annoying to implement from scratch, particularly when the most annoying parts are just holding the syntax back for no good reason. That said, a much simpler grammar certainly wouldn't be slower, so...

@pchampin
Copy link
Collaborator Author

@drobilla Granted, the analogy with HTML has its limits. Humans are more tolerant than machines, and so lenient HTML is less of a problem than lenient RDF. In any case, I agree that the strict mode should be the default.

@Tpt
Copy link
Collaborator

Tpt commented May 12, 2022

I agree that the strict mode should be the default.

+1

I doubt that this would have much impact.

I believe that Rio is currently more strict than Serde and validates the IRI according to RFC 3987. I did some quick profiling which suggest that is is the most important bottleneck in Rio for N-Triples, hence @pchampin suggestion. Not validating IRI at all would also allow to make use of fast tools like memchr that might bring some extra performance.

@arenas-guerrero-julian
Copy link

By looking at CPU usage for loading N-Quads in Oxigraph, it mostly uses 1 core. Would it be feasible to parallelize parsing of N-Triples/Quads using rayon? Since they are row-based (as CSV) chunks of rows could be parsed independently by different cores.

@Tpt
Copy link
Collaborator

Tpt commented Dec 13, 2022

Would it be feasible to parallelize parsing of N-Triples/Quads using rayon?

Yes, that's definitely doable. We might use memchr to find line jumps and then send the resulting lines to a pool of worker threads.

@rami3l
Copy link

rami3l commented Feb 6, 2023

I have encountered this issue today when parsing a real-world .ttl (a DBPedia dump) with lightrdf (which depends on this parser to give it a really great performance)...

Error                                     Traceback (most recent call last)
Cell In[6], line 8
      5 cnt = 0
      7 # `None` matches arbitrary term
----> 8 for triple in doc.search_triples(None, None, None):
      9     cnt += 1
     11 cnt

Error: error while parsing IRI 'http://www:FromMyHometown.net': Invalid character 'F' on line 1810714 at position 129

I do understand that it's better to have a valid .ttl here, however I do hesitate to patch the original data, and having a less strict parsing mode will definitely help :)

(cc ozekik/lightrdf#9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants