Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdfstar parsers and serializers update #2115

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

XuguangSong98
Copy link

@XuguangSong98 XuguangSong98 commented Sep 14, 2022

Summary of changes

  1. created a new term rdflib.term.RdfstarTriple to store the new quoted triple object
  2. added parser and serializer for turtle-star, trig-star, ntriples-star
  3. added test cases for turtle-star, trig-star, ntriples-star
  4. registered the new parsers and serializers in plugin
  5. comments and code styles remaining to be updated
  6. sparql star is in progress

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Added tests for any changes that have a runtime impact.
  • Checked that all tests and type checking passes.
  • For changes that have a potential impact on users of this project:
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
    • Considered adding an example in ./examples for new features.
    • Considered updating our changelog (CHANGELOG.md).
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@XuguangSong98
Copy link
Author

@nicholascar all the new stuff from https://github.com/RDFLib/rdflib-rdfstar

@XuguangSong98
Copy link
Author

run tests with python test_serializer_trigstar.py/python test_serializer_turtlestar.py/python test_serializer_ntriplesstar.py in rdflib and run tests with pytest test_parser_turtlestar.py/pytest test_parser_trigstar.py/pytest test_parser_ntriplesstar.py in rdflib/test

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 14, 2022

the parser will go through a preprocessing stage to transfer the rdfstar with reification to let the old parser understand, because of the new syntax "{|" annotation in rdfstar is hard to expand with old parser, then instead of adding the statement with blanknode, I edited the older parser so that it will created a RdfstarTriple obejct for the <<:s :p :o>> quoted triple if any, then only index the 1 statement, this implementation is also compatiable with rdf documents because any part without new syntax will not be changed in the preprocessing stage

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 15, 2022

@nicholascar so compare to the way of treating it as syntax sugar, we reduce the number of statements stored in graph, but in order to save the memory space, so that we don't need to store the whole nested rdflib.term.rdfstartriple object, we need a better support in store and memory class like store the rdfstartriple( Tuple(:b3f5e185425e8b59dc2f2c4178794ea6RdfstarTriple, s, p, o)) and treat the rdfstarTriple object just like the BNode which only has an id, but the way of adding support and edit the store and memory class needs huge amount of work so I may put it in the future work

@niklasl
Copy link
Member

niklasl commented Sep 15, 2022

While this is amiable work, I'm not sure merging this until RDF-star is a REC is wise. There are still open issues on RDF-star itself.

Also, I see uses of rdfstar:QuotedStatement and rdfstar:AssertedStatement here which are not referenced in the RDF CG Report. Any thoughts on those?

Furthermore, one of the open questions AFAIK is still whether RDF-star should turn triples into nodes in the core RDF model itself. I believe that this PR does just that (by adding RdfstarTriple), which is fine (and necessary) if that is the final decision. But until the complexities of that decision has been analysed and decided upon (within W3C), do we want to have that in an RDFLib release?

(I do know RDFLib already has e.g. Notation 3 support with variables (and formulae?), while that is still not a standard. One of the things W3C must do here is to make a decision on whether these drafts are in conflict or complementary. I recommend to proceed with caution in RDFLib to keep some check on the complexities that can arise.)

Also, are we to replace all parsers in RDFLib with Lark-generated ones, or use a mix of them? (There's been some discussion on that in other issues.)

(For the record I'm still in favour of a non-monolithic RDFLib with pluggable parsers and serializers, so people can choose which specs and implementations to use.)

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 15, 2022

Hi @niklasl

Also, I see uses of rdfstar:QuotedStatement and rdfstar:AssertedStatement here which are not referenced in the RDF CG Report. Any thoughts on those?

Hi, this is a new ontology we designed to identify whether a reification is refer to a quoted/asserted/both statement when further processing in the old parser

ont = """
PREFIX rdfstar: https://w3id.org/rdf-star/

rdfstar:QutotedStatement
a rdfs:Class ;
rdfs:subClassOf rdf:Statement ;
.

rdfstar:AssertedStatement
a rdfs:Class ;
rdfs:subClassOf rdf:Statement ;
rdfs:comment "An Asserted Statement is a..." ;
rdfs:label "Asserted Statement" ;
dcterms:creator: "Xuguang Song" ;
.
"""

because rewrite whole parser with lark parser is too difficult and take too much time so I decide to preprocess all rdf-star syntax into rdf style and further edit the old parser
we may publish onto the RDF CG Report but anyway it doesn't affect the functionality I believe.

Furthermore, one of the open questions AFAIK is still whether RDF-star should turn triples into nodes in the core RDF model itself. I believe that this PR does just that (by adding RdfstarTriple), which is fine (and necessary) if that is the final decision. But until the complexities of that decision has been analysed and decided upon (within W3C), do we want to have that in an RDFLib release?

I think GraphDB/rdf4j did that as well https://rdf4j.org/documentation/programming/rdfstar/

for exmaple we have a turtle file it will firstly be parsed into a standard reification
then when the old parser reads in, instead of processing as anonymous node in the old parser it will be processed independently for asserted triples they will be indexed and quoted triples will be made into new objects and registered in quoted triple dictionary and used in asserted triples' indexing.

(I do know RDFLib already has e.g. Notation 3 support with variables (and formulae?), while that is still not a standard. One of the things W3C must do here is to make a decision on whether these drafts are in conflict or complementary. I recommend to proceed with caution in RDFLib to keep some check on the complexities that can arise.)

I don't think it is complete correct me if I am wrong

@ghost
Copy link

ghost commented Sep 15, 2022

(I do know RDFLib already has e.g. Notation 3 support with variables (and formulae?),

I don't think it is complete correct me if I am wrong

How is it incomplete in your view?

@XuguangSong98
Copy link
Author

(I do know RDFLib already has e.g. Notation 3 support with variables (and formulae?),
I don't think it is complete correct me if I am wrong

How is it incomplete in your view?

@gjhiggins I looked into the formulae design before I started, I couldn't figure out how it is working. I did find there is an option in store to add quoted triple/statement but it is set to False as default, and need to specify the context if you want to set it to True?(I don't know what it means and how it works), and I don't know if it is regarding to rdfstar extension. Or it is for other use

@ghost
Copy link

ghost commented Sep 15, 2022

@gjhiggins I looked into the formulae design before I started, I couldn't figure out how it is working. I did find there is an option in store to add quoted triple/statement but it is set to False as default, and need to specify the context if you want to set it to True?(I don't know what it means and how it works), and I don't know if it is regarding to rdfstar extension. Or it is for other use

Ah okay, then I can confirm that i) It's not incomplete and ii) it woud be inappropriate to (ab)use the QuotedGraph for RDF Star reification purposes, its intended use is for representing rules expressed in N3 syntax:

LOG_NS = Namespace("http://www.w3.org/2000/10/swap/log#")

rulegraph = """@prefix : <urn:example:> .
@prefix log: <http://www.w3.org/2000/10/swap/log#> .
:bob :hasmother :alice .
:john :hasmother :alice .

@forAll :x, :y, :z .
{ :x :hasmother :z . :y :hasmother :z } log:implies { :x :issibling :y } .
"""

def test_formula():
    ds = Dataset()
    g = ds.parse( data=rulegraph, format="n3")

    rule_statement = list(g.triples((None, LOG_NS.implies, None)))[0]

    assert (
        repr(rule_statement)
        == "(<Graph identifier=_:Formula2 (<class 'rdflib.graph.QuotedGraph'>)>, "
        + "rdflib.term.URIRef('http://www.w3.org/2000/10/swap/log#implies'), "
        + "<Graph identifier=_:Formula3 (<class 'rdflib.graph.QuotedGraph'>)>)"
    )

    quoted_graph_sub_obj = list(g.subject_objects(predicate=LOG_NS.implies))

    assert (
        str(quoted_graph_sub_obj)
        == "[("
        + "<Graph identifier=_:Formula2 (<class 'rdflib.graph.QuotedGraph'>)>, "
        + "<Graph identifier=_:Formula3 (<class 'rdflib.graph.QuotedGraph'>)>"
        + ")]"
    )

    formula2, formula3 = quoted_graph_sub_obj[0]

    assert str(sorted(list(formula2))) == (
        "["
        "(rdflib.term.Variable('urn:example:x'), "
        "rdflib.term.URIRef('urn:example:hasmother'), "
        "rdflib.term.Variable('urn:example:z')), "
        "(rdflib.term.Variable('urn:example:y'), "
        "rdflib.term.URIRef('urn:example:hasmother'), "
        "rdflib.term.Variable('urn:example:z'))"
        "]"
    )

    assert (
        str(sorted(list(formula3)))
        == "[("
        + "rdflib.term.Variable('urn:example:x'), "
        + "rdflib.term.URIRef('urn:example:issibling'), "
        + "rdflib.term.Variable('urn:example:y')"
        + ")]"
    )

@aucampia
Copy link
Member

Furthermore, one of the open questions AFAIK is still whether RDF-star should turn triples into nodes in the core RDF model itself. I believe that this PR does just that (by adding RdfstarTriple), which is fine (and necessary) if that is the final decision. But until the complexities of that decision has been analysed and decided upon (within W3C), do we want to have that in an RDFLib release?

(I do know RDFLib already has e.g. Notation 3 support with variables (and formulae?), while that is still not a standard. One of the things W3C must do here is to make a decision on whether these drafts are in conflict or complementary. I recommend to proceed with caution in RDFLib to keep some check on the complexities that can arise.)

A possible compromise here is to add RDFStarTriple as provisional.

On the one hand, I am hesitant to commit to a public interface prematurely, but on the other hand, I would rather take sub-optimal RDFStar support than no RDFStar support, and making the API for it provisional kind of gets us out of this bind.

Also, are we to replace all parsers in RDFLib with Lark-generated ones, or use a mix of them? (There's been some discussion on that in other issues.)

For the parsers that are not based on existing formats Lark presents a quite attractive offering, in that it is easier to deal with than hand written parsers, and that it can generate standalone parsers that don't depend on third party libraries.

I'm fine with using a mix of them though, but we should try and avoid leaking any internals so we can switch out the implementations if need be.

(For the record I'm still in favour of a non-monolithic RDFLib with pluggable parsers and serializers, so people can choose which specs and implementations to use.)

At the moment resources for maintenance is still quite sparse. If we had a good cookiecutter template integrated with cruft that could make it easier to manage more repos, but even if we had that there would still be a lot of challenges keeping everything working and aligned.

If we do things right, and make usage of any third party libraries optional (i.e. by using extras), then the only real downside I see here is size, and at least to me I think the size of RDFLib is still quite manageable. Our wheel is less than 500 KB big, and a significant chunk of that (>33%) is from our pre-compiled namespaces.

@aucampia
Copy link
Member

@alooba1 thanks for the PR, I will have a more thorough look at it over the weekend, but do try and look at the CI failures, if you are unsure about their causes just ask and we will try and help, also just a reminder that there is a gitter/matrix chat and I will be happy to provide assistance to you there also.

@ghost
Copy link

ghost commented Sep 15, 2022

For the parsers that are not based on existing formats Lark presents a quite attractive offering, in that it is easier to deal with than hand written parsers, and that it can generate standalone parsers that don't depend on third party libraries.

fwiw, the import for a Lark LALR standalone parser (Lark_StandAlone) for (say) triples is only two lines of code different from the corresponding import for a Lark standalone parser for (say) trigstar (wherein the two lines of code contain strings of the corresponding grammar and terminals respectively). So, one common Lark_StandAlone with suitably swapped-in strings will likely serve all the Transformers for RDFLib Lark-based parsers.

@ghost
Copy link

ghost commented Sep 15, 2022

(For the record I'm still in favour of a non-monolithic RDFLib with pluggable parsers and serializers, so people can choose which specs and implementations to use.)

At the moment resources for maintenance is still quite sparse ... there would still be a lot of challenges keeping everything working and aligned.

Nevertheless, if the work that I've done (implementing a suite of Lark-based parsers for RDFLib¹) ever sees the light of day, that's probably the route I'll be choosing for making that work available. Same goes for the SQLite-based stores (one SQL, one key-value) and an RDNA serializer.

¹ I was interested in assessing the degree of any performance penalty that might accrue from a switch to Lark-based parsers and the only truly reliable way to investigate that was to implement the parsers.

@aucampia
Copy link
Member

Nevertheless, if the work that I've done (implementing a suite of Lark-based parsers for RDFLib¹) ever sees the light of day, that's probably the route I'll be choosing for making that work available. Same goes for the SQLite-based stores (one SQL, one key-value) and an RDNA serializer.

You should make a draft PR for this or at least drop the link to the branch or commits somewhere, having it as a baseline to look at will be helpful to others I think.

@XuguangSong98
Copy link
Author

@alooba1 thanks for the PR, I will have a more thorough look at it over the weekend, but do try and look at the CI failures, if you are unsure about their causes just ask and we will try and help, also just a reminder that there is a gitter/matrix chat and I will be happy to provide assistance to you there also.

ok I will start fixing the CI failures, there is still a problem for some nested annotation I am trying to fix, I think will be fixed today/tomorrow

@niklasl
Copy link
Member

niklasl commented Sep 17, 2022

@alooba1 I believe I understand. It is an interesting approach, and it would be good to discuss its design with the RDF-star CG. How can I get back the triple itself from the "tagged" bnode ID if the reified form isn't sent to storage?

I do wonder if this multiple passes approach is an unnecessary complication though. Using Lark, this should be simple enough to do in one sweep (see the Turtle parser in Pymantic for an example of that). A separate utility function could then be used if someone wanted to convert between RDF and RDF-star, following the CG report. Of course, by using Lark here, I believe it would be wise to adapt the regular Turtle and TriG parsers to this approach, rather than mixing them up.

Regarding that, there seems to be a lot of code duplication in this PR. The contents of notation3.py, turtlestar.py and trigstar.py have huge chunks of overlap (in fact, most of those are identical, so I suspect lots have been copied and pasted). I think this needs much more code reuse, and some more cleanup (lots of commented code for one) to be considered for merging.

Note that with RDF-star, all triples are to be possible to use as subjects and objects (asserted or not). Is this addressed in ths PR, or is the RdfStarTriple disjoint with regular triples?

Have you considered the relative merits of defining e.g. StarGraph, or StarDataset (and corresponding stores) to keep things explicit, given that RDF-star is defined as a superset of RDF itself? (I must confess I do not know enough about the "formula-aware" notion in RDFLib to be able to recommend going on that route or this one here). Or is RDFLib to be considered an "RDF-star library" throughout after this addition?

@niklasl
Copy link
Member

niklasl commented Sep 17, 2022

@aucampia What would the concrete use case be for adding possibly sub-optimal support which results in objects with an unstable interface? How would users of RDFLib be able to use that without depending on the resulting types (particularly when using type annotations)? If this is to be added mostly to support conversion between fortmats, note that there is only one other syntax with pending support for this extension to RDF, and that is JSON-LD-star. That isn't complete either (though I hope it will be in sync with the RDF-star REC timeline). I would suggest to keep this in a branch (for further cleanup) and tracking the progress of RDF-star towards REC (which might entail little to no changes if all features of it are universally approved).

Note that RDF-star has gotten some criticism for adding complexity (and possibly at odds with named graphs/datasets, and not being aligned with Notation 3, for better or worse). I can see its value though (as the old form of RDF reification have left most of us wanting; and RDF-star addresses that, to some extent). It does turn RDF graphs into hypergraphs though, and you cannot easily go back from that new level of complexity.

Having implemented parsing and serializing of TriG-star in both LDTR (in JS, mainly for teaching and visualization of RDF) and TRLD (in Python), I know that adding the star-syntax to existing TriG parsers/serializers does not have to be that hard, even with a handwritten parser. The hard part is stabilizing and agreeing upon the semantics and underlying data model. So I wouldn't rush it out into a stable library release before its thoroughly vetted. The effects on the RDF community at large is the real concern here.

Also, since maintainance time is scarse, I believe screening for quality and code reuse is essential to keep the RDFLib code base in a decent state. Copying and pasting will eventually make it unmanageable. I know you do a lot of screening and quality improvements, and I am humbled by your continued efforts. I just want to stress that adding this in a sub-optimal form right now might increase the maintenance burden.

(Side note: some of the precompiled namespaces should, IMHO, not be a part of RDFLib core. I'd put those in another package, installable for those who need it.)

@niklasl
Copy link
Member

niklasl commented Sep 17, 2022

@gjhiggins I do think reworking most/all parsers to use Lark could be beneficial. Of course, it needs to be meticulously done, but following the specs is much easier by using the official EBNF, so it would probably raise the quality of RDFLib parsers. I did some simple benchmarking, and both Pymantic and TRLD parse Turtle about ten times faster than RDFLib right now (that could be more due to the internal data model (or namespace handling) than the actual string wrangling though). Properly wiring Turtle/TriG/-star EBNF into RDFLib graphs ought to reduce the code a lot too (following the lead of Pymantic, and attributing them for the approach if we're looking at their code).

I would be interested in having a go at replacing RDFLib parsers with Lark-based ones (well, time permitted). As @aucampia says, would you like to publish your efforts, or collaborate on this in some other way? Of course, the results have to be thoroughly assessed for compatibility, speed and ease of maintenance.

We could then incorporate things from this PR, unless @alooba1 has the time to also join into this Lark readaptation. As I commented above, I believe clearly defining how to extend RDFLib to the hypergraph space of RDF-star is the hard part, and that would go much deeper than mere surface syntaxes. Unless doing clever tricks to keep triples-as-subjects as mere BNodes with special ID:s is doable; albeit perhaps at odds with the RDF-star ambitions. (You could even consider URI-encoding NTriples to keep them as "just" RDF identifiers. That is a hack of course; flattening the hyperghraph into opaque triples again, making it unnavigable "below" the triple level.)

@aucampia
Copy link
Member

Also, since maintainance time is scarse, I believe screening for quality and code reuse is essential to keep the RDFLib code base in a decent state. Copying and pasting will eventually make it unmanageable. I know you do a lot of screening and quality improvements, and I am humbled by your continued efforts. I just want to stress that adding this in a sub-optimal form right now might increase the maintenance burden.

I'm in complete agreement with this, I don't want to accept sub-optimal code, and I don't want duplication in human maintained code. I'm fine with duplication in computer maintained (i.e. computer generated) code though, but even there my preference would be to reduce it as much as possible.

I can see some things in this PR which does look like manual duplication as it contains copyright from Tim Burners Lee, and I think the code in this PR needs some work, but I will see what I can do to get it to a quality that I'm happy with before approving.

Regarding my suggestion of adding the RDFStar support as provisional,

Provisional should not be taken to mean sub-optimal, it should only be taken to imply that it does not carry the guarantees of our public interfaces. It does not mean the code that implements it can or should be lower quality. The interface may be sub-optimal, or incomplete, but it should be as optimal and complete as it can be given the situation. In this case, because there are still open questions about RDFStar, there are still open questions about what the final interface should look like, and even if there were no open questions about RDFStar there would still be open questions about what constitutes an optimal interface I think.

@aucampia What would the concrete use case be for adding possibly sub-optimal support which results in objects with an unstable interface?

To me the main benefit of merging the RDFStar support as provisional over not merging it is that it allows us to see what does and doesn't works in terms of an interface, and it makes it possible for people to try out RDFStar. If we see that the interface for RDFStar has no flaws, we promote the provisional interface to a stable interface, and then all code that was using the provisional interface keeps working.

Most of the current interfaces of RDFLib is not optimal, and making optimal interfaces is extremely difficult, having an interface that is as optimal as we can conceive but does not carry the guarantees of our public interface gives as an opportunity to adjust it if necessary, but also allows us to finalize it with no negative impact to users of it if we find that it is suitable.

How would users of RDFLib be able to use that without depending on the resulting types (particularly when using type annotations)?

They would depend on the interface, they just won't have the guarantees associated with the rest of our public interface.

I would suggest to keep this in a branch (for further cleanup) and tracking the progress of RDF-star towards REC (which might entail little to no changes if all features of it are universally approved).

Given given the quality of the code is sufficient, and the interface is as good as we can make it right now, I would much rather merge the code into the main branch as provisional than keep it in a feature branch because I think the most likely outcome of that is that the feature branch will become defunct.

If it is in main branch then any changes to the main branch has to include accommodations for it.

I do realize though that the quality of the code as it stands is possibly not quite sufficent, and the interface could be improved, but as I said I will try and work to get the PR where it needs to be before approving.

Another thing to keep in mind is that we do plan to do a massive overhaul of the core interfaces of RDFLib. This will be a lengthy process, and it is quite unlikely that the RDFStar support will look similar after the overhault, but we should consider RDFStar as part of this redesign I think, it does add complexity, but to some extent I hope that RDFStar does gain widespread adoption because I think it the value it gives is commensurate to the complexity it adds. And to me having some RDFStar support with the current core interfaces makes it easier to design new core interfaces that are even better suited to RDFStar.

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 17, 2022

I can see some things in this PR which does look like manual duplication as it contains copyright from Tim Burners Lee, and I think the code in this PR needs some work, but I will see what I can do to get it to a quality that I'm happy with before approving.

just as I stated in the previous sections, cuz I preprocess the input sequence to a level that old parser can understand, and then I use the old parser and further edit the old parser so that it will index the reification in a way rdfstar does (not just adding blank nodes) so the code is using part of n3 parser, maybe I can do a import for the parts that I never touched or?

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 17, 2022

@niklasl

@alooba1 I believe I understand. It is an interesting approach, and it would be good to discuss its design with the RDF-star CG. How can I get back the triple itself from the "tagged" bnode ID if the reified form isn't sent to storage?

@niklasl the s, p, o that a RdfstarTriple contains is recorded in the triple (nestedly) and sent to storage, as I discussed above, it store some duplicate information in memory if multiple times the RdfstarTriple is used. So it is better to have a support of graph.rdfstartriple((hashvalueoftriple, s, p, o)). So that this statement (rdfhash s p o) represents the nested relationship like BNode one and the RdfstarTriple becomes only an id. Then the relationships won't be memorized multiple times even if the RdfstartTriple is reused in the document.

I do wonder if this multiple passes approach is an unnecessary complication though. Using Lark, this should be simple enough to do in one sweep (see the Turtle parser in Pymantic for an example of that). A separate utility function could then be used if someone wanted to convert between RDF and RDF-star, following the CG report. Of course, by using Lark here, I believe it would be wise to adapt the regular Turtle and TriG parsers to this approach, rather than mixing them up.

yeah, it is not that efficient to do a preprocess before parsing, I noticed that part of code in pymantic maybe we can shift to pure lark in the future

Note that with RDF-star, all triples are to be possible to use as subjects and objects (asserted or not). Is this addressed in ths PR, or is the RdfStarTriple disjoint with regular triples?

its covered, thats why we need the asserted triple and quoted triple tag, so the quoted triple will be stored in the parser's dictionary and used as subject/object when necessary, I think the rdfstar test cases do cover all the possible combinations there

Have you considered the relative merits of defining e.g. StarGraph, or StarDataset (and corresponding stores) to keep things explicit, given that RDF-star is defined as a superset of RDF itself? (I must confess I do not know enough about the "formula-aware" notion in RDFLib to be able to recommend going on that route or this one here). Or is RDFLib to be considered an "RDF-star library" throughout after this addition?

thats kind of time consuming, since I haven't fully looked into the memory class and store class's structure, maybe put it as a future work

@aucampia
Copy link
Member

@alooba1 to get rid of the pre-commit.ci failures you can write a comment "pre-commit.ci autofix", you can also install pre-commit and run pre-commit run --all-files.

You can also use the devcontainer to get the same thing done:

docker-compose run --rm devcontainer task validate:fix

And if you use docker-compose run --rm devcontainer task validate you will get more or less the same validation that is being used by the CI.

But alternatively you can also use tox which is what the CI pipeline runs:

# run tox for all environments (this will take very long and run for python 3.7 -> 3.10)
tox
# run tox for only one specific environment, like python 3.7
tox -e py37

@@ -0,0 +1 @@
<http://example/a> << <http://example/s> <http://example/p> <http://example/o> >> <http://example/z> .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files should be relocated to test/data/ and test/data/fetcher.py should be updated to fetch it.

For more info see test/data/README.md

Copy link
Member

@aucampia aucampia Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to move it and add it to fetcher.py if you are fine with it.


g = Graph()
g.parse("test/turtlestar-evaluation/turtle-star-eval-quoted-annotation-2.ttl", format = "ttls")
print(g.serialize(format = "ttlstar"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should rather use logging than print as logging can be controlled with the --log-cli-level/--log-level pytest flag.

@@ -0,0 +1,68 @@
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should go into the test/test_serializers directory.

Comment on lines +22 to +24
g = Graph()
g.parse(data="test/turtlestar-evaluation/turtle-star-eval-01.ttl", format = "ttls")
print(g.serialize(format = "ttlstar"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test should not be bare in the file, it should be inside test functions so that pytest correctly reports on it.

Further, this should ideally be using a parameterized test, most likely the right thing to do here is to use test/test_roundtrip.py

Probably a good baseline is to look at the n3 suite roundtrips being done:

@pytest.mark.parametrize(
"checker, args",
make_cases(
N3_W3C_SUITE_FILES,
formats={"n3"},
# NOTE: Isomomorphic check does not work on Quoted Graphs
checks={Check.SET_EQUALS_WITHOUT_BLANKS},
graph_type=Graph,
same_public_id=True,
),
)
def test_n3_suite(
checker: Callable[[str, str, Path], None], args: Tuple[str, str, Path]
):
checker(*args)

Comment on lines +27 to +28
g.parse("test/turtlestar-evaluation/turtle-star-eval-02.ttl", format = "ttls")
print(g.serialize(format = "ttlstar"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to attempt some manner of equivalence checking here. Blank nodes may complicate matters, but we have GraphHelper.assert_sets_equals which could be adapted to deal with RDFStar nodes, the best would be to use bnode_handling = BNodeHandling.COLLAPSE.

We should actually change GraphHelper.assert_sets_equals to compare size also now that I think about it.

Comment on lines 17 to 22
register(
"ntstar",
Parser,
"rdflib.plugins.parsers.ntriples-star",
"NtriplesStarParser",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this (and similar in other parser tests) needed here if you are already have

rdflib/rdflib/plugin.py

Lines 267 to 284 in e0d5746

register(
"ntstar",
Serializer,
"rdflib.plugins.serializers.ntriples-star",
"NtriplesStarSerializer"
)
register(
"ttlstar",
Serializer,
"rdflib.plugins.serializers.turtlestar",
"TurtlestarSerializer"
)
register(
"trigstar",
Serializer,
"rdflib.plugins.serializers.trigstar",
"TrigstarSerializer"
)

"NtriplesStarParser",
)

# tests should be past
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be using the N-Triples-star test manifest [ref] for this together with our manifest processor [ref]

For an example see:

@pytest.mark.parametrize(
["manifest_entry"],
params_from_sources(
MAPPER,
ManifestEntry,
LOCAL_BASE_DIR / "manifest.ttl",
mark_dict=MARK_DICT,
report_prefix="rdflib_w3c_ntriples",
),
)
def test_entry(manifest_entry: ManifestEntry) -> None:
check_entry(manifest_entry)

@@ -0,0 +1,96 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (and other test files which essentially run the test suites) should be in test/test_w3c_spec/.

@@ -0,0 +1 @@
<http://example/a> << <http://example/s> <http://example/p> <http://example/o> >> <http://example/z> .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you have the same files in test/rdf-star/turtle-star/ and test/turtle-star

@aucampia
Copy link
Member

@alooba1 as far as I can tell you are not using LARK in standalone mode as you are importing lark, is there a reason for this? If you do this we will have to update the RDFLib dependencies to include lark and this is not entirely ideal.

Also, can you include the grammar files you used as inputs to lark?

It would also be best to add the steps generate the code the code in Taskfile.yml as we ideally want to be able to re-generate it easily if we need to tune or update the grammar. If you are unsure about this maybe just start by adding the grammar files and sharing the commands you used, then I can give you advice on how to update the taskfile.

And optimally the generated code should be untouched unless there is a very good reason that it must be hand edited, and then I would prefer that such changes are in the form of a unified diff patch so we can reapply them when we regenerate the code, but even just having somewhat of a separation between generated and written code is a good start.

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 19, 2022

@aucampia thanks for your advice, I will try fix these issues asap, but since I'm working on this as a student research project and it is ending at the end of October, I need to implement the sparql-star first as much as possible and then write the thesis first and after that, ideally in October I will start polish my code, I know they are very bad style wise

@ghost
Copy link

ghost commented Sep 19, 2022

You should make a draft PR for this or at least drop the link to the branch or commits somewhere, having it as a baseline to look at will be helpful to others I think.

I'll work towards it. The work's not in a suitable state for exposure as yet, I'm fairly deep in the weeds, as I'll explain later. I'd like to reiterate, my interest was/is in assessing the degree of the performance penalty, if any, that accrues from using Lark-based parsers. I based my work on pymantic's implementation of Lark-based turtle/ntriples/nquads parsers ... but ... because I was focussed on assessing performance, I retained quite a bit of the pymantic support code so's I could get on with the job without having to do a complete port to RDFLib first.

To wit, I rewired pymantic's BaseParser to use RDFLib objects and then bound a slightly-modified version of their Turtle Transformer class to the Lark parser instantiation (as one is enabled to do and advised “Applies the transformer to every parse tree (equivalent to applying it after the parse, but faster”)) ...

def unpack_predicateobjectlist(subject, pol):
# ...

class TurtleTransformer(BaseParser, Transformer):
# ...

turtle_lark = Lark(
    open(
        os.path.abspath(
            os.path.join(os.path.dirname(__file__), "grammar", "turtle.lark")
        ),
        "r",
    ).read(),
    parser="lalr",
    transformer=TurtleTransformer(),
    debug=False,
    # _plugins=lark_cython.plugins,
)


class LarkTurtleParser(rdflib.Parser):
    format = None

    def parse(self, string_or_stream, graph=None, base="", **kwargs):
        if hasattr(string_or_stream, "readline"):
            string = string_or_stream.read()
        else:
            # Presume string.
            string = string_or_stream

        if isinstance(string_or_stream, bytes):
            string = string_or_stream.decode("utf-8")
        else:
            string = string_or_stream

        # TODO: stringify the remaining input sources
        if isinstance(string, FileInputSource):
            string = string.file.read().decode()
        elif isinstance(string, StringInputSource):
            string = string.getCharacterStream().read()

        tf = turtle_lark.options.transformer
        tf.base_iri = base
        tf.preserve_bnode_ids = kwargs.get("preserve_bnode_ids", False)

        if graph is None:
            graph = tf._make_graph()

        tf._prepare_parse(graph)

        statements = list(turtle_lark.parse(string))

        tf._cleanup_parse()

        return graph

The above is near-boilerplate for the rest of the RDF syntax formats, as even a casual inspection of the three pymantic RDF parsers will reveal. The approach results in an RDF Parser class that conforms to the current RDFLib API and can be registered via the plugin interface.

The thing is, there are tradeoffs to consider. The tradeoff comments in the documentation of Lark's extensive set of Visitor/Transformer classes provokes some thoughtful consideration and I've only explored a few options as yet.

Of note in the above illustration is that the returned statements are ignored, this is because the graph is accessible in the Transformer (via the BaseParser inheritance) and I chose to bang the triples into the self._call_state.graphas soon as the Transformer assembles them ...

    def triples(self, children):
        if len(children) == 2:
            subject = children[0]
            for triple in unpack_predicate_object_list(subject, children[1]):
                self._call_state.graph.add(triple)
                yield triple
        elif len(children) == 1:
            for triple_or_node in children[0]:
                if isinstance(triple_or_node, rdflib.graph.Triple):
                    self._call_state.graph.add(
                        (
                            triple_or_node.subject,
                            triple_or_node.predicate,
                            triple_or_node.object,
                        )
                    )
                    yield triple_or_node

vs the original pymantic code ...

    def triples(self, children):
        if len(children) == 2:
            subject = children[0]
            for triple in unpack_predicate_object_list(subject, children[1]):
                yield triple
        elif len(children) == 1:
            for triple_or_node in children[0]:
                if isinstance(triple_or_node, Triple):
                    yield triple_or_node

And the parser returns a populated RDFLib Graph, as per standard.

I should stress that I've really only just scratched the surface, there's a fair amount of work still to be done in terms of working through the memory use/speed tradeoff combos as well as investigating the opportunities for generalising the various Transformer implementations.

@aucampia
Copy link
Member

@aucampia thanks for your advice, I will try fix these issues asap, but since I'm working on this as a student research project and it is ending at the end of October, I need to implement the sparql-star first as much as possible and then write the thesis first and after that, ideally in October I will start polish my code, I know they are very bad style wise

If you are open to it I would be willing to address some of these issues in your branch, and if there is some limits in which this is acceptable to you please set out adequate constraints.

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 19, 2022

If you are open to it I would be willing to address some of these issues in your branch, and if there is some limits in which this is acceptable to you please set out adequate constraints.

Sure, I am open to all advices. When I start writing these codes I didn't think too deep, I just want the functionalities to work first and then start polishing my codes. now I can think about performances / styles, actually I could avoid most of the global variables/improve the time complexities, but the time for the project is too tight for me

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 20, 2022

@aucampia I'm actually very worried about my serializer, because I didn't actually understand how the old one work, the getQName thing and it will be very nice if you know there are documents about the old serializer or give me some advice about my implementation, I basically rewirte the whole thing and there are a lot of redundancies at the moment, also currently the serializer is just designed to pass the tests so the turtle-star get parse in will be serialized into ntriples and trig-star will be serialized into nquads

@aucampia
Copy link
Member

aucampia commented Sep 20, 2022

@aucampia I'm actually very worried about my serializer, because I didn't actually understand how the old one work, the getQName thing and it will be very nice if you know there are documents about the old serializer or give me some advice about my implementation

The extent of the documentation about serializers is really this - if you have specific questions do ask though. It may be worth considering to base your serializer on https://github.com/RDFLib/rdflib/blob/main/rdflib/plugins/serializers/turtle.py#L39 - but I'm not that sure about it.

currently the serializer is just designed to pass the tests so the turtle-star get parse in will be serialized into ntriples and trig-star will be serialized into nquads

If is valid turtle-star (which ntriples-star should be) and valid trig-star (which nquads-start should be) it is technically correct™ , which is the best kind of correct 😄 - however, I would somewhat favour us just having a ntriples-star and nquads-star serializer if that is all they really are. We can potentially alias them to turtle-star and trig-star, but I can't really think of any compelling reason to do that. If we have ntriples-star and nquads-star serializers then can do some round tripping which would already be very useful.

I basically rewirte the whole thing and there are a lot of redundancies at the moment

The best advice I have here is to see where you can reuse, maybe try collect your code into something for nqauds-star and then just see where you can branch the logic to do the same as ntriples-star.

Just a reminder, I'm available on gitter/matrix.

@ghost
Copy link

ghost commented Sep 20, 2022

Hmm. I wonder if there's a need to add an attribute declaring a Graph as an RDF-star graph with selective referential transparency enabled?

According to 6.4.5 Selective referential transparency ...

we introduce the notion of transparency-enabling property (TEP), for selectively enabling this kind of entailment on specific properties. The basis of the idea is that each such property p is identified by adding to the RDF-star graph a triple of the form (p, rdf:type, rdf-star:TransparencyEnablingProperty); i.e., for the previous example, this triple would be (:measuredOn, rdf:type, rdf-star:TransparencyEnablingProperty)

By default, quoted RDF-star statements are referentially opaque, the above allows for selective referential transparency to be declared for certain predicates.

I'm wondering about the situation where statements in RDF-star graphs G¹ and G² include the same predicate p where p is declared as referentially transparent in G¹ but not in G² and the consequent impact on set-theoretic operations such as G¹ ∪ G². Or (as is likely the case) is my understanding flawed?

@nicholascar
Copy link
Member

I would much rather merge the code into the main branch as provisional

I support this too. Using branches within demo code, as opposed to development, is a pain and this is a demo of capability.

The main purpose of this work is to provide another RDF-star implementation in readiness for the moves to standardise RDF-star. RDFLib is an RDF / Linked Data implementation testbed, as much as it is a stable, mature library. We have needed a major piece of work to push RDF-star support in RDFLib ahead as we have had several failed starts previously.

If the final RDF-star standard is different to the provisional specification we have now, all other implementations (GraphDB, Jena, RDF4J etc.) will all have to change too, but anyway, the changes are hardly going to be completely revolutionary, only small updates to this base.

I'll be meeting with @alooba1 shortly (a few hours) and we'll try and focus on what code quality updates and replies to the comments above can be made ASAP. I think his effort on this work will conclude in about a week, so we will need to carry on which where he gets to.

I wanted Lark to be used for it's pretty clear to me that a Lark implementation of our RDF parsers is the way to go. It's going to be either the same or better in performance and the use of common. modern, Python will make parser maintenance much, much easier.

@nicholascar
Copy link
Member

Thank you @aucampia @niklasl & @aucampia for all of your comments and analysis above too! This helps @alooba1 improve his work learn from your experiences and will also help him with his student project write-up!

@ghost
Copy link

ghost commented Sep 24, 2022

Nevertheless, if the work that I've done (implementing a suite of Lark-based parsers for RDFLib¹) ever sees the light of day, that's probably the route I'll be choosing for making that work available. Same goes for the SQLite-based stores (one SQL, one key-value) and an RDNA serializer.

You should make a draft PR for this or at least drop the link to the branch or commits somewhere, having it as a baseline to look at will be helpful to others I think.

As invited ...

Lark experimental draft PR

@XuguangSong98
Copy link
Author

XuguangSong98 commented Sep 26, 2022

@gjhiggins the pure lark based parser is amazing, I looked into the experimental lark and found that some nested annotation and quoted triple tests are WIP, I wired my preprocessing class for expanding annotations to pure quoted triple formats into your current work and now they are able to pass all tests, including all the nested ones,

https://github.com/RDFLib/rdflib-rdfstar/blob/experimental_lark/rdflib/experimental/plugins/parsers/larkturtlestar.py

Although it is not the most efficient way..

@XuguangSong98
Copy link
Author

also @aucampia I cleaned up the turtle-star and trig-star parser a bit, only kept those functions I overided. I am still working on sparql-star atm, simple select insert delete are not supported but on another branch, I think I may need to start a new draft pr for the sparql-star once it is completed?

@ghost
Copy link

ghost commented Sep 26, 2022

@gjhiggins the pure lark based parser is amazing,

Glad you found it useful. All the credit should go to Nicholas Pilon and Gavin Carothers, it's nearly all their code, slightly adapted for RDFLib.

I looked into the experimental lark and found that some nested annotation and quoted triple tests are WIP

It doesn't impact your RDF Star work but fwiw, there's also some work left to do on the basic Lark Turtle parser. The N3 folk introduced a couple of new tests of the Turtle grammar's blanknodepropertlist rule which don't appear in the 2013 test suite (used both by pymantic and RDFLib) and which the Lark Turtle parser fails to process. Happily, the extant RDFLib Turtle parser does pass these additional tests.

@0m1n0
Copy link

0m1n0 commented Jun 22, 2023

Hi,

First of all, thank you for all your contributions :)
I also support this PR too.

I would like to assign information on triples and what I'm doing now is as follow:

  1. Create RDF data by RDFLib, using Reification ( rdf:Statement)
  2. Once pushed on a server, I convert reification data into RDF-star by SPARQL (see https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf-star/, "Converting Standard Reification to RDF*" section)

If I could directly create RDF-star by RDFLib, it would be really great!

@niklasl
Copy link
Member

niklasl commented Jul 26, 2024

Here's an informal status update regarding the RDF-star WG chartered to define RDF 1.2. (I'm a member, but I write here as myself; i.e. not representing the group as a whole):

We will have a charter extension until august 2026 (basically with one year for finalizing, one year for maintenance). There's been quite some discussion and some agreed upon changes. This will reasonably affect when this PR (and/or subsequent ones) can be merged into something releasable.

As things stand (debates are ongoing but this is the currently agreed baseline), RDF 1.2 will have triple terms; which are to be used with regular resources relating to those with a new predicate, currently named rdf:reifies; covering use cases ranging from triple provenance via qualifiers to N-ary relationships. These triple terms are "transparent", i.e. within the same interpretation as the graph wherein they are used. I believe that effectively they refer to a relationship, which may also hold by being an asserted triple in the graph. A notion of "well-formed graphs" might be defined to state that triple terms only make sense to use as objects, specifically with the rdf:reifies predicate. The reason for that is to ensure that the reifying resources ("reifiers"), are distinct resources, which is needed for the collected use cases, also avoids the "seminal example" problem. The RDF-star syntax will be defined as shorthand for these constructs, along with a new syntactic representation for the triple terms, mainly for use in the lower-level N-triples format. A means for representing these in RDF 1.1 ("unstarring") should also be defined, and possibly an entailment rule with the same effect, to enable interoperability with existing reasoners, etc.

(Again, this is my interpretation and summary, and not an official one from the WG.)

I will try to keep an eye on the work here and possibly assist in it. I want to help solidify an addition to RDFLib for what will be standardized.

The WG will also define tests for N-triples, Turtle and SPARQL which this addition would reasonably need to pass.

@tae898
Copy link

tae898 commented Aug 22, 2024

Would be really nice if this is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants