Proposed `Dataset` API changes #2591

namedgraph · 2023-09-12T12:18:21Z

namedgraph
Sep 12, 2023

The Dataset is quite weird and assumes that standalone Graphs have identifiers, which will be phased out (#2537). For example, adding a named graph to a Dataset looks like this:

g = Graph(identifier=URIRef("urn:named_graph"))
g.add((..., ... ,...))
...
d = Dataset()
d.add_graph(g)

Moreover, Dataset uses the term context when referring to named graphs. I think it should be phased out as well.
If in doubt, I suggest just copying Jena's Dataset API.

My suggestions for Dataset:

add add_named_graph(uri: IdentifiedNode, graph: Graph) method
add has_named_graph(uri: IdentifiedNode) method
add remove_named_graph(uri: IdentifiedNode) method
add replace_named_graph(uri: IdentifiedNode, graph: Graph)) method
add graphs() method as an alias for contexts()
add default_graph property as an alias for default_context
add get_named_graph as an alias for get_graph
deprecate graph(graph) method
deprecate remove_graph(graph) method
deprecate contexts() method

Using IdentifiedNode as a super-interface for URIRef and BNode (since both are allowed as graph names in RDF 1.1).

The above example would become something like this after these changes:

g = Graph()
g.add((..., ... ,...))
...
d = Dataset()
d.add_named_graph(URIRef("urn:named_graph"), g)

namedgraph · 2023-09-12T12:20:53Z

namedgraph
Sep 12, 2023
Author

Also aligns with #2446 and #2407

0 replies

namedgraph · 2023-09-13T08:23:28Z

namedgraph
Sep 13, 2023
Author

It also looks like a simple Graph.add(self, g: Graph) method is missing 🤷‍♂️ There's only __add__.

0 replies

nicholascar · 2023-10-13T11:06:54Z

nicholascar
Oct 13, 2023
Maintainer

I support all of the proposals in this discussion. This has been a long-time coming - we've noticed these things for years - but have never done anything about these and they still hurt u - @edmondchuc is battling with datasets in a current project.

I suggest we also remove the ConjunctiveGraph class and fold any differences it has with Dataset into Dataset constructor parameters.

0 replies

namedgraph · 2023-10-20T15:49:46Z

namedgraph
Oct 20, 2023
Author

I don't think ConjunctiveGraph has to go -- as I understand, it provides a "union graph" which most triplestores support. But Dataset probably does not need to extend it.

0 replies

recalcitrantsupplant · 2024-11-26T07:08:10Z

recalcitrantsupplant
Nov 26, 2024

I've wrote down my thoughts on what expected interfaces are in a pseudo python/rdflib format:

without reference to the current implementation. As such I don't expect what I've written below would be backwards compatible - I'd think it should be changed for this reason.
without reading other's suggestions

Hopefully it's a coherent perspective; it may take some effort to reconcile / integrate with others'. Will have a go at this next.

Minimal class definitions

Only enough to illustrate the thinking/scenarios
Graph:

class GraphType(Enum):
    DEFAULT = "default"
    NAMED = "named"

class Graph:
    def __init__(
        self,
        identifier: URIRef | None = None,
        graph_type: GraphType | None = None,
    ):
        pass

Dataset:

class Dataset:
    def __init__(self):
        pass

    def quads(
        self,
        context: GraphType | URIRef | list[GraphType | URIRef] | None = None,
    ):
        pass

    def triples(
        self,
        context: GraphType | URIRef | list[GraphType | URIRef] | None = None,
    ):
        pass

    def add_graph(
        self,
        graph: Graph,
        target: URIRef | GraphType.DEFAULT | None = None,
    ):
        pass

Graph Scenarios

Scenario 1: Default Graph (Start with Triple)

Graph instantiated without context becomes a "default" or contextless graph when the first thing added is a triple.

g = Graph()
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

print(g.graph_type)
> default  # graph type is now "default"; any triples or quads added after this have no context
g.parse(data="<ex:s2> <ex:p2> <ex:o2> <ex:graph> .", format="nquads")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None), ('<ex:s2>', '<ex:p2>', '<ex:o2>', None)]

Scenario 2: Named Graph (Start with Quad)

Graph instantiated without context gets context from parsed quad.
Subsequently parsed triples inherit the context.

g = Graph()
g.parse(data="<ex:s2> <ex:p2> <ex:o2> <ex:g2> .", format="nquads")
print(list(g.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(g.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g2>')]

print(g.graph_type)
> named
g.parse(data="<ex:s3> <ex:p3> <ex:o3> .", format="turtle")
print(list(g.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>'), ('<ex:s3>', '<ex:p3>', '<ex:o3>')]

print(list(g.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g2>'), ('<ex:s3>', '<ex:p3>', '<ex:o3>', '<ex:g2>')]

Scenario 3: Named Graph with Identifier

Triples added to graph inherit the context.

g = Graph(identifier="ex:g1")
print(g.graph_type)
> named
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', '<ex:g1>')]

Scenario 4: Add quad to default graph

Context is ignored.

g = Graph(graph_type="default")
g.parse(data="<ex:s1> <ex:p1> <ex:o1> <ex:graph> .", format="nquads")
print(list(g.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(g.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

Dataset Scenarios

Scenario 5: Add a Default Graph to a Dataset

g = Graph(graph_type="default")
g.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

ds = Dataset()
ds.add_graph(g)

print(list(ds.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(ds.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

Scenario 6: Add a Named Graph to a Dataset

g = Graph(identifier="ex:g1")
g.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g)

print(list(ds.triples()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.quads()))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

Scenario 7: Add a Graph to the Default Context

Graph ID of graph being added (if present) is overridden by "target".

g = Graph(identifier="ex:g1")
g.parse(data="<ex:s3> <ex:p3> <ex:o3> .", format="turtle")

ds = Dataset()
ds.add_graph(g, target="default")

print(list(ds.triples()))
> [('<ex:s3>', '<ex:p3>', '<ex:o3>')]

print(list(ds.quads()))
> [('<ex:s3>', '<ex:p3>', '<ex:o3>', None)]

Scenario 8: Add Graphs to Dataset changing the graph

Graph ID of graph being added (if present) is overridden by "target".

g = Graph(identifier="ex:g2", graph_type="named")
g.parse(data="<ex:s4> <ex:p4> <ex:o4> .", format="turtle")

ds = Dataset()
ds.add_graph(g, target="ex:newg")

print(list(ds.triples()))
> [('<ex:s4>', '<ex:p4>', '<ex:o4>')]

print(list(ds.quads()))
> [('<ex:s4>', '<ex:p4>', '<ex:o4>', '<ex:newg>')]

Scenario 9: Iterate Over Triples with Contexts

g1 = Graph(graph_type="default")
g1.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

g2 = Graph(identifier="ex:g1")
g2.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g1)
ds.add_graph(g2)

print(list(ds.triples()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context=["NAMED", "DEFAULT"])))  # equivalent to default behaviour when not specifying context 
> [('<ex:s1>', '<ex:p1>', '<ex:o1>'), ('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context="NAMED")))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>')]

print(list(ds.triples(context="DEFAULT")))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

print(list(ds.triples(context=["DEFAULT", "ex:g2"])))  # ex:g2 is not in the dataset so no data returned from this graph.
> [('<ex:s1>', '<ex:p1>', '<ex:o1>')]

Scenario 10: Iterate Over Quads with Contexts

g1 = Graph(graph_type="default")
g1.parse(data="<ex:s1> <ex:p1> <ex:o1> .", format="turtle")

g2 = Graph(identifier="ex:g1")
g2.parse(data="<ex:s2> <ex:p2> <ex:o2> .", format="turtle")

ds = Dataset()
ds.add_graph(g1)
ds.add_graph(g2)

print(list(ds.quads()))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None), ('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

print(list(ds.quads(context="NAMED")))
> [('<ex:s2>', '<ex:p2>', '<ex:o2>', '<ex:g1>')]

print(list(ds.quads(context="DEFAULT")))
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

print(list(ds.quads(context=["DEFAULT", "ex:g2"])))  # ex:g2 is not in the dataset so no data returned from this graph.
> [('<ex:s1>', '<ex:p1>', '<ex:o1>', None)]

12 replies

nicholascar Dec 2, 2024
Maintainer

Confirming what @ashleysommer says: this is about fixing things not patching. Yes things will break. For people unable to adopt v8, v7 will still be available for quite some time, as v4.22, v5 & v6 still are.

recalcitrantsupplant Dec 5, 2024

Thanks for the feedback.

To clarify a few points around my thinking, the terminology I'm using is:

Inclusive Dataset

covers all triples/quads in the Default Graph and all Named Graphs

Exclusive Dataset

graphs targeted must be specified or inferred.
- default union off: refers to the Default Graph
- default union on: refers to the union of all Named Graphs.

Dataset Default Context

in SPARQL:
- set by FROM and FROM NAMED
- when not set using FROM and FROM NAMED, can be inclusive (e.g. GraphDB) or exclusive (e.g. Fuseki)
in RDFLib
- I'm proposing setting via the context parameter. It would be an array of one or more of "Named", "Default", and URIs for Graphs.
- when not set e.g. Graph().triples() or Dataset().quads() I'm proposing the behaviour is inclusive. In my experience many people find the exclusive behaviour counter intuitive.

Default Graph: the "unnamed" graph.

What am I proposing:
The context parameter on the triples and quads methods (and all related methods that iterate on the dataset e.g. subject_objects etc. ) is equivalent to Dataset clauses in SPARQL i.e. FROM and FROM NAMED not equivalent to Graph clauses.

I think the options are:

Target Data	Context for Inclusive Dataset	Context for Exclusive Dataset
Everything (Default Graph + all Named Graphs)	Context not set OR `["Default", "Named"]`	`["Default", "Named"]`
Default Graph only	`["Default"]`	Default Union = true: `["Default"]` Default Union = false: Context not set
All Named Graphs only	`["Named"]`	Default Union = true: Context not set. Default Union = false: `["Named"]`
Specific Named Graphs	`[uri1, uri2]`	`[uri1, uri2]`
Default Graph + specific Named Graphs	`["Default", uri1, uri2]`	`["Default", uri1, uri2]`

I think Inclusive would be the better option here:

no need for default union setting
easier for new users to not "lose" their data:
- with Exclusive dataset, adding a quad to the Dataset when Default Union is off means the quad won't be found when using triples or quads methods unless context is set to "Named". Inverse is true with adding a triple where Default Union is on.
- equivalent scenarios for Inclusive Dataset only occur when a user explicitly sets context= to something that does not include their data.

Perhaps the inclusive/exclusive options are a good place to start as the other methods depend on these:

are the options as I've drawn them correct? other options?
what are others' preferences?

namedgraph Dec 18, 2024
Author

@nicholascar I understand 8.x will contain breaking changes, but I still think they should be kept to a minimum. More specifically, they should be limited to where RDFLib is non-compliant or the API is awkward, but not to rename or redesign the APIs that already make sense, just for the sake of redesigning them.

namedgraph Dec 18, 2024
Author

@recalcitrantsupplant honestly I don't understand/like your proposal :) Why is GraphType and context necessary at all? As for dataset, I think get_graph(identifier) which retrieves a named Graph is much more useful than quads().

recalcitrantsupplant Dec 18, 2024

Hey Martynas, I'll write this up more clearly soon outside of this discussion, including what I think shouldn't be changed. Given the feedback on Graph specifically, i.e. that having a name or identifier for it is inconsistent with the spec, I'll remove that from proposed changes.

The thinking with quads and related methods, by which I mean "subjects", "subject_predicates", "subject_objects" etc. Is that there should be a way to provide "dataset subsetting" (equivalent to SPARQL FROM and FROM NAMED, but entirely separate from the query method). In my day to day use of rdflib this would be quite useful.

The main pain point for me with Dataset otherwise is adding and accessing triples or quads, so I'll include proposed changes here too.

afs · 2024-12-06T13:47:15Z

afs
Dec 6, 2024

Observations from afar ...

The context parameter on the triples and quads methods

Would the dataset (the storage unit) have a default context setting? Otherwise if an app changes, then it might require every API call to be tracked down and changed.

FWIW Fuseki has both modes - union default graph is SPARQL only, and it is a view of the dataset at query time. The usual way is to have a setting on the dataset but it can be set per query execution.

For update, where do new triples go to in an inclusive dataset?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed `Dataset` API changes #2591

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Proposed Dataset API changes #2591

namedgraph Sep 12, 2023

Replies: 6 comments · 12 replies

namedgraph Sep 12, 2023 Author

namedgraph Sep 13, 2023 Author

nicholascar Oct 13, 2023 Maintainer

namedgraph Oct 20, 2023 Author

recalcitrantsupplant Nov 26, 2024

Minimal class definitions

Graph Scenarios

Scenario 1: Default Graph (Start with Triple)

Scenario 2: Named Graph (Start with Quad)

Scenario 3: Named Graph with Identifier

Scenario 4: Add quad to default graph

Dataset Scenarios

Scenario 5: Add a Default Graph to a Dataset

Scenario 6: Add a Named Graph to a Dataset

Scenario 7: Add a Graph to the Default Context

Scenario 8: Add Graphs to Dataset changing the graph

Scenario 9: Iterate Over Triples with Contexts

Scenario 10: Iterate Over Quads with Contexts

nicholascar Dec 2, 2024 Maintainer

recalcitrantsupplant Dec 5, 2024

namedgraph Dec 18, 2024 Author

namedgraph Dec 18, 2024 Author

recalcitrantsupplant Dec 18, 2024

afs Dec 6, 2024

Proposed `Dataset` API changes #2591

namedgraph
Sep 12, 2023

Replies: 6 comments 12 replies

namedgraph
Sep 12, 2023
Author

namedgraph
Sep 13, 2023
Author

nicholascar
Oct 13, 2023
Maintainer

namedgraph
Oct 20, 2023
Author

recalcitrantsupplant
Nov 26, 2024

nicholascar Dec 2, 2024
Maintainer

namedgraph Dec 18, 2024
Author

namedgraph Dec 18, 2024
Author

afs
Dec 6, 2024