Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming SHACL from the ontology graph #170

Open
ajnelson-nist opened this issue Nov 22, 2022 · 9 comments
Open

Trimming SHACL from the ontology graph #170

ajnelson-nist opened this issue Nov 22, 2022 · 9 comments

Comments

@ajnelson-nist
Copy link
Contributor

This is a port of a comment I'd made on Sep. 16 on Issue 89. I've migrated the comment here to avoid expanding the scope of that ticket.

I've come across undesired behavior related to pySHACL's ontology graph and shapes graph mix-in strategy. The short of it is, my community adopted a SHACL-based mechanism to review some OWL requirements and cross-check SHACL and OWL definitions with general patterns. (The shapes graph for this is here. We'd also appreciate hearing if you've seen something like this elsewhere; we couldn't find something like it.) My community's ontology is managed as Turtle files, one per namespace, and each Turtle file includes all OWL and SHACL statements for concepts in that namespace.

What we didn't appreciate is that this incurs a significant cost on reviewing all data graphs, due to the mix-in strategy, so our "TBox"-targeting tests also run on "ABox" data review because pyshacl mixes the "ontology" TBox and ABox stuff together. With our ontology (a few hundred classes and properties, each with associated SHACL shapes), this is a 20--30 second cost per pyshacl run, guaranteed to be redundantly re-reviewing the ontology.

So, I'd like to discuss whether the whole ontology mix-in is still the right approach in all cases.

The case my community is facing at the moment is our ontology is written including SHACL shapes and OWL definitions in the same graph file, so we feed SHACL statements into .validate()'s ont_graph keyword argument. Question: Would the ontology graph pySHACL uses (ont_graph) ever be expected to include SHACL statements? My guess is no.

I was considering whether there could be a further reasonable trim based on the user's inferencing capabilities request, but I think the answer there may be no. I suspect SHACL-SPARQL makes the whole ontology graph always potentially necessary, though I can't think of a good example at the moment - maybe linking via some annotation property only available in the ontology, or searching for deprecated classes?

For now, my one question is just the SHACL-in-ontology-graph question: Can SHACL be safely trimmed from the ontology graph?

@ashleysommer
Copy link
Collaborator

ashleysommer commented Dec 4, 2022

Hi @ajnelson-nist
Your issue description and your argument makes a lot of sense, for your use case for your community.

So, I'd like to discuss whether the whole ontology mix-in is still the right approach in all cases.

There are no rules for this side of SHACL validation. This is operating outside of the SHACL spec document, so we can change the operation as we see fit, depending on what makes most sense. I don't know how other SHACL validator implementations handle it, but my understanding is PySHACL is the only validator that uses an ontology mix-in option. Other implementations require you to script in a pre-prep stage for your datagraph before the validation run to ensure it has all of your ontological definitions in place, to ensure validation is able to proceed as expected.

I was considering whether there could be a further reasonable trim based on the user's inferencing capabilities request, but I think the answer there may be no.

There is a very fine line that would need to be walked to determine what needs to be trimmed and what does not. If you consider the open-world assumption of linked-data, then nothing could safely be trimmed.

I suppose there is certainly a best-effort approach, where we simply detect all instances of NodeShape, PropertyShape, and their constraints, and have an option to filter them from the ontology graph before performing the mix-in.

It is also common for an SHACL NodeShape definition to be an instance of RDFS:Class (see Implicit Class Targeting). In that case, the RDF instance is both a SHACL statement, and an ontological definition, so it cannot be removed from the graph.

So an alternative approach would be to discard everything except known ontological constructs, such as keep only RDFS:Class, OWL:Class, RDFS:subclassof, subpropertyof, domain, range, etc, from the ontology graph to mix in. That may be a better approach, except for the case where as you mentioned, as user could use SHACL-SPARQL and assume other triples are present in the datagraph.

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 18, 2023

@ajnelson-nist

I have pushed a pretty big change to ontology graph behaviour in the latest (0.22.0) release.

The feature now copies only what is required for RDFS and OWL inferencing, from the ontology graph into the datagraph.

As you are a big user of this feature, are you able to test it (eg, with the problematic case above) and let me know if it:
a) Still works as expected, produces correct validation results?
b) Mitigates any TBox/ABox issues that you were seeing above?
c) Runs faster?

Thanks!

@ajnelson-nist
Copy link
Contributor Author

Hi @ashleysommer ,

First, thank you for considering this issue further, and putting some effort into trying out a solution!

Unfortunately, this solution has not panned out, and may need to be reverted. For further drafting on this, I am quite comfortable testing against non-release branches with Git submodule tracking, and I would be happy to exercise a branch to the fullest.

The summary of impacts I've seen is:

  1. This did not end up speeding up my use case. I think my use case, which involves TBox and ABox review happening when only ABox is needed, just needs a different solution that splits out the TBox shapes from ABox shapes. Your solution tried stripping out SHACL from the mix-in, which addressed some TBox/"SBox" (SHACL shapes confirming shape consistency with OWL definitions) review slowdowns. Unfortunately, the slowdown comes also from a "Quality control" / "Sanity checking" TBox review shape that checks OWL ObjectPropertys and DatatypePropertys for consistency with OWL. So, the right answer here is wholly in my use case's space, not in pySHACL's.
  2. An unfortunate side effect is that a named individual I had left as a "Universal" individual (would be very frequently needed as a fallback) was stripped from the ontology + data graph mix-in, causing a SHACL minimal requirements set to fail. (The individual is here, and the now-failing shape is here.) Adding a type designation of owl:NamedIndividual doesn't carry forward the extra triples the ABox SHACL needs.

So, unfortunately, the change in 0.22.0 introduces a new fail case, where owl:NamedIndividuals that are bundled in the TBox file no longer work with SHACL validation.

ajnelson-nist added a commit to casework/CASE-Corpora that referenced this issue Apr 25, 2023
This patch is implemented to skip the pySHACL 0.22.0 release due to an
unforeseen bug with an individual node included in `case-corpora.ttl`.

References:
* RDFLib/pySHACL#170 (comment)

Signed-off-by: Alex Nelson <[email protected]>
@ashleysommer
Copy link
Collaborator

Hi @ajnelson-nist
Thanks for testing this new behavior.
I want to avoid reverting to the old behavior, because I believe the new implementation is still more in line with the original intention of the ontology-graph feature, it does have a measurable performance improvement in my benchmark testing, and it still passes the whole of SHT and DASH test suites.

Your failing case indicates that I have not taken into account the full content of owl:NamedIndividual instances in the ontology in the new implementation. This is an oversight, and that will be fixed in the next version.

Thank you for volunteering to test feature branches before release. This is a case where I thought my own internal testing was sufficient, but I should have put it to you in a branch to test first. I will push the above mentioned NamedIndividual bugfix in a normal release, then any further modifications I will push to a testing branch that you can evaluate before release.

@ashleysommer
Copy link
Collaborator

@ajnelson-nist New PySHACL version v0.22.1 is release that should fix the owl:NamedIndividual behaviour.

ajnelson-nist added a commit to casework/CASE-Corpora that referenced this issue Apr 26, 2023
This is to let the node be copied in pySHACL 0.22.1.

References:
* RDFLib/pySHACL#170 (comment)

Signed-off-by: Alex Nelson <[email protected]>
@ajnelson-nist
Copy link
Contributor Author

@ashleysommer Thank you for the quick fix. I've confirmed it works in my local build environment. Unfortunately, a new issue is now presenting, either from your change for owl:NamedIndividual or from the debug timing for #180. The qualified shape is now yielding the skolemized ID rather than a dump of the shape in Turtle.

Constraint Violation in QualifiedValueShapeConstraintComponent (http://www.w3.org/ns/shacl#QualifiedMinCountConstraintComponent):
	Severity: sh:Violation
	# ...
	Message: Focus node does not conform to shapes: [nde6d76368ff64c828817d5da07663b82b46]

Granted, that Turtle compacted into one line is usually barely legible to the eye. But it's better to see that than this new behavior which I suspect is unintended. I also see how this would be difficult to catch in a test framework and just took a manual run to see.

@ajnelson-nist
Copy link
Contributor Author

Unfortunately, I think there's another issue that is more attributable to the cloning strategy.

I just had CI fail when trying to evaluate a SHACL-SPARQL constraint. The constraint is here, and the failing data graph is here. The failure is that a certain prefix in the SPARQL query, uco-types:, is not recognized at the time pyshacl is running.

That SHACL-SPARQL constraint should be moot from that data graph, because the data graph doesn't include any usage of the uco-types: prefix. However, apparently the evaluation process now only picks up namespaces from the data graph, while it used to pick up namespaces from the shapes and/or ontology graphs as well. The cloning process apparently forgets to carry the ontology graph's namespaces dictionary forward.

From a community member using another tool, I've come to understand that the SPARQL I linked above is a little lazy/lax in its spelling. sh:prefix isn't used, and PREFIX statements aren't inlined in the query. The tool the community member was using rightly called this out. RDFLib had been letting us get away with it. I suppose if you let this new processing behavior from the cloning stand, you'd be moving pySHACL towards a more SHACL-conformant behavior. If you encode a catch for this case in a test, I'm not sure whether a prefix used in the ontology graph or shapes graph, and not used in the data graph, should be done as a PASS or XFAIL test.

For now, users who've been lax with prefixes are in for a mild surprise on their next pyshacl call.

ajnelson-nist added a commit to casework/CASE-Corpora that referenced this issue Apr 26, 2023
@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 26, 2023

The qualified shape is now yielding the skolemized ID rather than a dump of the shape in Turtle.

That looks like a bug in the new default message behaviour I implemented for QualifiedValueShapeConstraintComponent. In previous versions that constraint did not have any sh:message present. I can fix the stringification on that. Is that the only place you saw that bnode skolemized ID, or is it showing up elsewhere too?

apparently the evaluation process now only picks up namespaces from the data graph
I've come to understand that the SPARQL I linked above is a little lazy/lax in its spelling. sh:prefix isn't used, and PREFIX statements aren't inlined in the query. The tool the community member was using rightly called this out.

Yeah, the SPARQL query runs on the DataGraph, so by default it only includes the Prefixes present on the datagraph. That is the correct behaviour according to the SHACL-SPARQL Spec. If you need extra prefixes present, you should use the sh:declare and sh:prefixs pattern as outlined in the SHACL-SPARQL spec.

It seems like the old ontology-mix-in behaviour did copy the namespace-prefix mapping from the ontology file into the working datagraph. This was for purposes of generating shortened prefixed stringification of nodes in the textual output and Turtle representation of the result graph, but seems like it had the side effect of allowing your SPARQL queries to work without prefix declarations too.

However I will modify the new ontology inoculation feature to also copy over the namespace-prefix mapping from the ontology graph to the datagraph, because it seems like the sensible thing to do. That will likely fix the failure you are seeing, but I do recommend you update your SPARQL constraints to use the sh:declare and sh:prefixes pattern to avoid relying on this.

@ashleysommer
Copy link
Collaborator

@ajnelson-nist PySHACL v0.22.2 is released now that address the two issues mentioned above.

ajnelson-nist added a commit to casework/CASE-Utilities-Python that referenced this issue Aug 24, 2023
This is part of addressing what originally inspired pySHACL Issue 170.

This patch modifies the behavior of `case_validate`, that reviews OWL
syntax and OWL-SHACL interactions.  With this patch, that functionality
is now **opt-in** at call time.

Further work on separating the OWL review shapes from UCO into a general
CDO repository (originally started for CDOTSC-34) is currently believed
to not have an impact on the user interface element where the user opts
in to the more extensive review.

References:
* [CDOTSC-34] CDO should provide shapes for external ontologies and
  other RDF models, including OWL
* RDFLib/pySHACL#170

Signed-off-by: Alex Nelson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants