-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assert(): cli-friendly validation + entailment api for data pipelines #198
Comments
Hi @usalu
Thank you for validating my opinion on that. Most people who raise this issue seem to think that validating should modify their input graph by default and return the modified graph. That is not only violating the W3C SHACL spec, but does not make sense conceptually for a validator.
I admit, despite reading it several times, I am having a lot of trouble following and understanding your example. It seems very specific to a particular application case, and is not general at all. It had too much specific detail to be a general example case, and not enough detail for me to understand the problem you are attempting to explain.
From the rest of your writeup, I gather that you are asking about two different things:
In response to issue 1.: In response to issue 2.:
PySHACL already has full support for SHACL Functions from the SHACL-AF spec for more than two years. Specifically, it implements SPARQLFunction using RDFLib register_custom_function(), and it implements SHACL-JS JSFunctions using pyduktape2. So what you are describing is already possible (aside from the debugging ability). |
@ashleysommer Thank you for the quick and detailed answer!
The general idea behind data piplines was to share mappings between two different SHACL shapes in a resuable way, entirely descriptive way. To be more precise it would be something like a qualified data pipeline, because it does not only pipe one graph and returns a modified graph but actually the outputgraph would be SHACL validated. Only this is the reason why the pipeline is reusable. It acts like a statically defined functions but instead of static types as schema, you have a shacl shape. Only the combination of two together in one api like assert makes it powerfull. Think of shacl being Interface Definition Language like protobuf and a transpiler from one IDL-Definition into another one at the same time.
Yes, exactly.
I would leave all the OWL-RL and OWL related inferencing out because in my understanding OWL and SHACL have completly different purposes despite them doing technically the same (checking a schema, inferencing triples and reasoning whether the input graph is valid). For OWL, I see the main value in searching for knowledge inside an arbitrarly large graph which has more knowledge that I can ever understand. The idea is: Here is a complex ontology from which I (think) I understand the rules and here is an arbitrarly complex graph. Please give me back everything you know, so that I can find out something new. Aka open world. For SHACL, I see the main value in limiting how a graph looks like. Not the entire WWW. Only something that I can process. This limit is what creates freedom for creating interoperable behaviour. Something like OpenAPI and JSON Schema for microservices. A qualified data pipelines would be like the source code for a microservice which itself is a graph.
Currently in my understanding this is only possible at "compile time". What I was proposing is a way to use a graph like that to create the function and register it at runtime which uses the definition of the. If you look at the example: @prefix ex: <http://example.com/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.com/ns#> .
ex:multiply
a sh:SPARQLFunction ;
rdfs:comment "Multiplies its two arguments $op1 and $op2." ;
sh:parameter [
sh:path ex:op1 ;
sh:datatype xsd:integer ;
sh:description "The first operand" ;
] ;
sh:parameter [
sh:path ex:op2 ;
sh:datatype xsd:integer ;
sh:description "The second operand" ;
] ;
sh:returnType xsd:integer ;
sh:select """
SELECT ($op1 * $op2 AS ?result)
WHERE {
}
""" . then you see that def multiply(grpah, op1, op2):
return graph.query(f'SELECT ({op1} * {op2} AS ?result)') But that would be again at time of definition and not at runtime. So the issue is about using metaprograming to define such functions and register them at runtime. This is necissary because the shacl graph of a qualified data pipeline has to be reusable. It wouldn't work. if you had to pull all python implementations and manually register them. Let me try give a more detailed explanation: The COMPANYAGRAPH would be a custom shacl shape COMPANYASHAPE of Company A. The FIWARE2SIM shacl shape contains all mapping behaviour to transform a COMPANYASHAPE into a SIM shape graph which itself contains all mapping behaviour to translate into SIMREP which itself contains all mapping behaviour to VISREP. The original shape has geometry (3D), the SIM shape has simulation related information (2D + energy characteristics such as how many people per m²) and returns calculates the energy demand (simply by multiplying areas with usage) for individual rooms. The SIMREP shape is about reporting energy behaviour (e.g. in relative units such as kWh/m²*year which divides the energy use per m², etc). You can see it as a General Purpose programming Language which accepts one shape and returns another shape. Here a totaly different application: It would be pipeline for computing transdiciplinary connections from an article. Hopefully, I these examples on graph level help understand the idea. Feel free to tell me if things are still unclear to you! |
Following the discussion and the related issues [1][2][3][...], many people (including me) would be interested in natively accessing entailment.
First, I also believe that a validate function shouldn't return an entailed input. For that the concept is very wrong.
The name Shape Constraint Language in general doesn't suggest entailment. Same as SPARQL Update makes conceptwise no sense but of course it is very usefull in solving practical problems.
Here a practical example that I think depicts a very general problem:
Company A decided to create a digital twin and equip their buildings with sensors. Company B decided the same. Company A and B are not related. While company A used FIWARE for their digital twin, company B uses BRICK and a custom broker solution.
Time has passed. They both need to retrofit their buildings and for that they need another company for consultancy. Company C chooses company D to do simulation for them. Company B heared about surrogate modelling and hires company E which claims to be able to more accurretly predict energy consumption because they can also use the measured data in their model.
Company D uses a long established standard SIM for simulation. Company E uses their custom machine learning models. Both produce the same SIMREP for which the non-profit organization has already produced standard visualization VISREP which are importable by data visualization platforms.
Now, company A needs to entail and validate their graph to be SIM compliant. For that company A and C write FIWARE2SIM.
Company E already knows about BRICK and therefore they already have inhouse developed BRICK2SURROGATESIM.
Following data pipelines would be possible:
COMPANYAGRAPH | assert(FIWARE2SIM) | assert(SIM) | assert(SIMREP) | assert(VISREP)
COMPANYBGRAPH | assert(BRICK2SURROGATESIM) | assert(SURROGATESIM) | assert(SIMREP) | assert(VISREP)
An important difference to conventional pipelines is that the output of one pipeline doesn't necessarily need to be pruned but it can be the Union of previous pipelines (sh:closed being false) which is only possbile due to the nature of rdf. This means that assert(SURROGATESIM) can access parts of the COMPANYGRAPH if wanted/needed.
Instead of a building it could be about any other digital twin. Instead of a digital twin it could be about any product. Instead of a product it could be about any other subject. Instead of simulation or visualization it could be about any other service.
In general it would solve the problem that even if people use rdf as their data model there is still a lot of duck tapping required to make such pipelines work. Manual imperative labour again and again |**Unless you use our closed api which comforts you!**|
As concept I would suggest assert because it can include both validation and entailment. The value of assert is excatly the combination of both. In the validated entailment of shacl shape a -> shacl b is what brings value. For validation only shacl core is wonderfull. For entailment only, rdflib, dotnetrdf or whatever is wonderfull. Why hustle so much to create a stupid multiply function? Conistency and being able to use it inside a decarative environment (SPARQL). What I don't like about assert is that it focuses more on the validation than the entailment but I think the entailment is more important. I guess entail also just works.
The api would be as simple as that:
assert
stdin: data graph
stderr: validation graph
stdout: entailed graph
Now for this to work properly SHACL Functions and SHACL Javascript are vital.
For SHACL Functions, I have already proposed a solution to dynamically register extension functions inside an issue for rdflib.
Due to the imperative nature of SHACL JS, debugging should be possible. The easiest solution I can imagine is to create a wrapper webservice in which the JS functions fetchs into but that would be more hacky solution. Instead of that then there could be directly a SPARQL endpoint and the SPARQL Function which calls SERVICE. Better would be to integrate that into pyducktape2. Following the issues there, there seems to nothing like that. But I guess due to the complexity of such a pipeline, the JS code in general will be small and just trying to glue whats not possible in SPARQL, so that doesn't have such a high priority.
As you might have guessed, I am more of an application guy. Therfore the openworld owl reasoning stuff is not really my interest but I am rather for closed world shacl validation. While it is technically possible to run entailment with both engines, from a practical standpoint I can't see the usage of both (or do you have an example which uses owl for something that shacl can't provide?). Therefore I would set the shacl af and js flags to true and ont flags to false by default. After all it is pyshacl, right?
Long story short: Everything is already there to make this because you have done amazing work implement it! Just one more function and that's it. I can creat a pull request if the feature is wanted. :)
The text was updated successfully, but these errors were encountered: