Skip to content

Commit

Permalink
50 add the backend from graphdetector (#51)
Browse files Browse the repository at this point in the history
* Added Query Builder for building AQL queries based on graph pattern. Some adaptations are yet to be done.

* Commented some debug code

* Update component.py

---------

Co-authored-by: Opitz <[email protected]>
  • Loading branch information
TechDom and Opitz authored Dec 7, 2023
1 parent e8e7e0b commit 15a03ef
Show file tree
Hide file tree
Showing 10 changed files with 1,445 additions and 2 deletions.
181 changes: 181 additions & 0 deletions cag/framework/analyzer/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
## Query Builder
### Overview
The Query Builder can be used to build an ArangoDB query (AQL query) to query a given graph pattern. The graph pattern is defined in a JSON structure (see below).
### Usage
##### Database Configuration
In order to connect to the database, please provide the database connection details in a configuration object of type `cag.utils.config.Config`:

my_config = Config(
url=<host>:<port>,
user=<username>,
password=<password>,
database=<database name>,
graph=<database graph>
)

##### Database Configuration
Furthermore, it is required to specify the graph pattern, that should be queried, as a json object. The graph pattern object is of type `GraphData` and looks like this:

*GraphData:*
Contains two lists.

nodes: list[NodeDTO]
edges: list[EdgeDTO]

*NodeDTO*:
Each node must have an id (can be arbitrary chosen), a collection name (must match to a vertex collection name in the database) and filter properties, provided as a list of objects.

id: str
collection: str
userData: list[UserData]

*EdgeDTO*:
Each edge must have an id (can be arbitrary chosen), a source and target (which reference the corresponding node ids), a collection name (must match to an edge collection name in the database) and filter properties, provided as a list of objects.

id: str
source: str
target: str
name: str
userData: list[UserData]

*UserData*:
Userdata defines properties that documents should be filtered with (e.g. to filter only documents with a certain property). If any conditions are provided, they must contain the attribute name (i.e. property name of the collection in the database), its data type ("text", "number", "date" or "datetime"), the operator to use and the filter value.

attribute: str
attributeType: str
operator: OperatorEnum
value: str

*OperatorEnum*:
Within userData, the operator can be of any of the following values:

CONTAINS
CONTAINS_NOT
EQUALS
EQUALS_NOT
ALPHABETIC_CONTAINS
ALPHABETIC_CONTAINS_NOT
ALPHABETIC_STARTS_WITH
ALPHABETIC_ENDS_WITH
NUMERIC_EQUALS
NUMERIC_SMALLER_THAN
NUMERIC_SMALLER_OR_EQUAL
NUMERIC_LARGER_THAN
NUMERIC_LARGER_OR_EQUAL
DATE_SMALLER_THAN
DATE_SMALLER_OR_EQUAL
DATE_LARGER_THAN
DATE_LARGER_OR_EQUAL

A valid example of `graph_data` could look like this:

{
"nodes":[
{
"id":"0",
"collection":"Person",
"userData":[
{
"attribute":"name",
"attributeType":"text",
"operator": "alphabetic_contains",
"value":"Bob",
}
]
},
{
"id":"1",
"collection":"Document",
"userData":[
{
"attribute":"title",
"attributeType":"text",
"operator": "alphabetic_contains",
"value":"Climate Change",
}
]
}
],
"edges":[
{
"id":"0",
"source":"0",
"target":"1",
"name":"Wrote",
"userData": [
{
"attribute":"timestamp",
"attributeType":"datetime",
"operator": "date_smaller_than",
"value":"2023-11-24T23:00:00.000Z",
}
]
}
],
}
It indicates that a Person whose name contains the string "Bob" wrote a Document whose title contains "Climate Change" before the date 2023-11-24.

##### Query Generation
Next create a new QueryBuilder object using this configuration:

query_builder = QueryBuilder(my_config)

Finally, call the `_generate_aql_graph_query()` function to generate the AQL query, ready to be executed:

aql_query, bind_vars = query_builder.generate_aql_graph_query(graph_data, max_start_nodes=444, max_paths=555)

`max_start_nodes` denotes the maximum number of documents to be obtained in the first part of the query (see LIMIT 444 below). `max_paths` denotes the maximum number of paths to be obtained, before the graph traversal halts (see LIMIT 555 below).

`aql_query` contains the actual AQL query. It is parameterized, meaning that field names and values are replaced with respective parameter names (e.g. "@field_xy"). The corresponding mapping is stored in `bind_vars`.

A generated query typically looks like the following example, corresponding to the above example of `graph_data`:

LET doc_0 = FLATTEN(
FOR p IN Person
FILTER CONTAINS(LOWER(p.@field_bvsRP4nK2b), LOWER(@value_F83IiG4DIU)) /* "name" contains "Bob" */
LIMIT 444
RETURN doc
)

LET start_nodes = UNION(doc_0, [])

/* For each start node, start finding a path */
FOR start_node IN start_nodes
/* Each Edge has the corresponding direction specified to make querying faster */
FOR v, e, p IN @min..@max ANY start_node OUTBOUND Wrote
OPTIONS { vertexCollections: @vertex_collections, edgeCollections: @edge_collections }
FILTER (
(
(
IS_SAME_COLLECTION(p.vertices[0], @w2SVniSu3W) /* Vertex at index 0 is of "Person" collection */
)
AND
(
CONTAINS(LOWER(p.vertices[0].@field_5E8ggp4h39), LOWER(@value_IiTNm4EVbU)) /* "name" contains "Bob" */
)
AND
(
p.edges[0].@field_ktvTx42cK9 < DATE_ISO8601(@value_7sOZ02Ctnf) /* "timestamp" before "2023-11-24T23:00:00.000Z" */
)
AND
(
IS_SAME_COLLECTION(p.vertices[1], @DxJ8jHKDO0) /* Vertex at index 1 is of "Document" collection */
)
AND
(
CONTAINS(LOWER(p.vertices[1].@field_5x0FjrXzek), LOWER(@value_llAcaYxkIU)) /* "title" contains "Climate Change" */
)
)
)
LIMIT 555
RETURN p

##### Query Execution
To execute the generated query, call `execute_aql_query()`:

query_result = query_builder.execute_aql_query(aql_query, bind_vars)

The query result is of type `arango.cursor.Cursor` (see [docs](https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/))



41 changes: 41 additions & 0 deletions cag/framework/analyzer/aql_queries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### FLOW QUERIES ###
# Get start nodes without using views
GET_DOCUMENTS_FILTERS_USING_COLLECTION = (
'\n' +
'LET {query_var} = FLATTEN( ' +
'\n\tFOR {loop_var} IN {collection} ' +
'\t{filters} ' +
'\n\t{limit} ' +
'\n\tRETURN doc ' +
'\n)'
)

# Get start nodes using views
GET_DOCUMENTS_FILTERS_USING_VIEW = (
'\n' +
'LET {query_var} = FLATTEN( ' +
'\n\tFOR {loop_var} IN {view} ' +
'\t{filters} ' +
'\n\t{limit} ' +
'\n\tRETURN doc ' +
'\n)'
)

# Combine multiple start nodes queries
UNION_START_NODES = (
'\n' +
'LET {var} = UNION({lists}, []) '
)

# Perform graph traversal
GET_GRAPH = (
'{start_nodes_aql_query} ' +
'\n\n/* For each start node, start finding a path */ ' +
'\nFOR start_node IN {start_nodes_var} ' +
'\n\t/* Each Edge has the corresponding direction specified to make querying faster */ ' +
'\n\tFOR v, e, p IN @min..@max {base_direction} start_node {edge_directions} {graph_stmt} ' +
'\n\tOPTIONS {{ vertexCollections: @vertex_collections, edgeCollections: @edge_collections }} ' +
'\n\t\t{path_filters} ' +
'\n\t{limit} ' +
'\n\tRETURN p'
)
26 changes: 26 additions & 0 deletions cag/framework/analyzer/objects/Edge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from objects.Node import Node

class Edge:
def __init__(self,
id:str,
source:Node,
target:Node,
name:str,
conditions:list):
# Edge id
self.id:str = id

# Source node (object)
self.source:Node = source

# Target node (object)
self.target:Node = target

# Edge name (i.e. the name of the collection that this edge is given in ArangoDB)
self.name:str = name

# Edge filters. Each item is a separate condition
self.conditions:list = conditions

# Edge direction (ingoing, outgoing)
self.direction:int = None
Loading

0 comments on commit 15a03ef

Please sign in to comment.