You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be great to implement a parallel version of the cassandra connector. Assume that the Semagrow execution engine spans over a cluster of nodes and each node can execute part of the execution plan in parallel. Assume also that each each Semagrow node is colocated with a Cassandra node. Then, a single CQL query can be processed in parallel by all the colocated Cassandra and Semagrow nodes and perform some work locally to the physical node.
Suggested Solution
An easy way to retrieve data local to a Cassandra node is with the use of the CQL token function. The same technique is used by the sparql-cassandra-connector (for example see CqlTokenRange and CassandraTableScanRDD). Each Cassandra node gets an altered CQL query with token ranges added in the where clause. For example, suppose that there are 3 nodes in a cluster and the initial CQL query is
SELECT event_description
FROM events
WHERE event_category ='Alerts'
Each i-st node will then get a query of the form
SELECT event_description
FROM events
WHERE token(event_name) >= x_i AND token(event_name) < y_i AND event_category ='Alerts'
Ideally, the token range [x_i, y_i) matches with the local data of the i-st node and therefore there will be no network exchange. However, in the case that not every Cassandra node participates in a Semagrow computation then some of the nodes will get a query with tokens outside of their range. Cassandra cluster will handle the query by finding which node owns the specific tokens and transfers them to the node that handles the query.
Hope that the suggestion is at least sound.
The text was updated successfully, but these errors were encountered:
Motivation
It would be great to implement a parallel version of the cassandra connector. Assume that the Semagrow execution engine spans over a cluster of nodes and each node can execute part of the execution plan in parallel. Assume also that each each Semagrow node is colocated with a Cassandra node. Then, a single CQL query can be processed in parallel by all the colocated Cassandra and Semagrow nodes and perform some work locally to the physical node.
Suggested Solution
An easy way to retrieve data local to a Cassandra node is with the use of the CQL
token
function. The same technique is used by the sparql-cassandra-connector (for example see CqlTokenRange and CassandraTableScanRDD). Each Cassandra node gets an altered CQL query with token ranges added in the where clause. For example, suppose that there are 3 nodes in a cluster and the initial CQL query isEach i-st node will then get a query of the form
Ideally, the token range [x_i, y_i) matches with the local data of the i-st node and therefore there will be no network exchange. However, in the case that not every Cassandra node participates in a Semagrow computation then some of the nodes will get a query with tokens outside of their range. Cassandra cluster will handle the query by finding which node owns the specific tokens and transfers them to the node that handles the query.
Hope that the suggestion is at least sound.
The text was updated successfully, but these errors were encountered: