You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need an analytics module that allows us to run queries, get summaries, and find insights into the data stored in Caluma.
Requirements are as follows:
Analytics over multiple entity types possible
Possibility to run aggregate functions over any part of a case - work item attributes, document answers, and so on
Respect permission and visibility rules
Implementation - Specification
Architecture overview
The analytics module's functionality will be guided by the "Pivot Table" feature found in Microsoft Excel and LibreOffice Calc. There are two steps for a full analysis view:
Generate a table of data to be analyzed
Apply the pivot table mechanism on selected fields
The queries shall be stored as objects in the database, and be queried via GraphQL. This way, the module splits into a design time part and a runtime part, the same as the rest of Caluma.
Table building
The first step in building an analysis is selecting the relevant data. A starting point is required to identify each row. From there, the data needed for the analysis can be selected.
For example, let's start with Cases. From there, all answers in the case's document are available for selection. To also be able to select answers from workitem (and sub-workitem!) documents, those should be available as well.
classCaseStartingPoint(BaseStartingPoint):
defget_fields(self):
"""Return a list of (slug, label) tuples that can be selected as fields. """fields=set([
("document.form", _("Case doc: form slug")),
("worflow", _("Workflow slug")),
])
forcaseinCase.objects.all():
# naive impl!formetakeyincase.meta:
fields.add((f'meta.{metakey}', _("Case: Meta key {}").format(metakey)))
forquestionincase.document.form.questions.all():
fields.add(
(f'document.{question.slug}', _("Case Doc: Question {}").format(question.label))
)
# Descend into subforms etc as needed# descend into work items, repeatreturnsorted(fields)
defjoin_field(self, field):
"""Generate SQL to JOIN the given field"""# ...
For every such field, there must be a method to tells the system on how to JOIN the value, such that an efficient SQL query can be built.
The above example shows the basic structure on how this could look.
To improve performance in the design phase, we must consider returning a partial list of available fields, filtered by prefix: At first, only the top-level available fields are returned, and subsequent queries can then drill down to get more fields with a given prefix. The returned fields should be typed such that the frontend will know if a field is only a path segment or an actually usable field.
Users should then be able to filter the resulting table. As every field contains a representable number of values, a simple list of values to show for each column should work fine: For example, one could request that only cases with a given workflow slug should be shown.
Of the selected fields, users should also be able to rename them (make aliases), such that they make more sense in the context they're being used. Some fields also may be hidden when the query is being run, as they may be used for filtering only.
Thus, a "table" specification model could look roughly like the following in the database:
An example table specification could look as follows. We assume some procurement workflow.
field_source
alias
show
filters
case.document.form
form
false
["main-form"]
case.created_at
created_at
true
case.document.answers.total_price
total_price
true
case.workitems[ordering].status
false
["completed"]
case.document.answers.supplier_name
supplier
true
This would generate a table with three columns (the two columns used for filtering are
not shown, via the show flag.)
Important: For every given starting point, we only allow one output row per item. In case of the "Case" starting point, we allow only one row per case in the output. The field sources must ensure that this is always the case. If an analysis needs to check multiple work items, we will need a corresponding "work item" starting point (that may then go on and join related cases for filtering, for example). This implies as well that field sources (such as the workitems[ordering] above) will need to provide variants for the case that there's multiple work items. For example, they could provide a workitems[task=ordering,first-by-created] and workitems[task=ordering,last-by-created] so the user can select the relevant one.
A possible frontend mockup is shown below:
Alternative idea: We should check if we could provide a GraphQL query as input instead, but this would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).
UNIONs
There are cases where multiple data sources need to be queried to build a single analytics view. For example, multiple workflows could contain the same approval form, but possibly in different locations in the work-item structure.
Thus, we should support UNIONs of tables. The condition of allowing tables to be used together in such a way would be that they have the same set of aliases.
Pivot tables
A pivot table specification consists of a few sets of fields:
row fields - each combination of row field will be unique in the output table. For example, if you had "year" and "quarter" in the row fields, you would get four rows for each year. If you only had "quarter", there would be a total of four rows.
column fields - Each combination of values in this list will generate a custom column. (This may be empty, and probably won't be implemented at first)
data fields - Each data field will become a column in the pivot table. With each data field comes an aggregate function like sum, average, count, etc.
Faster SELECTs
-The above structure is quite heavy on the database, due to the fact that a caluma case can be arbitrarily deeply nested, using work items and child cases. If one were to query such a structure naively, it would imply an iterative/recursive approach just to get the possible fields. The generated SQL would also be very inefficient.-
We need a way to access all relevant data for a case in an efficient way. For this, we need to de-normalize some aspects of the case structure. We already do this within documents, via the family attribute, which always points to the root document, for example within a table row.
The same concept now needs to be extended to case structures as well. We need to have not only the root object, but a full "path" on every case, work item, and document.
A new path attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this: [case_id, workitem_id, document_id]. This should allow for faster querying of hierarchical data.
To keep the database consistent, the value shall be updated in a pre_save event handler, and a migration needs to be written that updates legacy application's data.
Notes on performance
When building analytics tools, or generally business intelligence (BI), there is the choice of doing on-line or off-line analytics. Off-line in this context means copying and de-normalizing data periodically and store it in a separate analytics database, and on-line implies querying the live production database.
We chose to build the analytics module as an on-line implementation. For one, we assume relatively low usage of the feature, and second, we want to apply the same visibilities as for regular data access.
There are a few low-hanging fruit regarding analytics performance that can be applied when needed:
Create indexes on data being queried (such as date answers), even functional indexes (extracting quarter from a data)
implement a secondary read-only database mirror for analytic processing
cache results on an application level
Generally, we assume that, despite analytics queries being rather slow, the expected volume won't impact the transactional load too much.
Visibilities and permissions
All analytics should also be subject to the visibility/permission framework:
Analyses and their input tables shall be filtered via their own visibility rule. This allows developers to define who can see any analysis.
Access to analyses and tables should be governed by the permissions as well, so that creation and management of analysis views can be allowed and disallowed as needed.
All data that is run though the analysis must be subject to the visibility rules that are defined for the corresponding models. In other words: if a user cannot see a given set of cases, those cases won't be counted when a user looks at an analysis that queries cases. However, we may add an option to disable this at design time, such that certain analyses can be run across all data, even if the user seeing the analysis wouldn't have access to the underlying entities.
We could probably use the same entity for "regular" tables and pivot tables. This could give us in theory the ability to use the output of a pivot table as another input... However I'm not sure this is a good idea performance-wise. It would reduce interface clutter however.
We need to discuss the GraphQL interface of table output. They are not typed in the classical way, where the GraphQL client could query a selection of fields.
WBS / Tasks
Ground work
Store "path" information on every case, work item, document (feat(analysis): add path attribute / mixin to structural models #1570):
The path attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this: [case_id, workitem_id, document_id]
This should allow for faster querying of hierarchical data
Generate a list of identifiers / values that are selectable for any given case
Basic field types (direct on primary object (case))
Support fields on case's document (Answers).
At first, we probably won't support table questions, as they would cause cartesian products.
Support descending through work item structure
Note: Same thing applies as for table questions, if we have multiple work items with the same task. We probably won't support that in the first version, without some explicit ruling as to how to deal with that
Build SQL query to output a table of the selected values
GraphQL view to query the tables
Integrate visibilities on generated Queryset/SQL: Users should not see things in analytics that they wouldn't see otherwise (Might be switchable on a per-table setting)
CHECKPOINT: Check performance of SQL generated and see if we need to optimize here, or abort and redirect to an offline solution
Implement Pivot Table
Model, GraphQL interface for pivot table table (Including support for visibility / permissions)
Implement SQL/Query generator to actually query the database
GraphQL view to query the pivot tables
Extensions
Support UNIONs as input of pivot tables
Support filters that are already in use in the rest of Caluma
Support for table questions: Potentially, we'd already aggregate the values there, but per-parent-document instead of globally. Would require an additional "aggregate function" field
Same for values from multiple work items
Support custom lookups for users from internal user ids/user names. This would potentially be interesting for other use cases as well. However this could just as well be done exclusively in the frontend, provided the frontend has a way of knowing which fields will be user IDs or group IDs etc.
Possible optimisations / Further investigations
check if we could provide a GraphQL query as input instead:
This would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).
The text was updated successfully, but these errors were encountered:
Requirements
We need an analytics module that allows us to run queries, get summaries, and find insights into the data stored in Caluma.
Requirements are as follows:
Implementation - Specification
Architecture overview
The analytics module's functionality will be guided by the "Pivot Table" feature found in Microsoft Excel and LibreOffice Calc. There are two steps for a full analysis view:
The queries shall be stored as objects in the database, and be queried via GraphQL. This way, the module splits into a design time part and a runtime part, the same as the rest of Caluma.
Table building
The first step in building an analysis is selecting the relevant data. A starting point is required to identify each row. From there, the data needed for the analysis can be selected.
For example, let's start with
Cases
. From there, all answers in the case's document are available for selection. To also be able to select answers from workitem (and sub-workitem!) documents, those should be available as well.For every such field, there must be a method to tells the system on how to
JOIN
the value, such that an efficient SQL query can be built.The above example shows the basic structure on how this could look.
To improve performance in the design phase, we must consider returning a partial list of available fields, filtered by prefix: At first, only the top-level available fields are returned, and subsequent queries can then drill down to get more fields with a given prefix. The returned fields should be typed such that the frontend will know if a field is only a path segment or an actually usable field.
Users should then be able to filter the resulting table. As every field contains a representable number of values, a simple list of values to show for each column should work fine: For example, one could request that only cases with a given workflow slug should be shown.
Of the selected fields, users should also be able to rename them (make aliases), such that they make more sense in the context they're being used. Some fields also may be hidden when the query is being run, as they may be used for filtering only.
Thus, a "table" specification model could look roughly like the following in the database:
An example table specification could look as follows. We assume some procurement workflow.
This would generate a table with three columns (the two columns used for filtering are
not shown, via the
show
flag.)Important: For every given starting point, we only allow one output row per item. In case of the "Case" starting point, we allow only one row per case in the output. The field sources must ensure that this is always the case. If an analysis needs to check multiple work items, we will need a corresponding "work item" starting point (that may then go on and join related cases for filtering, for example). This implies as well that field sources (such as the
workitems[ordering]
above) will need to provide variants for the case that there's multiple work items. For example, they could provide aworkitems[task=ordering,first-by-created]
andworkitems[task=ordering,last-by-created]
so the user can select the relevant one.A possible frontend mockup is shown below:
UNIONs
There are cases where multiple data sources need to be queried to build a single analytics view. For example, multiple workflows could contain the same approval form, but possibly in different locations in the work-item structure.
Thus, we should support UNIONs of tables. The condition of allowing tables to be used together in such a way would be that they have the same set of aliases.
Pivot tables
A pivot table specification consists of a few sets of fields:
Faster SELECTs
-The above structure is quite heavy on the database, due to the fact that a caluma case can be arbitrarily deeply nested, using work items and child cases. If one were to query such a structure naively, it would imply an iterative/recursive approach just to get the possible fields. The generated SQL would also be very inefficient.-
We need a way to access all relevant data for a case in an efficient way. For this, we need to de-normalize some aspects of the case structure. We already do this within documents, via the
family
attribute, which always points to the root document, for example within a table row.The same concept now needs to be extended to case structures as well. We need to have not only the root object, but a full "path" on every case, work item, and document.
A new
path
attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this:[case_id, workitem_id, document_id]
. This should allow for faster querying of hierarchical data.To keep the database consistent, the value shall be updated in a
pre_save
event handler, and a migration needs to be written that updates legacy application's data.Notes on performance
When building analytics tools, or generally business intelligence (BI), there is the choice of doing on-line or off-line analytics. Off-line in this context means copying and de-normalizing data periodically and store it in a separate analytics database, and on-line implies querying the live production database.
We chose to build the analytics module as an on-line implementation. For one, we assume relatively low usage of the feature, and second, we want to apply the same visibilities as for regular data access.
There are a few low-hanging fruit regarding analytics performance that can be applied when needed:
Generally, we assume that, despite analytics queries being rather slow, the expected volume won't impact the transactional load too much.
Visibilities and permissions
All analytics should also be subject to the visibility/permission framework:
Analyses and their input tables shall be filtered via their own visibility rule. This allows developers to define who can see any analysis.
Access to analyses and tables should be governed by the permissions as well, so that creation and management of analysis views can be allowed and disallowed as needed.
All data that is run though the analysis must be subject to the visibility rules that are defined for the corresponding models. In other words: if a user cannot see a given set of cases, those cases won't be counted when a user looks at an analysis that queries cases. However, we may add an option to disable this at design time, such that certain analyses can be run across all data, even if the user seeing the analysis wouldn't have access to the underlying entities.
GraphQL interface
Discussion points
WBS / Tasks
The
path
attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this:[case_id, workitem_id, document_id]
This should allow for faster querying of hierarchical data
At first, we probably won't support table questions, as they would cause cartesian products.
Note: Same thing applies as for table questions, if we have multiple work items with the same task. We probably won't support that in the first version, without some explicit ruling as to how to deal with that
This would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).
The text was updated successfully, but these errors were encountered: