History tracking, provenance and states #690

pablo-de-andres · 2020-06-22T14:45:10Z

pablo-de-andres
Jun 22, 2020
Maintainer

In GitLab by @pablo-de-andres on May 5, 2020, 10:58

Keeping track of different versions of an object has been a recurring topic in our discussions for a long time. This issue will group the motivation, approaches and decisions regarding this topic.

Relates to #127

Motivation

Wrappers

The behaviour of the wrappers could follow 2 main paradigms, namely modify or create data

1. Modify data

Advocate: @pablo-de-andres

Example: Current implementation of SimLammps

Description: The wrapper takes some input data, run for a number of steps, and overwrites the input data with the latest value. The wrapper here behaves as a process generating an output from an input. But no memory capabilities are present, and there is no direct way of tracing back the changes.

History tracking: Has to be implemented outside of the wrapper, and controlled by the user. The user sets what and when to store a snapshot.

Pros:

No need to keep track of the changes. This simplifies the handling of uids in the wrapper level.
Less memory consumption. No initial or intermediate states are stored.

Cons:

Requires external history tracking implementation
Theoretically, the user has to control the history.

2. Create data

Advocate: @urbanmatthias

Example: Current implementation of Gromacs Wrapper

Description: The wrapper has some input data (connected through a hasInput or similar relationship) and generates output data. When the user queries the wrapper, the data will be stored under a hasOutput (or equivalent) relationship. If multiple runs are called sequentially, the output of one simulation becomes the input of the next one. This means loading the full output state of the engine in the output, so it is available as an input. The behaviour of the wrapper would mimic more a full workflow, where every run is a process, with its own input and output.

History tracking: An inherent part of the wrapper. Requires to keep a connection between an entity through all its states (possibly through a relationship).

Pros:

No data is lost.
Avoids conflicts with multiple users working on the same data

Cons:

It would still require an external History Tracking to encompass more complex scenarios.
The uid changes its meaning. Now it doesn't refer to an entity, but to an entity in a specific state. This would make tracking the first and last states of an entity after multiple runs a bit cumbersome.
Higher memory impact, that could be unnecessary (depending on the use case).

Bonus option: Internal engine output file storage

Advocate: @ahashibon

Description: A hybrid of the first option where the engine is internally asked to generate output files every step (or multiple fixed steps) and stored internally. This files could be parsed on demand if the data is required.

History tracking: Done through the tracking of the files generated by the engine and kept internally.

Pros:

No data is lost.
Extra processing is only required when the files have to be parsed.

Cons:

Requires file handling. This would also become more complex when we want to persist the data.
Requires parsers for all engines. However, some of them might not generate files.
The history tracking becomes specific to each wrapper.
It would still require an external History Tracking to encompass more complex scenarios.
Higher memory impact, that could be unnecessary (depending on the use case).

Decision

Standardise option 1.

This requires the design an implementation of an external history tracking mechanism.

A desired requirement coming from this point would be to integrate the history tracking in a way that the user could easily define some parameters and have an optional approach similar to option 2 that would be automatic. This means the tracking should become a part of the semantic or the interoperability (session class) layer.

Implementation considerations

History tracking could be independent of the ontology, and become an intrinsic part of OSP-core (@yoavnash).
It should provide a way to keep multiple instances of objects with the same uid in the same place. (@pablo-de-andres)

Implementation ideas

Create a pseudo database wrapper that keeps a table per state, allowing objects with the same uuid in different tables.

pablo-de-andres · 2020-06-22T14:45:14Z

pablo-de-andres
Jun 22, 2020
Maintainer Author

In GitLab by @yoavnash on May 6, 2020, 11:01

To clarify the point about ontology independence: a solution may use an ontology, but I think it should be general so that it's not just working for EMMO. In this case, extending CUBA might be a good idea.

Another point to keep in mind is that not only a wrapper can create an event that is relevant to the history of the CUDS object. That is, also a user, who uses the semantic layer, can create those events. This might be relevant to support keeping track of workflows that go beyond a specific wrapper.

Concerning the implementation, allowing to store metadata for a CUDS object can also be useful in this case. Here, we can have a metadata module that will keep track of the changes (if the user wishes to do so) and with its own API, the user can query the changes. See a previous discussion about the metadata here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History tracking, provenance and states #690

{{title}}

Replies: 1 comment

{{title}}

Select a reply

History tracking, provenance and states #690

pablo-de-andres Jun 22, 2020 Maintainer

Motivation

Wrappers

1. Modify data

2. Create data

Bonus option: Internal engine output file storage

Decision

Implementation considerations

Implementation ideas

Replies: 1 comment

pablo-de-andres Jun 22, 2020 Maintainer Author

pablo-de-andres
Jun 22, 2020
Maintainer

pablo-de-andres
Jun 22, 2020
Maintainer Author