Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: a graph-based observability platform #22

Open
RichMorin opened this issue May 20, 2023 · 3 comments
Open

RFC: a graph-based observability platform #22

RichMorin opened this issue May 20, 2023 · 3 comments

Comments

@RichMorin
Copy link

RichMorin commented May 20, 2023

I've been speculating about a fairly radical addition to Elixir's observability toolbox. One set of open questions has to do with acquisition of runtime information, which is part of this wg's area of expertise. So, I'm strongly hoping for some comments, suggestions, etc.

The objective of the graph-based observability platform is to harvest, organize, and present a wide range of information having to do with a BEAM application (e.g., in Elixir). The harvesting starts with canonical Elixir definitions, adds code base definitions, then monitors the runtime for behavior patterns. As a result, it contains both contextual and current information on everything of interest, including:

  • canonical Elixir definitions: "Functions are defined in Modules", "Functions are used by Processes", etc. This kind of information is harvestable from existing documents.

  • code base definitions: "Function foo/2 is defined in Module FOO.", "Function foo/2 uses BAR.bar/3 in Library L.", etc. This kind of information is readily available (e.g., from an LSP server and/or other tooling).

  • runtime behavior patterns:

    • "Process FOO.foo/2, running as part of process P, on node N,
      has been spawning processes as often as 1/minute."
    • "S is the set of processes spawned by function FOO.foo/2 under process P."
    • ...

Careful Reader will note the fact that these statements are suitable for use in a property graph database such as Neo4j. Also, given Neo4j's architecture and solid real-time capabilities, it can handle substantial amounts of harvested information without getting bogged down.

So, this system should allow a user to explore a specified conceptual neighborhood, asking questions such as: "Which processes are spawning most frequently?", "Are any processes exhibiting anti-social behavior?".

That said, presentation is an open question, albeit with a reasonable set of existing approaches. Neo4j has an SQL-like query language called Cypher. It also has a graphical presentation mode and a well-developed API for program-based access. Using this API, any existing reporting system could obtain full system information.

Hand-waving a bit further, it seems likely that an LLM-based approach could be grafted onto the collected information. However, this isn't required in order to make the effort pay off.

Anyway, that's the pipe dream. (I did say it was speculative :-).

-r

@lpil
Copy link
Member

lpil commented May 20, 2023

On storage, I would appreciate the the backend being pluggable. PostgreSQL is a fantastic graph database and more widely used than Neo4j in the BEAM ecosystem, and on a single server SQLite3 has potential to outperform them both.

@RichMorin
Copy link
Author

RichMorin commented May 20, 2023

Pluggability should certainly be a design goal. In terms of databases, my suspicion is that Neo4j's architecture gives it some strong performance advantages. For example, following an edge only requires dereferencing a pointer in memory.

However, there are certainly advantages to PostgreSQL, SQLite3, etc. For that matter, databases such as Datomic, Mnesia, and Wikidata might also be worth investigating. The important thing, IMHO, is to collect anything and everything that seems useful; storage and presentation can then be experimented with, as desired.

One tricky challenge is defining a BEAM-friendly intermediate data structure. For example, each data source (and sink!) is likely to have its own preferred format and structure. So, finding a neutral data structure will be necessary in order to keep the combinatorics of interfacing under control. Suggestions welcome...

-r

@RichMorin
Copy link
Author

RichMorin commented May 22, 2023

I thought it might be interesting to find out what ChatGPT has to suggest regarding data structures:

Q: I'm interested in finding a convenient data structure for storing and serializing property graph data for use in Neo4j and other databases (e.g., Datomic, Mnesia, PostgreSQL, SQLite, Wikidata). One possibility would be to use JSON-LD for serialization and import the graph into a set of maps, etc. Another would be to encode the data in Cypher and translate it into JSON-LD. Suggestions?

A: When it comes to storing and serializing property graph data for use in multiple databases like Neo4j, Datomic, Mnesia, PostgreSQL, SQLite, and Wikidata, you have several options to consider. Let's discuss two possible approaches you mentioned: using JSON-LD for serialization or encoding the data in Cypher.

JSON-LD for Serialization:

JSON-LD is a convenient choice for serialization as it provides a way to represent linked data using JSON. It offers flexibility and interoperability across various platforms and tools. You can represent the property graph data using JSON-LD's context, subjects, predicates, and objects.

To store the data, you can import the JSON-LD into a set of maps or objects in the respective databases. Each vertex in the graph can be represented as a JSON object, and each edge can be represented as a nested object within the vertex or as a separate JSON object with references to the connected vertices.

This approach allows you to leverage the flexibility of JSON-LD for serializing the graph data and then import it into different databases according to their respective data structures.

Encoding Data in Cypher:

Cypher is a query language specifically designed for querying and manipulating graph data in Neo4j. While you can encode the graph data in Cypher, it might not be the most suitable choice for direct serialization into other databases.
Instead, you could consider using Cypher to query the data from Neo4j and then transform the results into a format suitable for the target database. For example, you could use Cypher queries to retrieve the graph data from Neo4j and translate it into JSON-LD or any other format supported by the target database.

By decoupling the serialization and storage steps, you have more flexibility in handling the data transformation and ensuring compatibility with the target databases.

In summary, using JSON-LD for serialization and importing the graph into a set of maps or objects is a straightforward approach. Alternatively, encoding the data in Cypher can be useful for querying the data from Neo4j and then transforming it into a suitable format for other databases. Ultimately, the choice depends on your specific requirements and the capabilities of the target databases you are working with.

Q: Can you suggest any other plausible alternatives?

A: Certainly! Here are a few other plausible alternatives for storing and serializing property graph data:

Apache Parquet: Parquet is a columnar storage format that is efficient for handling large datasets. It can be a good choice for storing property graph data, especially if you need to optimize for query performance. You can use libraries or tools that support Parquet for reading and writing graph data, such as Apache Arrow or Apache Spark.

Apache Avro: Avro is a data serialization framework that provides a compact binary format for storing structured data. It supports schema evolution and can be a good fit for storing property graph data. You can define a schema for your graph data and then serialize it into Avro format. Libraries like Apache Avro or Confluent's Avro can help with the serialization and deserialization process.

RDF (Resource Description Framework): RDF is a standard for representing knowledge and data on the web using subject-predicate-object triples. You can model your property graph data as RDF triples, where vertices are subjects, edges are predicates, and connected vertices are objects. RDF databases like Apache Jena, Virtuoso, or Stardog can be used to store and query RDF-based property graph data.

Protocol Buffers: Protocol Buffers, also known as Protobuf, is a language-agnostic binary serialization format developed by Google. It allows you to define the schema of your graph data using a specialized language and then generate code for serialization and deserialization. Protobuf can provide efficient and compact serialization, and there are libraries available for multiple programming languages.

MessagePack: MessagePack is a binary serialization format that aims to be fast and compact. It is a good choice when you need to optimize for performance and reduce data size. MessagePack allows you to serialize your property graph data into a compact binary format and has support for various programming languages.

Remember to consider the specific requirements of your project, such as query performance, interoperability, schema evolution, and compatibility with the target databases, when choosing the most suitable alternative for storing and serializing your property graph data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants