Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Log pattern support in OpenSearch #16627

Open
songkant-aws opened this issue Nov 13, 2024 · 14 comments
Open

[RFC] Log pattern support in OpenSearch #16627

songkant-aws opened this issue Nov 13, 2024 · 14 comments
Labels
enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo

Comments

@songkant-aws
Copy link

songkant-aws commented Nov 13, 2024

Is your feature request related to a problem? Please describe

Today, OpenSearch supports Grok or Patterns operator in PPL to leverage regex to exclude stop characters to generate a log message's pattern. By default, it applies a very simple rule to just exclude numerics and [a-zA-Z\d] characters. For example, [email protected] and [email protected] are potentially grouped as the same pattern because after the processing, their patterns are both @. This simple approach has low grouping accuracy because different log statements could have same combination of punctuations and the generated pattern is not friendly to human reading. To achieve better grouping accuracy, it needs expert with domain knowledge to manually apply suitable regex case by case.

I see automatic extracting log patterns is a popular trend in industrial log analysis. Industrial products like Sumo Logic has logreduce operator that groups log messages together based on string and pattern similarity. Ideally, a good log pattern functionality should process a stream of semi-structured log messages and identify which are constant words and variables for each log message. For example, a list of log messages[proxy.cse.cuhk.edu.hk:5070 open through proxy proxy.cse.cuhk.edy.hk:5070 HTTPS, proxy.cse.cuhk.edu.hk:5171 open through p3p.sogou.com:80 SOCKS, proxy.cse.cuhk.edu.hk:5070 open through proxy 182.254.114.110:80 HTTPS, proxy.cse.cuhk.edu.hk:5172 open through proxy socks.cse.cuhl.edu.hk:5070 SOCKS] could have such two common patterns: <*> open through proxy <*> HTTPS and <*> open through proxy <*> SOCKS.

I'm of an opinion to create a new module in OpenSearch to add several log parsing algorithms for extracting log common patterns from a stream of log message input so that other components like DSL or SQL/PPL plugin could leverage those algorithms to develop its own operators. Please share your thoughts and rate if this is a good idea or bad idea.

Describe the solution you'd like

The proposal here is to firstly create a new module like org.opensearch.patterns in milestone 1, similar to org.opensearch.grok. The goal of this module is to act as a library of multiple log parsing algorithms.

In milestone 2, import the algorithm in other plugins like opensearch-skills to migrate existing simple log pattern to the advanced algorithms.

In milestone 3, implement new operator in SQL/PPL plugin based on suitable algorithms

In milestone 4, grouping log patterns could be treated as a special aggregation, we could support log pattern aggregator (reduce) part in OpenSearch DSL or pipeline.

Related component

Libraries

Describe alternatives you've considered

Today, due to performance consideration, DSL or PPL may only return up to 10,000 results by default MAX_RESULT_WINDOW_SETTING. For this volume of data, it's probably enough to apply extracting common log patterns on Coordinator Node firstly.

Instead of only applying algorithms in aggregator (reduce) part, we could support partial log pattern aggregation on DataNode level for all of filtered documents, that could be over millions of log messages. Considering heavy work efforts, we want to prioritize grouping log patterns on Coordinator Node.

Additional context

Assumptions

  1. Based on industrial empirical knowledge, IP address, url, numbers, special software ids like process ids, etc are known variable tokens. At the preprocessing step, all of algorithms will apply a default regex to exclude those known variable tokens and default delimiters to split tokens. Users are also allowed to pass customized regex and delimiter to improve this preprocessing if they have deep domain knowledge.
  2. Log messages that are generated by the same log statement usually have the same number of tokens after delimiting.
  3. Constant tokens have high frequencies at the same token position if the same log statement logs many times.

Design Considerations

We run a bunch of algorithms as well as existing OpenSearch simple log pattern algorithm on an open-sourced benchmark called logparser to compare different algorithms' log grouping efficiency. The benchmark has 16 industrial software datasets in loghub. We also compared the time complexity and space complexity across different volumes of log data with 10 iterations to calculate its mean finish time in seconds and average memory cost in MB.

Grouping Accuracy

The following graph shows different algorithm's grouping accuracy percentiles in box plot across 16 industrial log datasets. We observed that OpenSearch simple log pattern approach is not as competitive as others. The top most accurate 3 algorithms are Brain > Drain > AEL.
image

Time Complexity

In overall, all of top 3 algorithms time complexity are bounded by O(n), n is number of log lines. Brain is the fastest algorithm in selected 4 datasets.
image
image
image
image

Space Complexity

In overall, all of top 3 algorithms space complexity are bounded by O(n * L), n is number of log lines, L is average number of tokens per log message. Brain has up to twice the memory cost of the other two algorithms.
image
image
image
image

Preferred Algorithm

Although, Brain algorithm has larger memory cost, it has excellent time efficiency to process different volumes of log data and highest grouping accuracy with lowest variance. It will be the first priority to be implemented.

Algorithm Introduction

  1. After preprocessing step, the algorithm input is a stream of split token list like [[token01, token02, token03, ..], [token11, token02, token03, ..], ...].
  2. Calculates the global token frequencies per column over the global data input, like a histogram of words at column position. Each token will be embedded as a frequency vector like <frequency, token, position>. For example, token02 and token03 has such vector like (2, token02, 1) and (2, token03, 2) based on the sample input mentioned in the first step.
  3. Initial log pattern is formed when tokens share the same highest frequency per log message. For example, <*> token02 token03 ... is the initial log pattern based on second step's result.
  4. The algorithm maintains a bidirectional tree data structure to supplement final log pattern with some heuristic rules for other tokens in the same log message.
  5. Final log patterns will be generated by traversing the tree.

Implementation Proposal

The benefit of creating a separate module is that it will provide general algorithm implementation and interfaces on any type of computation resource, whether it's a DataNode, CoordinatorNode or ML Node. It's agnostic to declarative language.
This section will simply discuss how could we implement grouping log pattern in OpenSearch SearchService.

Phase 1

In phase 1, we will prioritize grouping log patterns only in Coordinator node based on results from DataNode, considering there is MAX_RESULT_WINDOW limit of returning search results from DataNode and the algorithm has low cost for handling a data volume of 10,000. An example component is shown as follows:

image

Resource Isolation and Circuit Breakers

Applying grouping log pattern on single Coordinator node adds additional memory and CPU pressure. Although it's not a frequent query, it's still better to apply a quick circuit breaker to check memory usage to early cancel the search request.

Phase 2

In phase 2, we could push down log pattern to query phase so that DataNode can compute partial result for larger volume of data. Since Brain algorithm requires a global histogram, it needs two passes of map-reduce for distributed task computation. The global histogram generated in the first pass needs to dispatched to Data Nodes for second pass query. The initial idea is illustrated in the following graph

image

@songkant-aws songkant-aws added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 13, 2024
@github-actions github-actions bot added the Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo label Nov 13, 2024
@dbwiddis
Copy link
Member

I absolutely love this proposal.

I recently added a split search response processor and thought that the power of regex could really improve that capability. I envisioned the possibilities of regex and realized the Grok processor existed in ingest, and filed a feature request for a Grok Search Response Processor that I thought I would eventually get to.

But this proposal is better.

Yes, yes, yes, let's optimize common log patterns.

@xluo-aws
Copy link
Member

@anirudha , @dblock , @msfroh, @gaobinlong , could you help take a look at this proposal?

@dblock
Copy link
Member

dblock commented Nov 19, 2024

I am +1 to what @dbwiddis is saying.

@gaobinlong
Copy link
Collaborator

Implementing log pattern as a common library is good idea, we may consider making the library extensible for the reason that we may add more algorithms in future.

Another question is that where do we add the circuit breaker settings for the log pattern algorithm, maybe not feasible in the library?

@songkant-aws
Copy link
Author

As log pattern will be placed in new common library, it's not feasible to directly add protections like circuit breaker there. The initial plan is that we can add circuit breaker in new operators no matter it's in DSL or PPL.

@dblock
Copy link
Member

dblock commented Dec 2, 2024

[Catch All Triage - 1, 2, 3]

@songkant-aws
Copy link
Author

@penghuo @andrross + Andrew and Peng to review RFC if you're interested in this topic.

@reta
Copy link
Collaborator

reta commented Dec 4, 2024

@songkant-aws thanks for the proposal, a few questions if you don't mind, to understand the tradeoffs

Phase 1
In phase 1, we will prioritize grouping log patterns only in Coordinator node based on results from DataNode ...

Is my understanding correct that the grouping is going to done at search time for each query (over the results returned from data nodes)? I assume we are constrained at what could be done at ingestion time since the range of possible grouping pattern may not be known ahead of time?

@andrross
Copy link
Member

andrross commented Dec 4, 2024

Thanks @songkant-aws. I don't have a ton of domain knowledge with log pattern matching. I see your PR that adds a library and the high level architecture diagrams here, but what are the concrete integration points that you're considering? Would this library just be used by the sql/ppl plugin? Would it be used by some sort of search pipeline? Are there DSL changes if you're building new log pattern matching features into core directly?

@songkant-aws
Copy link
Author

songkant-aws commented Dec 5, 2024

Is my understanding correct that the grouping is going to done at search time for each query (over the results returned from data nodes)? I assume we are constrained at what could be done at ingestion time since the range of possible grouping pattern may not be known ahead of time?

@reta Yes, I think it is a more common case that groups over the results returned from data nodes, assuming users would like to apply various filtering conditions to focus on partial range of data. We don't know the possible grouping pattern ahead of time until the range is specified. For example, top N ERROR level logs could be significantly different than top N INFO level logs.

@songkant-aws
Copy link
Author

songkant-aws commented Dec 5, 2024

Some pattern grouping algorithms support online parsing, which is similar to training a machine learning model. For that case, I think we could leverage ingestion pipeline to create an initial parser for grouping any range of free text log messages. Another idea is directly integrating machine learning training process with ingestion pipeline(out of scope here).

For offline algorithms, it needs to know the statistics of range of data. Those algorithms are suitable for the on-demand searching and grouping method.

@songkant-aws
Copy link
Author

@andrross Andrew, it's planned to be reused in different places. The proposal is to expand the scope of supporting log patterns not only in sql/ppl but also dsl. Some plugins like skills/discover plugin could also directly import classes to use functionalities. That's the main reason why I add it as a new library(similar to other libraries like Grok). Would you have any suggestions on what could be a better place to add?

@penghuo
Copy link
Contributor

penghuo commented Dec 20, 2024

@songkant-aws good idea! this solution will help log analytics a lot. couple questions

  1. What is the expected usage of DSL/PPL with this library? Are we planning to add new queries or functions to DSL?
  2. Should we focus on Phase 2 design, ensuring the functionality is truly production-ready. This approach will also help us evaluate the technical pros and cons of integrating this function into the core.
  3. Is this algorithm tightly coupled with OpenSearch functionality, or can it run independently on any engine? For instance, if we apply this function on Spark to query CloudWatch data, would it still require an OpenSearch cluster?

@songkant-aws
Copy link
Author

@penghuo Thanks for your interest, Peng

  1. What is the expected usage of DSL/PPL with this library? Are we planning to add new queries or functions to DSL?

Ideally, the expectation is to add a new query like a special aggregation type with this library. So in this way, we can query millions or more logs to extract log patterns in map-reduce flavor, aka combine DataNode partial log patterns and reduce them into global log pattern for each query(with different filters or similar).

  1. Should we focus on Phase 2 design, ensuring the functionality is truly production-ready. This approach will also help us evaluate the technical pros and cons of integrating this function into the core.

Yeah, I think it's fair. Phase 2 design is thorough and can also make sure PPL's query pushed down to DataNode. Current PPL's simple log pattern only works on top 10000 result on coordinator node. I'm thinking, to simplify the implementation, we can leverage similarity function to merge similar partial patterns from DataNode instead of current Phase 2's structure.

  1. Is this algorithm tightly coupled with OpenSearch functionality, or can it run independently on any engine? For instance, if we apply this function on Spark to query CloudWatch data, would it still require an OpenSearch cluster?

The algorithm itself is actually independent of running on any engine. Each query engine needs to define its own physical plan(a special aggregation) to integrate with this computation method. Saying Spark to query CloudWatch data, I don't think it requires an OpenSearch cluster. The reason why we proposed to put it in opensearch.libs because it will publish an independent jar so that others can directly consume jar dependency to incorporate algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Libraries Lucene Upgrades and Libraries, Any 3rd party library that Core depends on, ex: nebula; team is respo
Projects
None yet
Development

No branches or pull requests

8 participants