-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Log pattern support in OpenSearch #16627
Comments
I absolutely love this proposal. I recently added a split search response processor and thought that the power of regex could really improve that capability. I envisioned the possibilities of regex and realized the Grok processor existed in ingest, and filed a feature request for a Grok Search Response Processor that I thought I would eventually get to. But this proposal is better. Yes, yes, yes, let's optimize common log patterns. |
@anirudha , @dblock , @msfroh, @gaobinlong , could you help take a look at this proposal? |
I am +1 to what @dbwiddis is saying. |
Implementing log pattern as a common library is good idea, we may consider making the library extensible for the reason that we may add more algorithms in future. Another question is that where do we add the circuit breaker settings for the log pattern algorithm, maybe not feasible in the library? |
As log pattern will be placed in new common library, it's not feasible to directly add protections like circuit breaker there. The initial plan is that we can add circuit breaker in new operators no matter it's in DSL or PPL. |
[Catch All Triage - 1, 2, 3] |
@songkant-aws thanks for the proposal, a few questions if you don't mind, to understand the tradeoffs
Is my understanding correct that the grouping is going to done at search time for each query (over the results returned from data nodes)? I assume we are constrained at what could be done at ingestion time since the range of possible grouping pattern may not be known ahead of time? |
Thanks @songkant-aws. I don't have a ton of domain knowledge with log pattern matching. I see your PR that adds a library and the high level architecture diagrams here, but what are the concrete integration points that you're considering? Would this library just be used by the sql/ppl plugin? Would it be used by some sort of search pipeline? Are there DSL changes if you're building new log pattern matching features into core directly? |
@reta Yes, I think it is a more common case that groups over the results returned from data nodes, assuming users would like to apply various filtering conditions to focus on partial range of data. We don't know the possible grouping pattern ahead of time until the range is specified. For example, top N ERROR level logs could be significantly different than top N INFO level logs. |
Some pattern grouping algorithms support online parsing, which is similar to training a machine learning model. For that case, I think we could leverage ingestion pipeline to create an initial parser for grouping any range of free text log messages. Another idea is directly integrating machine learning training process with ingestion pipeline(out of scope here). For offline algorithms, it needs to know the statistics of range of data. Those algorithms are suitable for the on-demand searching and grouping method. |
@andrross Andrew, it's planned to be reused in different places. The proposal is to expand the scope of supporting log patterns not only in sql/ppl but also dsl. Some plugins like skills/discover plugin could also directly import classes to use functionalities. That's the main reason why I add it as a new library(similar to other libraries like Grok). Would you have any suggestions on what could be a better place to add? |
@songkant-aws good idea! this solution will help log analytics a lot. couple questions
|
@penghuo Thanks for your interest, Peng
Ideally, the expectation is to add a new query like a special aggregation type with this library. So in this way, we can query millions or more logs to extract log patterns in map-reduce flavor, aka combine DataNode partial log patterns and reduce them into global log pattern for each query(with different filters or similar).
Yeah, I think it's fair. Phase 2 design is thorough and can also make sure PPL's query pushed down to DataNode. Current PPL's simple log pattern only works on top 10000 result on coordinator node. I'm thinking, to simplify the implementation, we can leverage similarity function to merge similar partial patterns from DataNode instead of current Phase 2's structure.
The algorithm itself is actually independent of running on any engine. Each query engine needs to define its own physical plan(a special aggregation) to integrate with this computation method. Saying Spark to query CloudWatch data, I don't think it requires an OpenSearch cluster. The reason why we proposed to put it in opensearch.libs because it will publish an independent jar so that others can directly consume jar dependency to incorporate algorithm. |
Is your feature request related to a problem? Please describe
Today, OpenSearch supports Grok or Patterns operator in PPL to leverage regex to exclude stop characters to generate a log message's pattern. By default, it applies a very simple rule to just exclude numerics and
[a-zA-Z\d]
characters. For example,[email protected]
and[email protected]
are potentially grouped as the same pattern because after the processing, their patterns are both@.
This simple approach has low grouping accuracy because different log statements could have same combination of punctuations and the generated pattern is not friendly to human reading. To achieve better grouping accuracy, it needs expert with domain knowledge to manually apply suitable regex case by case.I see automatic extracting log patterns is a popular trend in industrial log analysis. Industrial products like Sumo Logic has logreduce operator that groups log messages together based on string and pattern similarity. Ideally, a good log pattern functionality should process a stream of semi-structured log messages and identify which are constant words and variables for each log message. For example, a list of log messages
[proxy.cse.cuhk.edu.hk:5070 open through proxy proxy.cse.cuhk.edy.hk:5070 HTTPS, proxy.cse.cuhk.edu.hk:5171 open through p3p.sogou.com:80 SOCKS, proxy.cse.cuhk.edu.hk:5070 open through proxy 182.254.114.110:80 HTTPS, proxy.cse.cuhk.edu.hk:5172 open through proxy socks.cse.cuhl.edu.hk:5070 SOCKS]
could have such two common patterns:<*> open through proxy <*> HTTPS
and<*> open through proxy <*> SOCKS
.I'm of an opinion to create a new module in OpenSearch to add several log parsing algorithms for extracting log common patterns from a stream of log message input so that other components like DSL or SQL/PPL plugin could leverage those algorithms to develop its own operators. Please share your thoughts and rate if this is a good idea or bad idea.
Describe the solution you'd like
The proposal here is to firstly create a new module like
org.opensearch.patterns
in milestone 1, similar toorg.opensearch.grok
. The goal of this module is to act as a library of multiple log parsing algorithms.In milestone 2, import the algorithm in other plugins like opensearch-skills to migrate existing simple log pattern to the advanced algorithms.
In milestone 3, implement new operator in SQL/PPL plugin based on suitable algorithms
In milestone 4, grouping log patterns could be treated as a special aggregation, we could support log pattern aggregator (reduce) part in OpenSearch DSL or pipeline.
Related component
Libraries
Describe alternatives you've considered
Today, due to performance consideration, DSL or PPL may only return up to 10,000 results by default MAX_RESULT_WINDOW_SETTING. For this volume of data, it's probably enough to apply extracting common log patterns on Coordinator Node firstly.
Instead of only applying algorithms in aggregator (reduce) part, we could support partial log pattern aggregation on DataNode level for all of filtered documents, that could be over millions of log messages. Considering heavy work efforts, we want to prioritize grouping log patterns on Coordinator Node.
Additional context
Assumptions
Design Considerations
We run a bunch of algorithms as well as existing OpenSearch simple log pattern algorithm on an open-sourced benchmark called logparser to compare different algorithms' log grouping efficiency. The benchmark has 16 industrial software datasets in loghub. We also compared the time complexity and space complexity across different volumes of log data with 10 iterations to calculate its mean finish time in seconds and average memory cost in MB.
Grouping Accuracy
The following graph shows different algorithm's grouping accuracy percentiles in box plot across 16 industrial log datasets. We observed that OpenSearch simple log pattern approach is not as competitive as others. The top most accurate 3 algorithms are Brain > Drain > AEL.
Time Complexity
In overall, all of top 3 algorithms time complexity are bounded by O(n), n is number of log lines. Brain is the fastest algorithm in selected 4 datasets.
Space Complexity
In overall, all of top 3 algorithms space complexity are bounded by O(n * L), n is number of log lines, L is average number of tokens per log message. Brain has up to twice the memory cost of the other two algorithms.
Preferred Algorithm
Although, Brain algorithm has larger memory cost, it has excellent time efficiency to process different volumes of log data and highest grouping accuracy with lowest variance. It will be the first priority to be implemented.
Algorithm Introduction
<*> token02 token03 ...
is the initial log pattern based on second step's result.Implementation Proposal
The benefit of creating a separate module is that it will provide general algorithm implementation and interfaces on any type of computation resource, whether it's a DataNode, CoordinatorNode or ML Node. It's agnostic to declarative language.
This section will simply discuss how could we implement grouping log pattern in OpenSearch SearchService.
Phase 1
In phase 1, we will prioritize grouping log patterns only in Coordinator node based on results from DataNode, considering there is MAX_RESULT_WINDOW limit of returning search results from DataNode and the algorithm has low cost for handling a data volume of 10,000. An example component is shown as follows:
Resource Isolation and Circuit Breakers
Applying grouping log pattern on single Coordinator node adds additional memory and CPU pressure. Although it's not a frequent query, it's still better to apply a quick circuit breaker to check memory usage to early cancel the search request.
Phase 2
In phase 2, we could push down log pattern to query phase so that DataNode can compute partial result for larger volume of data. Since Brain algorithm requires a global histogram, it needs two passes of map-reduce for distributed task computation. The global histogram generated in the first pass needs to dispatched to Data Nodes for second pass query. The initial idea is illustrated in the following graph
The text was updated successfully, but these errors were encountered: