[Feature Request] Aggregation include
/exclude
should support faster filtering on prefixes
#14368
Labels
include
/exclude
should support faster filtering on prefixes
#14368
Is your feature request related to a problem? Please describe
I've been working with an ecommerce user of OpenSearch, where they've implemented something like Lucene's hierarchical facets by tagging products with
category:label
pairs (likecolor:red
). These are collected usingterms
aggregations.To avoid collecting facet labels that are irrelevant to the current search, they have some mechanism to identify the relevant facet categories for the query. They've been filtering these using a regular expression in the
include
parameter for theirterms
aggregation, like"include": "(color|size|brand|material|machine\\ washable|sleeve\\ length|...):.+"
, in order to filter thecategory:label
pairs based on a specific set of categories. Overall, prefix filtering on terms aggregations seems like a fairly reasonable thing to want to do.Unfortunately, this really slows down search requests, as the
IncludeExclude
class tries to step through all possiblecategory:label
values (in the global ordinals) that match the expression. The commit from 2015 that added the current automaton-based behavior (which was a significant speedup from what came before) mentions this problem in the commit message:Indeed, there's a
TODO
in place calling out that prefix matching should be a special case:OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/IncludeExclude.java
Line 344 in cf2c31f
https://github.com/msfroh/OpenSearch/blob/cf2c31fffe844f78f17cf1c2a780198b9b6258d4/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/IncludeExclude.java#L344
Describe the solution you'd like
I would like to specialize the
IncludeExclude.AutomatonBackedOrdinalsFilter
to handle prefixes better.Ideally, I would like to make it fully transparent to the user -- essentially, I'd like to address the
TODO
that I listed above, where we simply handle prefixes as a special case of regexpinclude
/exclude
.In order to do that, I need to learn more about Lucene's automaton matching, which is something that I would like to wrap my head around anyway. (I learned a little bit as a result of #13461, but that only scratched the surface. I want to know more.)
Related component
Search:Aggregations
Describe alternatives you've considered
If the Lucene automaton rabbit hole turns out to be too dark and scary, another option could be to add
include_prefixes
andexclude_prefixes
parameters that we can parse from theIncludeExclude
class.Once we have that, we could implement logic similar to
OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/IncludeExclude.java
Lines 340 to 350 in cf2c31f
The new logic would be something like:
Additional context
No response
The text was updated successfully, but these errors were encountered: