-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50589][SQL] Avoid extra expression duplication when push filter #49202
base: master
Are you sure you want to change the base?
Conversation
@cloud-fan Can you help first to see if the overall design is feasible? It involves a lot of UT, so I will modify it step by step. |
I'm very confused. When pushing predicates through Project, the problem we hit is we need to rewrite the attributes in the predicate with the corresponding expression in the Project. If the attribute appears more than once in the predicate, and the expression from Project is expensive, filter pushdown may make the query slower. The solution is to wrap the predicate with |
Filters support partial pushdown. If an common expression exists in two conditions, one condition is pushed to one level, and the other condition is pushed down to multiple levels. Therefore, after splitting by and, each filter need generates a With, then they need to share commonexprdef. Only the lowest commonexprdef will generate a common expression project. In addition, the child of the original common expression Alias also needs to be rewrite with new common expression attribute. |
Why do we care about common expressions in the original predicate? This is an optimization opportunity but what we should focus on is to avoid perf regression when pushing filter through project that duplicate expensive expressions. Maybe we should discuss this with a concrete example. |
As the ut in pr, I hope that the common expression will only be calculated only once. |
Do you want to make it simpler, commonexprdef is not shared, and the original alias is ignored? This way the common expression will be evaluated multiple times, but less than before, especially in nested cases. |
OK let's make the scope clear. We do not aim to find all duplicated expressions in the query plan tree and optimize them with |
OK, I'll do it as later, thank you. |
* @param replaceMap Replaced attributes and common expressions | ||
*/ | ||
def apply(expr: Expression, replaceMap: Map[Attribute, Expression]): With = { | ||
val commonExprDefsMap = replaceMap.map(m => m._1 -> CommonExpressionDef(m._2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we create a map if we never look up from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convenient to generate commonExprRefsMap.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
Outdated
Show resolved
Hide resolved
@@ -65,6 +65,8 @@ class SparkOptimizer( | |||
RewriteDistinctAggregates), | |||
Batch("Pushdown Filters from PartitionPruning", fixedPoint, | |||
PushDownPredicates), | |||
Batch("Rewrite With expression", fixedPoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just rewrite With
once at the very end of the optimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't push With filter into scan, so rewrite with at the end of operatorOptimizationBatch
.
how about the rule |
There is no replaceAlias logic in PushPredicateThroughJoin. |
ah ok. Shall we implement the better predicate split-and-combine in this PR to make it more useful? Otherwise we can't handle common cases like |
@cloud-fan It seems that the change in filters order causes the execution plan to mismatch. Is there any good solution? |
@@ -107,6 +107,14 @@ trait PredicateHelper extends AliasHelper with Logging { | |||
} | |||
} | |||
|
|||
/** | |||
* First split the predicates by And, then combine the predicates with the same references. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a misunderstanding. What I proposed is splitConjunctivePredicates
should split AND/OR under With
: With(c1, c2, c1 + c1 > 0 AND c2 + c2 > 0 AND c1 + c2 > 0)
can be split into With(c1, c1 + c1 > 0)
, With(c2, c2 + c2 > 0)
, With(c1, c2, c1 + c2 > 0)
. When combine the predicates later, we also recognize the With
and merge them to avoid duplicated expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why split OR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge With
requires a certain cost. It seems that we only need to merge once before rewriting With
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: when splitting the predicates under With
(by AND
), we don't refresh the CTE ids, so when merging the With
back, it should be pretty easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unnecessary to continuously split and merge with. We only need to split by and to generate with, and finally merge with before rewriting with.
@@ -112,6 +112,22 @@ object With { | |||
With(replaced(commonExprRefs), commonExprDefs) | |||
} | |||
|
|||
/** | |||
* Helper function to create a [[With]] statement when push down filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is very specific to filter pushdown, shall we put this method in the filter pushdown rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, wait for CI to finish running.
@@ -1831,8 +1841,8 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe | |||
} | |||
|
|||
if (pushDown.nonEmpty) { | |||
val pushDownPredicate = pushDown.reduce(And) | |||
val replaced = replaceAlias(pushDownPredicate, aliasMap) | |||
val replacedByWith = rewriteConditionByWith(pushDown.reduce(And), aliasMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewriteConditionByWith
will split the predicate anyway, why do we combine them with And
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculate the number of times each common expression is used through the entire condition.
val pushDownPredicate = pushDown.reduce(And) | ||
val replaced = replaceAlias(pushDownPredicate, aliasMap) | ||
val replacedByWith = rewriteConditionByWith(pushDown.reduce(And), aliasMap) | ||
val replaced = replaceAlias(replacedByWith, aliasMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't rewriteConditionByWith
already replace the aliases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, rewriteConditionByWith
only rewrite common attribute to common expression ref.
@@ -96,14 +96,17 @@ object RewriteWithExpression extends Rule[LogicalPlan] { | |||
val defs = w.defs.map(rewriteWithExprAndInputPlans(_, inputPlans, isNestedWith = true)) | |||
val refToExpr = mutable.HashMap.empty[CommonExpressionId, Expression] | |||
val childProjections = Array.fill(inputPlans.length)(mutable.ArrayBuffer.empty[Alias]) | |||
val refsCount = child.collect { case r: CommonExpressionRef => r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only collect references to the current With
, not the nested With
.
val refsCount = child.collect { case r: CommonExpressionRef => r} | |
val refsCount = child.collect { case r: CommonExpressionRef if defs.exists(_.id == r.id) => r} |
val newDefs = left.defs.toBuffer | ||
val replaceMap = mutable.HashMap.empty[CommonExpressionId, CommonExpressionRef] | ||
right.defs.foreach {rDef => | ||
val index = left.defs.indexWhere(lDef => rDef.child.fastEquals(lDef.child)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plan comparison can be expensive. Since it's a very special case for filter pushdown, shall we use the same CommonExpressionDef
ids for different With
s in the split predicates?
CI failure is not relevant. |
if (!SQLConf.get.getConf(SQLConf.ALWAYS_INLINE_COMMON_EXPR)) { | ||
val canRewriteConf = cond.filter(canRewriteByWith) | ||
if (canRewriteConf.nonEmpty) { | ||
val replaceWithMap = canRewriteConf.reduce(And) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do canRewriteConf.flatMap(_.collect...)
|
||
// With does not support inline subquery | ||
private def canRewriteByWith(expr: Expression): Boolean = { | ||
!expr.containsPattern(PLAN_EXPRESSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check too strong? We only require the common expression to not contain subqueries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not feasible, https://github.com/zml1206/spark/actions/runs/12491712984/job/34858074548
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both rewriting subqueries and pushing down predicates are in batch "operator optimization before inferring filters". Pushing down predicates may cause SubqueryExpression to contain common refs, and then rewriting subqueries cannot replace common refs.
.groupBy(identity) | ||
.transform((_, v) => v.size) | ||
.filter(m => aliasMap.contains(m._1) && m._2 > 1) | ||
.map(m => m._1 -> trimAliases(aliasMap.getOrElse(m._1, m._1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slo we've done the work of def replaceAlias
here, why do we need to call replaceAlias
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here just get the map of attribute and common expression to generate With, without rewriting the condition.
val defs = mutable.HashSet.empty[CommonExpressionDef] | ||
val replaced = expr.transform { | ||
case a: Attribute if refsMap.contains(a) => | ||
defs.add(defsMap.get(a).get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defs.add(defsMap.get(a).get) | |
defs.add(defsMap(a)) |
val refsCount = child.collect { | ||
case r: CommonExpressionRef | ||
if defs.exists { | ||
case d: CommonExpressionDef => d.id == r.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can create an id set to avoid repeated linear search.
Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition))) | ||
Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition), | ||
Batch("Merge With expression", fixedPoint, MergeWithExpression), | ||
Batch("Rewrite With expression", fixedPoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how many places do we need to put this batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before Extract Python UDFs, infer filters and last of each optimizer.
Input [4]: [w_warehouse_name#12, i_item_id#7, inv_before#19, inv_after#20] | ||
Condition : (CASE WHEN (inv_before#19 > 0) THEN ((cast(inv_after#20 as double) / cast(inv_before#19 as double)) >= 0.666667) END AND CASE WHEN (inv_before#19 > 0) THEN ((cast(inv_after#20 as double) / cast(inv_before#19 as double)) <= 1.5) END) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so this is a positive plan change that we avoid computing the CASE WHEN multiple times in the Filter.
What changes were proposed in this pull request?
Push predicates to
Project
/Aggregate
will extra expression duplication, this will cause the execution plan to become larger, and may also cause performance regression if common expression is non-cheap.This PR use
With
expression to rewrite condition which contains attribute that are not cheap and be consumed multiple times. Each predicate generates one or 0With
. Before rewrite With, extra expression duplication is reduced again through the new ruleMergeWithExpression
. In addition, optimized rewriteWith
rule, don't pre-evaluate common expressions which are only used once.Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.
Was this patch authored or co-authored using generative AI tooling?
No.