You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.
Describe the bug
When running the following sql, the number of result data will increase. When you perform a full join on a table with 10,000 rows of data and two identical tables, the result is 13,200 entries instead of 10,000 entries.In addition, if inner join is used, the amount of data will be reduced. If left join is used, coredump error will be generated.We think that the condition for this error to recur is to use the max aggregate function on the string field before SortMergeJoin. This is because the above error is not repeated when aggregating non string fields or using sum and other aggregation functions for strings.
To Reproduce
Here's the sql: select t1.value as value1, t2.value as value2, t1.data1a as data1a from( select value, MAX(data1) as data1a from gy_orc.test_smj3 t1 group by value) t1 FULL JOIN (select value, MAX(data1) as data1b from gy_orc.test_smj4 t2 group by value) t2 on t1.value=t2.value
Notes:The data type of data1 is string
The text was updated successfully, but these errors were encountered:
@zhouyuan We have found the cause of this problem: we have configured spark. oap. sql. columnar. hashagg. support String=true, when aggregating String type fields, when aggregating String type fields, it is converted to the ColumnarHashAggregate operator, which results in the deletion of Sort subtree, which results in incorrect SortMergeJoin results. How can we correctly insert a Sort operator before SortMergeJoin?
Thanks for the detailed and clear log, this is a bug on the "hashagg for string" - the impl does not fit for "sortagg + SMJ" case.
A quick fix is to do not allow to use hashagg in "sortagg + SMJ" case, however in this way gazelle would need to fallback to Vanilla Spark as Gazelle does not have "SortAgg" impl. I'll generate a quick patch for you.
Thanks, -yuan
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Describe the bug
When running the following sql, the number of result data will increase. When you perform a full join on a table with 10,000 rows of data and two identical tables, the result is 13,200 entries instead of 10,000 entries.In addition, if inner join is used, the amount of data will be reduced. If left join is used, coredump error will be generated.We think that the condition for this error to recur is to use the max aggregate function on the string field before SortMergeJoin. This is because the above error is not repeated when aggregating non string fields or using sum and other aggregation functions for strings.
To Reproduce
Here's the sql:
select t1.value as value1, t2.value as value2, t1.data1a as data1a from( select value, MAX(data1) as data1a from gy_orc.test_smj3 t1 group by value) t1 FULL JOIN (select value, MAX(data1) as data1b from gy_orc.test_smj4 t2 group by value) t2 on t1.value=t2.value
Notes:The data type of data1 is string
The text was updated successfully, but these errors were encountered: