-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50189][SQL] Upgrade ICU4J to 76.1
#48721
base: master
Are you sure you want to change the base?
Conversation
Let me update the benchmark result of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic @uros-db Could you take a look at the PR, please.
test("invalid collationId") { | ||
ignore("invalid collationId") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the above question.
test("CollationKey generates correct collation key for collated string") { | ||
ignore("CollationKey generates correct collation key for collated string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the final 'answer' to this PR, I am still investigating. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect to see any changes here. Do we know why these hashes have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this change is that the CollationKey
returned by Collator.getCollationKey(...)
are different. As for why they are different, I am investigating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CollationKeys#writeSortKeyUpToQuaternary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's use the following code to reproduce it (collation:
UNICODE
)
object CollationKeySuite {
def main(args: Array[String]): Unit = {
val builder = new ULocale.Builder
builder.setLocale(ULocale.ROOT)
builder.setUnicodeLocaleKeyword("ks", "level3")
val resultLocale = builder.build
val collator = Collator.getInstance(resultLocale)
collator.freeze
val s = UTF8String.fromString("aa")
val hash = collator.getCollationKey(s.toValidString).hashCode()
println(hash)
}
}
- ICU4j 76.1, result:
10628395
- ICU4j 75.1, result:
10381418
-
Through debugging, it was found that different versions of
icu4j
load different versions of underlying data resource files, such asnfc.nrm
A.ICU4j 76.1
->15.1.0.0
B.ICU4j 75.1
->16.0.0.0
-
I
guess
it should be related to the PR below (Unicode 15.1
->Unicode 16
)
ICU-22707 Unicode 16 alpha unicode-org/icu#2930
ICU-22707 Unicode 16 beta jun04 unicode-org/icu#3028
ICU-22707 Unicode 16 aug16 unicode-org/icu#3110
ICU-22707 adjust UTS46 for Unicode 16 unicode-org/icu#3130
ICU-22769 Rename of the ICU4J data folder to not contain a version unicode-org/icu#3000 -
ref docs
https://unicode-org.github.io/icu/download/76.html#release-overview
https://unicode-org.github.io/icu/download/76.html#common-changes
@@ -168,6 +168,7 @@ class CollationExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
} | |||
|
|||
test("CollationKey generates correct collation key for collated string") { | |||
val b: Byte = 0x2B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same, the CollationKey ( m_key_ )
is different.
- In version
75.1
, its value is0x2A
(42
), - while in version
76.1
, its value is0x2B
(43
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also related to CollationKey
.
Murmur3HashTestCase("SQL ", "UNICODE_RTRIM", -1923567940), | ||
Murmur3HashTestCase("SQL", "UNICODE_CI", 1029527950), | ||
Murmur3HashTestCase("SQL ", "UNICODE_CI_RTRIM", 1029527950) | ||
Murmur3HashTestCase("SQL", "UNICODE", 1483684981), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also related to CollationKey
.
@uros-db @dongjoon-hyun @stefankandic @MaxGekk |
What changes were proposed in this pull request?
The pr aims to upgrade
ICU4J
from75.1
to76.1
.Why are the changes needed?
The full release notes:
https://github.com/unicode-org/icu/releases/tag/release-76-1
https://unicode-org.github.io/icu/download/76.html
We need to keep the version up-to-date.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA.
Was this patch authored or co-authored using generative AI tooling?
No.