Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50189][SQL] Upgrade ICU4J to 76.1 #48721

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Oct 31, 2024

What changes were proposed in this pull request?

The pr aims to upgrade ICU4J from 75.1 to 76.1.

Why are the changes needed?

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@panbingkun
Copy link
Contributor Author

panbingkun commented Oct 31, 2024

Let me update the benchmark result of CollationBenchmark and CollationNonASCIIBenchmark

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic @uros-db Could you take a look at the PR, please.

test("invalid collationId") {
ignore("invalid collationId") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we making this change in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the above question.

test("CollationKey generates correct collation key for collated string") {
ignore("CollationKey generates correct collation key for collated string") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we making this change in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the final 'answer' to this PR, I am still investigating. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect to see any changes here. Do we know why these hashes have changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this change is that the CollationKey returned by Collator.getCollationKey(...) are different. As for why they are different, I am investigating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollationKeys#writeSortKeyUpToQuaternary

Copy link
Contributor Author

@panbingkun panbingkun Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's use the following code to reproduce it (collation: UNICODE)
object CollationKeySuite {

  def main(args: Array[String]): Unit = {
    val builder = new ULocale.Builder
    builder.setLocale(ULocale.ROOT)
    builder.setUnicodeLocaleKeyword("ks", "level3")
    val resultLocale = builder.build
    val collator = Collator.getInstance(resultLocale)
    collator.freeze
    val s = UTF8String.fromString("aa")
    val hash = collator.getCollationKey(s.toValidString).hashCode()
    println(hash)
  }
}
  • ICU4j 76.1, result:
10628395
  • ICU4j 75.1, result:
10381418

@@ -168,6 +168,7 @@ class CollationExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
}

test("CollationKey generates correct collation key for collated string") {
val b: Byte = 0x2B
Copy link
Contributor Author

@panbingkun panbingkun Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same, the CollationKey ( m_key_ ) is different.

  • In version 75.1, its value is 0x2A (42),
  • while in version 76.1, its value is 0x2B (43)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to CollationKey.

Murmur3HashTestCase("SQL ", "UNICODE_RTRIM", -1923567940),
Murmur3HashTestCase("SQL", "UNICODE_CI", 1029527950),
Murmur3HashTestCase("SQL ", "UNICODE_CI_RTRIM", 1029527950)
Murmur3HashTestCase("SQL", "UNICODE", 1483684981),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to CollationKey.

@panbingkun panbingkun marked this pull request as ready for review November 1, 2024 11:36
@panbingkun
Copy link
Contributor Author

@uros-db @dongjoon-hyun @stefankandic @MaxGekk
The detailed explanation has been updated, this PR is ready for review.
Thank you very much for the review, if you has free time. ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants