Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java performance improvements from Apache Lucene #195

Merged
merged 6 commits into from
Mar 27, 2024

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Mar 25, 2024

This is long overdue...

  • Remove StringBuilder and use simple char[] to reduce allocations and overhead.
  • Allow use of stemmers without unnecessary memory allocations via additional methods
  • Use MethodHandle.invokeExact() instead of reflection.

The changes mean that users will need java 7 at a minimum, but since java 7 version is long EOL, I don't think it will cause anyone grief.

Users can still use setCurrent(String) ... stem() ... getCurrent(), but this adds support for a higher-performance approach: setCurrent(char[], int), getCurrentBuffer(), stem() ... getCurrentBuffer()/getCurrentBufferLength()

This avoids many per-word object allocations:

  • Caller no longer forced to create a new String for every input word.
  • The backing byte[]/char[] of that String they were forced to create.
  • The StringBuilder previously created by snowball for every word.
  • The backing char[] for that StringBuilder.
  • Resulting String when they retrieve the result.

When indexing documents, all these per-word allocations put too much pressure on garbage collection and bottleneck performance.

compiler/generator_java.c Outdated Show resolved Hide resolved
compiler/generator_java.c Show resolved Hide resolved
java/org/tartarus/snowball/SnowballProgram.java Outdated Show resolved Hide resolved
java/org/tartarus/snowball/SnowballProgram.java Outdated Show resolved Hide resolved
@ojwb
Copy link
Member

ojwb commented Mar 26, 2024

Thanks, mostly this looks good but I've spotted some issues and noted them above (mostly minor - the incorrect implementation of len and size really needs addressing though).

It would also be helpful to update java/org/tartarus/snowball/TestApp.java to demonstrate the new best practice instead of creating a new String for each input word - could you do that? (I could try but I don't write much Java so the result would probably be suboptimal.)

I tried to keep it simple here, but doing a rough measurement of `time make check_java` x 3 gives the idea:

```
before:

real  0m24.713s
user  0m39.545s
sys 0m1.962s

real  0m24.992s
user  0m39.570s
sys 0m2.082s

real  0m24.166s
user  0m38.705s
sys 0m1.842s

after:

real  0m20.365s
user  0m31.869s
sys 0m1.781s

real  0m20.593s
user  0m32.733s
sys 0m1.858s

real  0m20.576s
user  0m32.992s
sys 0m1.763s
```

This mini-benchmark of the tests isn't indicative of typical performance, it is bottlenecked by JVM startup cost for each language.
@rmuir
Copy link
Contributor Author

rmuir commented Mar 27, 2024

I updated the TestApp in edaa2fa with some very timing data...

I will followup for your other comments one-by-one, thank you for looking into this.

@ojwb
Copy link
Member

ojwb commented Mar 27, 2024

Thanks for the updates - I'm happy to merge now (and I can then tweak so methodObject only gets generated when required) unless you've anything else you're about to add?

@rmuir
Copy link
Contributor Author

rmuir commented Mar 27, 2024

Yes if you could help with the conditional emit of methodObject that would be very helpful, I think it might be beyond my abilities at the moment. I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.

@ojwb ojwb merged commit 87827d4 into snowballstem:master Mar 27, 2024
17 checks passed
ojwb added a commit that referenced this pull request Mar 27, 2024
@ojwb
Copy link
Member

ojwb commented Mar 27, 2024

Yes if you could help with the conditional emit of methodObject that would be very helpful, I think it might be beyond my abilities at the moment.

Done in 34f3612.

I can followup with separate PRs for any other things, but this one is the big one! Thank you again for your help.

It'd be good to aim to have Lucene able to use an unmodified version of Snowball - not just because then all Java users can benefit from improvements, but also it's caused confusion for multiple users over the years who've generated Java code for a new stemmer with upstream Snowball then tried to add it to their Lucene source tree.

@ojwb
Copy link
Member

ojwb commented Mar 27, 2024

I will also document the new Java 7 requirement (I agree it's not problematic but we should state version requirements up front).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants