Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interest in a ruby implementation? #7

Open
dgollahon opened this issue May 5, 2024 · 3 comments
Open

Interest in a ruby implementation? #7

dgollahon opened this issue May 5, 2024 · 3 comments

Comments

@dgollahon
Copy link

Hi,

I am interested in using rapidfuzz-rs through magnus in Ruby. I have no problem doing this for just myself (it's very straightforward), but I was wondering if it would make sense to opensource a project there for others. I am happy to release it under my own github or "donate" it to this organization if that is desirable/helpful. I don't want to squat the rapidfuzz gem name if this group/someone else would like to own it.

Thanks!
Daniel

@maxbachmann
Copy link
Member

I think placing it in the rapidfuzz organisation would make sense for people to find it more easily. In terms of gems it would probably make sense to use some trusted publishing system via github actions similar to what is done for the Python version of the library.

There are a couple of things that I did differently in the Python version compared to the C++/Rust version to make it more useful for Python users:

  • there is a pure Python fallback implementation for platforms on which the faster C++/rust based solution can't be compiled (e.g. because no compiler is present)
  • the preferred implementation is the compiled one
  • Performing individual comparisons from Python is relatively slow. To speed this up I do provide the rapidfuzz.process module which allows the user to perform comparisons for complete datasets. E.g. process.extractOne to find the best match in a 1 x many comparison. This is generally faster, since it avoids interpreter overhead + in Python the global interpreter lock.
  • the cached scorer structs are not available from Python. Their speedup is simply to small in comparison to the function call overhead. Instead they are used under the hood by the process functions. This is done by tagging any scorer with an attribute giving access to these lowered functions.

I never used ruby myself. So I can't help with any ruby specific questions, but I would be more than happy to help with any questions in regards to the library.

@dgollahon
Copy link
Author

dgollahon commented May 7, 2024

Ok, that makes sense.

I think native ruby fallback would probably be something I don't have time to implement but I think a relatively "dumb" port using the magnus tooling I mentioned above would not be heavy lift. I'm not sure exactly when I'll get to this but I will plan on putting up a draft repo at some point and possibly reserve the relevant gem name and then figure out publishing lifecycle later on.

I think the overhead for functions bound via magnus (indirectly the C APIs) should be reasonable for most use-cases. Using the osa_distance function i found some minor test workloads to be 5-150 times as fast as a similar C-based gem in the ecosystem.

@maxbachmann
Copy link
Member

Yes I started out without all of these things in the Python version as well and added them as I had time + need for them.

Wrapping the API using something like magnus is probably not too much work, since most of the functions share a similar interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants