-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FastText encoder #1047
Labels
enhancement
New feature or request
Comments
The problem with fasttext is that you basically need to depend on fasttext, AFAIK, and it only provides this model.
I was more considering the SentenceTransformer way, which would provide much more options.
I'm open to discussion, of course :)
|
Isn't FastText archived at this point? This is why I dropped it in embetter. |
@koaning as long as there is no numpy 3, we should be fine 😉 More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it. |
More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.
Long term, I think that it is important, to implement the patterns in https://arxiv.org/abs/2312.09634, where "diverse entries" get encoded differently than "dirty categories"
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem Description
When encoding long text on small datasets, https://arxiv.org/abs/2312.09634 has shown that embeddings improve prediction performance over string models like
MinHashEncoder
. More recently, CARTE performed well using FastText to initialize column names and category embeddings.Feature Description
Create an encoder that downloads FastText weights, loads them during
fit
, and applies them duringtransform
. Note that FastText dependencies are only ["pybind11>=2.2", "numpy"].Alternative Solutions
Instead, we could create a transformer using
SentenceTransformer
, which would download weights from HuggingFace. The issue is that although these models provide more powerful embeddings than FastText, this solution would require installingtorch
,transformers
, and finallysentence-transformers
. Also, running these models is markedly slower than using FastText.Additional Context
No response
The text was updated successfully, but these errors were encountered: