Add a FastText encoder #1047

Vincent-Maladiere · 2024-08-30T10:18:11Z

Problem Description

When encoding long text on small datasets, https://arxiv.org/abs/2312.09634 has shown that embeddings improve prediction performance over string models like MinHashEncoder. More recently, CARTE performed well using FastText to initialize column names and category embeddings.

Feature Description

Create an encoder that downloads FastText weights, loads them during fit, and applies them during transform. Note that FastText dependencies are only ["pybind11>=2.2", "numpy"].

Alternative Solutions

Instead, we could create a transformer using SentenceTransformer, which would download weights from HuggingFace. The issue is that although these models provide more powerful embeddings than FastText, this solution would require installing torch, transformers, and finally sentence-transformers. Also, running these models is markedly slower than using FastText.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2024-08-30T11:10:18Z

The problem with fasttext is that you basically need to depend on fasttext, AFAIK, and it only provides this model. I was more considering the SentenceTransformer way, which would provide much more options. I'm open to discussion, of course :)

koaning · 2024-09-02T13:13:30Z

Isn't FastText archived at this point? This is why I dropped it in embetter.

https://github.com/facebookresearch/fastText

Vincent-Maladiere · 2024-09-05T14:57:06Z

@koaning as long as there is no numpy 3, we should be fine 😉

More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.

GaelVaroquaux · 2024-09-05T15:47:12Z

More seriously, if we are fine with an optional torch dependency and its CI, I'm all for it.

Long term, I think that it is important, to implement the patterns in https://arxiv.org/abs/2312.09634, where "diverse entries" get encoded differently than "dirty categories"

Vincent-Maladiere added the enhancement New feature or request label Aug 30, 2024

Vincent-Maladiere mentioned this issue Sep 19, 2024

FEA Add TextEncoder #1077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a FastText encoder #1047

Add a FastText encoder #1047

Vincent-Maladiere commented Aug 30, 2024 •

edited

Loading

GaelVaroquaux commented Aug 30, 2024 via email

koaning commented Sep 2, 2024

Vincent-Maladiere commented Sep 5, 2024

GaelVaroquaux commented Sep 5, 2024 via email

Add a FastText encoder #1047

Add a FastText encoder #1047

Comments

Vincent-Maladiere commented Aug 30, 2024 • edited Loading

Problem Description

Feature Description

Alternative Solutions

Additional Context

GaelVaroquaux commented Aug 30, 2024 via email

koaning commented Sep 2, 2024

Vincent-Maladiere commented Sep 5, 2024

GaelVaroquaux commented Sep 5, 2024 via email

Vincent-Maladiere commented Aug 30, 2024 •

edited

Loading