-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ChemTEB #1585
Comments
Since it is a fork (https://github.com/basf/chemteb) it should be relatively easy to integrate ; cc @HSILA in case you are interested in opening a PR :) |
Thank you for recognizing our work on ChemTEB, and apologies for the delayed response. I can complete the tasks' metadata and open a pull request. ChemTEB currently has over 35 tasks; is it okay to integrate all of them? The performance in bitext mining tasks is around zero. I think I should exclude them so they don't affect the models' average scores. What do you think? Also, a quick question: in PairClassification tasks, we can have a task with multiple subsets (for example, in |
Thanks for getting back! That would be amazing! I think all of them are fine as long as the Bitext Mining 0 performance is due to models being bad and not the task being unsolvable/random. (cc @KennethEnevoldsen in case of thoughts)
Sounds possible to me but not sure about the details 🤔 |
Thank you for your encouraging words. Regarding the Bitext Mining tasks (and some PairClassification tasks), the performance around zero is likely because they involve matching chemical compound names, descriptions, or formulas with their corresponding SMILES codes. These are highly domain-specific challenges that general-purpose embedding models don’t seem to be trained to handle. While they are not entirely random, they appear unsolvable by generic models. |
I see; I think these are fine to have then! Probably of high interest for people training chemistry-specific embedding models! |
https://arxiv.org/abs/2412.00532v1
The text was updated successfully, but these errors were encountered: