New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

关于中文公式训练 #40

Open

ignore1999 opened this issue Nov 7, 2024 · 1 comment

ignore1999 commented Nov 7, 2024

请问构建中文公式训练时，方法上是否有差异？以及是否需要额外的tricks？

Member

wangbinDL commented Nov 7, 2024

构建高质量大规模的中文公式数据；
mBART对中文能力支持一般，如果中文公式数据量少需考虑换更好的Decoder，如Qwen2.5, InternVL2.5等；
tokenizer当前使用的nougat的，不确定tokenizer是否对中文比较好，可以考虑重新换一个toknizer；

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment