Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于中文公式训练 #40

Open
ignore1999 opened this issue Nov 7, 2024 · 1 comment
Open

关于中文公式训练 #40

ignore1999 opened this issue Nov 7, 2024 · 1 comment

Comments

@ignore1999
Copy link

请问构建中文公式训练时,方法上是否有差异?以及是否需要额外的tricks?

@wangbinDL
Copy link
Member

  1. 构建高质量大规模的中文公式数据;
  2. mBART对中文能力支持一般,如果中文公式数据量少需考虑换更好的Decoder,如Qwen2.5, InternVL2.5等;
  3. tokenizer当前使用的nougat的,不确定tokenizer是否对中文比较好,可以考虑重新换一个toknizer;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants