Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

论文标题有字体大小差异时,会造成单词被拆分 #942

Closed
gcy0926 opened this issue Nov 13, 2024 · 7 comments
Closed

论文标题有字体大小差异时,会造成单词被拆分 #942

gcy0926 opened this issue Nov 13, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@gcy0926
Copy link

gcy0926 commented Nov 13, 2024

Description of the bug | 错误描述

单词字体不一致差费

How to reproduce the bug | 如何复现

RT,当pdf文本中存在字体大小不一致的情况时,解析后的单词被拆分。

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cpu

@gcy0926 gcy0926 added the bug Something isn't working label Nov 13, 2024
@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

Uploading RAG评估-A unified evaluation.pdf…

没上传成功,能重传下吗

@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

image
这个问题在0.9修了,你用的是稳定版demo,那个更新会慢一个版本

@myhloli myhloli closed this as completed Nov 13, 2024
@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

复测了你v1的论文,只能说尽力合并了,但是还是不能全合并上
image
image
因字体不统一,导致span提取的时候就被切开了,拼接的时候不会用太复杂的规则去做合并,目前只做了单字符的合并,如果是多个字符被切开是没办法合回去的

@gcy0926
Copy link
Author

gcy0926 commented Nov 13, 2024

请问我需要拉哪个版本的代码使用呢?

@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

dev分支

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants