Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

解析报错500 #903

Open
zhongxin129 opened this issue Nov 8, 2024 · 1 comment
Open

解析报错500 #903

zhongxin129 opened this issue Nov 8, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zhongxin129
Copy link

Description of the bug | 错误描述

解析pdf时报错
app-1 | 2024-11-06 10:42:24.790 | INFO | magic_pdf.model.pdf_extract_kit:call:490 - table time: 0.0
app-1 | │ │ │ │ └ b'%PDF-1.7\n%\xe4\xe3\xcf\xd2\n4 0 obj\n<</Type/XObject\n/Subtype/Form\n/FormType 1\n/Matrix[1 0 0 1 0 0]\n/BBox[0 0 595 841]...
app-1 | │ │ │ └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | │ │ └ <function doc_analyze at 0x7f1b4175c160>
app-1 | │ └ []
app-1 | └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 166, in doc_analyze
app-1 | result = custom_model(img)
app-1 | │ └ array([[[255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ ...,
app-1 | │ [255, 255, 255],
app-1 | │ [255...
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 468, in call
app-1 | html_code = self.table_model.img2html(new_image)
app-1 | │ │ │ └ <PIL.Image.Image image mode=RGB size=1283x457 at 0x7F19C9E5FA60>
app-1 | │ │ └ <function ppTableModel.img2html at 0x7f1a144c48b0>
app-1 | │ └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/ppTableModel.py", line 42, in img2html
app-1 | pred_res, _ = self.table_sys(image)
app-1 | │ │ └ array([[[255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ ...,
app-1 | │ │ [ 67, 67, 67],
app-1 | │ │ [ 67...
app-1 | │ └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 100, in call
app-1 | pred_html = self.match(structure_res, dt_boxes, rec_res)
app-1 | │ │ │ │ └ []
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ (['', '', '

', '', '', '', '', '', '', '', '</e...
app-1 | │ └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 949, in call
app-1 | match_results = self.match()
app-1 | │ └ <function Matcher.match at 0x7f1a1448cd30>
app-1 | └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 769, in match
app-1 | get_bboxes_list(end2end_result, structure_master_result)
app-1 | │ │ └ {'text': ',,,,,,,,,,,,<e...
app-1 | │ └ []
app-1 | └ <function get_bboxes_list at 0x7f1a1448c3a0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 302, in get_bboxes_list
app-1 | xywh_bbox = xyxy2xywh(src_bboxes)
app-1 | │ └ array([], dtype=float64)
app-1 | └ <function xyxy2xywh at 0x7f1a14693d00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 71, in xyxy2xywh
app-1 | new_bboxes[0] = bboxes[0] + (bboxes[2] - bboxes[0]) / 2
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ array([], dtype=float64)
app-1 | │ └ array([], dtype=float64)
app-1 | └ array([], dtype=float64)
app-1 |
app-1 | IndexError: index 0 is out of bounds for axis 0 with size 0
app-1 | INFO: 10.0.104.3:53724 - "POST /pdf_parse?parse_method=ocr&is_json_md_dump=True&output_dir=output HTTP/1.1" 500 Internal Server Error

How to reproduce the bug | 如何复现

image
image
这是需要解析的pdf的两张截图,不方便整体上传

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

@zhongxin129 zhongxin129 added the bug Something isn't working label Nov 8, 2024
@myhloli
Copy link
Collaborator

myhloli commented Nov 13, 2024

复现需要提供pdf文档,方便把这两页单独截出来生成一个新的pdf上传一下吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants