解析报错500 #903

zhongxin129 · 2024-11-08T06:24:51Z

Description of the bug | 错误描述

解析pdf时报错
app-1 | 2024-11-06 10:42:24.790 | INFO | magic_pdf.model.pdf_extract_kit:call:490 - table time: 0.0
app-1 | │ │ │ │ └ b'%PDF-1.7\n%\xe4\xe3\xcf\xd2\n4 0 obj\n<</Type/XObject\n/Subtype/Form\n/FormType 1\n/Matrix[1 0 0 1 0 0]\n/BBox[0 0 595 841]...
app-1 | │ │ │ └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | │ │ └ <function doc_analyze at 0x7f1b4175c160>
app-1 | │ └ []
app-1 | └ <magic_pdf.pipe.OCRPipe.OCRPipe object at 0x7f19c9e5f370>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 166, in doc_analyze
app-1 | result = custom_model(img)
app-1 | │ └ array([[[255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ [255, 255, 255],
app-1 | │ ...,
app-1 | │ [255, 255, 255],
app-1 | │ [255...
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 468, in call
app-1 | html_code = self.table_model.img2html(new_image)
app-1 | │ │ │ └ <PIL.Image.Image image mode=RGB size=1283x457 at 0x7F19C9E5FA60>
app-1 | │ │ └ <function ppTableModel.img2html at 0x7f1a144c48b0>
app-1 | │ └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f19c9e5dcc0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/model/ppTableModel.py", line 42, in img2html
app-1 | pred_res, _ = self.table_sys(image)
app-1 | │ │ └ array([[[255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ [255, 255, 255],
app-1 | │ │ ...,
app-1 | │ │ [ 67, 67, 67],
app-1 | │ │ [ 67...
app-1 | │ └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | └ <magic_pdf.model.ppTableModel.ppTableModel object at 0x7f19e1bfbb20>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/predict_table.py", line 100, in call
app-1 | pred_html = self.match(structure_res, dt_boxes, rec_res)
app-1 | │ │ │ │ └ []
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ (['', '', '

', '', '', '', '', '', '', '', '</e...
app-1 | │ └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | └ <paddleocr.ppstructure.table.predict_table.TableSystem object at 0x7f19e1bfbd00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 949, in call
app-1 | match_results = self.match()
app-1 | │ └ <function Matcher.match at 0x7f1a1448cd30>
app-1 | └ <ppstructure.table.table_master_match.TableMasterMatcher object at 0x7f19c9d3fb80>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 769, in match
app-1 | get_bboxes_list(end2end_result, structure_master_result)
app-1 | │ │ └ {'text': ',,,,,,,,,,,,<e...
app-1 | │ └ []
app-1 | └ <function get_bboxes_list at 0x7f1a1448c3a0>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 302, in get_bboxes_list
app-1 | xywh_bbox = xyxy2xywh(src_bboxes)
app-1 | │ └ array([], dtype=float64)
app-1 | └ <function xyxy2xywh at 0x7f1a14693d00>
app-1 | File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/ppstructure/table/table_master_match.py", line 71, in xyxy2xywh
app-1 | new_bboxes[0] = bboxes[0] + (bboxes[2] - bboxes[0]) / 2
app-1 | │ │ │ └ array([], dtype=float64)
app-1 | │ │ └ array([], dtype=float64)
app-1 | │ └ array([], dtype=float64)
app-1 | └ array([], dtype=float64)
app-1 |
app-1 | IndexError: index 0 is out of bounds for axis 0 with size 0
app-1 | INFO: 10.0.104.3:53724 - "POST /pdf_parse?parse_method=ocr&is_json_md_dump=True&output_dir=output HTTP/1.1" 500 Internal Server Error

How to reproduce the bug | 如何复现

这是需要解析的pdf的两张截图，不方便整体上传

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

myhloli · 2024-11-13T04:22:45Z

复现需要提供pdf文档，方便把这两页单独截出来生成一个新的pdf上传一下吗？

zhongxin129 added the bug Something isn't working label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

解析报错500 #903

解析报错500 #903

zhongxin129 commented Nov 8, 2024

myhloli commented Nov 13, 2024

解析报错500 #903

解析报错500 #903

Comments

zhongxin129 commented Nov 8, 2024

Description of the bug | 错误描述

How to reproduce the bug | 如何复现

Operating system | 操作系统

Python version | Python 版本

Software version | 软件版本 (magic-pdf --version)

Device mode | 设备模式

myhloli commented Nov 13, 2024