Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IndexError in para_split_v3.py for empty line handling #916

Closed

Conversation

hyastar
Copy link
Contributor

@hyastar hyastar commented Nov 9, 2024

This PR resolves an IndexError in the __is_list_or_index_block function that occurs when lines_text_list contains an empty string. Changes include:

  • Added an empty string check before accessing the last character.
  • Replaced direct indexing with safer slicing (text[-1:] instead of text[-1]).
  • Enhanced handling for empty or whitespace-only lines.

Sample input that caused the error: ['a.', 'b.', 'c.', 'd.', 'e.', 'f.', '']

myhloli and others added 27 commits November 6, 2024 18:04
docs(README): update badges
- Implement xycut algorithm to sort blocks when layoutreader fails
- Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails
- Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
feat(model): add xycut algorithm for block sorting
- Decrease the maximum line count from 512 to 316 for layoutreader
- Lower the line count threshold from 316 to 200 to ensure compatibility
- This change aims to prevent potential issues with layoutreader's maximum line support
refactor(pdf_parse): adjust line count threshold for layoutreader
- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py
- Change the default table model from TABLE_MASTER to RAPID_TABLE
feat(table): integrate RapidTable model for table recognition
- Add missing '.jpg' file type to the list of allowed file types for upload
fix(gradio-app): add missing file type in upload
… output

- Add orig_model_list parameter to maintain original model data
- Deep copy model_json and pipe.model_list to preserve data integrity
- Update json_md_dump function call to include orig_model_list
- Improve condition check for empty model_json
refactor(magic_pdf_parse_main): optimize model data handling and JSON output
Modify the test directory
- Update test_image2html to use unittest framework
- Add more assertions
test(table): improve ppTableModel test coverage
- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py
feat(table): add RapidOCR support for RapidTable model
Copy link
Contributor

github-actions bot commented Nov 9, 2024


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


1 out of 3 committers have signed the CLA.
✅ (hyastar)[https://github.com/hyastar]
@xu rui
@DTwz
xu rui seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@hyastar
Copy link
Contributor Author

hyastar commented Nov 10, 2024

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Nov 10, 2024
@hyastar
Copy link
Contributor Author

hyastar commented Nov 10, 2024

Additional Information

When processing large PDF files (280MB or larger) using the magic-pdf library, an IndexError frequently occurs in the __is_list_or_index_block function within para_split_v3.py. This error is encountered in an environment with an RTX 4090 GPU, 100GB of system memory, and 24GB of GPU memory when the lines_text_list contains an empty string, causing an out-of-range error.

Test Environment

  • GPU: NVIDIA RTX 4090
  • System Memory: 100GB
  • GPU Memory: 24GB

Copy link
Collaborator

@myhloli myhloli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外建议通过拉取dev分支的方式提交自己的代码,并提交pr到dev分支,因为我们的开发都会在dev分支完成,并仅在release时同步到master

@@ -102,7 +102,9 @@ def __is_list_or_index_block(block):
if span_type == ContentType.Text:
line_text += span['content'].strip()

lines_text_list.append(line_text)
# 只添加非空文本
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

空行也要添加进去,因为lines_text_list的长度要和block['lines']保持一致,如果长度不一致,后面用index匹配就会出现错位

if len(line_text) > 0:
if line_text[-1] in LIST_END_FLAG:
if len(line_text) > 0: # 额外检查确保不是空字符串
if line_text and line_text[-1] in LIST_END_FLAG:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line_text>0时,不需要再通过if line_text判断是否为空


if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8:
line_num_flag = True
total_valid_lines = len(lines_text_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这边处在if len(lines_text_list) > 0:的内部,lines_text_list的长度一定是大于0的,所以不需要判断长度是否为0

@@ -176,7 +180,7 @@ def __is_list_or_index_block(block):
# 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item
elif line_end_flag:
for i, line in enumerate(block['lines']):
if lines_text_list[i][-1] in LIST_END_FLAG:
if i < len(lines_text_list) and lines_text_list[i] and lines_text_list[i][-1] in LIST_END_FLAG:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里原方法少了对lines_text_list[i]是否为空做判断,可能会有error,可以直接改成

if len(lines_text_list[i]>0) and lines_text_list[i][-1] in LIST_END_FLAG:

因为lines_text_list的长度和block['lines']一致,所以不需要判断i<len(lines_text_list)

line[ListLineTag.IS_LIST_START_LINE] = True
if lines_text_list[i][-1] in LIST_END_FLAG:
line[ListLineTag.IS_LIST_END_LINE] = True
if i < len(lines_text_list) and lines_text_list[i]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同理,不需要这个判断

line[ListLineTag.IS_LIST_END_LINE] = True
if i < len(lines_text_list) and lines_text_list[i]:

if lines_text_list[i][0].isdigit():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里类似上面, 少了内容是否为空的判断,加一个len的检测就好

if len(lines_text_list[i]>0) and lines_text_list[i][0].isdigit()::

@hyastar hyastar force-pushed the fix-indexerror-in-para_split_v3 branch from 133ff5e to e75076b Compare November 11, 2024 06:44
@myhloli myhloli closed this Nov 11, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 11, 2024
@hyastar hyastar deleted the fix-indexerror-in-para_split_v3 branch November 11, 2024 08:25
@hyastar hyastar restored the fix-indexerror-in-para_split_v3 branch November 11, 2024 08:25
@hyastar hyastar deleted the fix-indexerror-in-para_split_v3 branch November 11, 2024 08:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants