-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix IndexError in para_split_v3.py for empty line handling #916
Fix IndexError in para_split_v3.py for empty line handling #916
Conversation
docs(README): update badges
- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
feat(model): add xycut algorithm for block sorting
- Decrease the maximum line count from 512 to 316 for layoutreader
- Lower the line count threshold from 316 to 200 to ensure compatibility - This change aims to prevent potential issues with layoutreader's maximum line support
refactor(pdf_parse): adjust line count threshold for layoutreader
Feat/add en docs
feat: using next_docs
- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py
- Change the default table model from TABLE_MASTER to RAPID_TABLE
feat(table): integrate RapidTable model for table recognition
- Add missing '.jpg' file type to the list of allowed file types for upload
fix(gradio-app): add missing file type in upload
… output - Add orig_model_list parameter to maintain original model data - Deep copy model_json and pipe.model_list to preserve data integrity - Update json_md_dump function call to include orig_model_list - Improve condition check for empty model_json
refactor(magic_pdf_parse_main): optimize model data handling and JSON output
Modify the test directory
- Update test_image2html to use unittest framework - Add more assertions
test(table): improve ppTableModel test coverage
- Integrate RapidOCR with RapidTable model for table recognition - Improve memory management for devices with <= 8GB VRAM - Update table recognition process to use RapidOCR for RapidTable - Add rapidocr-paddle dependency in setup.py
feat(table): add RapidOCR support for RapidTable model
I have read the CLA Document and I hereby sign the CLA 1 out of 3 committers have signed the CLA. |
I have read the CLA Document and I hereby sign the CLA |
Additional InformationWhen processing large PDF files (280MB or larger) using the Test Environment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外建议通过拉取dev分支的方式提交自己的代码,并提交pr到dev分支,因为我们的开发都会在dev分支完成,并仅在release时同步到master
magic_pdf/para/para_split_v3.py
Outdated
@@ -102,7 +102,9 @@ def __is_list_or_index_block(block): | |||
if span_type == ContentType.Text: | |||
line_text += span['content'].strip() | |||
|
|||
lines_text_list.append(line_text) | |||
# 只添加非空文本 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
空行也要添加进去,因为lines_text_list的长度要和block['lines']保持一致,如果长度不一致,后面用index匹配就会出现错位
magic_pdf/para/para_split_v3.py
Outdated
if len(line_text) > 0: | ||
if line_text[-1] in LIST_END_FLAG: | ||
if len(line_text) > 0: # 额外检查确保不是空字符串 | ||
if line_text and line_text[-1] in LIST_END_FLAG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line_text>0时,不需要再通过if line_text判断是否为空
magic_pdf/para/para_split_v3.py
Outdated
|
||
if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8: | ||
line_num_flag = True | ||
total_valid_lines = len(lines_text_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这边处在if len(lines_text_list) > 0:
的内部,lines_text_list的长度一定是大于0的,所以不需要判断长度是否为0
magic_pdf/para/para_split_v3.py
Outdated
@@ -176,7 +180,7 @@ def __is_list_or_index_block(block): | |||
# 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item | |||
elif line_end_flag: | |||
for i, line in enumerate(block['lines']): | |||
if lines_text_list[i][-1] in LIST_END_FLAG: | |||
if i < len(lines_text_list) and lines_text_list[i] and lines_text_list[i][-1] in LIST_END_FLAG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里原方法少了对lines_text_list[i]是否为空做判断,可能会有error,可以直接改成
if len(lines_text_list[i]>0) and lines_text_list[i][-1] in LIST_END_FLAG:
因为lines_text_list的长度和block['lines']一致,所以不需要判断i<len(lines_text_list)
magic_pdf/para/para_split_v3.py
Outdated
line[ListLineTag.IS_LIST_START_LINE] = True | ||
if lines_text_list[i][-1] in LIST_END_FLAG: | ||
line[ListLineTag.IS_LIST_END_LINE] = True | ||
if i < len(lines_text_list) and lines_text_list[i]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同理,不需要这个判断
line[ListLineTag.IS_LIST_END_LINE] = True | ||
if i < len(lines_text_list) and lines_text_list[i]: | ||
|
||
if lines_text_list[i][0].isdigit(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里类似上面, 少了内容是否为空的判断,加一个len的检测就好
if len(lines_text_list[i]>0) and lines_text_list[i][0].isdigit()::
133ff5e
to
e75076b
Compare
This PR resolves an IndexError in the __is_list_or_index_block function that occurs when lines_text_list contains an empty string. Changes include:
Sample input that caused the error: ['a.', 'b.', 'c.', 'd.', 'e.', 'f.', '']