Fix IndexError in para_split_v3.py for empty line handling #916

hyastar · 2024-11-09T13:23:24Z

This PR resolves an IndexError in the __is_list_or_index_block function that occurs when lines_text_list contains an empty string. Changes include:

Added an empty string check before accessing the last character.
Replaced direct indexing with safer slicing (text[-1:] instead of text[-1]).
Enhanced handling for empty or whitespace-only lines.

Sample input that caused the error: ['a.', 'b.', 'c.', 'd.', 'e.', 'f.', '']

docs(README): update badges

- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks

feat(model): add xycut algorithm for block sorting

- Decrease the maximum line count from 512 to 316 for layoutreader

- Lower the line count threshold from 316 to 200 to ensure compatibility - This change aims to prevent potential issues with layoutreader's maximum line support

refactor(pdf_parse): adjust line count threshold for layoutreader

Feat/add en docs

feat: using next_docs

- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py

- Change the default table model from TABLE_MASTER to RAPID_TABLE

feat(table): integrate RapidTable model for table recognition

- Add missing '.jpg' file type to the list of allowed file types for upload

fix(gradio-app): add missing file type in upload

… output - Add orig_model_list parameter to maintain original model data - Deep copy model_json and pipe.model_list to preserve data integrity - Update json_md_dump function call to include orig_model_list - Improve condition check for empty model_json

refactor(magic_pdf_parse_main): optimize model data handling and JSON output

Modify the test directory

- Update test_image2html to use unittest framework - Add more assertions

test(table): improve ppTableModel test coverage

- Integrate RapidOCR with RapidTable model for table recognition - Improve memory management for devices with <= 8GB VRAM - Update table recognition process to use RapidOCR for RapidTable - Add rapidocr-paddle dependency in setup.py

feat(table): add RapidOCR support for RapidTable model

github-actions · 2024-11-09T13:23:41Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

1 out of 3 committers have signed the CLA.
✅ (hyastar)[https://github.com/hyastar]
❌ @xu rui
❌ @DTwz
xu rui seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.}

hyastar · 2024-11-10T01:32:42Z

I have read the CLA Document and I hereby sign the CLA

hyastar · 2024-11-10T01:38:29Z

Additional Information

When processing large PDF files (280MB or larger) using the magic-pdf library, an IndexError frequently occurs in the __is_list_or_index_block function within para_split_v3.py. This error is encountered in an environment with an RTX 4090 GPU, 100GB of system memory, and 24GB of GPU memory when the lines_text_list contains an empty string, causing an out-of-range error.

Test Environment

GPU: NVIDIA RTX 4090
System Memory: 100GB
GPU Memory: 24GB

myhloli

另外建议通过拉取dev分支的方式提交自己的代码，并提交pr到dev分支，因为我们的开发都会在dev分支完成，并仅在release时同步到master

myhloli · 2024-11-11T03:17:38Z