Skip to content

Commit

Permalink
refactor(para_split_v3): refine list block detection in paragraph spl…
Browse files Browse the repository at this point in the history
…itting

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity
  • Loading branch information
myhloli committed Oct 15, 2024
1 parent 244b868 commit 81b9fd7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion magic_pdf/para/para_split_v3.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ def __is_list_or_index_block(block):
line[ListLineTag.IS_LIST_END_LINE] = True
line_start_flag = True
# 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_LINE 结尾且数量和start line 一致
elif num_start_count == flag_end_count: # 简单一点先不考虑左侧不贴边的情况
elif num_start_count >= 2 and num_start_count == flag_end_count: # 简单一点先不考虑左侧不贴边的情况
for i, line in enumerate(block['lines']):
if lines_text_list[i][0].isdigit():
line[ListLineTag.IS_LIST_START_LINE] = True
Expand Down

0 comments on commit 81b9fd7

Please sign in to comment.