Extraction skipping every alternate line and also not extracting headers #4017

chelsud123 · 2024-11-04T14:31:02Z

Description of the bug

In the attached PDF, I am trying to extract the last table. PyMuPDF is extracting it, but skipping every alternate line and is also not extracting the column headers. The following items are attached:

Problem_document.pdf : The PDF from which the table has been extracted
Table_error.png : The erroneous extraction
Problem_document.pdf

How to reproduce the bug

import pymupdf
from IPython.display import display, Image

doc = pymupdf.open('Problem_document.pdf')
page = doc[idx]
tabs = page.find_tables(add_lines=None)
print(f"{len(tabs.tables)} found on page {idx}")
if tabs.tables:
   for idt in range(0, len(tabs.tables)):
       df_test = pd.DataFrame(tabs[idt].extract())
       display(df_test)

PyMuPDF version

1.24.13

Operating system

Windows

Python version

3.9

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction skipping every alternate line and also not extracting headers #4017

Extraction skipping every alternate line and also not extracting headers #4017

chelsud123 commented Nov 4, 2024 •

edited

Loading

Extraction skipping every alternate line and also not extracting headers #4017

Extraction skipping every alternate line and also not extracting headers #4017

Comments

chelsud123 commented Nov 4, 2024 • edited Loading

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

chelsud123 commented Nov 4, 2024 •

edited

Loading