Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar to #826 Still not resolved #2565

Closed
sagard21 opened this issue Jul 30, 2023 · 5 comments
Closed

Similar to #826 Still not resolved #2565

sagard21 opened this issue Jul 30, 2023 · 5 comments
Labels
example required not a bug not a bug / user error / unable to reproduce

Comments

@sagard21
Copy link

sagard21 commented Jul 30, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

An eclipse / oval shape is annotated instead of rectangle. This issue is noticed on Ubuntu 22.04 but not on Mac. The issue is caused when the OCR text is used and its bounding boxes are used for annotation.

However when annotation is done on Mac using same coordinates, the rectangle looks proper.

To Reproduce (mandatory)

Explain the steps to reproduce the behavior, For example, include a minimal code snippet, example files, etc.

import fitz

doc = fitz.open(filepath)

# Iterate over each page
for page_object in doc.pages():
    # Set the last known text offset
    prev_offset = 0
    # Placeholder to save the text coordinates for entire page
    coordinates = []
    # Placeholder to save the text content
    page = []

    # Get OCR text
    res = page_object.get_textpage_ocr(
        tessdata="/usr/share/tesseract-ocr/4.00/tessdata/", full=True, dpi=300
    )

    # Iterate over every word in the page
    for block in page_object.get_text("dict", textpage=res, sort=False)[
        "blocks"
    ]:
        for line in block["lines"]:
            line_ls = []
            for span in line["spans"]:
                bbox = fitz.Rect(span["bbox"]).irect
                word = span["text"]
                line_ls.append(word)
                start = prev_offset
                end = start + len(word)
                prev_offset = end + 1
                coordinates.append(
                    {
                        "word": word,
                        "start": start,
                        "end": end,
                        "x0": bbox[0],
                        "y0": bbox[1],
                        "x1": bbox[2],
                        "y1": bbox[3],
                    }
                )
            page.append(" ".join(line_ls))

    # Get the entire page text
    text_ocr = "\n".join(page)

# Run NER
ner_predictions = ner_function(text_ocr)

ner_bounding_boxes = bounding_box_identifier(ner_predictions) # Provides list of coordinates [(121, 35, 144, 42), ...]

for b in ner_bounding_boxes:
    # Create a rectangle
    rect = fitz.Rect(b[0], b[1], b[2], b[3])
    # Add annotation to the page
    annot = page_object.add_highlight_annot(rect)
    # Update the color
    annot.set_colors(stroke=color)
    # Update the page for changes to take effect
    annot.update()

doc.save('/path/to/location.pdf')

For problems when building or installing PyMuPDF, give the full output of the build/install command so that, for example, all pip/compiler/linker errors/warnings can be seen.

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

Screenshots (optional)

If applicable, add screenshots to help explain your problem.
image

Your configuration (mandatory)

  • Operating system, potentially version and bitness (Problem on Ubuntu 22.04)
  • Python version, bitness (3.10)
  • PyMuPDF version, installation method (wheel or generated from source). (1.22.5)

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here.
Works on mac

@JorjMcKie
Copy link
Collaborator

Please provide a reproducing page. This is mandatory for bug reports.

@sagard21
Copy link
Author

@JorjMcKie Provided the script I used for generating the highlighted PDF. One more update - Out of 20 PDF files processed, Only 2 had this issue. Rest of the PDF files were proper. This got even more weird now :(

@JorjMcKie
Copy link
Collaborator

@JorjMcKie Provided the script I used for generating the highlighted PDF. One more update - Out of 20 PDF files processed, Only 2 had this issue. Rest of the PDF files were proper. This got even more weird now :(

Of course I saw your code! I have been asking for at least one example PDF page exhibiting your problem. IAW that page must be exactly the same on both (MAC / Linux) platforms and yet be handled differently by the same app on each platform.

As an aside:
Your picture does not show ellipses / ovals. Instead, your annotation rectangles are given such, that they are interpreted as being rotated by 90°. So left or right rect edge is interpreted as the bottom edge and top / bottom edges became left and right.
When you consult the documentation you will see that highlighting is internally based on quadruples, allowing to mark text that has just any orientation. The sequence of the quad's corners defines what is left, right, top and bottom. Change that sequence, and you will see awkward looking highlights.
For ease of use, PyMuPDF also understands rectangles and converts them to quads.

All this is code, that is not influenced by the platform running it. Therefore, I don't believe you can have different behavior between Mac and Linux - given the same files and PyMuPDF versions.
The only explanation I can imagine: the input files have different orientation (rotation) on both platforms.
Maybe you OCR your files separately on each platform?

@JorjMcKie
Copy link
Collaborator

Reminder:
Please provide us with a PDF page as described, so we can reproduce the problem.

@JorjMcKie
Copy link
Collaborator

Going to close this because of missing data to reproduce the problem.

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
example required not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants