Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

Closed
sucanthudu opened this issue Oct 20, 2020 · 16 comments
Assignees
Labels

Comments

@sucanthudu
Copy link

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))

print(pages)

@JorjMcKie
Copy link
Collaborator

I think you mean to search for each word independently from the others. I also assume you want to ignore upper or lower case, and that you are not interested in (1) the position of the words on each page, or (2) how many times on each page a word occurs

If these assumptions are correct, you have several options, the easiest ones being page.searchFor() and page.getText("words").
The best performance in your case is possible with page.getText("words").

import fitz

search_words = ("pixmap", "alpha", "outline", "rgb")  # words to consider, choose one of lower or upper case
results = {}
doc = fitz.open("v110-changes.pdf")
for page in doc:
    words = [w[4].lower() for w in page.getText("words")]  # all words on page (upper, lower depending on above!)
    for sword in search_words:  # loop through search list
        if sword in words:  # occurs as one of the words
            pages = results.get(sword, set())  # get set of page numbers so far
            pages.add(page.number)  # add this page number
            results[sword] = pages  # write back to results

# report the results
for word in results:
    result = list(map(str, results[word]))  # turn set of page numbers...
    page_list = ", ".join(result)  # to a comma-separated string
    print("Word '%s' occurs on pages %s." % (word, page_list))

Example output:

Word 'pixmap' occurs on pages 0.
Word 'alpha' occurs on pages 0.
Word 'outline' occurs on pages 0, 1.
Word 'rgb' occurs on pages 2.

This script only counts if a search word appears completely "isolated", meaning a space before and after it. If you also want to count if the search word is e.g. following by punctuation, or is combined with other words, modify the script like this:

import fitz

search_words = ("pixmap", "alpha", "outline", "rgb")
results = {}
doc = fitz.open("v110-changes.pdf")
for page in doc:
    words = [w[4].lower() for w in page.getText("words")]
    for sword in search_words:
        for word in words:
            if sword in word:  # search word is *part* of a word on the page
                pages = results.get(sword, set())
                pages.add(page.number)
                results[sword] = pages

for word in results:
    result = list(map(str, results[word]))
    page_list = ", ".join(result)
    print("Word '%s' occurs on pages %s." % (word, page_list))

and the result is

Word 'pixmap' occurs on pages 0.
Word 'alpha' occurs on pages 0.
Word 'outline' occurs on pages 0, 1.
Word 'rgb' occurs on pages 0, 2.

So "rgb" also occurs in combination with other characters.

@sucanthudu
Copy link
Author

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))
print(pages)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()
Thanks for the reply, that was great. But my objective is to search for only one particular string for example like “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” or “VALUES” from many huge pdf documents having more than 300 pages and after finding that only one particular string i need to extract or get that particular pdf page alone from those documents.

For example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted.
Same way in another pdf document, one page may contain “GEOMETRY” as a search string, that particular pdf page should be extracted using this search string.

I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages.

@JorjMcKie
Copy link
Collaborator

I think I do not quite understand. Is it this:
Take MATHS and take every page from every PDF where it occurs and put that page in a new MATH pdf.
Same with all the other search words?

@JorjMcKie
Copy link
Collaborator

If your question is, how to copy pages between PDFs: use doc.insertPDF(sourcePDF, from_page=n, to_page=n) to copy a page from sourcePDF that contains a desired content.

@JorjMcKie
Copy link
Collaborator

For example:

pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open()  # make empty PDF for containing MATH pages
for pdf in pdf_list:
    src = fitz.open(pdf)
    for page in src:
        for word in page.getText("words"):
            if searchword in word[4]:
                doc.insertPDF(src, from_page=page.number, to_page=page.number)
                break
doc.save("math.pdf")

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 20, 2020

or

pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open()  # make empty PDF for containing MATH pages
for pdf in pdf_list:
    src = fitz.open(pdf)
    for page in src:
        if page.searchFor(searchword) != []:
            doc.insertPDF(src, from_page=page.number, to_page=page.number)

doc.save("math.pdf")

@sucanthudu
Copy link
Author

sucanthudu commented Oct 20, 2020

I think I do not quite understand. Is it this:
Take MATHS and take every page from every PDF where it occurs and put that page in a new MATH pdf.
Same with all the other search words?

Yes correct this was my question exactly but with some changes i quoting again
"MATHS" will be present as a unique keyword in "file1.pdf", "GEOMETRY" will be present as a unique keyword in "file2.pdf" and soon. My searchwords will be ("MATHS","GEOMETRY",......) Now I need a new "Maths.pdf" document output which contain only the pages having "MATHS" keyword related content information.

(Please kindly Note: "file1.pdf" may also have a search keyword like "GEOMETRY" which is not the pages of interest here in my case so in new "Maths.pdf" document output only pages having "MATHS" keyword related table contents should be present and "GEOMETRY" should be skipped and the pages containing "GEOMETRY" keywords should not appear in "Maths.pdf")

In the same way i need a new "Geometry.pdf" document output which contain only the pages having "GEOMETRY" keyword related table information. (Similarly "MATHS" keyword pages should skipped and the pages containing "MATHS" keywords should not appear in "Geometry.pdf")

@JorjMcKie
Copy link
Collaborator

Well, based on my various snippets above, that is easy now, isn't it? For MATHS take file1.pdf only and copy the resp. pages to MATH.pdf, for GEOMETRY, walk thru file2.pdf and due the same thing as skecthed above, etc. for the other keywords.

The key things to know are two: (1) how to detect whether the keyword is present on a page, and (2) how to copy an eligible page to the subset PDF.

@sucanthudu
Copy link
Author

The key things to know are two: (1) how to detect whether the keyword is present on a page, and (2) how to copy an eligible page to the subset PDF.

yes exactly thanks for the help. One last thing i have edited the previous comment kindly give suggestions.

@JorjMcKie
Copy link
Collaborator

  1. Detecting a keyword on a page is done by checking page.searchFor(keyword) != []. If this condition is met, you could additionally make sure that page.searchFor(other) == [] for all other keywords if this is what you want.
  2. Once you know you want a page, copy it via (example) doc.insertPDF(sourcepdf, from_page=page.number, to_page=page.number), where doc is the document, which you will later save as MATH.pdf.

@utmcontent
Copy link

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))
print(pages)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()
Thanks for the reply, that was great. But my objective is to search for only one particular string for example like “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” or “VALUES” from many huge pdf documents having more than 300 pages and after finding that only one particular string i need to extract or get that particular pdf page alone from those documents.

For example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted.
Same way in another pdf document, one page may contain “GEOMETRY” as a search string, that particular pdf page should be extracted using this search string.

I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages.

You can use
```python
def function():
return 1
```

def a():
    return 1

to show your code,plain text format code just hard to read.

@JorjMcKie
Copy link
Collaborator

@sucanthudu - how are things going? Are you all set? Need more help?

@sucanthudu
Copy link
Author

Thanks for the help and support. All the above suggestions and snippets are really helping me up. Now its going good and if any doubts arise i will reach here.

@sucanthudu
Copy link
Author

@JorjMcKie Hope you are doing great. I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, please kindly help me out in solving this
keyword_list = ['Profit','Loss', 'Income','Expense,'Savings']
For example: pdf page1 will contain Profit and Loss, now using these 'Profit' and 'Loss' two keywords, i need to extract pdf_page1.
pdf page 2 will contain Income, Expense and Savings, now using all these 'Income' and 'Expense' and 'Savings three keywords 'i need to extract pdf_page2.
Like this i have bag of words pattern for each page based on the bag of words set pattern i need to extract pages. please help me out in solving this.
please suggest.

pdf_document = fitz.open(pdf_file_path)
keyword_list_set = 'Profit' and 'Loss', 'Income' and 'Expense' and 'Savings'
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(keyword_list_set):
pages.append(this_page)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()

@PoojaAkolkar
Copy link

I need to search keywords and get only small summary on this keyword related by large pdf.

@JorjMcKie
Copy link
Collaborator

I need to search keywords and get only small summary on this keyword related by large pdf.

@PoojaAkolkar - please be more specific. I don't understand what your problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants