I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

sucanthudu · 2020-10-20T08:13:59Z

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))

print(pages)

JorjMcKie · 2020-10-20T09:29:10Z

I think you mean to search for each word independently from the others. I also assume you want to ignore upper or lower case, and that you are not interested in (1) the position of the words on each page, or (2) how many times on each page a word occurs

If these assumptions are correct, you have several options, the easiest ones being page.searchFor() and page.getText("words").
The best performance in your case is possible with page.getText("words").

import fitz

search_words = ("pixmap", "alpha", "outline", "rgb")  # words to consider, choose one of lower or upper case
results = {}
doc = fitz.open("v110-changes.pdf")
for page in doc:
    words = [w[4].lower() for w in page.getText("words")]  # all words on page (upper, lower depending on above!)
    for sword in search_words:  # loop through search list
        if sword in words:  # occurs as one of the words
            pages = results.get(sword, set())  # get set of page numbers so far
            pages.add(page.number)  # add this page number
            results[sword] = pages  # write back to results

# report the results
for word in results:
    result = list(map(str, results[word]))  # turn set of page numbers...
    page_list = ", ".join(result)  # to a comma-separated string
    print("Word '%s' occurs on pages %s." % (word, page_list))

Example output:

Word 'pixmap' occurs on pages 0.
Word 'alpha' occurs on pages 0.
Word 'outline' occurs on pages 0, 1.
Word 'rgb' occurs on pages 2.

This script only counts if a search word appears completely "isolated", meaning a space before and after it. If you also want to count if the search word is e.g. following by punctuation, or is combined with other words, modify the script like this:

import fitz

search_words = ("pixmap", "alpha", "outline", "rgb")
results = {}
doc = fitz.open("v110-changes.pdf")
for page in doc:
    words = [w[4].lower() for w in page.getText("words")]
    for sword in search_words:
        for word in words:
            if sword in word:  # search word is *part* of a word on the page
                pages = results.get(sword, set())
                pages.add(page.number)
                results[sword] = pages

for word in results:
    result = list(map(str, results[word]))
    page_list = ", ".join(result)
    print("Word '%s' occurs on pages %s." % (word, page_list))

and the result is

Word 'pixmap' occurs on pages 0.
Word 'alpha' occurs on pages 0.
Word 'outline' occurs on pages 0, 1.
Word 'rgb' occurs on pages 0, 2.

So "rgb" also occurs in combination with other characters.

sucanthudu · 2020-10-20T11:26:53Z

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))
print(pages)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()
Thanks for the reply, that was great. But my objective is to search for only one particular string for example like “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” or “VALUES” from many huge pdf documents having more than 300 pages and after finding that only one particular string i need to extract or get that particular pdf page alone from those documents.

For example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted.
Same way in another pdf document, one page may contain “GEOMETRY” as a search string, that particular pdf page should be extracted using this search string.

I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages.

JorjMcKie · 2020-10-20T11:37:49Z

I think I do not quite understand. Is it this:
Take MATHS and take every page from every PDF where it occurs and put that page in a new MATH pdf.
Same with all the other search words?

JorjMcKie · 2020-10-20T11:50:22Z

If your question is, how to copy pages between PDFs: use doc.insertPDF(sourcePDF, from_page=n, to_page=n) to copy a page from sourcePDF that contains a desired content.

JorjMcKie · 2020-10-20T11:57:07Z

For example:

pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open()  # make empty PDF for containing MATH pages
for pdf in pdf_list:
    src = fitz.open(pdf)
    for page in src:
        for word in page.getText("words"):
            if searchword in word[4]:
                doc.insertPDF(src, from_page=page.number, to_page=page.number)
                break
doc.save("math.pdf")

JorjMcKie · 2020-10-20T12:00:13Z

or

pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open()  # make empty PDF for containing MATH pages
for pdf in pdf_list:
    src = fitz.open(pdf)
    for page in src:
        if page.searchFor(searchword) != []:
            doc.insertPDF(src, from_page=page.number, to_page=page.number)

doc.save("math.pdf")

sucanthudu · 2020-10-20T14:37:04Z

I think I do not quite understand. Is it this:
Take MATHS and take every page from every PDF where it occurs and put that page in a new MATH pdf.
Same with all the other search words?

Yes correct this was my question exactly but with some changes i quoting again
"MATHS" will be present as a unique keyword in "file1.pdf", "GEOMETRY" will be present as a unique keyword in "file2.pdf" and soon. My searchwords will be ("MATHS","GEOMETRY",......) Now I need a new "Maths.pdf" document output which contain only the pages having "MATHS" keyword related content information.

(Please kindly Note: "file1.pdf" may also have a search keyword like "GEOMETRY" which is not the pages of interest here in my case so in new "Maths.pdf" document output only pages having "MATHS" keyword related table contents should be present and "GEOMETRY" should be skipped and the pages containing "GEOMETRY" keywords should not appear in "Maths.pdf")

In the same way i need a new "Geometry.pdf" document output which contain only the pages having "GEOMETRY" keyword related table information. (Similarly "MATHS" keyword pages should skipped and the pages containing "MATHS" keywords should not appear in "Geometry.pdf")

JorjMcKie · 2020-10-20T14:45:20Z

Well, based on my various snippets above, that is easy now, isn't it? For MATHS take file1.pdf only and copy the resp. pages to MATH.pdf, for GEOMETRY, walk thru file2.pdf and due the same thing as skecthed above, etc. for the other keywords.

The key things to know are two: (1) how to detect whether the keyword is present on a page, and (2) how to copy an eligible page to the subset PDF.

sucanthudu · 2020-10-20T14:57:17Z

The key things to know are two: (1) how to detect whether the keyword is present on a page, and (2) how to copy an eligible page to the subset PDF.

yes exactly thanks for the help. One last thing i have edited the previous comment kindly give suggestions.

JorjMcKie · 2020-10-20T15:05:07Z

Detecting a keyword on a page is done by checking page.searchFor(keyword) != []. If this condition is met, you could additionally make sure that page.searchFor(other) == [] for all other keywords if this is what you want.
Once you know you want a page, copy it via (example) doc.insertPDF(sourcepdf, from_page=page.number, to_page=page.number), where doc is the document, which you will later save as MATH.pdf.

utmcontent · 2020-10-20T16:18:38Z

pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))
print(pages)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()
Thanks for the reply, that was great. But my objective is to search for only one particular string for example like “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” or “VALUES” from many huge pdf documents having more than 300 pages and after finding that only one particular string i need to extract or get that particular pdf page alone from those documents.

For example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted.
Same way in another pdf document, one page may contain “GEOMETRY” as a search string, that particular pdf page should be extracted using this search string.

I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages.

You can use
```python
def function():
return 1
```

def a():
    return 1

to show your code,plain text format code just hard to read.

JorjMcKie · 2020-10-21T13:51:32Z

@sucanthudu - how are things going? Are you all set? Need more help?

sucanthudu · 2020-10-22T12:49:13Z

Thanks for the help and support. All the above suggestions and snippets are really helping me up. Now its going good and if any doubts arise i will reach here.

sucanthudu · 2023-02-27T10:29:26Z

@JorjMcKie Hope you are doing great. I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, please kindly help me out in solving this
keyword_list = ['Profit','Loss', 'Income','Expense,'Savings']
For example: pdf page1 will contain Profit and Loss, now using these 'Profit' and 'Loss' two keywords, i need to extract pdf_page1.
pdf page 2 will contain Income, Expense and Savings, now using all these 'Income' and 'Expense' and 'Savings three keywords 'i need to extract pdf_page2.
Like this i have bag of words pattern for each page based on the bag of words set pattern i need to extract pages. please help me out in solving this.
please suggest.

pdf_document = fitz.open(pdf_file_path)
keyword_list_set = 'Profit' and 'Loss', 'Income' and 'Expense' and 'Savings'
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(keyword_list_set):
pages.append(this_page)

pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open(pdf_output_path, 'wb') as f:
pdfWriter.write(f)
f.close()

PoojaAkolkar · 2023-08-02T06:32:09Z

I need to search keywords and get only small summary on this keyword related by large pdf.

JorjMcKie · 2023-08-03T11:53:01Z

I need to search keywords and get only small summary on this keyword related by large pdf.

@PoojaAkolkar - please be more specific. I don't understand what your problem is.

sucanthudu added the question label Oct 20, 2020

sucanthudu assigned JorjMcKie Oct 20, 2020

JorjMcKie closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020 •

edited

Loading

sucanthudu commented Oct 20, 2020 •

edited

Loading

JorjMcKie commented Oct 20, 2020

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

utmcontent commented Oct 20, 2020

JorjMcKie commented Oct 21, 2020

sucanthudu commented Oct 22, 2020

sucanthudu commented Feb 27, 2023

PoojaAkolkar commented Aug 2, 2023

JorjMcKie commented Aug 3, 2023

I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693

Comments

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020 • edited Loading

sucanthudu commented Oct 20, 2020 • edited Loading

JorjMcKie commented Oct 20, 2020

sucanthudu commented Oct 20, 2020

JorjMcKie commented Oct 20, 2020

utmcontent commented Oct 20, 2020

JorjMcKie commented Oct 21, 2020

sucanthudu commented Oct 22, 2020

sucanthudu commented Feb 27, 2023

PoojaAkolkar commented Aug 2, 2023

JorjMcKie commented Aug 3, 2023

JorjMcKie commented Oct 20, 2020 •

edited

Loading

sucanthudu commented Oct 20, 2020 •

edited

Loading