-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I need to search for multiple keywords in a pdf document and extract the specific pdf_pages, please kindly help me out in solving this. #693
Comments
I think you mean to search for each word independently from the others. I also assume you want to ignore upper or lower case, and that you are not interested in (1) the position of the words on each page, or (2) how many times on each page a word occurs If these assumptions are correct, you have several options, the easiest ones being import fitz
search_words = ("pixmap", "alpha", "outline", "rgb") # words to consider, choose one of lower or upper case
results = {}
doc = fitz.open("v110-changes.pdf")
for page in doc:
words = [w[4].lower() for w in page.getText("words")] # all words on page (upper, lower depending on above!)
for sword in search_words: # loop through search list
if sword in words: # occurs as one of the words
pages = results.get(sword, set()) # get set of page numbers so far
pages.add(page.number) # add this page number
results[sword] = pages # write back to results
# report the results
for word in results:
result = list(map(str, results[word])) # turn set of page numbers...
page_list = ", ".join(result) # to a comma-separated string
print("Word '%s' occurs on pages %s." % (word, page_list)) Example output:
This script only counts if a search word appears completely "isolated", meaning a space before and after it. If you also want to count if the search word is e.g. following by punctuation, or is combined with other words, modify the script like this:
and the result is
So "rgb" also occurs in combination with other characters. |
pdf_document = fitz.open(pdf_file_path) pdf = PdfFileReader(pdf_file_path) For example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted. I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages. |
I think I do not quite understand. Is it this: |
If your question is, how to copy pages between PDFs: use |
For example: pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open() # make empty PDF for containing MATH pages
for pdf in pdf_list:
src = fitz.open(pdf)
for page in src:
for word in page.getText("words"):
if searchword in word[4]:
doc.insertPDF(src, from_page=page.number, to_page=page.number)
break
doc.save("math.pdf") |
or pdf_list = ("file1.pdf", "file2.pdf", ...)
searchword= "MATH"
mathpdf = fitz.open() # make empty PDF for containing MATH pages
for pdf in pdf_list:
src = fitz.open(pdf)
for page in src:
if page.searchFor(searchword) != []:
doc.insertPDF(src, from_page=page.number, to_page=page.number)
doc.save("math.pdf") |
Yes correct this was my question exactly but with some changes i quoting again (Please kindly Note: "file1.pdf" may also have a search keyword like "GEOMETRY" which is not the pages of interest here in my case so in new "Maths.pdf" document output only pages having "MATHS" keyword related table contents should be present and "GEOMETRY" should be skipped and the pages containing "GEOMETRY" keywords should not appear in "Maths.pdf") In the same way i need a new "Geometry.pdf" document output which contain only the pages having "GEOMETRY" keyword related table information. (Similarly "MATHS" keyword pages should skipped and the pages containing "MATHS" keywords should not appear in "Geometry.pdf") |
Well, based on my various snippets above, that is easy now, isn't it? For MATHS take file1.pdf only and copy the resp. pages to MATH.pdf, for GEOMETRY, walk thru file2.pdf and due the same thing as skecthed above, etc. for the other keywords. The key things to know are two: (1) how to detect whether the keyword is present on a page, and (2) how to copy an eligible page to the subset PDF. |
yes exactly thanks for the help. One last thing i have edited the previous comment kindly give suggestions. |
|
You can use def a():
return 1 to show your code,plain text format code just hard to read. |
@sucanthudu - how are things going? Are you all set? Need more help? |
Thanks for the help and support. All the above suggestions and snippets are really helping me up. Now its going good and if any doubts arise i will reach here. |
@JorjMcKie Hope you are doing great. I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, please kindly help me out in solving this pdf_document = fitz.open(pdf_file_path) pdf = PdfFileReader(pdf_file_path) |
I need to search keywords and get only small summary on this keyword related by large pdf. |
@PoojaAkolkar - please be more specific. I don't understand what your problem is. |
pdf_document = fitz.open(pdf_file_path)
search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")
pages = [ ]
for this_page in range(len(pdf_document)):
page = pdf_document.loadPage(this_page)
if page.searchFor(search_item):
pages.append(this_page)
print("%s found on page_no %i" % (search_item, this_page))
print(pages)
The text was updated successfully, but these errors were encountered: