Skip to content

A few simple Python scripts to extract text from PDF files

Notifications You must be signed in to change notification settings

damianodamiani/PDF-to-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-to-Text

A few simple Python scripts to extract text from text-based or OCR-ed PDF files:

PDF-to-Text-A

This code searches only through the specified directory for PDF files, extracts their text, and saves them as individual text files in the specified output directory.

PDF-to-Text-B

This code searches only through the specified directory for PDF files, extracts their text, and combines it to save it as one text files in the specified output directory.

PDF-to-Text-C

This code searches through the specified directory and all its subdirectories for PDF files, extracts their text, and saves them as individual text files in the specified output directory.

PDF-to-Text-D

This code searches through the specified directory and all its subdirectories for PDF files, extracts their text, aand combines it to save it as one text files in the specified output directory.

How to use

PDF-to-Text-A

  1. Open the Python script in your code editor.
  2. In pdf_directory = '/path/to/pdf/files' replace /path/to/pdf/files with the actual directory path.
  3. In output_directory = '/path/to/output/directory' replace /path/to/output/directory with the desired output directory path.
  4. Save the script and you're ready to go.

PDF-to-Text-B

  1. Open the Python script in your code editor.
  2. In pdf_directory = '/path/to/pdf/files' replace /path/to/pdf/files with the actual directory path.
  3. In output_directory = '/path/to/output/directory' replace /path/to/output/directory with the desired output directory path.
  4. Rename the output file 'combined_text.txt' as desired.
  5. Save the script and you're ready to go.

PDF-to-Text-C

  1. Open the Python script in your code editor.
  2. In pdf_directory = '/path/to/pdf/files' replace /path/to/pdf/files with the actual directory path.
  3. In output_directory = '/path/to/output/directory' replace /path/to/output/directory with the desired output directory path.
  4. Save the script and you're ready to go.

PDF-to-Text-D

  1. Open the Python script in your code editor.
  2. In pdf_directory = '/path/to/pdf/files' replace /path/to/pdf/files with the actual directory path.
  3. In output_directory = '/path/to/output/directory' replace /path/to/output/directory with the desired output directory path.
  4. In combined_text_file_name = 'combined_text.txt' rename the output file as desired.
  5. Save the script and you're ready to go.

Requirements

To run either of these Python scripts you need to have the PyPDF2 library in your terminal, you can install it using pip: pip install PyPDF2.

Scripts written with the help of GPT-3.5.

About

A few simple Python scripts to extract text from PDF files

Resources

Stars

Watchers

Forks

Languages