Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of pdf_extract.py script #114

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

AdevGarcia
Copy link

Description:
This PR refactors the pdf_extract.py script to improve readability and maintainability of the code.
In order not to affect the current code, the app.py script and the app_tools library have been created.
app.py performs the same process as pdf_extract.py.
The app_tools library incorporates the refactorings of the different steps.

app_tools
|- pdf.py
|- layout_analysis.py
|- formula_analysis.py
|- ocr_analysis.py
|- table_analysis.py
|- visualize.py
|- config.py
|- utils.py

If you find it interesting you can replace app.py with pdf_extract.py

Motivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.

Main changes:

  • The script app.py has been created with the pipeline of pdf_extract.py.
  • The library app_tools has been created that contains the classes and methods to perform each step of the pipeline.
  • pdf.py: Provides a set of app_tools for working with PDF files.
  • layout_analysis.py: Analyzes the layout of documents by detecting the layout of each page in a document image.
  • formula_analysis.py: Is designed to handle formula detection and recognition in images.
  • ocr_analysis.py: OCR Processor. It is responsible for performing OCR recognition.
  • table_analysis.py: Represents a Table Processor that is used for table recognition in documents.
  • visualize.py: It generates visualizations of the document layout
  • config.py: Configure model parameters and logs
  • utils.py: save results in json

Functionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.

Instructions for Reviewers:

  • Review the app.py and app_tools scripts to ensure that the logic has been ported correctly.
  • Verifies that there are no observable changes in the system's behavior when running the tests.

Example of Use:

python app.py --pdf 1706.03762.pdf

Added detailed logging configurations to improve visibility and debugging. Refactored PDF handling and processing into separate utility functions for better code organization and maintainability.
Relocate logging configuration into utils/config.py and move model initialization functions to utils/model_tools.py. Additionally, separate detection and recognition functionalities into distinct modules to enhance code readability and modularity.
Separated OCR recognition and table recognition into distinct functions. This improves code readability and maintainability by isolating each recognition task, enabling easier debugging and future enhancements.
Replaced standalone functions in `pdf_tools.py` with a new `PDFProcessor` class to encapsulate PDF processing logic. Adjusted `app.py` to use the new `PDFProcessor` class methods, improving code organization and maintainability.
Deleted redundant utility files and integrated functionality into new, focused modules under `app_tools`. Introduced `TableProcessor`, `LayoutAnalyzer`, `FormulaProcessor`, and `OCRProcessor` classes to handle specific operations. Updated `app.py` to reflect these changes and streamline the process flow.
Refactored several Python modules to simplify documentation strings and improve readability. Added argparse to app.py for better handling of command line arguments. Improved error handling and logging in several files. Revised documentation.
Updated library versions for consistency and reproducibility. Added new dependencies: torch, torchvision, numpy, opencv-python, Pillow, PyYAML, and pytz.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant