Refactoring of pdf_extract.py
script
#114
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR refactors the
pdf_extract.py
script to improve readability and maintainability of the code.In order not to affect the current code, the
app.py
script and theapp_tools
library have been created.app.py
performs the same process aspdf_extract.py
.The
app_tools
library incorporates the refactorings of the different steps.app_tools
|- pdf.py
|- layout_analysis.py
|- formula_analysis.py
|- ocr_analysis.py
|- table_analysis.py
|- visualize.py
|- config.py
|- utils.py
If you find it interesting you can replace
app.py
withpdf_extract.py
Motivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.
Main changes:
app.py
has been created with the pipeline ofpdf_extract.py
.app_tools
has been created that contains the classes and methods to perform each step of the pipeline.pdf.py
: Provides a set of app_tools for working with PDF files.layout_analysis.py
: Analyzes the layout of documents by detecting the layout of each page in a document image.formula_analysis.py
: Is designed to handle formula detection and recognition in images.ocr_analysis.py
: OCR Processor. It is responsible for performing OCR recognition.table_analysis.py
: Represents a Table Processor that is used for table recognition in documents.visualize.py
: It generates visualizations of the document layoutconfig.py
: Configure model parameters and logsutils.py
: save results in jsonFunctionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.
Instructions for Reviewers:
app.py
andapp_tools
scripts to ensure that the logic has been ported correctly.Example of Use: