Skip to content

Commit

Permalink
feat: Reduce Unstructured IO image size (to speed up document process…
Browse files Browse the repository at this point in the history
…ing) (#557)
  • Loading branch information
charles-marion authored Sep 4, 2024
1 parent cc88f74 commit 0204f91
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 3 deletions.
1 change: 1 addition & 0 deletions lib/shared/file-import-batch-job/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ beautifulsoup4==4.12.2
requests==2.32.2
attrs==23.1.0
feedparser==6.0.11
PyJWT==2.9.0
21 changes: 18 additions & 3 deletions lib/shared/file-import-dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
FROM quay.io/unstructured-io/unstructured:0.11.2
FROM quay.io/unstructured-io/unstructured:0.15.6 as source

USER root
# Remove training data
RUN rm -rf /usr/local/share/tessdata

#Remove large packages that are not used. Docker image does not support GPUs.
#Related ticket https://github.com/Unstructured-IO/unstructured/issues/2976
RUN pip uninstall -y `pip freeze | grep torch` && pip uninstall -y `pip freeze | grep nvidia`
# Torch is needed for image analysis in pdfs (using CPU version)
RUN pip install torch==2.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

# Remove previous layers to create a smaller image
FROM scratch
COPY --from=source / /

USER notebook-user

WORKDIR /app
COPY file-import-batch-job/requirements.txt requirements.txt
RUN pip install -r requirements.txt

RUN pip install -r requirements.txt && rm -rf example-docs test_unstructured
COPY layers/python-sdk/python/ .
COPY file-import-batch-job/main.py ./main.py

Expand Down

0 comments on commit 0204f91

Please sign in to comment.