Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393

Open
mennthor opened this issue Mar 18, 2024 · 8 comments

Comments

@mennthor
Copy link

mennthor commented Mar 18, 2024

Describe the bug
Sending a single PDF (this one: https://arxiv.org/abs/2310.12931, embedded text, 39 pages) to the self hosted API, with either ocr_only or hi_res strategy.
The server has 32GB RAM, 8 CPU cores and a CUDA enabled GPU, ressources are below 20% CPU and 5% RAM when processing the PDF.
The Unstructured API version is v0.0.61.

The server responds with 504 after some 20 to 30s and the client caller via partition_via_api will try again for some time.
In the server logs I can see, that each time a new request is made it is properly worked on, printing out '[...] unstructured INFO Processing entire page OCR [...]' and the server does not crash.
However the client detaches and discards the request, so when the server is done processing, it is just discarded.

On the client side I get

INFO: Response status code: 504 Retry attempt #1. Sleeping 1.4 seconds before retry.
INFO: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

leaving the process running on the server and triggering a new one which gets detached in a similar fashion.

I tested the same call on the officially hosted API and the results was there after some 5m:30s.

Question: Is there some setting I can make on the server side to avoid that? It is obviously running on the officially hosted service so the answer should be yes. Could you hint me the right way to go with this?

To Reproduce
Calling the API like this:

from unstructured.partition.api import partition_via_api
API_BASE_URL = "....."

result = partition_via_api(
    "files/2310.12931.pdf",
    strategy="ocr_only",
    api_url=f"{API_BASE_URL}/general/v0/general/",
)
  • Filetype: PDF (see above for exact file)
  • Any additional API parameters: The server runs with UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB="4096"

Environment:

  • Using the hosted API or self hosting?
    • using the self-hosted API
  • How are you calling the API? (Langchain, SDKs, cUrl, etc.)
    • Calling via Unstructured API SDK partition_via_api

Additional context
Update: The same happens when calling the API with cURL like so:

curl -X "POST" "$API_BASE_URL/general/v0/general" \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F files=@files/2310.12931.pdf \
-F 'strategy=ocr_only'

minus the retry part, because cURL simply stops after a single attempt.
On the server side the behaviour stays the same, the process keeps running until properly finished and is then discarded because the client detached.

@mennthor
Copy link
Author

Hi :)
does someone have an idea what might be going wrong here?

@awalker4
Copy link
Collaborator

awalker4 commented May 4, 2024

Hi, sorry for the delay here. I think the first place to check would be your nginx config if you have access to it. There's usually some server timeout value that can be increased. In general, our advice for hi_res or ocr pdfs has been to split up the file and send smaller pages in parallel, since these long lived requests can be all sorts of trouble as you're seeing. Our client code supports split_pdf_page=True, which should also work in partition_via_api. More details in the python-client readme. Let me know if this works!

@mennthor
Copy link
Author

mennthor commented May 6, 2024

Thx :)
I'll try it and give feedback

@mennthor
Copy link
Author

Just a short update.
I'm having trouble using the split option, but because of certificate errors (network with custom certificates).
I tried both the linked version and another one by manually splitting and sending via concurrent.futures ThreadPool but both attempts do not work.
This definitely has nothing to do with Unstructured, and I'll try to figure this out to get to the proper testing with the suggested PSDF splitting.

@Bryson14
Copy link

I'm having the same issues. 504 means your proxy server is timing out the HTTP request because the unstructured server hasn't responded. You can change the idle timeout period for the proxy server, or you can give the unstructured container more CPU power. I'm looking into using a GPU docker container because the logs show "lib/python3.10/site-packages/torch/cuda/init.py:619: UserWarning: Can't initialize NVM?L".. But I'll have to figure that out

@bmolnar95
Copy link

@awalker4 Hy. I have the same problem, without split_pdf_page=True via partition_via_api() the PDF partitioning is very slow. If I use it I found that we lost the page break elements and all element will be on the first page (in the metadata). Even if my files are greater than 1 page. How can I resolve this?

@JOSHMT0744
Copy link

@mennthor did you ever get anywhere with testing how to get the pdf splitting to fix the 504 errors?

@mennthor
Copy link
Author

@JOSHMT0744
Sorry for taking so long.
I actually tried it last week, updating the server to v0.0.76, using the split_pdf_page keyword and increasing all timeouts in the nginx ingress to 10mins.
This solved all of the 504 errors so far.
Thx for the help 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants