-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting frequent 504s with medium sized PDFs and ocr_only and hi_res strategies #393
Comments
Hi :) |
Hi, sorry for the delay here. I think the first place to check would be your nginx config if you have access to it. There's usually some server timeout value that can be increased. In general, our advice for hi_res or ocr pdfs has been to split up the file and send smaller pages in parallel, since these long lived requests can be all sorts of trouble as you're seeing. Our client code supports |
Thx :) |
Just a short update. |
I'm having the same issues. 504 means your proxy server is timing out the HTTP request because the unstructured server hasn't responded. You can change the idle timeout period for the proxy server, or you can give the unstructured container more CPU power. I'm looking into using a GPU docker container because the logs show "lib/python3.10/site-packages/torch/cuda/init.py:619: UserWarning: Can't initialize NVM?L".. But I'll have to figure that out |
@awalker4 Hy. I have the same problem, without split_pdf_page=True via partition_via_api() the PDF partitioning is very slow. If I use it I found that we lost the page break elements and all element will be on the first page (in the metadata). Even if my files are greater than 1 page. How can I resolve this? |
@mennthor did you ever get anywhere with testing how to get the pdf splitting to fix the 504 errors? |
@JOSHMT0744 |
Describe the bug
Sending a single PDF (this one: https://arxiv.org/abs/2310.12931, embedded text, 39 pages) to the self hosted API, with either ocr_only or hi_res strategy.
The server has 32GB RAM, 8 CPU cores and a CUDA enabled GPU, ressources are below 20% CPU and 5% RAM when processing the PDF.
The Unstructured API version is v0.0.61.
The server responds with 504 after some 20 to 30s and the client caller via
partition_via_api
will try again for some time.In the server logs I can see, that each time a new request is made it is properly worked on, printing out '[...] unstructured INFO Processing entire page OCR [...]' and the server does not crash.
However the client detaches and discards the request, so when the server is done processing, it is just discarded.
On the client side I get
leaving the process running on the server and triggering a new one which gets detached in a similar fashion.
I tested the same call on the officially hosted API and the results was there after some 5m:30s.
Question: Is there some setting I can make on the server side to avoid that? It is obviously running on the officially hosted service so the answer should be yes. Could you hint me the right way to go with this?
To Reproduce
Calling the API like this:
Environment:
partition_via_api
Additional context
Update: The same happens when calling the API with cURL like so:
minus the retry part, because cURL simply stops after a single attempt.
On the server side the behaviour stays the same, the process keeps running until properly finished and is then discarded because the client detached.
The text was updated successfully, but these errors were encountered: