Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte #267

Open
sentry-io bot opened this issue Oct 3, 2023 · 6 comments

Comments

@sentry-io
Copy link

sentry-io bot commented Oct 3, 2023

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte
(23 additional frame(s) were not displayed)
...
  File "prepline_general/api/general.py", line 686, in pipeline_1
    list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
  File "prepline_general/api/general.py", line 607, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 418, in pipeline_api
    raise e
  File "prepline_general/api/general.py", line 396, in pipeline_api
    elements = partition(
@Krishna2709
Copy link

Hi, I am facing the same error. Please let me know if you resolved it.

@awalker4
Copy link
Collaborator

Hi there! Do you have a file that reproduces the issue that you're able to share?

@andrePankraz
Copy link

Same problem via unstructured-python-client:
Failed to process a request due to API server error with status code 500. Attempting retry number 1 after sleep.
unstructured-client: 36 - log_retries()] Server message - {"detail":"'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"}

If I try to send some file with e.g. encoding UTF-16 and it will not work.
The encoding parameter is set correctly and can be seen here unstructured-client/general.py
req = client.prepare_request(requests_http.Request('POST', url, params=query_params, data=data, files=form, headers=headers))

I'm not sure if the issue is with the unstructured-python-client not encoding the form-post correctly or setting the accept header correctly, or if it's a problem with the server API.

@awalker4
Copy link
Collaborator

Hi @andrePankraz , can you clarify how you're making the API call? The server does take a encoding param (shown in the table here) that defaults to utf-8. I suspect this file will work if you send encoding='utf-16'.

@andrePankraz
Copy link

Have you really tested it with an utf-16 file?

curl -X 'POST' \
    'http://ai1.dev.init:8004/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -F 'files=@data/documents/CSV_UTF_16.csv' \
    -F 'strategy=hi_res' \
    -F 'languages=deu' \
    -F 'encoding=utf-16'

{"detail":"'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"}

@Krishna2709
Copy link

Hi there! Do you have a file that reproduces the issue that you can share?

Hey @awalker4 , my file was corrupted while formatting it. There's no issue from the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants