Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server crash with exceed context | lib version >= v0.2.81 #1759

Open
4 tasks done
carlostomazin opened this issue Sep 25, 2024 · 0 comments · May be fixed by #1796
Open
4 tasks done

Server crash with exceed context | lib version >= v0.2.81 #1759

carlostomazin opened this issue Sep 25, 2024 · 0 comments · May be fixed by #1796

Comments

@carlostomazin
Copy link

carlostomazin commented Sep 25, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm trying to send a prompt with a large context.
I hope the server returns an error 'exceed context' and remains active.

Current Behavior

The server crashed with exceed context and no respond for other requests.

Environment and Context

Windows 11
CUDA 12.6
Python 3.11.9

Failure Information (for bugs)

  • using llama.cpp python version <= v0.2.77 works fine.
  • using llama.cpp python version >= v0.2.81 works bad.

Steps to Reproduce

python -m venv .venv-ai
.\.venv-ai\Scripts\activate
set CMAKE_ARGS=-DGGML_CUDA=on
pip install llama-cpp-python[server]==v0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
python -B -m llama_cpp.server --model ./openhermes-2.5-mistral-7b.Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n_gpu_layers -1 --chat_format chatml --n_ctx 32
import json
import requets

def invoke(prompt):
  payload = {
    "messages": [
        {"role": "user", "content": prompt}
    ],
  }

  resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    data=json.dumps(payload),
    timeout=10
  )
  return resp.json()

invoke("text context > 32 tokens")
invoke("text context > 32 tokens") <--- not response crash server

Failure Logs

Expected log, but no respond for other requests.
Exception: Requested tokens (76) exceed context window of 32
Traceback (most recent call last):
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\server\errors.py", line 171, in custom_route_handler
    response = await original_route_handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\server\app.py", line 513, in create_chat_completion
    ] = await run_in_threadpool(llama.create_chat_completion, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\anyio\_backends\_asyncio.py", line 2405, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\anyio\_backends\_asyncio.py", line 914, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\llama.py", line 1898, in create_chat_completion
    return handler(
           ^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\llama_chat_format.py", line 637, in chat_completion_handler
    completion_or_chunks = llama.create_completion(
                           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\llama.py", line 1732, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\llama.py", line 1169, in _create_completion
    raise ValueError(
ValueError: Requested tokens (76) exceed context window of 32
INFO:     127.0.0.1:59463 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
If I force quit, crash my terminal
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\asyncio\base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "C:\Program Files\Python311\Lib\asyncio\windows_events.py", line 321, in run_forever
    super().run_forever()
  File "C:\Program Files\Python311\Lib\asyncio\base_events.py", line 608, in run_forever
    self._run_once()
  File "C:\Program Files\Python311\Lib\asyncio\base_events.py", line 1936, in _run_once
    handle._run()
  File "C:\Program Files\Python311\Lib\asyncio\events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\uvicorn\server.py", line 68, in serve
    with self.capture_signals():
  File "C:\Program Files\Python311\Lib\contextlib.py", line 144, in __exit__
    next(self.gen)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\uvicorn\server.py", line 328, in capture_signals
    signal.raise_signal(captured_signal)
  File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\uvicorn\middleware\proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\middleware\errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\middleware\cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette_context\middleware\raw_middleware.py", line 92, in __call__
    await self.app(scope, receive, send_wrapper)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\middleware\exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\llama_cpp\server\errors.py", line 171, in custom_route_handler
    response = await original_route_handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\routing.py", line 291, in app
    solved_result = await solve_dependencies(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\dependencies\utils.py", line 624, in solve_dependencies
    solved = await solve_generator(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\dependencies\utils.py", line 550, in solve_generator
    return await stack.enter_async_context(cm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\contextlib.py", line 650, in enter_async_context
    result = await _enter(cm)
             ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python311\Lib\contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\fastapi\concurrency.py", line 27, in contextmanager_in_threadpool
    yield await run_in_threadpool(cm.__enter__)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\starlette\concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\carlo\workspace\genAI\carlos-ai\.venv-ai\Lib\site-packages\anyio\_backends\_asyncio.py", line 2405, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError
@carlostomazin carlostomazin changed the title Server crash with exceed context >= v0.2.81 Server crash with exceed context | lib version >= v0.2.81 Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant