You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens".
On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.
Why this difference ?
To reproduce
openllm serve llama3.1:8b-4bit
in a python console with openai client installed :
from openai import OpenAI
openai_client = OpenAI(api_key="test", base_url="http://localhost:3000/v1")
openai_client.chat.completions.create(
model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',
messages=[{"role":"user", "content": "This is a test"}],
presence_penalty=0.,
frequency_penalty=0.,
stream=False,
temperature=0.,
max_tokens=2048
)
Logs
On client side :
Traceback (most recent call last):
File "<stdin>", line 1, in<module>
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_utils\_utils.py", line 277, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\resources\chat\completions.py", line 590, in create
return self._post(
^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1240, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 921, in request
return self._request(
^^^^^^^^^^^^^^
File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1020, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 2048 tokens. However, you requested 2087 tokens (39 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}
Hi. The default max tokens is set for minimal GPU memory usage. We are working on parameterize feature, so you might be able to openllm serve llama3.1:8B --arg vllm.engine.max_tokens=131072 with next minor version.
Describe the bug
When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens".
On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.
Why this difference ?
To reproduce
openllm serve llama3.1:8b-4bit
in a python console with openai client installed :
Logs
Environment
System information
bentoml
: 1.3.5python
: 3.11.8platform
: Linux-6.2.0-39-generic-x86_64-with-glibc2.37uid_gid
: 1000:1000conda
: 24.3.0in_conda_env
: Trueconda_packages
pip_packages
transformers
version: 4.44.2System information (Optional)
No response
The text was updated successfully, but these errors were encountered: