bug: llama3.1:8B maximum context length #1083

GaetanBaert · 2024-09-18T10:09:54Z

Describe the bug

When I try to serve a llama 3.1 8B-4bit with openllm, it says that "This model's maximum context length is 2048 tokens".
On https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, it says that the maximum context length is 128k tokens.

Why this difference ?

To reproduce

openllm serve llama3.1:8b-4bit

in a python console with openai client installed :

from openai import OpenAI
openai_client = OpenAI(api_key="test", base_url="http://localhost:3000/v1")
openai_client.chat.completions.create(
            model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',
            messages=[{"role":"user", "content": "This is a test"}],
            presence_penalty=0.,
            frequency_penalty=0.,
            stream=False,
            temperature=0.,
            max_tokens=2048
        )

Logs

On client side : 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_utils\_utils.py", line 277, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\resources\chat\completions.py", line 590, in create
    return self._post(
           ^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1240, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 921, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "C:\Users\gaeta\anaconda3\envs\microservice\Lib\site-packages\openai\_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 2048 tokens. However, you requested 2087 tokens (39 in the messages, 2048 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}

Environment

System information

bentoml: 1.3.5
python: 3.11.8
platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
uid_gid: 1000:1000
conda: 24.3.0
in_conda_env: True

conda_packages

name: pytorch
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - aom=3.9.1=hac33072_0
  - bzip2=1.0.8=h5eee18b_5
  - ca-certificates=2024.8.30=hbcca054_0
  - cairo=1.18.0=hebfffa5_3
  - dav1d=1.2.1=hd590300_0
  - expat=2.6.3=h5888daf_0
  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
  - font-ttf-inconsolata=3.000=h77eed37_0
  - font-ttf-source-code-pro=2.038=h77eed37_0
  - font-ttf-ubuntu=0.83=h77eed37_2
  - fontconfig=2.14.2=h14ed4e7_0
  - fonts-conda-ecosystem=1=0
  - fonts-conda-forge=1=0
  - freetype=2.12.1=h267a509_2
  - fribidi=1.0.10=h36c2ea0_0
  - gettext=0.22.5=he02047a_3
  - gettext-tools=0.22.5=he02047a_3
  - gmp=6.3.0=hac33072_2
  - gnutls=3.8.7=h32866dd_0
  - graphite2=1.3.13=h59595ed_1003
  - harfbuzz=9.0.0=hda332d3_1
  - icu=75.1=he02047a_0
  - lame=3.100=h166bdaf_1003
  - ld_impl_linux-64=2.38=h1181459_1
  - libabseil=20240116.2=cxx17_he02047a_1
  - libasprintf=0.22.5=he8f35ee_3
  - libasprintf-devel=0.22.5=he8f35ee_3
  - libass=0.17.3=h1dc1e6a_0
  - libdrm=2.4.123=hb9d3cd8_0
  - libexpat=2.6.3=h5888daf_0
  - libffi=3.4.4=h6a678d5_0
  - libgcc=14.1.0=h77fa898_1
  - libgcc-ng=14.1.0=h69a702a_1
  - libgettextpo=0.22.5=he02047a_3
  - libgettextpo-devel=0.22.5=he02047a_3
  - libglib=2.80.3=h315aac3_2
  - libgomp=14.1.0=h77fa898_1
  - libhwloc=2.11.1=default_hecaa2ac_1000
  - libiconv=1.17=hd590300_2
  - libidn2=2.3.7=hd590300_0
  - libnsl=2.0.1=hd590300_0
  - libopenvino=2024.3.0=h2da1b83_0
  - libopenvino-auto-batch-plugin=2024.3.0=hb045406_0
  - libopenvino-auto-plugin=2024.3.0=hb045406_0
  - libopenvino-hetero-plugin=2024.3.0=h5c03a75_0
  - libopenvino-intel-cpu-plugin=2024.3.0=h2da1b83_0
  - libopenvino-intel-gpu-plugin=2024.3.0=h2da1b83_0
  - libopenvino-intel-npu-plugin=2024.3.0=h2da1b83_0
  - libopenvino-ir-frontend=2024.3.0=h5c03a75_0
  - libopenvino-onnx-frontend=2024.3.0=h07e8aee_0
  - libopenvino-paddle-frontend=2024.3.0=h07e8aee_0
  - libopenvino-pytorch-frontend=2024.3.0=he02047a_0
  - libopenvino-tensorflow-frontend=2024.3.0=h39126c6_0
  - libopenvino-tensorflow-lite-frontend=2024.3.0=he02047a_0
  - libopus=1.3.1=h7f98852_1
  - libpciaccess=0.18=hd590300_0
  - libpng=1.6.44=hadc24fc_0
  - libprotobuf=4.25.3=h08a7969_0
  - libsqlite=3.45.2=h2797004_0
  - libstdcxx=14.1.0=hc0a3c3a_1
  - libstdcxx-ng=14.1.0=h4852527_1
  - libtasn1=4.19.0=h166bdaf_0
  - libunistring=0.9.10=h7f98852_0
  - libuuid=2.38.1=h0b41bf4_0
  - libva=2.22.0=hb711507_0
  - libvpx=1.14.1=hac33072_0
  - libxcb=1.16=hb9d3cd8_1
  - libxcrypt=4.4.36=hd590300_1
  - libxml2=2.12.7=he7c6b58_4
  - libzlib=1.3.1=h4ab18f5_1
  - ncurses=6.4=h6a678d5_0
  - nettle=3.9.1=h7ab15ed_0
  - ocl-icd=2.3.2=hd590300_1
  - openh264=2.4.1=h59595ed_0
  - openssl=3.3.2=hb9d3cd8_0
  - p11-kit=0.24.1=hc5aa10d_0
  - pcre2=10.44=hba22ea6_2
  - pip=23.3.1=py311h06a4308_0
  - pixman=0.43.2=h59595ed_0
  - pthread-stubs=0.4=h36c2ea0_1001
  - pugixml=1.14=h59595ed_0
  - python=3.11.8=hab00c5b_0_cpython
  - readline=8.2=h5eee18b_0
  - setuptools=68.2.2=py311h06a4308_0
  - snappy=1.2.1=ha2e4443_0
  - sqlite=3.45.2=h2c6b66d_0
  - svt-av1=2.2.1=h5888daf_0
  - tbb=2021.13.0=h84d6215_0
  - tk=8.6.13=noxft_h4845f30_101
  - wayland=1.23.1=h3e06ad9_0
  - wayland-protocols=1.37=hd8ed1ab_0
  - wheel=0.41.2=py311h06a4308_0
  - x264=1!164.3095=h166bdaf_2
  - x265=3.5=h924138e_3
  - xorg-fixesproto=5.0=h7f98852_1002
  - xorg-kbproto=1.0.7=h7f98852_1002
  - xorg-libice=1.1.1=hd590300_0
  - xorg-libsm=1.2.4=h7391055_0
  - xorg-libx11=1.8.9=hb711507_1
  - xorg-libxau=1.0.11=hd590300_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xorg-libxext=1.3.4=h0b41bf4_2
  - xorg-libxfixes=5.0.3=h7f98852_1004
  - xorg-libxrender=0.9.11=hd590300_0
  - xorg-renderproto=0.11.1=h7f98852_1002
  - xorg-xextproto=7.3.0=h0b41bf4_1003
  - xorg-xproto=7.0.31=h7f98852_1007
  - xz=5.4.6=h5eee18b_0
  - zlib=1.3.1=h4ab18f5_1
  - pip:
      - accelerate==0.34.1
      - aiohappyeyeballs==2.4.0
      - aiohttp==3.10.5
      - aiosignal==1.3.1
      - aiosqlite==0.20.0
      - aniso8601==9.0.1
      - annotated-types==0.7.0
      - ansi2html==1.9.1
      - anyio==4.4.0
      - appdirs==1.4.4
      - arrow==1.3.0
      - asgiref==3.8.1
      - attrs==24.2.0
      - bentoml==1.3.5
      - bitsandbytes==0.43.3
      - blinker==1.7.0
      - cattrs==23.1.2
      - certifi==2024.2.2
      - charset-normalizer==3.3.2
      - circus==0.18.0
      - click==8.1.7
      - click-option-group==0.5.6
      - cloudpickle==3.0.0
      - ctranslate2==4.1.0
      - cuda-python==12.6.0
      - datasets==3.0.0
      - deepmerge==2.0
      - deprecated==1.2.14
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - dulwich==0.22.1
      - einops==0.8.0
      - enum-compat==0.0.3
      - fastapi==0.115.0
      - fastcore==1.7.8
      - ffmpeg==1.4
      - filelock==3.13.4
      - flask==3.0.3
      - flask-restful==0.3.10
      - frozenlist==1.4.1
      - fs==2.4.16
      - fsspec==2024.3.1
      - gguf==0.9.1
      - ghapi==1.0.6
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.2
      - httpx-ws==0.6.0
      - huggingface-hub==0.24.6
      - idna==3.7
      - importlib-metadata==6.11.0
      - inflection==0.5.1
      - inquirerpy==0.3.4
      - interegular==0.3.3
      - itsdangerous==2.2.0
      - jinja2==3.1.2
      - jiter==0.5.0
      - jsonschema==4.23.0
      - jsonschema-specifications==2023.12.1
      - lark==1.2.2
      - llvmlite==0.43.0
      - lm-format-enforcer==0.10.6
      - markdown-it-py==3.0.0
      - markupsafe==2.1.3
      - mdurl==0.1.2
      - mistral-common==1.4.1
      - mpmath==1.3.0
      - msgpack==1.1.0
      - msgspec==0.18.6
      - multidict==6.1.0
      - multiprocess==0.70.16
      - mypy-extensions==1.0.0
      - nest-asyncio==1.6.0
      - networkx==3.2.1
      - ninja==1.11.1.1
      - numba==0.60.0
      - numpy==1.26.4
      - nvgpu==0.10.0
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
      - nvidia-cuda-runtime-cu12==12.1.105
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.0.2.54
      - nvidia-curand-cu12==10.3.2.106
      - nvidia-cusolver-cu12==11.4.5.107
      - nvidia-cusparse-cu12==12.1.0.106
      - nvidia-ml-py==11.525.150
      - nvidia-nccl-cu12==2.20.5
      - nvidia-nvjitlink-cu12==12.1.105
      - nvidia-nvtx-cu12==12.1.105
      - openai==1.41.0
      - opencv-python-headless==4.10.0.84
      - openllm==0.6.10
      - openllm-client==0.5.7
      - openllm-core==0.5.7
      - opentelemetry-api==1.20.0
      - opentelemetry-instrumentation==0.41b0
      - opentelemetry-instrumentation-aiohttp-client==0.41b0
      - opentelemetry-instrumentation-asgi==0.41b0
      - opentelemetry-sdk==1.20.0
      - opentelemetry-semantic-conventions==0.41b0
      - opentelemetry-util-http==0.41b0
      - orjson==3.10.7
      - outlines==0.0.46
      - packaging==24.0
      - pandas==2.2.2
      - partial-json-parser==0.2.1.1.post4
      - pathlib==1.0.1
      - pathspec==0.12.1
      - pfzy==0.3.4
      - pillow==10.4.0
      - pip-requirements-parser==32.0.1
      - prometheus-client==0.20.0
      - prometheus-fastapi-instrumentator==7.0.0
      - prompt-toolkit==3.0.36
      - protobuf==5.28.1
      - psutil==5.9.8
      - py-cpuinfo==9.0.0
      - pyairports==2.1.1
      - pyaml==24.7.0
      - pyarrow==17.0.0
      - pycountry==24.6.1
      - pydantic==2.9.2
      - pydantic-core==2.23.4
      - pygments==2.18.0
      - pynvml==11.5.0
      - pyparsing==3.1.4
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - python-json-logger==2.0.7
      - python-multipart==0.0.9
      - pytz==2024.1
      - pyyaml==6.0.1
      - pyzmq==26.2.0
      - questionary==2.0.1
      - ray==2.36.0
      - referencing==0.35.1
      - regex==2024.4.16
      - requests==2.32.3
      - rich==13.8.1
      - rpds-py==0.20.0
      - safetensors==0.4.3
      - schema==0.7.7
      - scipy==1.14.1
      - sentencepiece==0.2.0
      - shellingham==1.5.4
      - simple-di==0.1.5
      - six==1.16.0
      - sniffio==1.3.1
      - starlette==0.38.5
      - sympy==1.12
      - tabulate==0.9.0
      - termcolor==2.4.0
      - tiktoken==0.7.0
      - tokenizers==0.19.1
      - tomli-w==1.0.0
      - torch==2.4.1
      - torch-model-archiver==0.10.0
      - torchaudio==2.4.1
      - torchserve==0.11.1
      - torchvision==0.19.0
      - tornado==6.4.1
      - tqdm==4.66.5
      - transformers==4.44.2
      - triton==3.0.0
      - typer==0.12.5
      - types-python-dateutil==2.9.0.20240316
      - typing-extensions==4.11.0
      - tzdata==2024.1
      - urllib3==2.2.1
      - uv==0.4.11
      - uvicorn==0.30.6
      - uvloop==0.20.0
      - vllm==0.6.1.post2
      - vllm-flash-attn==2.6.1
      - watchfiles==0.24.0
      - wcwidth==0.2.13
      - websockets==13.0.1
      - werkzeug==3.0.2
      - wrapt==1.16.0
      - wsproto==1.2.0
      - xformers==0.0.27.post2
      - xxhash==3.5.0
      - yarl==1.11.1
      - zipp==3.20.2
prefix: /home/ubuntu/miniconda3/envs/pytorch

pip_packages

accelerate==0.34.1
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
aiosqlite==0.20.0
aniso8601==9.0.1
annotated-types==0.7.0
ansi2html==1.9.1
anyio==4.4.0
appdirs==1.4.4
arrow==1.3.0
asgiref==3.8.1
attrs==24.2.0
bentoml==1.3.5
bitsandbytes==0.43.3
blinker==1.7.0
cattrs==23.1.2
certifi==2024.2.2
charset-normalizer==3.3.2
circus==0.18.0
click==8.1.7
click-option-group==0.5.6
cloudpickle==3.0.0
ctranslate2==4.1.0
cuda-python==12.6.0
datasets==3.0.0
deepmerge==2.0
deprecated==1.2.14
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
dulwich==0.22.1
einops==0.8.0
enum-compat==0.0.3
fastapi==0.115.0
fastcore==1.7.8
ffmpeg==1.4
filelock==3.13.4
flask==3.0.3
flask-restful==0.3.10
frozenlist==1.4.1
fs==2.4.16
fsspec==2024.3.1
gguf==0.9.1
ghapi==1.0.6
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.2
httpx-ws==0.6.0
huggingface-hub==0.24.6
idna==3.7
importlib-metadata==6.11.0
inflection==0.5.1
inquirerpy==0.3.4
interegular==0.3.3
itsdangerous==2.2.0
jinja2==3.1.2
jiter==0.5.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
lark==1.2.2
llvmlite==0.43.0
lm-format-enforcer==0.10.6
markdown-it-py==3.0.0
markupsafe==2.1.3
mdurl==0.1.2
mistral-common==1.4.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.18.6
multidict==6.1.0
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.60.0
numpy==1.26.4
nvgpu==0.10.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==11.525.150
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.1.105
nvidia-nvtx-cu12==12.1.105
openai==1.41.0
opencv-python-headless==4.10.0.84
openllm==0.6.10
openllm-client==0.5.7
openllm-core==0.5.7
opentelemetry-api==1.20.0
opentelemetry-instrumentation==0.41b0
opentelemetry-instrumentation-aiohttp-client==0.41b0
opentelemetry-instrumentation-asgi==0.41b0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
opentelemetry-util-http==0.41b0
orjson==3.10.7
outlines==0.0.46
packaging==24.0
pandas==2.2.2
partial-json-parser==0.2.1.1.post4
pathlib==1.0.1
pathspec==0.12.1
pfzy==0.3.4
pillow==10.4.0
pip==23.3.1
pip-requirements-parser==32.0.1
prometheus-client==0.20.0
prometheus-fastapi-instrumentator==7.0.0
prompt-toolkit==3.0.36
protobuf==5.28.1
psutil==5.9.8
py-cpuinfo==9.0.0
pyairports==2.1.1
pyaml==24.7.0
pyarrow==17.0.0
pycountry==24.6.1
pydantic==2.9.2
pydantic-core==2.23.4
pygments==2.18.0
pynvml==11.5.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
pyyaml==6.0.1
pyzmq==26.2.0
questionary==2.0.1
ray==2.36.0
referencing==0.35.1
regex==2024.4.16
requests==2.32.3
rich==13.8.1
rpds-py==0.20.0
safetensors==0.4.3
schema==0.7.7
scipy==1.14.1
sentencepiece==0.2.0
setuptools==68.2.2
shellingham==1.5.4
simple-di==0.1.5
six==1.16.0
sniffio==1.3.1
starlette==0.38.5
sympy==1.12
tabulate==0.9.0
termcolor==2.4.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli-w==1.0.0
torch==2.4.1
torch-model-archiver==0.10.0
torchaudio==2.4.1
torchserve==0.11.1
torchvision==0.19.0
tornado==6.4.1
tqdm==4.66.5
transformers==4.44.2
triton==3.0.0
typer==0.12.5
types-python-dateutil==2.9.0.20240316
typing-extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
uv==0.4.11
uvicorn==0.30.6
uvloop==0.20.0
vllm==0.6.1.post2
vllm-flash-attn==2.6.1
watchfiles==0.24.0
wcwidth==0.2.13
websockets==13.0.1
werkzeug==3.0.2
wheel==0.41.2
wrapt==1.16.0
wsproto==1.2.0
xformers==0.0.27.post2
xxhash==3.5.0
yarl==1.11.1
zipp==3.20.2

transformers version: 4.44.2
Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.37
Python version: 3.11.8
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.3
Accelerate version: 0.34.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA L4

System information (Optional)

No response

The text was updated successfully, but these errors were encountered:

bojiang · 2024-09-29T05:47:36Z

Hi. The default max tokens is set for minimal GPU memory usage. We are working on parameterize feature, so you might be able to openllm serve llama3.1:8B --arg vllm.engine.max_tokens=131072 with next minor version.

GaetanBaert · 2024-10-29T09:22:45Z

Hello,

Do you have any ETA about this ? I really need to go further 2048 tokens, since I want to do some RAG.

parano assigned aarnphm Sep 28, 2024

bojiang unassigned aarnphm Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: llama3.1:8B maximum context length #1083

bug: llama3.1:8B maximum context length #1083

GaetanBaert commented Sep 18, 2024

bojiang commented Sep 29, 2024

GaetanBaert commented Oct 29, 2024

bug: llama3.1:8B maximum context length #1083

bug: llama3.1:8B maximum context length #1083

Comments

GaetanBaert commented Sep 18, 2024

Describe the bug

To reproduce

Logs

Environment

System information

System information (Optional)

bojiang commented Sep 29, 2024

GaetanBaert commented Oct 29, 2024