Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Speculative decoding small draft doesn't work on macOS #2907

Open
vlbosch opened this issue Sep 16, 2024 · 4 comments
Open

[Bug] Speculative decoding small draft doesn't work on macOS #2907

vlbosch opened this issue Sep 16, 2024 · 4 comments
Labels
bug Confirmed bugs

Comments

@vlbosch
Copy link

vlbosch commented Sep 16, 2024

🐛 Bug

I tried to use Mistral Small 7B Instruct v0.3 as draft model for Mistral Large 2407. When not served using "--mode server", the model(s) never respond. I think that's because only CPU is used, instead of GPU as well. When serving with "--mode server", I see that the first token is streamed in the frontend, but then I get the following error: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false.

To Reproduce

Steps to reproduce the behavior:

  1. Download Mistral Large 2407
  2. Quantize model and gen config
  3. Run Mistral Large to see if it works standalone
  4. Run the speculative decoding with: python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server
  5. First token is streamed, then error message

USER@MBPM3MVLB ~ % python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server
[2024-09-16 08:50:13] INFO auto_device.py:79: Found device: metal:0
[2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/3826dfed383847636248c8e5e540102b.dylib
[2024-09-16 08:50:13] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
[2024-09-16 08:50:13] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO download_cache.py:166: Weights already downloaded: /Users/USER/.cache/mlc_llm/model_weights/hf/mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
[2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/7bbcaf068957bbf173dbd8ad644faea6.dylib
[2024-09-16 08:50:13] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-09-16 08:50:13] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-09-16 08:50:13] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 80, max KV cache token capacity is 32768, prefill chunk size is 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 86697.674 MB (Parameters: 69664.656 MB. KVCache: 15602.123 MB. Temporary buffer: 1430.894 MB). The actual usage might be slightly larger than the estimated number.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
INFO: Started server process [69315]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:9999 (Press CTRL+C to quit)
INFO: 127.0.0.1:58406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [08:50:41] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine_actions/batch_draft.cc:151: InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false:
Stack trace:

zsh: abort python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC

Expected behavior

The model streams the output to the provided prompt.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Macbook Pro
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): macOS Sequoia
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): M3 Max
  • How you installed MLC-LLM (conda, source): conda with pip install
  • How you installed TVM-Unity (pip, source): pip
  • Python version (e.g. 3.10): 3.12
  • GPU driver version (if applicable): -
  • CUDA/cuDNN version (if applicable): -
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
    "USE_NVTX: OFF
    USE_GTEST: AUTO
    SUMMARIZE: OFF
    TVM_DEBUG_WITH_ABI_CHANGE: OFF
    USE_IOS_RPC: OFF
    USE_MSC: OFF
    USE_ETHOSU:
    CUDA_VERSION: NOT-FOUND
    USE_LIBBACKTRACE: AUTO
    DLPACK_PATH: 3rdparty/dlpack/include
    USE_TENSORRT_CODEGEN: OFF
    USE_THRUST: OFF
    USE_TARGET_ONNX: OFF
    USE_AOT_EXECUTOR: ON
    BUILD_DUMMY_LIBTVM: OFF
    USE_CUDNN: OFF
    USE_TENSORRT_RUNTIME: OFF
    USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
    USE_CCACHE: AUTO
    USE_ARM_COMPUTE_LIB: OFF
    USE_CPP_RTVM:
    USE_OPENCL_GTEST: /path/to/opencl/gtest
    TVM_LOG_BEFORE_THROW: OFF
    USE_MKL: OFF
    USE_PT_TVMDSOOP: OFF
    MLIR_VERSION: NOT-FOUND
    USE_CLML: OFF
    USE_STACKVM_RUNTIME: OFF
    USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
    ROCM_PATH: /opt/rocm
    USE_DNNL: OFF
    USE_MSCCL: OFF
    USE_VITIS_AI: OFF
    USE_MLIR: OFF
    USE_RCCL: OFF
    USE_LLVM: llvm-config --link-static
    USE_VERILATOR: OFF
    USE_TF_TVMDSOOP: OFF
    USE_THREADS: ON
    USE_MSVC_MT: OFF
    BACKTRACE_ON_SEGFAULT: OFF
    USE_GRAPH_EXECUTOR: ON
    USE_NCCL: OFF
    USE_ROCBLAS: OFF
    GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb
    USE_VULKAN: OFF
    USE_RUST_EXT: OFF
    USE_CUTLASS: OFF
    USE_CPP_RPC: OFF
    USE_HEXAGON: OFF
    USE_CUSTOM_LOGGING: OFF
    USE_UMA: OFF
    USE_FALLBACK_STL_MAP: OFF
    USE_SORT: ON
    USE_RTTI: ON
    GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400
    USE_HIPBLAS: OFF
    USE_HEXAGON_SDK: /path/to/sdk
    USE_BLAS: none
    USE_ETHOSN: OFF
    USE_LIBTORCH: OFF
    USE_RANDOM: ON
    USE_CUDA: OFF
    USE_COREML: OFF
    USE_AMX: OFF
    BUILD_STATIC_RUNTIME: OFF
    USE_CMSISNN: OFF
    USE_KHRONOS_SPIRV: OFF
    USE_CLML_GRAPH_EXECUTOR: OFF
    USE_TFLITE: OFF
    USE_HEXAGON_GTEST: /path/to/hexagon/gtest
    PICOJSON_PATH: 3rdparty/picojson
    USE_OPENCL_ENABLE_HOST_PTR: OFF
    INSTALL_DEV: OFF
    USE_PROFILER: ON
    USE_NNPACK: OFF
    LLVM_VERSION: 17.0.1
    USE_MRVL: OFF
    USE_OPENCL: OFF
    COMPILER_RT_PATH: 3rdparty/compiler-rt
    RANG_PATH: 3rdparty/rang/include
    USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
    USE_OPENMP: OFF
    USE_BNNS: OFF
    USE_FLASHINFER:
    USE_CUBLAS: OFF
    USE_METAL: ON
    USE_MICRO_STANDALONE_RUNTIME: OFF
    USE_HEXAGON_EXTERNAL_LIBS: OFF
    USE_ALTERNATIVE_LINKER: AUTO
    USE_BYODT_POSIT: OFF
    USE_NVSHMEM: OFF
    USE_HEXAGON_RPC: OFF
    USE_MICRO: OFF
    DMLC_PATH: 3rdparty/dmlc-core/include
    INDEX_DEFAULT_I64: ON
    USE_RELAY_DEBUG: OFF
    USE_RPC: ON
    USE_TENSORFLOW_PATH: none
    TVM_CLML_VERSION:
    USE_MIOPEN: OFF
    USE_ROCM: OFF
    USE_PAPI: OFF
    USE_CURAND: OFF
    TVM_CXX_COMPILER_PATH: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
    HIDE_PRIVATE_SYMBOLS: ON"
  • Any other relevant information: I tried it with both the Mistral Small model from mlc-ai HF-repo, and also with a locally quantized version. Both yield the same error.

Additional context

Both models work fine separately.

@vlbosch vlbosch added the bug Confirmed bugs label Sep 16, 2024
@MasterJH5574
Copy link
Member

Thank you @vlbosch. We also ran into this and get it fixed in #2906. The nightly packages are under built and will be ready in a few hours. I'll report back when the nightly build is done.

@MasterJH5574
Copy link
Member

Hi @vlbosch the nightly wheel is updated and could you please try upgrade?

@vlbosch
Copy link
Author

vlbosch commented Sep 17, 2024

@MasterJH5574 Thanks for the quick response! I just updated to the latest nightly and retried. Small draft-mode does work now, however the speed running with small draft is slower than with Mistral Large alone. I thought that the baseline would be the regular speed of the large model? Or does that only count for the other speculative modes like eagle and medusa?

@MasterJH5574
Copy link
Member

Hi @vlbosch thanks for following up and sorry for the late response. We have not yet benchmarked the speculative decoding performance on Metal so it's possible that the spec decoding for Metal needs more optimizations. It also depends on the speculative decoding acceptance of the 7B model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

2 participants