[Bug] Speculative decoding small draft doesn't work on macOS #2907

vlbosch · 2024-09-16T07:00:02Z

🐛 Bug

I tried to use Mistral Small 7B Instruct v0.3 as draft model for Mistral Large 2407. When not served using "--mode server", the model(s) never respond. I think that's because only CPU is used, instead of GPU as well. When serving with "--mode server", I see that the first token is streamed in the frontend, but then I get the following error: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false.

To Reproduce

Steps to reproduce the behavior:

Download Mistral Large 2407
Quantize model and gen config
Run Mistral Large to see if it works standalone
Run the speculative decoding with: python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server
First token is streamed, then error message

USER@MBPM3MVLB ~ % python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC --additional-models "HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC" --speculative-mode small_draft --port 9999 --device metal --mode server
[2024-09-16 08:50:13] INFO auto_device.py:79: Found device: metal:0
[2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/3826dfed383847636248c8e5e540102b.dylib
[2024-09-16 08:50:13] INFO download_cache.py:227: Downloading model from HuggingFace: HF://mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
[2024-09-16 08:50:13] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO download_cache.py:166: Weights already downloaded: /Users/USER/.cache/mlc_llm/model_weights/hf/mlc-ai/Mistral-7B-Instruct-v0.3-q4f16_1-MLC
[2024-09-16 08:50:13] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-09-16 08:50:13] INFO jit.py:158: Using cached model lib: /Users/USER/.cache/mlc_llm/model_lib/7bbcaf068957bbf173dbd8ad644faea6.dylib
[2024-09-16 08:50:13] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-09-16 08:50:13] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2024-09-16 08:50:13] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "server". So max batch size is 80, max KV cache token capacity is 32768, prefill chunk size is 2048.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 86697.674 MB (Parameters: 69664.656 MB. KVCache: 15602.123 MB. Temporary buffer: 1430.894 MB). The actual usage might be slightly larger than the estimated number.
[08:50:13] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine.cc:365: Warning: Hybrid prefill mode fallbacks to chunked prefill, due to speculative mode is enabled and not implemented with hybrid prefill yet.
INFO: Started server process [69315]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:9999 (Press CTRL+C to quit)
INFO: 127.0.0.1:58406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [08:50:41] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/engine_actions/batch_draft.cc:151: InternalError: Check failed: (!mstates[i]->draft_output_tokens.empty()) is false:
Stack trace:

zsh: abort python -m mlc_llm serve /Users/USER/LLM/Mistral-Large-Instruct-2407-MLC

Expected behavior

The model streams the output to the provided prompt.

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Macbook Pro
Operating system (e.g. Ubuntu/Windows/MacOS/...): macOS Sequoia
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): M3 Max
How you installed MLC-LLM (conda, source): conda with pip install
How you installed TVM-Unity (pip, source): pip
Python version (e.g. 3.10): 3.12
GPU driver version (if applicable): -
CUDA/cuDNN version (if applicable): -
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
"USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
TVM_LOG_BEFORE_THROW: OFF
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_MSCCL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: llvm-config --link-static
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 2685d6ace64c30a077c1b3f6893d2e38589be7bb
USE_VULKAN: OFF
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-09-07 15:18:06 -0400
USE_HIPBLAS: OFF
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 17.0.1
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_FLASHINFER:
USE_CUBLAS: OFF
USE_METAL: ON
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_NVSHMEM: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
HIDE_PRIVATE_SYMBOLS: ON"
Any other relevant information: I tried it with both the Mistral Small model from mlc-ai HF-repo, and also with a locally quantized version. Both yield the same error.

Additional context

Both models work fine separately.

The text was updated successfully, but these errors were encountered:

MasterJH5574 · 2024-09-16T14:31:16Z

Thank you @vlbosch. We also ran into this and get it fixed in #2906. The nightly packages are under built and will be ready in a few hours. I'll report back when the nightly build is done.

MasterJH5574 · 2024-09-17T00:40:14Z

Hi @vlbosch the nightly wheel is updated and could you please try upgrade?

vlbosch · 2024-09-17T06:46:29Z

@MasterJH5574 Thanks for the quick response! I just updated to the latest nightly and retried. Small draft-mode does work now, however the speed running with small draft is slower than with Mistral Large alone. I thought that the baseline would be the regular speed of the large model? Or does that only count for the other speculative modes like eagle and medusa?

MasterJH5574 · 2024-10-16T15:58:06Z

Hi @vlbosch thanks for following up and sorry for the late response. We have not yet benchmarked the speculative decoding performance on Metal so it's possible that the spec decoding for Metal needs more optimizations. It also depends on the speculative decoding acceptance of the 7B model.

vlbosch added the bug Confirmed bugs label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Speculative decoding small draft doesn't work on macOS #2907

[Bug] Speculative decoding small draft doesn't work on macOS #2907

vlbosch commented Sep 16, 2024

MasterJH5574 commented Sep 16, 2024

MasterJH5574 commented Sep 17, 2024

vlbosch commented Sep 17, 2024

MasterJH5574 commented Oct 16, 2024

[Bug] Speculative decoding small draft doesn't work on macOS #2907

[Bug] Speculative decoding small draft doesn't work on macOS #2907

Comments

vlbosch commented Sep 16, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

MasterJH5574 commented Sep 16, 2024

MasterJH5574 commented Sep 17, 2024

vlbosch commented Sep 17, 2024

MasterJH5574 commented Oct 16, 2024