-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074
Conversation
Hey, I realised in the end that the tiny model doesn't work as it has a n_audio_state = 384 and k_quants at the time required a layer to be a multiple of 256 (base->large all are) as this is the 'block size'. There has been an update since and k_quants can now be done with a block size of 64 so it should be quite easy to update this pull request if anyone is requiring a k_quant'd tiny model. Also nice work on the mel_spec, you've done some really good work there! |
Update Added a few things to this pull request:
I had high hopes for the Metal inference but unfortunately it's fallen a bit flat, it might be that my implementation isn't optimal or there's an error in my code but on testing Metal inference runs slower than CPU (see examples below). I'm wondering it it's because the matrix's involved are relatively small compared with Llama and as such the overhead from using the GPU isn't offset by it's speed, especially compared with the speed up we're already getting from the Accelerate framework. The gap does appear to close on large models but not enough for it to overtake the CPU. In implementing this I also sync'd up Whisper's GGML with Llama's. Non-metal inference runs as normal. For anyone wanting to test/use this just be aware that to run on the GPU I had to make a small model change and convert the decoder.positional_embedding layer (dpe) to FP16 as GGML Metal doesn't support FP32 currently. I've made a change to the model loading code so that when WHISPER_USE_METAL=1 it expects dpe to be FP16 and when WHISPER_USE_METAL=0 it expects dpe to be FP32. I've also updated the quantization code so that when you quantize a model it will convert dpe to FP16. Be aware thought that because of these changes models quantized from this pull request won't run when WHISPER_USE_METAL=0 unless you change the non-metal dpe to expect FP16. The below is from running this on a M2 Macbook Pro 16GB
See below for details of some test runs. One other thing to note is that on the smaller models (<= small) the output is slightly difference, the GPU output seems to really like exclamation points for some reason. Small Q5_K CPU
Small Q5_K GPU
Medium Q5_K CPU
Medium Q5_K GPU
Large Q5_K CPU
Large Q5_K GPU
|
Hi @RRUK01 - thank you for the contribution. Nice work! Will be looking into this PR in a week or two. Sorry for the delay. The slower Metal runs are surprising.
I could be doing something wrong though. Btw, you can easily add F32 kernel like this: diff --git a/ggml-metal.m b/ggml-metal.m
index ee77252..4ff8ad0 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -53,6 +53,7 @@ struct ggml_metal_context {
GGML_METAL_DECL_KERNEL(gelu);
GGML_METAL_DECL_KERNEL(soft_max);
GGML_METAL_DECL_KERNEL(diag_mask_inf);
+ GGML_METAL_DECL_KERNEL(get_rows_f32);
GGML_METAL_DECL_KERNEL(get_rows_f16);
GGML_METAL_DECL_KERNEL(get_rows_q4_0);
GGML_METAL_DECL_KERNEL(get_rows_q4_1);
@@ -170,6 +171,7 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
GGML_METAL_ADD_KERNEL(gelu);
GGML_METAL_ADD_KERNEL(soft_max);
GGML_METAL_ADD_KERNEL(diag_mask_inf);
+ GGML_METAL_ADD_KERNEL(get_rows_f32);
GGML_METAL_ADD_KERNEL(get_rows_f16);
GGML_METAL_ADD_KERNEL(get_rows_q4_0);
GGML_METAL_ADD_KERNEL(get_rows_q4_1);
@@ -893,6 +895,7 @@ void ggml_metal_graph_compute(
}
switch (src0->type) {
+ case GGML_TYPE_F32: [encoder setComputePipelineState:ctx->pipeline_get_rows_f32]; break;
case GGML_TYPE_F16: [encoder setComputePipelineState:ctx->pipeline_get_rows_f16]; break;
case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_0]; break;
case GGML_TYPE_Q4_1: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_1]; break;
@@ -1127,4 +1130,4 @@ void ggml_metal_graph_compute(
GGML_ASSERT(false);
}
}
-}
\ No newline at end of file
+}
diff --git a/ggml-metal.metal b/ggml-metal.metal
index 9e9e5f4..857a47c 100644
--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -219,6 +219,22 @@ kernel void kernel_diag_mask_inf(
}
}
+kernel void kernel_get_rows_f32(
+ device const float * src0,
+ device const int * src1,
+ device float * dst,
+ constant int64_t & ne00,
+ constant uint64_t & nb01,
+ constant uint64_t & nb1,
+ uint tpig[[thread_position_in_grid]]) {
+ const int i = tpig;
+ const int r = ((device int32_t *) src1)[i];
+
+ for (int j = 0; j < ne00; j++) {
+ dst[i*nb1 + j] = ((device float *) ((device char *) src0 + r*nb01))[j];
+ }
+}
+
kernel void kernel_get_rows_f16(
device const void * src0,
device const int * src1,
@@ -1969,4 +1985,4 @@ kernel void kernel_mul_mat_q6_K_f32(
if (tiisg == 0) {
dst[r1*ne0 + row] = tot;
}
-}
\ No newline at end of file
+} |
Hey @ggerganov, excellent project! I've been tinkering with it for the past couple of months and wanted to contribute something back.
It's nothing major but I think I've got the k_quants working. I've followed the llama implementation in terms of keeping it behind a flag (WHISPER_K_QUANTS) which is on by default.
I've tested it for Q2_K - Q6_K for tiny -> large and it's working across all combinations except for the tiny models where it
the model fails to evaluate.
I'm new to contributing to open source (this is my first ever pull request) and C++ so if I've made any mistakes or you have any suggestions just let me know!
New model sizes: