Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

RRUK01 · 2023-07-01T20:59:42Z

Hey @ggerganov, excellent project! I've been tinkering with it for the past couple of months and wanted to contribute something back.

It's nothing major but I think I've got the k_quants working. I've followed the llama implementation in terms of keeping it behind a flag (WHISPER_K_QUANTS) which is on by default.

I've tested it for Q2_K - Q6_K for tiny -> large and it's working across all combinations except for the tiny models where it
the model fails to evaluate.

I'm new to contributing to open source (this is my first ever pull request) and C++ so if I've made any mistakes or you have any suggestions just let me know!

New model sizes:

model	k_quant	size (mb)
base	Q2_K	29.9
base	Q3_K	37.1
base	Q4_K	46.5
base	Q5_K	55.3
base	Q6_K	64.7
small	Q2_K	89.7
small	Q3_K	113.8
small	Q4_K	145.5
small	Q5_K	175.2
small	Q6_K	206.8
medium	Q2_K	266.9
medium	Q3_K	343.9
medium	Q4_K	444.5
medium	Q5_K	539.2
medium	Q6_K	639.9
large	Q2_K	529.3
large	Q3_K	685.1
large	Q4_K	888.9
large	Q5_K	1080.8
large	Q6_K	1284.6

bobqianic · 2023-08-04T17:58:38Z

@RRUK01 Hi, could you please check if pull request #1148 resolves the issue with evaluating the tiny models? I suspect that the issue may be due to an error in the generation of the log-mel spectrogram.

RRUK01 · 2023-08-05T14:40:00Z

@RRUK01 Hi, could you please check if pull request #1148 resolves the issue with evaluating the tiny models? I suspect that the issue may be due to an error in the generation of the log-mel spectrogram.

Hey, I realised in the end that the tiny model doesn't work as it has a n_audio_state = 384 and k_quants at the time required a layer to be a multiple of 256 (base->large all are) as this is the 'block size'. There has been an update since and k_quants can now be done with a block size of 64 so it should be quite easy to update this pull request if anyone is requiring a k_quant'd tiny model.

Also nice work on the mel_spec, you've done some really good work there!

RRUK01 · 2023-08-06T20:38:03Z

Update

Added a few things to this pull request:

Decoder inference on Metal whisper : add Metal support in the Decoder #1047
Sync'd with latest GGML updates

I had high hopes for the Metal inference but unfortunately it's fallen a bit flat, it might be that my implementation isn't optimal or there's an error in my code but on testing Metal inference runs slower than CPU (see examples below). I'm wondering it it's because the matrix's involved are relatively small compared with Llama and as such the overhead from using the GPU isn't offset by it's speed, especially compared with the speed up we're already getting from the Accelerate framework. The gap does appear to close on large models but not enough for it to overtake the CPU.

In implementing this I also sync'd up Whisper's GGML with Llama's. Non-metal inference runs as normal.

For anyone wanting to test/use this just be aware that to run on the GPU I had to make a small model change and convert the decoder.positional_embedding layer (dpe) to FP16 as GGML Metal doesn't support FP32 currently. I've made a change to the model loading code so that when WHISPER_USE_METAL=1 it expects dpe to be FP16 and when WHISPER_USE_METAL=0 it expects dpe to be FP32. I've also updated the quantization code so that when you quantize a model it will convert dpe to FP16. Be aware thought that because of these changes models quantized from this pull request won't run when WHISPER_USE_METAL=0 unless you change the non-metal dpe to expect FP16.

The below is from running this on a M2 Macbook Pro 16GB

Model	Decode time per run in ms (CPU)	Decode time per run in ms (Metal)
Small (Q5_K)	4.24	7.00
Medium (Q5_K)	11.47	14.43
Large (Q5_K)	25.80	28.37

See below for details of some test runs. One other thing to note is that on the smaller models (<= small) the output is slightly difference, the GPU output seems to really like exclamation points for some reason.

Small Q5_K CPU

whisper_init_from_file_no_state: loading model from 'models/ggml-small_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: mem required  =  453.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  166.10 MB
whisper_model_load: model size    =  165.87 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =    97.75 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.77 ms
whisper_print_timings:   sample time =    10.94 ms /    28 runs (    0.39 ms per run)
whisper_print_timings:   encode time =   986.92 ms /     1 runs (  986.92 ms per run)
whisper_print_timings:   decode time =   118.77 ms /    28 runs (    4.24 ms per run)
whisper_print_timings:    total time =  1258.26 ms

Small Q5_K GPU

whisper_init_from_file_no_state: loading model from 'models/ggml-small_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: mem required  =  453.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  166.10 MB
whisper_model_load: model size    =  165.87 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x13881aa60
ggml_metal_init: loaded kernel_add_row                        0x13881cc20
ggml_metal_init: loaded kernel_mul                            0x13881c080
ggml_metal_init: loaded kernel_mul_row                        0x13881d6a0
ggml_metal_init: loaded kernel_scale                          0x13881e100
ggml_metal_init: loaded kernel_silu                           0x13881e940
ggml_metal_init: loaded kernel_relu                           0x13881c480
ggml_metal_init: loaded kernel_gelu                           0x13881f240
ggml_metal_init: loaded kernel_soft_max                       0x138820280
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1388214e0
ggml_metal_init: loaded kernel_get_rows_f16                   0x138821740
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1073044f0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x1073050a0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x107305610
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x107305f30
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x107306980
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1073072d0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x107307ad0
ggml_metal_init: loaded kernel_rms_norm                       0x107308540
ggml_metal_init: loaded kernel_norm                           0x107309640
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x10730a240
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x10730adc0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x10730b790
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x10730c140
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x10730cc50
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x10730d580
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10730df60
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10730e930
ggml_metal_init: loaded kernel_rope                           0x10730eb90
ggml_metal_init: loaded kernel_alibi_f32                      0x10730f5c0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x107310140
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x138822560
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x130372a50
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    26.12 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   176.00 MB, (  176.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    56.00 MB, (  232.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    16.00 MB, (  248.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =    53.00 MB, (  301.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   120.00 MB, (  421.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    36.00 MB, (  457.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     6.00 MB, (  463.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     6.00 MB, (  469.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.000]   And! So, my fellow Americans, ask not what your country can do!
[00:00:07.000 --> 00:00:11.000]   Ask what! you can do for your country!


whisper_print_timings:     load time =   150.77 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.22 ms
whisper_print_timings:   sample time =    11.23 ms /    30 runs (    0.37 ms per run)
whisper_print_timings:   encode time =  1105.81 ms /     1 runs ( 1105.81 ms per run)
whisper_print_timings:   decode time =   210.00 ms /    30 runs (    7.00 ms per run)
whisper_print_timings:    total time =  1559.63 ms

Medium Q5_K CPU

whisper_init_from_file_no_state: loading model from 'models/ggml-medium_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 4
whisper_model_load: mem required  =  975.00 MB (+   43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  513.23 MB
whisper_model_load: model size    =  512.77 MB
whisper_init_state: kv self size  =   42.00 MB
whisper_init_state: kv cross size =  140.62 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.800]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:07.800 --> 00:00:10.800]   Ask what you can do for your country.


whisper_print_timings:     load time =   182.82 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.30 ms
whisper_print_timings:   sample time =    11.66 ms /    30 runs (    0.39 ms per run)
whisper_print_timings:   encode time =  3035.53 ms /     1 runs ( 3035.53 ms per run)
whisper_print_timings:   decode time =   344.22 ms /    30 runs (   11.47 ms per run)
whisper_print_timings:    total time =  3630.39 ms

Medium Q5_K GPU

whisper_init_from_file_no_state: loading model from 'models/ggml-medium_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 4
whisper_model_load: mem required  =  975.00 MB (+   43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  513.23 MB
whisper_model_load: model size    =  512.77 MB
whisper_init_state: kv self size  =   42.00 MB
whisper_init_state: kv cross size =  140.62 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x12462d840
ggml_metal_init: loaded kernel_add_row                        0x12462fa00
ggml_metal_init: loaded kernel_mul                            0x12462ee60
ggml_metal_init: loaded kernel_mul_row                        0x124630480
ggml_metal_init: loaded kernel_scale                          0x124630ee0
ggml_metal_init: loaded kernel_silu                           0x1246316f0
ggml_metal_init: loaded kernel_relu                           0x12462f260
ggml_metal_init: loaded kernel_gelu                           0x124632030
ggml_metal_init: loaded kernel_soft_max                       0x124632f60
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1246342d0
ggml_metal_init: loaded kernel_get_rows_f16                   0x124634530
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x124635740
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x124634bc0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x124634e20
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x124636140
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x124636ac0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x124637280
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x124637b90
ggml_metal_init: loaded kernel_rms_norm                       0x124638670
ggml_metal_init: loaded kernel_norm                           0x124639790
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x124639b90
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x12463ad90
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x12463b750
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x12463c140
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x12463cc60
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x12463d720
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x12463df70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x12463e940
ggml_metal_init: loaded kernel_rope                           0x12463f3e0
ggml_metal_init: loaded kernel_alibi_f32                      0x12463ffa0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x124640e30
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x1246419e0
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x124638fb0
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    34.82 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   540.00 MB, (  540.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    74.00 MB, (  614.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    43.00 MB, (  657.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =   141.00 MB, (  798.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   158.00 MB, (  956.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    48.00 MB, ( 1004.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     7.00 MB, ( 1011.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     7.00 MB, ( 1018.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.800]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:07.800 --> 00:00:10.800]   Ask what you can do for your country.


whisper_print_timings:     load time =   179.97 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.94 ms
whisper_print_timings:   sample time =    13.76 ms /    30 runs (    0.46 ms per run)
whisper_print_timings:   encode time =  2981.41 ms /     1 runs ( 2981.41 ms per run)
whisper_print_timings:   decode time =   432.81 ms /    30 runs (   14.43 ms per run)
whisper_print_timings:    total time =  3695.76 ms
ggml_metal_free: deallocating

Large Q5_K CPU

whisper_init_from_file_no_state: loading model from 'models/ggml-large_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5
whisper_model_load: mem required  = 1686.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1029.58 MB
whisper_model_load: model size    = 1028.97 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   479.21 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    40.84 ms
whisper_print_timings:   sample time =    10.74 ms /    27 runs (    0.40 ms per run)
whisper_print_timings:   encode time =  5745.24 ms /     1 runs ( 5745.24 ms per run)
whisper_print_timings:   decode time =   696.49 ms /    27 runs (   25.80 ms per run)
whisper_print_timings:    total time =  7029.23 ms

Large Q5_K GPU

whisper_init_from_file_no_state: loading model from 'models/ggml-large_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5
whisper_model_load: mem required  = 1686.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1029.58 MB
whisper_model_load: model size    = 1028.97 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x149ca6390
ggml_metal_init: loaded kernel_add_row                        0x149ca8d40
ggml_metal_init: loaded kernel_mul                            0x149ca8120
ggml_metal_init: loaded kernel_mul_row                        0x149ca97c0
ggml_metal_init: loaded kernel_scale                          0x149caa230
ggml_metal_init: loaded kernel_silu                           0x149caaa70
ggml_metal_init: loaded kernel_relu                           0x149ca8520
ggml_metal_init: loaded kernel_gelu                           0x149cab3a0
ggml_metal_init: loaded kernel_soft_max                       0x149cac2e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x149cad650
ggml_metal_init: loaded kernel_get_rows_f16                   0x149cad8b0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x149caeb10
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x155f597c0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x155f5a550
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x155f5aa70
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x155f5b4a0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1498774f0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x149878260
ggml_metal_init: loaded kernel_rms_norm                       0x149878870
ggml_metal_init: loaded kernel_norm                           0x14987bab0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x14987c6b0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x149e04a30
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x149e05020
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x155f5be20
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x155f5cf80
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x14987c910
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x14987d620
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x14987dba0
ggml_metal_init: loaded kernel_rope                           0x155e192c0
ggml_metal_init: loaded kernel_alibi_f32                      0x155e19f00
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x155e1af30
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x14987f5c0
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x149878d10
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    43.53 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  1081.00 MB, ( 1081.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    94.00 MB, ( 1175.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    71.00 MB, ( 1246.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =   235.00 MB, ( 1481.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   198.00 MB, ( 1679.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    60.00 MB, ( 1739.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     9.00 MB, ( 1748.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     9.00 MB, ( 1757.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   657.45 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.17 ms
whisper_print_timings:   sample time =    11.60 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  6002.63 ms /     1 runs ( 6002.63 ms per run)
whisper_print_timings:   decode time =   766.03 ms /    27 runs (   28.37 ms per run)
whisper_print_timings:    total time =  7651.89 ms
ggml_metal_free: deallocating

ggerganov · 2023-08-08T13:07:15Z

Hi @RRUK01 - thank you for the contribution. Nice work!

Will be looking into this PR in a week or two. Sorry for the delay.

The slower Metal runs are surprising.
I tried to do a quick Metal run, but I get the error:

 16:03:33 ▶ ⚓ v1.4.2-79-gf5c5888 ▶ $ ▶WHISPER_USE_METAL=1 make -j &&  ./main -m models/ggml-small.en.bin -f samples/gb0.wav 
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/main/main.cpp examples/common.cpp examples/common-ggml.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/bench/bench.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o bench  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/quantize/quantize.cpp examples/common.cpp examples/common-ggml.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [2      ] number of best candidates to keep
  -bs N,     --beam-size N       [-1     ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -su,       --speed-up          [false  ] speed up audio by x2 (reduced accuracy)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference

whisper_init_from_file_no_state: loading model from 'models/ggml-small.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: mem required  =  743.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/whisper.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x123e0b400
ggml_metal_init: loaded kernel_add_row                        0x123e0bbb0
ggml_metal_init: loaded kernel_mul                            0x123e0c0f0
ggml_metal_init: loaded kernel_mul_row                        0x123e0c740
ggml_metal_init: loaded kernel_scale                          0x123e0cc80
ggml_metal_init: loaded kernel_silu                           0x123e0d1c0
ggml_metal_init: loaded kernel_relu                           0x123e0d700
ggml_metal_init: loaded kernel_gelu                           0x123e0dc40
ggml_metal_init: loaded kernel_soft_max                       0x123e0e310
ggml_metal_init: loaded kernel_diag_mask_inf                  0x123e0e990
ggml_metal_init: loaded kernel_get_rows_f32                   0x123e0f1b0
ggml_metal_init: loaded kernel_get_rows_f16                   0x123f05130
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x123f05730
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x123f05dd0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x123f06470
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x123f06b10
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x123f071b0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x123f07850
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x123f07ef0
ggml_metal_init: loaded kernel_rms_norm                       0x123f08860
ggml_metal_init: loaded kernel_norm                           0x123f08f30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x123f09800
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x123f09ee0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x123f0a740
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x123f0ae20
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x123f0b500
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x123f0bbe0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x123f0c4a0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x123f0cb60
ggml_metal_init: loaded kernel_rope                           0x123f0d0a0
ggml_metal_init: loaded kernel_alibi_f32                      0x123f0dbe0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x123f0e490
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x123f0ec20
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x123f0f3b0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    75.97 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   466.00 MB, (  466.45 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    56.00 MB, (  522.45 / 147456.00)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    16.00 MB, (  538.45 / 147456.00)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =    53.00 MB, (  591.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   120.00 MB, (  711.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    36.00 MB, (  747.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     6.00 MB, (  753.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     6.00 MB, (  759.45 / 147456.00)

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing 'samples/gb0.wav' (2037686 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil

I could be doing something wrong though.

Btw, you can easily add F32 kernel like this:

diff --git a/ggml-metal.m b/ggml-metal.m
index ee77252..4ff8ad0 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -53,6 +53,7 @@ struct ggml_metal_context {
     GGML_METAL_DECL_KERNEL(gelu);
     GGML_METAL_DECL_KERNEL(soft_max);
     GGML_METAL_DECL_KERNEL(diag_mask_inf);
+    GGML_METAL_DECL_KERNEL(get_rows_f32);
     GGML_METAL_DECL_KERNEL(get_rows_f16);
     GGML_METAL_DECL_KERNEL(get_rows_q4_0);
     GGML_METAL_DECL_KERNEL(get_rows_q4_1);
@@ -170,6 +171,7 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
         GGML_METAL_ADD_KERNEL(gelu);
         GGML_METAL_ADD_KERNEL(soft_max);
         GGML_METAL_ADD_KERNEL(diag_mask_inf);
+        GGML_METAL_ADD_KERNEL(get_rows_f32);
         GGML_METAL_ADD_KERNEL(get_rows_f16);
         GGML_METAL_ADD_KERNEL(get_rows_q4_0);
         GGML_METAL_ADD_KERNEL(get_rows_q4_1);
@@ -893,6 +895,7 @@ void ggml_metal_graph_compute(
                             }
 
                             switch (src0->type) {
+                                case GGML_TYPE_F32:  [encoder setComputePipelineState:ctx->pipeline_get_rows_f32]; break;
                                 case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_get_rows_f16]; break;
                                 case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_0]; break;
                                 case GGML_TYPE_Q4_1: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_1]; break;
@@ -1127,4 +1130,4 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
     }
-}
\ No newline at end of file
+}
diff --git a/ggml-metal.metal b/ggml-metal.metal
index 9e9e5f4..857a47c 100644
--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -219,6 +219,22 @@ kernel void kernel_diag_mask_inf(
     }
 }
 
+kernel void kernel_get_rows_f32(
+        device const float * src0,
+        device const   int * src1,
+        device       float * dst,
+        constant   int64_t & ne00,
+        constant  uint64_t & nb01,
+        constant  uint64_t & nb1,
+        uint tpig[[thread_position_in_grid]]) {
+    const int i = tpig;
+    const int r = ((device int32_t *) src1)[i];
+
+    for (int j = 0; j < ne00; j++) {
+        dst[i*nb1 + j] = ((device float *) ((device char *) src0 + r*nb01))[j];
+    }
+}
+
 kernel void kernel_get_rows_f16(
         device const  void * src0,
         device const   int * src1,
@@ -1969,4 +1985,4 @@ kernel void kernel_mul_mat_q6_K_f32(
     if (tiisg == 0) {
         dst[r1*ne0 + row] = tot;
     }
-}
\ No newline at end of file
+}

RRUK01 and others added 5 commits July 1, 2023 20:24

Adding k_quants.c & k_quants.h

5ee29b6

Adding k_quants

e69defc

Update Makefile

c468e9c

Update MEM_REQ_MODEL with k_quant sizes

c8629e6

Merge branch 'master' into master

ae5aab4

Merge branch 'ggerganov:master' into master

db1816f

RRUK01 and others added 7 commits August 6, 2023 15:51

Merge branch 'ggerganov:master' into master

d32c1db

Sync'd with latest GGML, code added for decoder inference on metal

9e2fc74

Decoder inference on metal

1cbf2ab

Updated README.md with instructions for Metal build

6769b76

Fixed Metal example in README.md

7ca9785

Fix error in ifdef macro + add some deallocation code

9f67c4c

Fixed GGML_TYPE when not using Metal

f5c5888

RRUK01 changed the title ~~Adding k_quants~~ Metal Decoder Inference, K-Quants added, Sync'd with latest GGML Aug 6, 2023

RRUK01 closed this by deleting the head repository Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

RRUK01 commented Jul 1, 2023

bobqianic commented Aug 4, 2023

RRUK01 commented Aug 5, 2023

RRUK01 commented Aug 6, 2023

ggerganov commented Aug 8, 2023

Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

Conversation

RRUK01 commented Jul 1, 2023

bobqianic commented Aug 4, 2023

RRUK01 commented Aug 5, 2023

RRUK01 commented Aug 6, 2023

ggerganov commented Aug 8, 2023