Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal Decoder Inference, K-Quants added, Sync'd with latest GGML #1074

Closed
wants to merge 13 commits into from
Closed

Conversation

RRUK01
Copy link

@RRUK01 RRUK01 commented Jul 1, 2023

Hey @ggerganov, excellent project! I've been tinkering with it for the past couple of months and wanted to contribute something back.

It's nothing major but I think I've got the k_quants working. I've followed the llama implementation in terms of keeping it behind a flag (WHISPER_K_QUANTS) which is on by default.

I've tested it for Q2_K - Q6_K for tiny -> large and it's working across all combinations except for the tiny models where it
the model fails to evaluate.

I'm new to contributing to open source (this is my first ever pull request) and C++ so if I've made any mistakes or you have any suggestions just let me know!

New model sizes:

model k_quant size (mb)
base Q2_K 29.9
base Q3_K 37.1
base Q4_K 46.5
base Q5_K 55.3
base Q6_K 64.7
small Q2_K 89.7
small Q3_K 113.8
small Q4_K 145.5
small Q5_K 175.2
small Q6_K 206.8
medium Q2_K 266.9
medium Q3_K 343.9
medium Q4_K 444.5
medium Q5_K 539.2
medium Q6_K 639.9
large Q2_K 529.3
large Q3_K 685.1
large Q4_K 888.9
large Q5_K 1080.8
large Q6_K 1284.6

@bobqianic
Copy link
Collaborator

@RRUK01 Hi, could you please check if pull request #1148 resolves the issue with evaluating the tiny models? I suspect that the issue may be due to an error in the generation of the log-mel spectrogram.

@RRUK01
Copy link
Author

RRUK01 commented Aug 5, 2023

@RRUK01 Hi, could you please check if pull request #1148 resolves the issue with evaluating the tiny models? I suspect that the issue may be due to an error in the generation of the log-mel spectrogram.

Hey, I realised in the end that the tiny model doesn't work as it has a n_audio_state = 384 and k_quants at the time required a layer to be a multiple of 256 (base->large all are) as this is the 'block size'. There has been an update since and k_quants can now be done with a block size of 64 so it should be quite easy to update this pull request if anyone is requiring a k_quant'd tiny model.

Also nice work on the mel_spec, you've done some really good work there!

@RRUK01 RRUK01 changed the title Adding k_quants Metal Decoder Inference, K-Quants added, Sync'd with latest GGML Aug 6, 2023
@RRUK01
Copy link
Author

RRUK01 commented Aug 6, 2023

Update

Added a few things to this pull request:

I had high hopes for the Metal inference but unfortunately it's fallen a bit flat, it might be that my implementation isn't optimal or there's an error in my code but on testing Metal inference runs slower than CPU (see examples below). I'm wondering it it's because the matrix's involved are relatively small compared with Llama and as such the overhead from using the GPU isn't offset by it's speed, especially compared with the speed up we're already getting from the Accelerate framework. The gap does appear to close on large models but not enough for it to overtake the CPU.

In implementing this I also sync'd up Whisper's GGML with Llama's. Non-metal inference runs as normal.

For anyone wanting to test/use this just be aware that to run on the GPU I had to make a small model change and convert the decoder.positional_embedding layer (dpe) to FP16 as GGML Metal doesn't support FP32 currently. I've made a change to the model loading code so that when WHISPER_USE_METAL=1 it expects dpe to be FP16 and when WHISPER_USE_METAL=0 it expects dpe to be FP32. I've also updated the quantization code so that when you quantize a model it will convert dpe to FP16. Be aware thought that because of these changes models quantized from this pull request won't run when WHISPER_USE_METAL=0 unless you change the non-metal dpe to expect FP16.

The below is from running this on a M2 Macbook Pro 16GB

Model Decode time per run in ms (CPU) Decode time per run in ms (Metal)
Small (Q5_K) 4.24 7.00
Medium (Q5_K) 11.47 14.43
Large (Q5_K) 25.80 28.37

See below for details of some test runs. One other thing to note is that on the smaller models (<= small) the output is slightly difference, the GPU output seems to really like exclamation points for some reason.

Small Q5_K CPU
whisper_init_from_file_no_state: loading model from 'models/ggml-small_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: mem required  =  453.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  166.10 MB
whisper_model_load: model size    =  165.87 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =    97.75 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.77 ms
whisper_print_timings:   sample time =    10.94 ms /    28 runs (    0.39 ms per run)
whisper_print_timings:   encode time =   986.92 ms /     1 runs (  986.92 ms per run)
whisper_print_timings:   decode time =   118.77 ms /    28 runs (    4.24 ms per run)
whisper_print_timings:    total time =  1258.26 ms

Small Q5_K GPU
whisper_init_from_file_no_state: loading model from 'models/ggml-small_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 3
whisper_model_load: mem required  =  453.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  166.10 MB
whisper_model_load: model size    =  165.87 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x13881aa60
ggml_metal_init: loaded kernel_add_row                        0x13881cc20
ggml_metal_init: loaded kernel_mul                            0x13881c080
ggml_metal_init: loaded kernel_mul_row                        0x13881d6a0
ggml_metal_init: loaded kernel_scale                          0x13881e100
ggml_metal_init: loaded kernel_silu                           0x13881e940
ggml_metal_init: loaded kernel_relu                           0x13881c480
ggml_metal_init: loaded kernel_gelu                           0x13881f240
ggml_metal_init: loaded kernel_soft_max                       0x138820280
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1388214e0
ggml_metal_init: loaded kernel_get_rows_f16                   0x138821740
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1073044f0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x1073050a0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x107305610
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x107305f30
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x107306980
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1073072d0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x107307ad0
ggml_metal_init: loaded kernel_rms_norm                       0x107308540
ggml_metal_init: loaded kernel_norm                           0x107309640
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x10730a240
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x10730adc0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x10730b790
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x10730c140
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x10730cc50
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x10730d580
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10730df60
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10730e930
ggml_metal_init: loaded kernel_rope                           0x10730eb90
ggml_metal_init: loaded kernel_alibi_f32                      0x10730f5c0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x107310140
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x138822560
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x130372a50
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    26.12 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   176.00 MB, (  176.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    56.00 MB, (  232.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    16.00 MB, (  248.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =    53.00 MB, (  301.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   120.00 MB, (  421.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    36.00 MB, (  457.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     6.00 MB, (  463.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     6.00 MB, (  469.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.000]   And! So, my fellow Americans, ask not what your country can do!
[00:00:07.000 --> 00:00:11.000]   Ask what! you can do for your country!


whisper_print_timings:     load time =   150.77 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.22 ms
whisper_print_timings:   sample time =    11.23 ms /    30 runs (    0.37 ms per run)
whisper_print_timings:   encode time =  1105.81 ms /     1 runs ( 1105.81 ms per run)
whisper_print_timings:   decode time =   210.00 ms /    30 runs (    7.00 ms per run)
whisper_print_timings:    total time =  1559.63 ms

Medium Q5_K CPU
whisper_init_from_file_no_state: loading model from 'models/ggml-medium_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 4
whisper_model_load: mem required  =  975.00 MB (+   43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  513.23 MB
whisper_model_load: model size    =  512.77 MB
whisper_init_state: kv self size  =   42.00 MB
whisper_init_state: kv cross size =  140.62 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.800]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:07.800 --> 00:00:10.800]   Ask what you can do for your country.


whisper_print_timings:     load time =   182.82 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.30 ms
whisper_print_timings:   sample time =    11.66 ms /    30 runs (    0.39 ms per run)
whisper_print_timings:   encode time =  3035.53 ms /     1 runs ( 3035.53 ms per run)
whisper_print_timings:   decode time =   344.22 ms /    30 runs (   11.47 ms per run)
whisper_print_timings:    total time =  3630.39 ms
Medium Q5_K GPU
whisper_init_from_file_no_state: loading model from 'models/ggml-medium_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 4
whisper_model_load: mem required  =  975.00 MB (+   43.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  513.23 MB
whisper_model_load: model size    =  512.77 MB
whisper_init_state: kv self size  =   42.00 MB
whisper_init_state: kv cross size =  140.62 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x12462d840
ggml_metal_init: loaded kernel_add_row                        0x12462fa00
ggml_metal_init: loaded kernel_mul                            0x12462ee60
ggml_metal_init: loaded kernel_mul_row                        0x124630480
ggml_metal_init: loaded kernel_scale                          0x124630ee0
ggml_metal_init: loaded kernel_silu                           0x1246316f0
ggml_metal_init: loaded kernel_relu                           0x12462f260
ggml_metal_init: loaded kernel_gelu                           0x124632030
ggml_metal_init: loaded kernel_soft_max                       0x124632f60
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1246342d0
ggml_metal_init: loaded kernel_get_rows_f16                   0x124634530
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x124635740
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x124634bc0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x124634e20
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x124636140
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x124636ac0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x124637280
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x124637b90
ggml_metal_init: loaded kernel_rms_norm                       0x124638670
ggml_metal_init: loaded kernel_norm                           0x124639790
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x124639b90
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x12463ad90
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x12463b750
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x12463c140
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x12463cc60
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x12463d720
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x12463df70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x12463e940
ggml_metal_init: loaded kernel_rope                           0x12463f3e0
ggml_metal_init: loaded kernel_alibi_f32                      0x12463ffa0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x124640e30
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x1246419e0
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x124638fb0
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    34.82 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   540.00 MB, (  540.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    74.00 MB, (  614.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    43.00 MB, (  657.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =   141.00 MB, (  798.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   158.00 MB, (  956.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    48.00 MB, ( 1004.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     7.00 MB, ( 1011.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     7.00 MB, ( 1018.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.800]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:07.800 --> 00:00:10.800]   Ask what you can do for your country.


whisper_print_timings:     load time =   179.97 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.94 ms
whisper_print_timings:   sample time =    13.76 ms /    30 runs (    0.46 ms per run)
whisper_print_timings:   encode time =  2981.41 ms /     1 runs ( 2981.41 ms per run)
whisper_print_timings:   decode time =   432.81 ms /    30 runs (   14.43 ms per run)
whisper_print_timings:    total time =  3695.76 ms
ggml_metal_free: deallocating
Large Q5_K CPU
whisper_init_from_file_no_state: loading model from 'models/ggml-large_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5
whisper_model_load: mem required  = 1686.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1029.58 MB
whisper_model_load: model size    = 1028.97 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 0 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   479.21 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    40.84 ms
whisper_print_timings:   sample time =    10.74 ms /    27 runs (    0.40 ms per run)
whisper_print_timings:   encode time =  5745.24 ms /     1 runs ( 5745.24 ms per run)
whisper_print_timings:   decode time =   696.49 ms /    27 runs (   25.80 ms per run)
whisper_print_timings:    total time =  7029.23 ms
Large Q5_K GPU
whisper_init_from_file_no_state: loading model from 'models/ggml-large_q5_k.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 13
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5
whisper_model_load: mem required  = 1686.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 1029.58 MB
whisper_model_load: model size    = 1028.97 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loaded kernel_add                            0x149ca6390
ggml_metal_init: loaded kernel_add_row                        0x149ca8d40
ggml_metal_init: loaded kernel_mul                            0x149ca8120
ggml_metal_init: loaded kernel_mul_row                        0x149ca97c0
ggml_metal_init: loaded kernel_scale                          0x149caa230
ggml_metal_init: loaded kernel_silu                           0x149caaa70
ggml_metal_init: loaded kernel_relu                           0x149ca8520
ggml_metal_init: loaded kernel_gelu                           0x149cab3a0
ggml_metal_init: loaded kernel_soft_max                       0x149cac2e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x149cad650
ggml_metal_init: loaded kernel_get_rows_f16                   0x149cad8b0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x149caeb10
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x155f597c0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x155f5a550
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x155f5aa70
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x155f5b4a0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x1498774f0
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x149878260
ggml_metal_init: loaded kernel_rms_norm                       0x149878870
ggml_metal_init: loaded kernel_norm                           0x14987bab0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x14987c6b0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x149e04a30
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x149e05020
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x155f5be20
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x155f5cf80
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x14987c910
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x14987d620
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x14987dba0
ggml_metal_init: loaded kernel_rope                           0x155e192c0
ggml_metal_init: loaded kernel_alibi_f32                      0x155e19f00
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x155e1af30
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x14987f5c0
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x149878d10
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    43.53 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  1081.00 MB, ( 1081.45 / 10922.67)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    94.00 MB, ( 1175.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    71.00 MB, ( 1246.45 / 10922.67)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =   235.00 MB, ( 1481.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   198.00 MB, ( 1679.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    60.00 MB, ( 1739.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     9.00 MB, ( 1748.45 / 10922.67)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     9.00 MB, ( 1757.45 / 10922.67)

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   657.45 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.17 ms
whisper_print_timings:   sample time =    11.60 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  6002.63 ms /     1 runs ( 6002.63 ms per run)
whisper_print_timings:   decode time =   766.03 ms /    27 runs (   28.37 ms per run)
whisper_print_timings:    total time =  7651.89 ms
ggml_metal_free: deallocating

@ggerganov
Copy link
Owner

Hi @RRUK01 - thank you for the contribution. Nice work!

Will be looking into this PR in a week or two. Sorry for the delay.

The slower Metal runs are surprising.
I tried to do a quick Metal run, but I get the error:

 16:03:33 ▶ ⚓ v1.4.2-79-gf5c5888 ▶ $ ▶WHISPER_USE_METAL=1 make -j &&  ./main -m models/ggml-small.en.bin -f samples/gb0.wav 
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/main/main.cpp examples/common.cpp examples/common-ggml.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/bench/bench.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o bench  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread -DGGML_USE_K_QUANTS -DWHISPER_USE_METAL  examples/quantize/quantize.cpp examples/common.cpp examples/common-ggml.cpp ggml.o k_quants.o whisper.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -framework CoreGraphics
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [2      ] number of best candidates to keep
  -bs N,     --beam-size N       [-1     ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -su,       --speed-up          [false  ] speed up audio by x2 (reduced accuracy)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference

whisper_init_from_file_no_state: loading model from 'models/ggml-small.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: mem required  =  743.00 MB (+   16.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/whisper.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x123e0b400
ggml_metal_init: loaded kernel_add_row                        0x123e0bbb0
ggml_metal_init: loaded kernel_mul                            0x123e0c0f0
ggml_metal_init: loaded kernel_mul_row                        0x123e0c740
ggml_metal_init: loaded kernel_scale                          0x123e0cc80
ggml_metal_init: loaded kernel_silu                           0x123e0d1c0
ggml_metal_init: loaded kernel_relu                           0x123e0d700
ggml_metal_init: loaded kernel_gelu                           0x123e0dc40
ggml_metal_init: loaded kernel_soft_max                       0x123e0e310
ggml_metal_init: loaded kernel_diag_mask_inf                  0x123e0e990
ggml_metal_init: loaded kernel_get_rows_f32                   0x123e0f1b0
ggml_metal_init: loaded kernel_get_rows_f16                   0x123f05130
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x123f05730
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x123f05dd0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x123f06470
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x123f06b10
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x123f071b0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x123f07850
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x123f07ef0
ggml_metal_init: loaded kernel_rms_norm                       0x123f08860
ggml_metal_init: loaded kernel_norm                           0x123f08f30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x123f09800
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x123f09ee0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x123f0a740
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x123f0ae20
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x123f0b500
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x123f0bbe0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x123f0c4a0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x123f0cb60
ggml_metal_init: loaded kernel_rope                           0x123f0d0a0
ggml_metal_init: loaded kernel_alibi_f32                      0x123f0dbe0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x123f0e490
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x123f0ec20
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x123f0f3b0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
whisper_init_from_file: max tensor size =    75.97 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   466.00 MB, (  466.45 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    56.00 MB, (  522.45 / 147456.00)
ggml_metal_add_buffer: allocated 'kvself          ' buffer, size =    16.00 MB, (  538.45 / 147456.00)
ggml_metal_add_buffer: allocated 'kvcross         ' buffer, size =    53.00 MB, (  591.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   120.00 MB, (  711.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =    36.00 MB, (  747.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr2            ' buffer, size =     6.00 MB, (  753.45 / 147456.00)
ggml_metal_add_buffer: allocated 'scr3            ' buffer, size =     6.00 MB, (  759.45 / 147456.00)

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | METAL = 1 | K_QUANTS = 1 | 

main: processing 'samples/gb0.wav' (2037686 samples, 127.4 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil

I could be doing something wrong though.

Btw, you can easily add F32 kernel like this:

diff --git a/ggml-metal.m b/ggml-metal.m
index ee77252..4ff8ad0 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -53,6 +53,7 @@ struct ggml_metal_context {
     GGML_METAL_DECL_KERNEL(gelu);
     GGML_METAL_DECL_KERNEL(soft_max);
     GGML_METAL_DECL_KERNEL(diag_mask_inf);
+    GGML_METAL_DECL_KERNEL(get_rows_f32);
     GGML_METAL_DECL_KERNEL(get_rows_f16);
     GGML_METAL_DECL_KERNEL(get_rows_q4_0);
     GGML_METAL_DECL_KERNEL(get_rows_q4_1);
@@ -170,6 +171,7 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
         GGML_METAL_ADD_KERNEL(gelu);
         GGML_METAL_ADD_KERNEL(soft_max);
         GGML_METAL_ADD_KERNEL(diag_mask_inf);
+        GGML_METAL_ADD_KERNEL(get_rows_f32);
         GGML_METAL_ADD_KERNEL(get_rows_f16);
         GGML_METAL_ADD_KERNEL(get_rows_q4_0);
         GGML_METAL_ADD_KERNEL(get_rows_q4_1);
@@ -893,6 +895,7 @@ void ggml_metal_graph_compute(
                             }
 
                             switch (src0->type) {
+                                case GGML_TYPE_F32:  [encoder setComputePipelineState:ctx->pipeline_get_rows_f32]; break;
                                 case GGML_TYPE_F16:  [encoder setComputePipelineState:ctx->pipeline_get_rows_f16]; break;
                                 case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_0]; break;
                                 case GGML_TYPE_Q4_1: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_1]; break;
@@ -1127,4 +1130,4 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
     }
-}
\ No newline at end of file
+}
diff --git a/ggml-metal.metal b/ggml-metal.metal
index 9e9e5f4..857a47c 100644
--- a/ggml-metal.metal
+++ b/ggml-metal.metal
@@ -219,6 +219,22 @@ kernel void kernel_diag_mask_inf(
     }
 }
 
+kernel void kernel_get_rows_f32(
+        device const float * src0,
+        device const   int * src1,
+        device       float * dst,
+        constant   int64_t & ne00,
+        constant  uint64_t & nb01,
+        constant  uint64_t & nb1,
+        uint tpig[[thread_position_in_grid]]) {
+    const int i = tpig;
+    const int r = ((device int32_t *) src1)[i];
+
+    for (int j = 0; j < ne00; j++) {
+        dst[i*nb1 + j] = ((device float *) ((device char *) src0 + r*nb01))[j];
+    }
+}
+
 kernel void kernel_get_rows_f16(
         device const  void * src0,
         device const   int * src1,
@@ -1969,4 +1985,4 @@ kernel void kernel_mul_mat_q6_K_f32(
     if (tiisg == 0) {
         dst[r1*ne0 + row] = tot;
     }
-}
\ No newline at end of file
+}

@RRUK01 RRUK01 closed this by deleting the head repository Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants