A few moments in the process of work LLamaSharp & KernelMemory #923

aropb · 2024-09-21T18:05:26Z

Description

Wherever possible, it is better not to create a Context (this increases the memory used).
For example, you can use:
weights.Tokenize()
Instead of:
context.Tokenize()
The problem of multithreading. It occurs when embeddings begin to be created at the same time and a question is asked about the model (Executor). This is a big problem and I think it does not apply to KM. I think that some native api calls need to be protected from simultaneous calls.

Error:

Error: CUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:1889
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
CUDA error: operation not permitted when stream is capturing
current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535
cudaDeviceSynchronize()
SafeLLamaContextHandle.llama_new_context_with_model

Or here is such an error when calling at the same time:

...
LLamaStatelessExecutor executor = new(Weights, ModelParams);
...
await foreach (string text in executor.InferAsync(prompt, DefaultInferenceParams, cancellationToken))
{
sb.Append(text);
}
...

CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph)
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0

How can all these problems be solved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few moments in the process of work LLamaSharp & KernelMemory #923

A few moments in the process of work LLamaSharp & KernelMemory #923

aropb commented Sep 21, 2024 •

edited

Loading

A few moments in the process of work LLamaSharp & KernelMemory #923

A few moments in the process of work LLamaSharp & KernelMemory #923

Comments

aropb commented Sep 21, 2024 • edited Loading

Description

aropb commented Sep 21, 2024 •

edited

Loading