Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few moments in the process of work LLamaSharp & KernelMemory #923

Open
aropb opened this issue Sep 21, 2024 · 0 comments
Open

A few moments in the process of work LLamaSharp & KernelMemory #923

aropb opened this issue Sep 21, 2024 · 0 comments

Comments

@aropb
Copy link

aropb commented Sep 21, 2024

Description

  1. Wherever possible, it is better not to create a Context (this increases the memory used).
    For example, you can use:
    weights.Tokenize()
    Instead of:
    context.Tokenize()

  2. The problem of multithreading. It occurs when embeddings begin to be created at the same time and a question is asked about the model (Executor). This is a big problem and I think it does not apply to KM. I think that some native api calls need to be protected from simultaneous calls.

Error:

Error: CUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:1889
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
CUDA error: operation not permitted when stream is capturing
current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535
cudaDeviceSynchronize()
SafeLLamaContextHandle.llama_new_context_with_model

Or here is such an error when calling at the same time:

...
LLamaStatelessExecutor executor = new(Weights, ModelParams);
...
await foreach (string text in executor.InferAsync(prompt, DefaultInferenceParams, cancellationToken))
{
sb.Append(text);
}
...

CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0
cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph)
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode => <>c__DisplayClass17_0.b__0

How can all these problems be solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant