Deploying HuggingFace TGI for Open Source LLMs in K8s / OpenShift with GitOps

The aim of this repository is to easily deploy our OpenSource LLMs in OpenShift or Kubernetes clusters using GitOps:

Overview

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.

This repo will deploy HuggingFace Text Generation Inference server deployments in K8s/OpenShift with GitOps:

With this we can easily deploy different Open Source LLMs such as Llama2, Falcon, Mistral or FlanT5-XL among others in our OpenShift / Kubernetes clusters to be consumed as another application:

Requirements

ROSA or OpenShift Clusters (can be also deployed in K8s with some tweaks)
GPU available (24gb vRAM recommended)
Node Feature Discovery Operator
NVIDIA GPU Operator
ArgoCD / OpenShift GitOps

Tested with A10G (g5.2xlarge) with Spot Instances using a ROSA cluster with 4.13 version and RHODS with 2.14.0

Models available to deploy using GitOps

Mistral-7B-Instruct

kubectl apply -k gitops/mistral

Flan-T5-XXL

kubectl apply -k gitops/flant5xxl

Falcon-7B-Instruct

kubectl apply -k gitops/falcon

Llama2-7B-Chat-HF

kubectl apply -k gitops/llama2

NOTE: this model needs to set the HUGGING_FACE_HUB_TOKEN_BASE64 in a Secret to be downloaded.

export HUGGING_FACE_HUB_TOKEN_BASE64=$(echo -n 'your-token-value' | base64)
envsubst < hg-tgi/overlays/llama2-7b/hf-token-secret-template.yaml > /tmp/hf-token-secret.yaml
kubectl apply -f /tmp/hf-token-secret.yaml -n llms

CodeLlama-7b-Instruct-hf

kubectl apply -k gitops/codellama

StarCoder

kubectl apply -k gitops/starcoder

NOTE: this model needs to set the HF_TOKEN in a Secret to be downloaded.

export HUGGING_FACE_HUB_TOKEN_BASE64=$(echo -n 'your-token-value' | base64)
envsubst < hg-tgi/overlays/llama2-7b/hf-token-secret-template.yaml > /tmp/hf-token-secret.yaml
kubectl apply -f /tmp/hf-token-secret.yaml -n llms

Inference to the LLMs

Check the Inference Guide to test your LLM deployed with Hugging Face Text Generation Inference

FrontEnd Gradio ChatBot powered by HF-TGI

We will deploy alongside the HF-TGI a Gradio ChatBot application with Memory powered by LangChain.

This FrontEnd will be using the HF-TGI deployed as a backend, powering and fueling the AI NPL Chat capabilities of this FrontEnd Chatbot App.

Once the Gradio ChatBot is deployed, will access directly to the HF-TGI Server that serves the LLM of your choice (see section below), and will answer your questions:

NOTE: If you want to know more, check the original source rh-aiservices-bu repository.

Extra Notes

Repo is heavily based in the llm-on-openshift repo. Kudos to the team!

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
accelerator		accelerator
argocd-clusterrole		argocd-clusterrole
assets		assets
gitops		gitops
hg-tgi		hg-tgi
inference		inference
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean.sh		clean.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deploying HuggingFace TGI for Open Source LLMs in K8s / OpenShift with GitOps

Overview

Requirements

Models available to deploy using GitOps

Mistral-7B-Instruct

Flan-T5-XXL

Falcon-7B-Instruct

Llama2-7B-Chat-HF

CodeLlama-7b-Instruct-hf

StarCoder

Inference to the LLMs

FrontEnd Gradio ChatBot powered by HF-TGI

Extra Notes

About

Releases

Packages

Languages

License

erkerc/hftgi-llms

Folders and files

Latest commit

History

Repository files navigation

Deploying HuggingFace TGI for Open Source LLMs in K8s / OpenShift with GitOps

Overview

Requirements

Models available to deploy using GitOps

Inference to the LLMs

FrontEnd Gradio ChatBot powered by HF-TGI

Extra Notes

About

Resources

License

Stars

Watchers

Forks

Languages