diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb new file mode 100644 index 0000000..a380fe9 --- /dev/null +++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb @@ -0,0 +1,481 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Tce3stUlHN0L" + }, + "source": [ + "##### Copyright 2024 Google LLC." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "tuOe1ymfHZPu" + }, + "outputs": [], + "source": [ + "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dfsDR_omdNea" + }, + "source": [ + "# Getting Started with Gemma 2 and Llamafile\n", + "\n", + "[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n", + "\n", + "Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your own cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n", + "\n", + "[Llamafile](https://github.com/Mozilla-Ocho/llamafile) is a tool that simplifies the distribution and execution of open Large Language Models (LLMs) by packaging them into a single-file executable called a \"llamafile.\" By combining llama.cpp with Cosmopolitan Libc, it consolidates the complexity of LLMs into one framework that runs locally on most computers without any installation. The goal is to make open LLMs more accessible to both developers and end users.\n", + "\n", + "In this tutorial, you will learn how to run the Gemma 2 model from Google using Llamafile.\n", + "\n", + "\n", + " \n", + "
\n", + " Run in Google Colab\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FaqZItBdeokU" + }, + "source": [ + "## Setup\n", + "\n", + "### Select the Colab runtime\n", + "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n", + "\n", + "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n", + "2. Select **Change runtime type**.\n", + "3. Under **Hardware accelerator**, select **T4 GPU**.\n", + "\n", + "### Gemma setup\n", + "\n", + "**Before you dive into the tutorial, let's get you set up with Gemma:**\n", + "\n", + "1. **Hugging Face Account:** If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n", + "2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n", + "3. **Colab with Gemma Power:** For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.\n", + "4. **Hugging Face Token:** Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n", + "\n", + "**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CY2kGtsyYpHF" + }, + "source": [ + "### Configure your HF token\n", + "\n", + "Add your Hugging Face token to the Colab Secrets manager to securely store it.\n", + "\n", + "1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. \"The\n", + "2. Create a new secret with the name `HF_TOKEN`.\n", + "3. Copy/paste your token key into the Value input box of `HF_TOKEN`.\n", + "4. Toggle the button on the left to allow notebook access to the secret.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7-1PYEuJuJyN" + }, + "outputs": [], + "source": [ + "import os\n", + "from google.colab import userdata\n", + "\n", + "# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n", + "# vars as appropriate for your system.\n", + "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iwjo5_Uucxkw" + }, + "source": [ + "### Install dependencies\n", + "You'll need to install a few Python packages and dependencies to interact with Hugging Face and run the model.\n", + "\n", + "Run the following cell to install or upgrade it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5STiMkJQt4gF" + }, + "outputs": [], + "source": [ + "# The huggingface_hub library allows us to download models and other files from Hugging Face.\n", + "!pip install --upgrade -q huggingface_hub\n", + "\n", + "# Download the latest Llamafile binary (https://github.com/Mozilla-Ocho/llamafile/releases)\n", + "!wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13\n", + "\n", + "# Make the binary executable\n", + "!chmod +x llamafile\n", + "\n", + "# Download the zipalign binary (https://github.com/Mozilla-Ocho/llamafile/releases/download//zipalign-)\n", + "!wget -O zipalign https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.4/zipalign-0.8.4\n", + "\n", + "# Make the binary executable\n", + "!chmod +x zipalign" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2_bahJBmwvSp" + }, + "source": [ + "### Logging into Hugging Face Hub\n", + "\n", + "Next, you'll have to log into the Hugging Face Hub using your access token. This will allow us to download the Gemma model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ztSoQDMnt4ii" + }, + "outputs": [], + "source": [ + "from huggingface_hub import login\n", + "\n", + "login(os.environ[\"HF_TOKEN\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_-NAW6VXOeBt" + }, + "source": [ + "### Downloading the Gemma 2 Model\n", + "Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D4fq8ha_t4k-" + }, + "outputs": [], + "source": [ + "from huggingface_hub import hf_hub_download\n", + "\n", + "# Specify the repository and filename\n", + "repo_id = 'google/gemma-2-2b-GGUF' # Repository containing the GGUF model\n", + "filename = '2b_pt_v2.gguf' # The GGUF model file\n", + "\n", + "# Download the model file to the current directory\n", + "hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pjq2dOx90I6e" + }, + "source": [ + "### Creating a Gemma 2 Llamafile\n", + "\n", + "With Llamafile you can run the web server using a simple command like:\n", + "\n", + "```bash\n", + "./gemma2.llamafile ...\n", + "```\n", + "\n", + "To do this you can package both the model weights and a special `.args` file that specifies the default arguments. Start by creating a file named `.args` with the following content:\n", + "\n", + "- `-m`: Specifies the model file to use.\n", + "- `--host`: Specifies the hostname\n", + "- `-ngl`: Sets the number of GPU layers to offload. Setting it to `9999` offloads as many layers as possible to the GPU.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VAFGtj3U3wVa" + }, + "outputs": [], + "source": [ + "%%writefile .args\n", + "-m\n", + "2b_pt_v2.gguf\n", + "--host\n", + "127.0.0.1\n", + "-ngl\n", + "9999\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cKCjnZMEGElt" + }, + "source": [ + "As shown above, the .args file contains one argument per line. The `...` placeholder optionally indicates where any additional command-line arguments provided by the user will be inserted. Now, let's include both the model weights and the argument file into the executable using `zipalign`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tdTrHf4N0MAT" + }, + "outputs": [], + "source": [ + "!cp llamafile gemma2.llamafile\n", + "!./zipalign \\\n", + " gemma2.llamafile \\\n", + " 2b_pt_v2.gguf \\\n", + " .args" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sHD-eCDvPIPh" + }, + "source": [ + "### Run Llamafile\n", + "Let's now run Llamafile in server mode, which allows us to interact with it via HTTP requests.\n", + "\n", + "- `nohup`: Runs the command in the background, immune to hangups.\n", + "- `./llamafile`: Runs the Llamafile binary.\n", + "- `--server`: Starts Llamafile in server mode.\n", + "- `--nobrowser`: Prevents Llamafile from opening a browser window.\n", + "- `--port`: Specifies the port number to use.\n", + "- `> llamafile.log`: Redirects the output to a log file.\n", + "- `&`: Runs the process in the background.\n", + "\n", + "**Note:** The `llamafile.log` file will contain logs that can help you troubleshoot any issues." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pApwODy89e-L" + }, + "outputs": [], + "source": [ + "!nohup ./gemma2.llamafile --server --nobrowser --port 8081 > llamafile.log &\n", + "# Here we add a delay to let the server warm up before we make any requests\n", + "!sleep 60" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e0_eRbS4q1YT" + }, + "source": [ + "To quickly test the API let's use cURL for making a simple HTTP request." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2pac5IBOuc_d" + }, + "outputs": [], + "source": [ + "%%bash\n", + "sleep 2\n", + "curl -X POST http://localhost:8081/completion \\\n", + "-H \"Content-Type: application/json\" \\\n", + "-d '{\n", + " \"system_prompt\": {\n", + " \"prompt\": \"You are an AI assistant. Don'\\''t make things up.\",\n", + " \"anti_prompt\": \"User:\",\n", + " \"assistant_name\": \"Assistant:\"\n", + " },\n", + " \"stream\": false,\n", + " \"n_predict\": 128,\n", + " \"temperature\": 0.7,\n", + " \"stop\": [\"User:\", \"Assistant:\"],\n", + " \"api_key\": \"\",\n", + " \"prompt\": \"User: What is the capital of France?\\\\n\\\\nAssistant:\"\n", + "}' | python3 -c '\n", + "import json\n", + "import sys\n", + "\n", + "try:\n", + " response = json.load(sys.stdin)\n", + " print(json.dumps(response, indent=2))\n", + "except json.JSONDecodeError as e:\n", + " print(\"JSONDecodeError:\", e)\n", + " print(\"No valid JSON data received.\")\n", + "'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XJl2l_Cc5u4N" + }, + "source": [ + "### Creating a simple function to interact with Gemma 2\n", + "\n", + "The `get_completion()` function sends a prompt to an AI language model and retrieves a generated response. This allows you to interact with the model by providing input text and receiving an AI-generated completion.\n", + "\n", + "- **Prompt**: The main input text or question you want the AI to answer.\n", + "- **System Prompt**: Sets the context for the AI, instructing it on how to behave (e.g., \"You are an AI assistant. Don't make things up.\").\n", + "- **Parameters**:\n", + " - `temperature`: Controls the creativity of the response (lower values = more deterministic).\n", + " - `n_predict`: The maximum number of tokens (words or pieces of words) to generate.\n", + " - `stop`: Sequences where the AI should stop generating further text.\n", + " \n", + "\n", + "This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xDAwv_0y5zJD" + }, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "import time\n", + "\n", + "def get_completion(prompt, stream=False, n_predict=128, temperature=0.7,\n", + " stop=[\"User:\", \"Assistant:\"],\n", + " url='http://localhost:8081/completion'):\n", + " \"\"\"\n", + " Sends a POST request to the AI completion API with the given parameters.\n", + "\n", + " Args:\n", + " prompt: The prompt or question to send to the API.\n", + " stream (bool): Whether to stream the response.\n", + " n_predict (int): Number of tokens to predict.\n", + " temperature (float): Controls the randomness of the predictions.\n", + " stop (list): List of stop sequences.\n", + " url (str): The API endpoint URL.\n", + "\n", + " Returns:\n", + " requests.Response: The HTTP response object from the API,\n", + " or None if an error occurs.\n", + " \"\"\"\n", + " headers = {\n", + " 'Content-Type': 'application/json'\n", + " }\n", + "\n", + " payload = {\n", + " \"system_prompt\": {\n", + " \"prompt\": \"You are an AI assistant. Don't make things up.\",\n", + " \"anti_prompt\": \"User:\",\n", + " \"assistant_name\": \"Assistant:\"\n", + " },\n", + " \"stream\": stream,\n", + " \"n_predict\": n_predict,\n", + " \"temperature\": temperature,\n", + " \"stop\": stop,\n", + " \"prompt\": prompt\n", + " }\n", + "\n", + " try:\n", + " response = requests.post(url, headers=headers, json=payload)\n", + " # Raises HTTPError for bad responses (4xx or 5xx)\n", + " response.raise_for_status()\n", + " return response\n", + " except requests.exceptions.RequestException as e:\n", + " print(f\"An error occurred while making the request: {e}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T6ctSiGjQGik" + }, + "source": [ + "You can now interact with the Gemma 2 model through the Llamafile server by calling `get_completion` with your desired prompt to get a response from the AI model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tZX4jjxp_wob" + }, + "outputs": [], + "source": [ + "# Define your prompt and parameters\n", + "prompt = \"User: What is the capital of France?.\\n\\nAssistant:\"\n", + "n_predict = 128\n", + "temperature = 0.7\n", + "\n", + "# Call the get_completion function with your parameters\n", + "time.sleep(2)\n", + "response = get_completion(\n", + " prompt=prompt,\n", + " n_predict=n_predict,\n", + " temperature=temperature\n", + ")\n", + "\n", + "# Print the response\n", + "print(response.json()['content'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QGSoEOU1QlZ5" + }, + "source": [ + "Congratulations! You've successfully set up the Gemma 2 model using Llamafile in a Colab environment. You can now experiment with the model, generate text, and explore its capabilities." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "name": "Using_Gemma_with_Lllamafile.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/README.md b/README.md index 2650872..48efebb 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,7 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo | [Prompt_chaining.ipynb](Gemma/Prompt_chaining.ipynb) | Illustrate prompt chaining and iterative generation with Gemma. | | [Advanced_Prompting_Techniques.ipynb](Gemma/Advanced_Prompting_Techniques.ipynb) | Illustrate advanced prompting techniques with Gemma. | | [Run_with_Ollama.ipynb](Gemma/Run_with_Ollama.ipynb) | Run Gemma models using [Ollama](https://www.ollama.com/). | +| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb) | Run Gemma models using [Llamafile](https://github.com/Mozilla-Ocho/llamafile/). | | [Aligning_DPO_Gemma_2b_it.ipynb](Gemma/Aligning_DPO_Gemma_2b_it.ipynb) | Demonstrate how to align a Gemma model using DPO (Direct Preference Optimization) with [Hugging Face TRL](https://huggingface.co/docs/trl/en/index). | | [Deploy_with_vLLM.ipynb](Gemma/Deploy_with_vLLM.ipynb) | Deploy a Gemma model using [vLLM](https://github.com/vllm-project/vllm). | | [Deploy_Gemma_in_Vertex_AI.ipynb](Gemma/Deploy_Gemma_in_Vertex_AI.ipynb) | Deploy a Gemma model using [Vertex AI](https://cloud.google.com/vertex-ai). | diff --git a/WISHLIST.md b/WISHLIST.md index c08507c..55e8002 100644 --- a/WISHLIST.md +++ b/WISHLIST.md @@ -2,7 +2,6 @@ A wish list of cookbooks showcasing: * Inference * Integration with [Google GenKit](https://firebase.google.com/products/genkit) - * Llamafile demo * llama.cpp demo * HF local-gemma demo * ElasticSearch integration