From f574d154b1d9f1e2438c6659a2bbbc5fa12a035e Mon Sep 17 00:00:00 2001
From: Kinar R <42828719+kinarr@users.noreply.github.com>
Date: Fri, 20 Sep 2024 17:54:25 +0530
Subject: [PATCH 1/9] Added Gemma 2 Llamafile Demo Colab

---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 320 ++++++++++++++++++++++++
 1 file changed, 320 insertions(+)
 create mode 100644 Gemma/Using_Gemma_with_Lllamafile.ipynb
diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
new file mode 100644
index 0000000..d11f235
--- /dev/null
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -0,0 +1,320 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Tce3stUlHN0L"
+      },
+      "source": [
+        "##### Copyright 2024 Google LLC."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "tuOe1ymfHZPu"
+      },
+      "outputs": [],
+      "source": [
+        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "dfsDR_omdNea"
+      },
+      "source": [
+        "# Getting Started with Gemma 2 and Llamafile\n",
+        "\n",
+        "[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n",
+        "\n",
+        "Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your own cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n",
+        "\n",
+        "[Llamafile](https://github.com/Mozilla-Ocho/llamafile) is a tool that simplifies the distribution and execution of open Large Language Models (LLMs) by packaging them into a single-file executable called a \"llamafile.\" By combining llama.cpp with Cosmopolitan Libc, it consolidates the complexity of LLMs into one framework that runs locally on most computers without any installation. The goal is to make open LLMs more accessible to both developers and end users.\n",
+        "\n",
+        "In this tutorial, you will learn how to run the Gemma 2 model from Google using Llamafile.\n",
+        "\n",
+        "<table align=\"left\">\n",
+        "  <td>\n",
+        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_Lllamafile.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "  </td>\n",
+        "</table>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "FaqZItBdeokU"
+      },
+      "source": [
+        "## Setup\n",
+        "\n",
+        "### Select the Colab runtime\n",
+        "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
+        "\n",
+        "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
+        "2. Select **Change runtime type**.\n",
+        "3. Under **Hardware accelerator**, select **T4 GPU**.\n",
+        "\n",
+        "### Gemma setup\n",
+        "\n",
+        "**Before we dive into the tutorial, let's get you set up with Gemma:**\n",
+        "\n",
+        "1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n",
+        "2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n",
+        "3. **Colab with Gemma Power:**  For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.\n",
+        "4. **Hugging Face Token:**  Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n",
+        "\n",
+        "**Once you've completed these steps, you're ready to move on to the next section where we'll set up environment variables in your Colab environment.**\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CY2kGtsyYpHF"
+      },
+      "source": [
+        "### Configure your HF token\n",
+        "\n",
+        "Add your Hugging Face token to the Colab Secrets manager to securely store it.\n",
+        "\n",
+        "1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src=\"https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg\" alt=\"The Secrets tab is found on the left panel.\" width=50%>\n",
+        "2. Create a new secret with the name `HF_TOKEN`.\n",
+        "3. Copy/paste your token key into the Value input box of `HF_TOKEN`.\n",
+        "4. Toggle the button on the left to allow notebook access to the secret.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "7-1PYEuJuJyN"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "from google.colab import userdata\n",
+        "\n",
+        "# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n",
+        "# vars as appropriate for your system.\n",
+        "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "iwjo5_Uucxkw"
+      },
+      "source": [
+        "### Install dependencies\n",
+        "You'll need to install a few Python packages to interact with Hugging Face and run the model.\n",
+        "\n",
+        "Run the following cell to install or upgrade it:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "5STiMkJQt4gF"
+      },
+      "outputs": [],
+      "source": [
+        "# The huggingface_hub library allows us to download models and other files from Hugging Face.\n",
+        "!pip install --upgrade -q huggingface_hub"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2_bahJBmwvSp"
+      },
+      "source": [
+        "### Logging into Hugging Face Hub\n",
+        "\n",
+        "Next, you'll have to log into the Hugging Face Hub using your access token. This will allow us to download the Gemma model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ztSoQDMnt4ii"
+      },
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import login\n",
+        "\n",
+        "login(os.environ[\"HF_TOKEN\"])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_-NAW6VXOeBt"
+      },
+      "source": [
+        "### Downloading the Gemma 2 Model\n",
+        "Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "D4fq8ha_t4k-"
+      },
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import hf_hub_download\n",
+        "\n",
+        "# Specify the repository and filename\n",
+        "repo_id = 'google/gemma-2-2b-GGUF'  # Repository containing the GGUF model\n",
+        "filename = '2b_pt_v2.gguf'  # The GGUF model file\n",
+        "\n",
+        "# Download the model file to the current directory\n",
+        "hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "iv9YdKDrrR5y"
+      },
+      "source": [
+        "### Installing Llamafile\n",
+        "Llamafile is a tool that allows you to run Llama and Llama-like models efficiently.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "FicMgdefrQs6"
+      },
+      "outputs": [],
+      "source": [
+        "# Download the latest Llamafile binary (https://github.com/Mozilla-Ocho/llamafile/releases)\n",
+        "!wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13\n",
+        "\n",
+        "# Make the binary executable\n",
+        "!chmod +x llamafile"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sHD-eCDvPIPh"
+      },
+      "source": [
+        "### Run Llamafile\n",
+        "Let's now run Llamafile in server mode, which allows us to interact with it via HTTP requests."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pApwODy89e-L"
+      },
+      "outputs": [],
+      "source": [
+        "!nohup ./llamafile --server --nobrowser --port 8081 -ngl 9999 -m 2b_pt_v2.gguf > llamafile.log &"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Xsr4qmGHPYaL"
+      },
+      "source": [
+        "Let's break down the command:\n",
+        "\n",
+        "- `nohup`: Runs the command in the background, immune to hangups.\n",
+        "- `./llamafile`: Runs the Llamafile binary.\n",
+        "- `--server`: Starts Llamafile in server mode.\n",
+        "- `--nobrowser`: Prevents Llamafile from opening a browser window.\n",
+        "- `--port`: Specifies the port number to use.\n",
+        "- `-ngl`: Sets the number of GPU layers to offload. Setting it to `9999` offloads as many layers as possible to the GPU.\n",
+        "- `-m`: Specifies the model file to use.\n",
+        "- `> llamafile.log`: Redirects the output to a log file.\n",
+        "- `&`: Runs the process in the background.\n",
+        "\n",
+        "**Note:** The `llamafile.log` file will contain logs that can help you troubleshoot any issues."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "HbhOjxm4PrRc"
+      },
+      "source": [
+        "### Accessing the Llamafile Server\n",
+        "Since you'll be running this in a Colab environment, we need to set up port forwarding to access the Llamafile server.\n",
+        "\n",
+        "Use Colab's `eval_js` function to create a proxy URL for port `8081`.\n",
+        "\n",
+        "Click on the link provided in the output above to open the Llamafile web interface. From there, you can enter prompts and receive responses from the model.\n",
+        "\n",
+        "1. Enter a prompt: In the **input box**, type a question or a statement you would like the model to respond to.  \n",
+        "2. Submit the prompt: Click on **Send** to query the model.  \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "G-QMy0ImMY5x"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab.output import eval_js\n",
+        "print(eval_js(\"google.colab.kernel.proxyPort(8081)\"))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T6ctSiGjQGik"
+      },
+      "source": [
+        "You can now interact with the Gemma 2 model through the Llamafile server.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "QGSoEOU1QlZ5"
+      },
+      "source": [
+        "Congratulations! You've successfully set up the Gemma 2 model using Llamafile in a Colab environment. You can now experiment with the model, generate text, and explore its capabilities."
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "name": "Using_Gemma_with_Lllamafile.ipynb",
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

From 7c20cba79641dce2bc3826c60070ee120764c4bf Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Sat, 21 Sep 2024 16:09:46 +0530
Subject: [PATCH 2/9] Updated Colab to use a HTTP request for querying the
 model

Additional notes: Creates a Gemma 2 Llamafile executable binary
---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 173 +++++++++++++++++++-----
 1 file changed, 142 insertions(+), 31 deletions(-)

diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
index d11f235..453dfdb 100644
--- a/Gemma/Using_Gemma_with_Lllamafile.ipynb
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -120,7 +120,7 @@
       },
       "source": [
         "### Install dependencies\n",
-        "You'll need to install a few Python packages to interact with Hugging Face and run the model.\n",
+        "You'll need to install a few Python packages and dependencies to interact with Hugging Face and run the model.\n",
         "\n",
         "Run the following cell to install or upgrade it:"
       ]
@@ -134,7 +134,19 @@
       "outputs": [],
       "source": [
         "# The huggingface_hub library allows us to download models and other files from Hugging Face.\n",
-        "!pip install --upgrade -q huggingface_hub"
+        "!pip install --upgrade -q huggingface_hub\n",
+        "\n",
+        "# Download the latest Llamafile binary (https://github.com/Mozilla-Ocho/llamafile/releases)\n",
+        "!wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13\n",
+        "\n",
+        "# Make the binary executable\n",
+        "!chmod +x llamafile\n",
+        "\n",
+        "# Download the zipalign binary (https://github.com/Mozilla-Ocho/llamafile/releases/download/<version>/zipalign-<version>)\n",
+        "!wget -O zipalign https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.4/zipalign-0.8.4\n",
+        "\n",
+        "# Make the binary executable\n",
+        "!chmod +x zipalign"
       ]
     },
     {
@@ -192,97 +204,172 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "iv9YdKDrrR5y"
+        "id": "Pjq2dOx90I6e"
       },
       "source": [
-        "### Installing Llamafile\n",
-        "Llamafile is a tool that allows you to run Llama and Llama-like models efficiently.\n"
+        "### Creating a Gemma 2 Llamafile\n",
+        "\n",
+        "With Llamafile you can run the web server using a simple command like:\n",
+        "\n",
+        "```bash\n",
+        "./gemma2.llamafile ...\n",
+        "```\n",
+        "\n",
+        "To do this you can package both the model weights and a special `.args` file that specifies the default arguments. Start by creating a file named `.args` with the following content:\n",
+        "\n",
+        "- `-m`: Specifies the model file to use.\n",
+        "- `--host`: Specifies the hostname\n",
+        "- `-ngl`: Sets the number of GPU layers to offload. Setting it to `9999` offloads as many layers as possible to the GPU.\n"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "FicMgdefrQs6"
+        "id": "VAFGtj3U3wVa"
       },
       "outputs": [],
       "source": [
-        "# Download the latest Llamafile binary (https://github.com/Mozilla-Ocho/llamafile/releases)\n",
-        "!wget -O llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.13/llamafile-0.8.13\n",
-        "\n",
-        "# Make the binary executable\n",
-        "!chmod +x llamafile"
+        "%%writefile .args\n",
+        "-m\n",
+        "2b_pt_v2.gguf\n",
+        "--host\n",
+        "0.0.0.0\n",
+        "-ngl\n",
+        "9999\n",
+        "..."
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "sHD-eCDvPIPh"
+        "id": "cKCjnZMEGElt"
       },
       "source": [
-        "### Run Llamafile\n",
-        "Let's now run Llamafile in server mode, which allows us to interact with it via HTTP requests."
+        "As shown above, the .args file contains one argument per line. The `...` placeholder optionally indicates where any additional command-line arguments provided by the user will be inserted. Now, let's include both the model weights and the argument file into the executable using `zipalign`:"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "pApwODy89e-L"
+        "id": "tdTrHf4N0MAT"
       },
       "outputs": [],
       "source": [
-        "!nohup ./llamafile --server --nobrowser --port 8081 -ngl 9999 -m 2b_pt_v2.gguf > llamafile.log &"
+        "!cp llamafile gemma2.llamafile\n",
+        "!./zipalign \\\n",
+        "  gemma2.llamafile \\\n",
+        "  2b_pt_v2.gguf \\\n",
+        "  .args"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "Xsr4qmGHPYaL"
+        "id": "sHD-eCDvPIPh"
       },
       "source": [
-        "Let's break down the command:\n",
+        "### Run Llamafile\n",
+        "Let's now run Llamafile in server mode, which allows us to interact with it via HTTP requests.\n",
         "\n",
         "- `nohup`: Runs the command in the background, immune to hangups.\n",
         "- `./llamafile`: Runs the Llamafile binary.\n",
         "- `--server`: Starts Llamafile in server mode.\n",
         "- `--nobrowser`: Prevents Llamafile from opening a browser window.\n",
         "- `--port`: Specifies the port number to use.\n",
-        "- `-ngl`: Sets the number of GPU layers to offload. Setting it to `9999` offloads as many layers as possible to the GPU.\n",
-        "- `-m`: Specifies the model file to use.\n",
         "- `> llamafile.log`: Redirects the output to a log file.\n",
         "- `&`: Runs the process in the background.\n",
         "\n",
         "**Note:** The `llamafile.log` file will contain logs that can help you troubleshoot any issues."
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pApwODy89e-L"
+      },
+      "outputs": [],
+      "source": [
+        "!nohup ./gemma2.llamafile --server --nobrowser --port 8081 > llamafile.log &"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "HbhOjxm4PrRc"
+        "id": "XJl2l_Cc5u4N"
       },
       "source": [
-        "### Accessing the Llamafile Server\n",
-        "Since you'll be running this in a Colab environment, we need to set up port forwarding to access the Llamafile server.\n",
+        "### Creating a simple function to interact with Gemma 2\n",
         "\n",
-        "Use Colab's `eval_js` function to create a proxy URL for port `8081`.\n",
+        "The `get_completion()` function sends a prompt to an AI language model and retrieves a generated response. This allows you to interact with the model by providing input text and receiving an AI-generated completion.\n",
         "\n",
-        "Click on the link provided in the output above to open the Llamafile web interface. From there, you can enter prompts and receive responses from the model.\n",
+        "- **Prompt**: The main input text or question you want the AI to answer.\n",
+        "- **System Prompt**: Sets the context for the AI, instructing it on how to behave (e.g., \"You are an AI assistant. Don't make things up.\").\n",
+        "- **Parameters**:\n",
+        "  - `temperature`: Controls the creativity of the response (lower values = more deterministic).\n",
+        "  - `n_predict`: The maximum number of tokens (words or pieces of words) to generate.\n",
+        "  - `stop`: Sequences where the AI should stop generating further text.\n",
+        "  \n",
         "\n",
-        "1. Enter a prompt: In the **input box**, type a question or a statement you would like the model to respond to.  \n",
-        "2. Submit the prompt: Click on **Send** to query the model.  \n"
+        "This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions."
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
-        "id": "G-QMy0ImMY5x"
+        "id": "xDAwv_0y5zJD"
       },
       "outputs": [],
       "source": [
-        "from google.colab.output import eval_js\n",
-        "print(eval_js(\"google.colab.kernel.proxyPort(8081)\"))"
+        "import requests\n",
+        "import json\n",
+        "\n",
+        "def get_completion(prompt, stream=False, n_predict=128, temperature=0.7,\n",
+        "                   stop=[\"</s>\", \"User:\", \"Assistant:\"],\n",
+        "                   url='http://localhost:8081/completion'):\n",
+        "    \"\"\"\n",
+        "    Sends a POST request to the AI completion API with the given parameters.\n",
+        "\n",
+        "    Args:\n",
+        "        prompt: The prompt or question to send to the API.\n",
+        "        stream (bool): Whether to stream the response.\n",
+        "        n_predict (int): Number of tokens to predict.\n",
+        "        temperature (float): Controls the randomness of the predictions.\n",
+        "        stop (list): List of stop sequences.\n",
+        "        url (str): The API endpoint URL.\n",
+        "\n",
+        "    Returns:\n",
+        "        requests.Response: The HTTP response object from the API,\n",
+        "        or None if an error occurs.\n",
+        "    \"\"\"\n",
+        "    headers = {\n",
+        "        'Content-Type': 'application/json'\n",
+        "    }\n",
+        "\n",
+        "    payload = {\n",
+        "        \"system_prompt\": {\n",
+        "            \"prompt\": \"You are an AI assistant. Don't make things up.\",\n",
+        "            \"anti_prompt\": \"User:\",\n",
+        "            \"assistant_name\": \"Assistant:\"\n",
+        "        },\n",
+        "        \"stream\": stream,\n",
+        "        \"n_predict\": n_predict,\n",
+        "        \"temperature\": temperature,\n",
+        "        \"stop\": stop,\n",
+        "        \"prompt\": prompt\n",
+        "    }\n",
+        "\n",
+        "    try:\n",
+        "        response = requests.post(url, headers=headers, json=payload)\n",
+        "        # Raises HTTPError for bad responses (4xx or 5xx)\n",
+        "        response.raise_for_status()\n",
+        "        return response\n",
+        "    except requests.exceptions.RequestException as e:\n",
+        "        print(f\"An error occurred while making the request: {e}\")\n",
+        "        return None"
       ]
     },
     {
@@ -291,7 +378,31 @@
         "id": "T6ctSiGjQGik"
       },
       "source": [
-        "You can now interact with the Gemma 2 model through the Llamafile server.\n"
+        "You can now interact with the Gemma 2 model through the Llamafile server by calling `get_completion` with your desired prompt to get a response from the AI model.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tZX4jjxp_wob"
+      },
+      "outputs": [],
+      "source": [
+        "# Define your prompt and parameters\n",
+        "prompt = \"User: What is the capital of France?.\\n\\nAssistant:\"\n",
+        "n_predict = 128\n",
+        "temperature = 0.7\n",
+        "\n",
+        "# Call the get_completion function with your parameters\n",
+        "response = get_completion(\n",
+        "    prompt=prompt,\n",
+        "    n_predict=n_predict,\n",
+        "    temperature=temperature\n",
+        ")\n",
+        "\n",
+        "# Print the response\n",
+        "print(response.json()['content'])"
       ]
     },
     {

From 27b20ab9ded6478ed23431c3ef77078fd50c3a07 Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 08:19:22 +0530
Subject: [PATCH 3/9] Added a simple curl CLI request before making the Python
 request

---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 36 +++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
index 453dfdb..cb7907b 100644
--- a/Gemma/Using_Gemma_with_Lllamafile.ipynb
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -316,6 +316,42 @@
         "This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions."
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "e0_eRbS4q1YT"
+      },
+      "source": [
+        "To quickly test the API let's use cURL for making a simple HTTP request."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "2pac5IBOuc_d"
+      },
+      "outputs": [],
+      "source": [
+        "%%bash\n",
+        "\n",
+        "curl http://localhost:8081/completion \\\n",
+        "-H \"Content-Type: application/json\" \\\n",
+        "-d '{\n",
+        "  \"stream\": false,\n",
+        "  \"n_predict\": 128,\n",
+        "  \"temperature\": 0.7,\n",
+        "  \"stop\": [\"</s>\", \"User:\", \"Assistant:\"],\n",
+        "  \"api_key\": \"\",\n",
+        "  \"prompt\": \"User: What is the capital of France?.\\\\n\\\\nAssistant:\"\n",
+        "}' | python3 -c '\n",
+        "import json\n",
+        "import sys\n",
+        "json.dump(json.load(sys.stdin), sys.stdout, indent=2)\n",
+        "print()\n",
+        "'"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,

From 941bef2aea420ae680a2a6277b4cfa1e914dbe0f Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 08:31:44 +0530
Subject: [PATCH 4/9] Removed an unused stop sequence and added delay between
 requests to avoid hitting rate limits

---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
index cb7907b..a64254e 100644
--- a/Gemma/Using_Gemma_with_Lllamafile.ipynb
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -334,14 +334,14 @@
       "outputs": [],
       "source": [
         "%%bash\n",
-        "\n",
+        "sleep 2\n",
         "curl http://localhost:8081/completion \\\n",
         "-H \"Content-Type: application/json\" \\\n",
         "-d '{\n",
         "  \"stream\": false,\n",
         "  \"n_predict\": 128,\n",
         "  \"temperature\": 0.7,\n",
-        "  \"stop\": [\"</s>\", \"User:\", \"Assistant:\"],\n",
+        "  \"stop\": [\"User:\", \"Assistant:\"],\n",
         "  \"api_key\": \"\",\n",
         "  \"prompt\": \"User: What is the capital of France?.\\\\n\\\\nAssistant:\"\n",
         "}' | python3 -c '\n",
@@ -349,7 +349,7 @@
         "import sys\n",
         "json.dump(json.load(sys.stdin), sys.stdout, indent=2)\n",
         "print()\n",
-        "'"
+        "'\n"
       ]
     },
     {
@@ -362,9 +362,10 @@
       "source": [
         "import requests\n",
         "import json\n",
+        "import time\n",
         "\n",
         "def get_completion(prompt, stream=False, n_predict=128, temperature=0.7,\n",
-        "                   stop=[\"</s>\", \"User:\", \"Assistant:\"],\n",
+        "                   stop=[\"User:\", \"Assistant:\"],\n",
         "                   url='http://localhost:8081/completion'):\n",
         "    \"\"\"\n",
         "    Sends a POST request to the AI completion API with the given parameters.\n",
@@ -431,6 +432,7 @@
         "temperature = 0.7\n",
         "\n",
         "# Call the get_completion function with your parameters\n",
+        "time.sleep(2)\n",
         "response = get_completion(\n",
         "    prompt=prompt,\n",
         "    n_predict=n_predict,\n",

From e8891aeb1e5e7d9f725dbb907d203edae5e0d79a Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 09:31:54 +0530
Subject: [PATCH 5/9] Reformatted notebook and added a delay before making any
 requests to the server

---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 68 +++++++++++++++----------
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
index a64254e..8bb5a19 100644
--- a/Gemma/Using_Gemma_with_Lllamafile.ipynb
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -234,7 +234,7 @@
         "-m\n",
         "2b_pt_v2.gguf\n",
         "--host\n",
-        "0.0.0.0\n",
+        "127.0.0.1\n",
         "-ngl\n",
         "9999\n",
         "..."
@@ -292,28 +292,9 @@
       },
       "outputs": [],
       "source": [
-        "!nohup ./gemma2.llamafile --server --nobrowser --port 8081 > llamafile.log &"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "XJl2l_Cc5u4N"
-      },
-      "source": [
-        "### Creating a simple function to interact with Gemma 2\n",
-        "\n",
-        "The `get_completion()` function sends a prompt to an AI language model and retrieves a generated response. This allows you to interact with the model by providing input text and receiving an AI-generated completion.\n",
-        "\n",
-        "- **Prompt**: The main input text or question you want the AI to answer.\n",
-        "- **System Prompt**: Sets the context for the AI, instructing it on how to behave (e.g., \"You are an AI assistant. Don't make things up.\").\n",
-        "- **Parameters**:\n",
-        "  - `temperature`: Controls the creativity of the response (lower values = more deterministic).\n",
-        "  - `n_predict`: The maximum number of tokens (words or pieces of words) to generate.\n",
-        "  - `stop`: Sequences where the AI should stop generating further text.\n",
-        "  \n",
-        "\n",
-        "This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions."
+        "!nohup ./gemma2.llamafile --server --nobrowser --port 8081 > llamafile.log &\n",
+        "# Here we add a delay to let the server warm up before we make any requests\n",
+        "!sleep 60"
       ]
     },
     {
@@ -335,21 +316,52 @@
       "source": [
         "%%bash\n",
         "sleep 2\n",
-        "curl http://localhost:8081/completion \\\n",
+        "curl -X POST http://localhost:8081/completion \\\n",
         "-H \"Content-Type: application/json\" \\\n",
         "-d '{\n",
+        "  \"system_prompt\": {\n",
+        "    \"prompt\": \"You are an AI assistant. Don'\\''t make things up.\",\n",
+        "    \"anti_prompt\": \"User:\",\n",
+        "    \"assistant_name\": \"Assistant:\"\n",
+        "  },\n",
         "  \"stream\": false,\n",
         "  \"n_predict\": 128,\n",
         "  \"temperature\": 0.7,\n",
         "  \"stop\": [\"User:\", \"Assistant:\"],\n",
         "  \"api_key\": \"\",\n",
-        "  \"prompt\": \"User: What is the capital of France?.\\\\n\\\\nAssistant:\"\n",
+        "  \"prompt\": \"User: What is the capital of France?\\\\n\\\\nAssistant:\"\n",
         "}' | python3 -c '\n",
         "import json\n",
         "import sys\n",
-        "json.dump(json.load(sys.stdin), sys.stdout, indent=2)\n",
-        "print()\n",
-        "'\n"
+        "\n",
+        "try:\n",
+        "    response = json.load(sys.stdin)\n",
+        "    print(json.dumps(response, indent=2))\n",
+        "except json.JSONDecodeError as e:\n",
+        "    print(\"JSONDecodeError:\", e)\n",
+        "    print(\"No valid JSON data received.\")\n",
+        "'"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XJl2l_Cc5u4N"
+      },
+      "source": [
+        "### Creating a simple function to interact with Gemma 2\n",
+        "\n",
+        "The `get_completion()` function sends a prompt to an AI language model and retrieves a generated response. This allows you to interact with the model by providing input text and receiving an AI-generated completion.\n",
+        "\n",
+        "- **Prompt**: The main input text or question you want the AI to answer.\n",
+        "- **System Prompt**: Sets the context for the AI, instructing it on how to behave (e.g., \"You are an AI assistant. Don't make things up.\").\n",
+        "- **Parameters**:\n",
+        "  - `temperature`: Controls the creativity of the response (lower values = more deterministic).\n",
+        "  - `n_predict`: The maximum number of tokens (words or pieces of words) to generate.\n",
+        "  - `stop`: Sequences where the AI should stop generating further text.\n",
+        "  \n",
+        "\n",
+        "This function simplifies the process of communicating with an AI language model, making it easier for you to experiment with generating text completions."
       ]
     },
     {

From 198cba463181a5c461c4b18843499ab6289dafdf Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 14:13:23 +0530
Subject: [PATCH 6/9] Updated the README and WISHLIST files

- Adds Llamafile demo to README
- Removes Llamafile demo from WISHLIST
---
 README.md   | 1 +
 WISHLIST.md | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 2650872..baa7831 100644
--- a/README.md
+++ b/README.md
@@ -43,6 +43,7 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo
 | [Minimal_RAG.ipynb](Gemma/Minimal_RAG.ipynb)                                                                                                     | Minimal example of building a RAG system with Gemma using [Google UniSim](https://github.com/google/unisim) and [Hugging Face](https://huggingface.co/).                                |
 | [RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb](Gemma/RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb)                                 | RAG PDF Search in multiple documents using Gemma 2 2B on Google Colab.                                                                                                                  |
 | [Using_Gemma_with_LangChain.ipynb](Gemma/Using_Gemma_with_LangChain.ipynb)                                                                       | Examples to demonstrate using Gemma with [LangChain](https://www.langchain.com/).                                                                                                       |
+| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb)                                                                       | An examples to demonstrate using Gemma with [Llamafile](https://github.com/Mozilla-Ocho/llamafile/).                                                                                    |
 | [Gemma_RAG_LlamaIndex.ipynb](Gemma/Gemma_RAG_LlamaIndex.ipynb)                                                                                   | RAG example with [LlamaIndex](https://www.llamaindex.ai/) using Gemma.                                                                                                                  |
 | [Integrate_with_Mesop.ipynb](Gemma/Integrate_with_Mesop.ipynb)                                                                                   | Integrate Gemma with [Google Mesop](https://google.github.io/mesop/).                                                                                                                   |
 | [Integrate_with_OneTwo.ipynb](Gemma/Integrate_with_OneTwo.ipynb)                                                                                 | Integrate Gemma with [Google OneTwo](https://github.com/google-deepmind/onetwo).                                                                                                        |
diff --git a/WISHLIST.md b/WISHLIST.md
index c08507c..55e8002 100644
--- a/WISHLIST.md
+++ b/WISHLIST.md
@@ -2,7 +2,6 @@ A wish list of cookbooks showcasing:
 
 * Inference
   * Integration with [Google GenKit](https://firebase.google.com/products/genkit)
-  * Llamafile demo
   * llama.cpp demo
   * HF local-gemma demo
   * ElasticSearch integration

From 5e4a81b11ed2902dbec9230a5cd66ffaefa11fb0 Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 14:15:10 +0530
Subject: [PATCH 7/9] Change we to you in the Colab documentation

---
 Gemma/Using_Gemma_with_Lllamafile.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Gemma/Using_Gemma_with_Lllamafile.ipynb b/Gemma/Using_Gemma_with_Lllamafile.ipynb
index 8bb5a19..a380fe9 100644
--- a/Gemma/Using_Gemma_with_Lllamafile.ipynb
+++ b/Gemma/Using_Gemma_with_Lllamafile.ipynb
@@ -71,7 +71,7 @@
         "\n",
         "### Gemma setup\n",
         "\n",
-        "**Before we dive into the tutorial, let's get you set up with Gemma:**\n",
+        "**Before you dive into the tutorial, let's get you set up with Gemma:**\n",
         "\n",
         "1. **Hugging Face Account:**  If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n",
         "2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n",

From 915853af17db9662c874174457e1b7efab6579a9 Mon Sep 17 00:00:00 2001
From: kinarr <kinar.ravishankar@gmail.com>
Date: Mon, 23 Sep 2024 14:17:14 +0530
Subject: [PATCH 8/9] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index baa7831..24e64d7 100644
--- a/README.md
+++ b/README.md
@@ -43,7 +43,7 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo
 | [Minimal_RAG.ipynb](Gemma/Minimal_RAG.ipynb)                                                                                                     | Minimal example of building a RAG system with Gemma using [Google UniSim](https://github.com/google/unisim) and [Hugging Face](https://huggingface.co/).                                |
 | [RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb](Gemma/RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb)                                 | RAG PDF Search in multiple documents using Gemma 2 2B on Google Colab.                                                                                                                  |
 | [Using_Gemma_with_LangChain.ipynb](Gemma/Using_Gemma_with_LangChain.ipynb)                                                                       | Examples to demonstrate using Gemma with [LangChain](https://www.langchain.com/).                                                                                                       |
-| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb)                                                                       | An examples to demonstrate using Gemma with [Llamafile](https://github.com/Mozilla-Ocho/llamafile/).                                                                                    |
+| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb)                                                                       | An example to demonstrate using Gemma with [Llamafile](https://github.com/Mozilla-Ocho/llamafile/).                                                                                     |
 | [Gemma_RAG_LlamaIndex.ipynb](Gemma/Gemma_RAG_LlamaIndex.ipynb)                                                                                   | RAG example with [LlamaIndex](https://www.llamaindex.ai/) using Gemma.                                                                                                                  |
 | [Integrate_with_Mesop.ipynb](Gemma/Integrate_with_Mesop.ipynb)                                                                                   | Integrate Gemma with [Google Mesop](https://google.github.io/mesop/).                                                                                                                   |
 | [Integrate_with_OneTwo.ipynb](Gemma/Integrate_with_OneTwo.ipynb)                                                                                 | Integrate Gemma with [Google OneTwo](https://github.com/google-deepmind/onetwo).                                                                                                        |

From 03ab8abfb87d685ff6649d0a99d1276a622d08e2 Mon Sep 17 00:00:00 2001
From: Kinar R <42828719+kinarr@users.noreply.github.com>
Date: Mon, 23 Sep 2024 19:44:53 +0530
Subject: [PATCH 9/9] Updated README.md to reposition notebook order in the
 list

Moved this up to below the Ollama notebook
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 24e64d7..48efebb 100644
--- a/README.md
+++ b/README.md
@@ -36,6 +36,7 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo
 | [Prompt_chaining.ipynb](Gemma/Prompt_chaining.ipynb)                                                                                             | Illustrate prompt chaining and iterative generation with Gemma.                                                                                                                         |
 | [Advanced_Prompting_Techniques.ipynb](Gemma/Advanced_Prompting_Techniques.ipynb)                                                                 | Illustrate advanced prompting techniques with Gemma.                                                                                                                                    |
 | [Run_with_Ollama.ipynb](Gemma/Run_with_Ollama.ipynb)                                                                                             | Run Gemma models using [Ollama](https://www.ollama.com/).                                                                                                                               |
+| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb)                                                                       | Run Gemma models using [Llamafile](https://github.com/Mozilla-Ocho/llamafile/).                                                                                                         |
 | [Aligning_DPO_Gemma_2b_it.ipynb](Gemma/Aligning_DPO_Gemma_2b_it.ipynb)                                                                           | Demonstrate how to align a Gemma model using DPO (Direct Preference Optimization) with [Hugging Face TRL](https://huggingface.co/docs/trl/en/index).                                    |
 | [Deploy_with_vLLM.ipynb](Gemma/Deploy_with_vLLM.ipynb)                                                                                           | Deploy a Gemma model using [vLLM](https://github.com/vllm-project/vllm).                                                                                                                |
 | [Deploy_Gemma_in_Vertex_AI.ipynb](Gemma/Deploy_Gemma_in_Vertex_AI.ipynb)                                                                         | Deploy a Gemma model using [Vertex AI](https://cloud.google.com/vertex-ai).                                                                                                             |
@@ -43,7 +44,6 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo
 | [Minimal_RAG.ipynb](Gemma/Minimal_RAG.ipynb)                                                                                                     | Minimal example of building a RAG system with Gemma using [Google UniSim](https://github.com/google/unisim) and [Hugging Face](https://huggingface.co/).                                |
 | [RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb](Gemma/RAG_PDF_Search_in_multiple_documents_on_Colab.ipynb)                                 | RAG PDF Search in multiple documents using Gemma 2 2B on Google Colab.                                                                                                                  |
 | [Using_Gemma_with_LangChain.ipynb](Gemma/Using_Gemma_with_LangChain.ipynb)                                                                       | Examples to demonstrate using Gemma with [LangChain](https://www.langchain.com/).                                                                                                       |
-| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb)                                                                       | An example to demonstrate using Gemma with [Llamafile](https://github.com/Mozilla-Ocho/llamafile/).                                                                                     |
 | [Gemma_RAG_LlamaIndex.ipynb](Gemma/Gemma_RAG_LlamaIndex.ipynb)                                                                                   | RAG example with [LlamaIndex](https://www.llamaindex.ai/) using Gemma.                                                                                                                  |
 | [Integrate_with_Mesop.ipynb](Gemma/Integrate_with_Mesop.ipynb)                                                                                   | Integrate Gemma with [Google Mesop](https://google.github.io/mesop/).                                                                                                                   |
 | [Integrate_with_OneTwo.ipynb](Gemma/Integrate_with_OneTwo.ipynb)                                                                                 | Integrate Gemma with [Google OneTwo](https://github.com/google-deepmind/onetwo).                                                                                                        |