AI Annotator is primarily designed for prototyping and testing annotation or classification tasks using Large Language Models (LLMs). While it doesn't offer extensive functionality or customizability, it provides a streamlined solution for quick experimentation. Additionally, it serves as a wrapper for a Vector Database (currently ChromaDB), making it adaptable for any task that leverages retrieval-augmented generation (RAG).
It supports a range of models, including both LLMs and embedding models. This includes locally run solutions such as Hugging Face and Ollama, with the flexibility to easily integrate custom models. For a more lightweight setup, API-based options like OpenAI and Mistral are supported, offering a simpler way to get started without the need for local deployment. Standardized task/instruction formats and automated parsing of model outputs are also included.
-
Clone the Repository
Clone the repository to your local machine:git clone https://github.com/nsschw/ai_annotator.git
-
Installing the package
Install the package using pip:pip install -e ai_annontator
or if you want to use locally hosted models (Hugging Face and Transformers):
pip install -e ai_annontator[local]
-
Import Necessary Modules
Import the relevant classes and functions fromai_annotator
or other libraries:from ai_annotator import AnnotationProject, OllamaModel, HuggingFaceEmbeddingModel, AnnotationConfig
-
Define Your Task
Create a task description to define what you're annotating or classifying. For example:task = """ You will be given an abstract of a study. Your task is to determine whether the study is valid based on the following criteria: 1. The study must be a meta-analysis. 2. The study must examine the association between life satisfaction, well-being, or subjective well-being and any other variable. Structure your feedback as follows: Feedback:: Evaluation (Your reasoning whether this is a valid article or not) Valid: (1 if not valid, 1 if valid) """
-
Configure Models
Set up the LLM and embedding models. This example shows how to use both Ollama and Hugging Face models:model = OllamaModel(host="http://ollama:11434", model="llama3.1:7b") emb_model = HuggingFaceEmbeddingModel("Alibaba-NLP/gte-Qwen2-1.5B-instruct")
-
Set Project Configuration
Define the configuration for the annotation project using theAnnotationConfig
class. Specify the data path, task description, and models to use:project_config = AnnotationConfig( db_path="SecondOrderMetaStudy", task_description=task, embedding_model=emb_model, model=model )
-
Create Annotation Project
Initialize theAnnotationProject
with your configuration and add data from a CSV file:ap = AnnotationProject(config=project_config) ap.add_data_from_csv("abtracts.csv", column_mapping={"input": "notes_abstract", "output": "valid_abstract"})
-
Generate Reasoning
Use a reasoning prompt to generate reasoning for each data point:ap.generate_reasoning(reasoning_prompt="What are the clues that lead to: [{output}] being correct in the document: [{input}] with the task being: [{task_description}].")
-
Run Predictions
Finally, run predictions on the test dataset:test_cases = ap.predict(["Test_Case_1", "Test_Case_2"...], number_demonstrations=3, use_reasoning=True)
-
Figure out a way to use strucured output (similar to OpenAI) for tasks definition, model output, and evaluation. See also: jsonformer, instructor, outlines
-
Training a simple Peft Model
-
Change to lazy loading of models
-
Discovering the possibility of deploying a jury for the annotation project
-
HP-Tuner for comparing different models
-
Parse JSON Output
-
Enable Model Offloading
-
Enable passing of bnb_config to the hf models