Skip to content

nsschw/ai_annotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Annotator

AI Annotator is primarily designed for prototyping and testing annotation or classification tasks using Large Language Models (LLMs). While it doesn't offer extensive functionality or customizability, it provides a streamlined solution for quick experimentation. Additionally, it serves as a wrapper for a Vector Database (currently ChromaDB), making it adaptable for any task that leverages retrieval-augmented generation (RAG).

It supports a range of models, including both LLMs and embedding models. This includes locally run solutions such as Hugging Face and Ollama, with the flexibility to easily integrate custom models. For a more lightweight setup, API-based options like OpenAI and Mistral are supported, offering a simpler way to get started without the need for local deployment. Standardized task/instruction formats and automated parsing of model outputs are also included.

Installation

  1. Clone the Repository
    Clone the repository to your local machine:

    git clone https://github.com/nsschw/ai_annotator.git
  2. Installing the package
    Install the package using pip:

    pip install -e ai_annontator

    or if you want to use locally hosted models (Hugging Face and Transformers):

    pip install -e ai_annontator[local]

How to Use

  1. Import Necessary Modules
    Import the relevant classes and functions from ai_annotator or other libraries:

    from ai_annotator import AnnotationProject, OllamaModel, HuggingFaceEmbeddingModel, AnnotationConfig
  2. Define Your Task
    Create a task description to define what you're annotating or classifying. For example:

    task = """
    You will be given an abstract of a study. Your task is to determine whether the study is valid based on the following criteria:
    1. The study must be a meta-analysis.
    2. The study must examine the association between life satisfaction, well-being, or subjective well-being and any other variable.
    
    Structure your feedback as follows:
    
    Feedback::
    Evaluation (Your reasoning whether this is a valid article or not)
    Valid: (1 if not valid, 1 if valid)
    """
  3. Configure Models
    Set up the LLM and embedding models. This example shows how to use both Ollama and Hugging Face models:

    model = OllamaModel(host="http://ollama:11434", model="llama3.1:7b")
    emb_model = HuggingFaceEmbeddingModel("Alibaba-NLP/gte-Qwen2-1.5B-instruct")
  4. Set Project Configuration
    Define the configuration for the annotation project using the AnnotationConfig class. Specify the data path, task description, and models to use:

    project_config = AnnotationConfig(
        db_path="SecondOrderMetaStudy",
        task_description=task,
        embedding_model=emb_model,
        model=model
    )
  5. Create Annotation Project
    Initialize the AnnotationProject with your configuration and add data from a CSV file:

    ap = AnnotationProject(config=project_config)
    ap.add_data_from_csv("abtracts.csv", column_mapping={"input": "notes_abstract", "output": "valid_abstract"})
  6. Generate Reasoning
    Use a reasoning prompt to generate reasoning for each data point:

    ap.generate_reasoning(reasoning_prompt="What are the clues that lead to: [{output}] being correct in the document: [{input}] with the task being: [{task_description}].")
  7. Run Predictions
    Finally, run predictions on the test dataset:

    test_cases = ap.predict(["Test_Case_1", "Test_Case_2"...], number_demonstrations=3, use_reasoning=True)

ToDo

  • Figure out a way to use strucured output (similar to OpenAI) for tasks definition, model output, and evaluation. See also: jsonformer, instructor, outlines

  • Training a simple Peft Model

  • Change to lazy loading of models

  • Discovering the possibility of deploying a jury for the annotation project

  • HP-Tuner for comparing different models

  • Parse JSON Output

  • Enable Model Offloading

  • Enable passing of bnb_config to the hf models