Skip to content

trishullab/LibraryAugmentedSymbolicRegression.jl

 
 

Repository files navigation

LaSR: Library augmented Symbolic Regression

LibraryAugmentedSymbolicRegression.jl (LaSR.jl) accelerates the search for symbolic expressions using library learning.

Latest release Website Forums Paper
version Dev Discussions Paper
Build status Coverage
CI Coverage Status

LaSR is integrated with SymbolicRegression.jl. Check out PySR for a Python frontend.

Cite this software

Contents:

Benchmarking

If you'd like to compare with LaSR, we've archived the code used in the paper in the lasr-experiments branch. Clone this repository and run:

$ git switch lasr-experiments

to switch to the branch and follow the instructions in the README to reproduce our results. This directory contains the data and code for running and evaluating LaSR on the following datasets:

  • Feynman Equations dataset
  • Synthetic equations dataset
    • and generation code
  • Bigbench experiments
    • and evaluation code

Note

The code in the lasr-experiments branch directly modifies a 'frozen' version of SymbolicRegression.jl and PySR. While we gradually work on integrating LaSR into the main PySR repository, you can still use LaSR within Python by installing the pip package in this branch.

Quickstart

Install in Julia with:

using Pkg
Pkg.add("LibraryAugmentedSymbolicRegression")

LaSR uses the same interface as SymbolicRegression.jl, and is integrated into SymbolicRegression.jl through the SymbolicRegressionLaSRExt. However, LaSR can be directly used with MLJ as well. The only difference is that you need to pass an LLMOptions object to the LaSRRegressor constructor.

For example, we can modify the example.jl from the SymbolicRegression.jl documentation to use LaSR as follows:

Note

LaSR searches for the LLM query prompts in a a directory called prompts/ at the location you start Julia. You can download and extract the prompts.zip folder from here to the desired location. If you wish to use a different location, you can pass a different prompts_dir argument to the LLMOptions object.

import LibraryAugmentedSymbolicRegression: LaSRRegressor, LLMOptions, LLMWeights
import MLJ: machine, fit!, predict, report

# Dataset with two named features:
X = (a = rand(500), b = rand(500))

# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2

# with some noise:
y = y .+ randn(500) .* 1e-3

model = LaSRRegressor(
    niterations=50,
    binary_operators=[+, -, *],
    unary_operators=[cos],
    llm_options=LLMOptions(
        active=true,
        weights=LLMWeights(llm_mutate=0.1, llm_crossover=0.1, llm_gen_random=0.1),
        prompt_evol=true,
        prompt_concepts=true,
        api_key="token-abc123",
        prompts_dir="prompts/",
        llm_recorder_dir="lasr_runs/debug_0/",
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        api_kwargs=Dict("url" => "http://localhost:11440/v1"),
        var_order=Dict("a" => "angle", "b" => "bias"),
        llm_context="We believe the function to be a trigonometric function of the angle and a quadratic function of the bias.",
    )
)
mach = machine(model, X, y)

# ensure ./prompts/ exists. If not, download and extract the prompts.zip file from the repository.
fit!(mach)
# open ./lasr_runs/debug_0/llm_calls.txt to see the LLM interactions.
report(mach)
predict(mach, X)

Search options

Other than LLMOptions, We have the same search options as SymbolicRegression.jl. See https://astroautomata.com/SymbolicRegression.jl/stable/api/#Options

LLM Options

LaSR uses PromptingTools.jl for zero shot prompting. If you wish to make changes to the prompting options, you can pass an LLMOptions object to the LaSRRegressor constructor. The options available are:

llm_options = LLMOptions(
    active=true,                                                                # Whether to use LLM inference or not
    weights=LLMWeights(llm_mutate=0.1, llm_crossover=0.1, llm_gen_random=0.1),  # Probability of using LLM for mutation, crossover, and random generation
    num_pareto_context=5,                                                       # Number of equations to sample from the Pareto frontier for summarization.
    prompt_evol=true,                                                           # Whether to evolve natural language concepts through LLM calls.
    prompt_concepts=true,                                                       # Whether to use natural language concepts in the search.
    api_key="token-abc123",                                                     # API key to OpenAI API compatible server.
    model="meta-llama/Meta-Llama-3-8B-Instruct",                                # LLM model to use.
    api_kwargs=Dict("url" => "http://localhost:11440/v1"),                      # Keyword arguments passed to server.
    http_kwargs=Dict("retries" => 3, "readtimeout" => 3600),                    # Keyword arguments passed to HTTP requests.
    prompts_dir="prompts/",                                                      # Directory to look for zero shot prompts to the LLM.
    llm_recorder_dir="lasr_runs/debug_0/",                                       # Directory to log LLM interactions.
    llm_context="",                                                             # Natural language concept to start with. You should also be able to initialize with a list of concepts.
    var_order=nothing,                                                          # Dict(variable_name => new_name).
    idea_threshold=30                                                           # Number of concepts to keep track of.
    is_parametric=false,                                                        # This is a special flag to allow sampling parametric equations from LaSR. This won't be needed for most users.
)

Best Practices

  1. Always make sure you cannot find a satisfactory solution with active=false before using LLM guidance.
  2. Start with a LLM OpenAI compatible server running on your local machine before moving onto paid services. There are many online resources to set up a local LLM server 1 2 3 4
  3. If you are using LLM, do a back-of-the-envelope calculation to estimate the cost of running LLM for your problem. Each iteration will make around 60k calls to the LLM model. With the default prompts (in prompts/), each call usually requires generating 250 to 1000 tokens. This gives us an upper bound of 60M tokens per iteration if p=1.00. Hence, running the model at p=0.01 for 40 iterations will result in 24M tokens for each equation.

Organization

LibraryAugmentedSymbolicRegression.jl development is kept independent from the main codebase. However, to ensure LaSR can be used easily, it is integrated into SymbolicRegression.jl via the ext/SymbolicRegressionLaSRExt extension module. This, in turn, is loaded into PySR. This cartoon summarizes the interaction between the different packages:

LibraryAugmentedSymbolicRegression.jl organization

Note

The ext/SymbolicRegressionLaSRExt module is not yet available in the released version of SymbolicRegression.jl. It will be available in the release vX.X.X of SymbolicRegression.jl.

Running with Ollama

LaSR can be paired with any LLM server that is compatible with OpenAI's API. Ollama is a free and open-source LLM server geared towards running LLMs on commodity laptops. You can download and setup Ollama from here. After this, run:

$ ollama help
Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.
$ ollama pull llama3.1
# This downloads a 4GB-ish file that contains the Llama3.1 8B model.
# Ollama, by default, runs on port 11434 of your local machine. Let's try a debug query to make sure we can connect to Ollama.
$ curl http://localhost:11434/v1/models
{"object":"list","data":[{"id":"llama3.1:latest","object":"model","created":1730973855,"owned_by":"library"},{"id":"mistral:latest","object":"model","created":1697556753,"owned_by":"library"},{"id":"wizard-math:latest","object":"model","created":1697556753,"owned_by":"library"},{"id":"codellama:latest","object":"model","created":1693414395,"owned_by":"library"},{"id":"nous-hermes-llama2:latest","object":"model","created":1691000950,"owned_by":"library"}]}

$ curl http://localhost:11434/v1/completions -H "Content-Type: application/json"   -H "Authorization: Bearer token-abc123"   -d '{
    "model": "llama3.1:latest",
    "prompt": "Once upon a time,",
    "max_tokens": 50,
    "temperature": 0.7
  }'

{"id":"cmpl-626","object":"text_completion","created":1730977391,"model":"llama3.1:latest","system_fingerprint":"fp_ollama","choices":[{"text":"...in a far-off kingdom, hidden behind a veil of sparkling mist and whispering leaves, there existed a magical realm unlike any other.","index":0,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":29,"total_tokens":44}}

Now, we can run the simple example in Julia with model_name as llama3.1:latest and the HTTP URL as http://localhost:11434/v1:

import LibraryAugmentedSymbolicRegression: LaSRRegressor, LLMOptions, LLMWeights
import MLJ: machine, fit!, predict, report

# Dataset with two named features:
X = (a = rand(500), b = rand(500))

# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2

# with some noise:
y = y .+ randn(500) .* 1e-3

model = LaSRRegressor(
    niterations=50,
    binary_operators=[+, -, *],
    unary_operators=[cos],
    llm_options=LLMOptions(
        active=true,
        weights=LLMWeights(llm_mutate=0.1, llm_crossover=0.1, llm_gen_random=0.1),
        prompt_evol=true,
        prompt_concepts=true,
        api_key="token-abc123",
        prompts_dir="prompts/",
        llm_recorder_dir="lasr_runs/debug_0/",
        model="llama3.1:latest",
        api_kwargs=Dict("url" => "http://127.0.0.1:11434/v1"),
        var_order=Dict("a" => "angle", "b" => "bias"),
        llm_context="We believe the function to be a trigonometric function of the angle and a quadratic function of the bias."
    )
)

mach = machine(model, X, y)
fit!(mach)
# julia> fit!(mach)
# [ Info: Training machine(LaSRRegressor(binary_operators = Function[+, -, *], …), …).
# ┌ Warning: You are using multithreading mode, but only one thread is available. Try starting julia with `--threads=auto`.
# └ @ LibraryAugmentedSymbolicRegression ~/Desktop/projects/004_scientific_discovery/LibraryAugmentedSymbolicRegression.jl/src/Configure.jl:55
# [ Info: Tokens: 476 in 22.4 seconds
# [ Info: Started!
# [ Info: Tokens: 542 in 49.2 seconds
# [ Info: Tokens: 556 in 51.1 seconds
# [ Info: Tokens: 573 in 53.2 seconds
report(mach)
predict(mach, X)

Packages

No packages published

Languages

  • Julia 99.5%
  • Other 0.5%