Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BFCL User-Friendly and Easy to Extend #510

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
b1b67b1
Move `*.jsonl` files from `eval_checker` to `data` dir
devanshamin Jul 7, 2024
1a0a932
Add `pyproject.toml` file
devanshamin Jul 7, 2024
ae89841
Ignore `poetry.lock` and `.cache` dir
devanshamin Jul 7, 2024
76e1bde
Add `.env.example` containing all the env vars
devanshamin Jul 7, 2024
e11240d
Remove `changelog` from `README`
devanshamin Jul 7, 2024
ebc2142
Refactor `model_handler`
devanshamin Jul 7, 2024
4121e51
Move `eval_checker` to `bfcl/eval_checker`
devanshamin Jul 7, 2024
12bdeed
Add benchmark module
devanshamin Jul 7, 2024
837c767
Remove `eval_data_compilation`
devanshamin Jul 7, 2024
1e0004f
Remove `poetry.lock`
devanshamin Jul 8, 2024
e52d531
Add hugging face hub token
devanshamin Jul 8, 2024
f0833ed
Update build system
devanshamin Jul 8, 2024
1e8da5a
Move `functionary` from `oss_model` to `proprietary_model`
devanshamin Jul 8, 2024
34a170a
Fix type error
devanshamin Jul 8, 2024
f736521
Remove test category
devanshamin Jul 8, 2024
893c9af
Make `eval_checker` consistent with `main` branch by merging (#496)
devanshamin Jul 8, 2024
88e8462
Standardize test groups
devanshamin Jul 9, 2024
cb7349a
Improve test data downloading and saving model responses
devanshamin Jul 9, 2024
a4a1c4f
Support benchmarking of proprietary models
devanshamin Jul 10, 2024
795d959
Replaced with `bfcl/benchmark.py`
devanshamin Jul 10, 2024
c7c5167
Add relevance evaluator
devanshamin Jul 11, 2024
1605012
Rename `benchmark` to `llm_generation`
devanshamin Jul 11, 2024
90a6bde
Rename `evaluate` to `evaluation`
devanshamin Jul 11, 2024
fb0a599
Update sub-commands
devanshamin Jul 11, 2024
a42fd29
Add evaluation for executable group
devanshamin Jul 12, 2024
fa2694a
Standardize checker result
devanshamin Jul 12, 2024
09384e3
Convert checker from module to directory
devanshamin Jul 12, 2024
7bd671e
Add evaluation for ast group
devanshamin Jul 13, 2024
7c65495
Remove `eval_checker` dir
devanshamin Jul 13, 2024
159039d
Generate bfcl leaderboard result csv file
devanshamin Jul 13, 2024
707e2bd
Fix issue of incorrect test category comparison
devanshamin Jul 13, 2024
e85ca86
Update comments
devanshamin Jul 13, 2024
3f73201
Add new readme
devanshamin Jul 13, 2024
15b9c6a
Fix evaluation section
devanshamin Jul 13, 2024
e0645b1
update package dependency version
HuanzhiMao Jul 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,4 @@ berkeley-function-call-leaderboard/score/

.direnv/
.venv
.cache
23 changes: 23 additions & 0 deletions berkeley-function-call-leaderboard/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# [OPTIONAL] Required for downloading gated hugging face models
HUGGING_FACE_HUB_TOKEN=

# [OPTIONAL] Required for LLM generation step
# Provide the API key for the model(s) you intend to use
OPENAI_API_KEY=sk-XXXXXX
MISTRAL_API_KEY=
FIREWORKS_API_KEY=
ANTHROPIC_API_KEY=
NVIDIA_API_KEY=nvapi-XXXXXX
GEMINI_GCP_PROJECT_ID=

COHERE_API_KEY=
USE_COHERE_OPTIMIZATION=False # True/False

DATABRICKS_API_KEY=
DATABRICKS_AZURE_ENDPOINT_URL=

# [OPTIONAL] Required for evaluation of `executable` test group
RAPID_API_KEY=
EXCHANGERATE_API_KEY=
OMDB_API_KEY=
GEOCODE_API_KEY=
42 changes: 42 additions & 0 deletions berkeley-function-call-leaderboard/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Changelog

All notable changes to this project will be documented in this file.

* [July 3, 2024] [#489](https://github.com/ShishirPatil/gorilla/pull/489): Add new model `nvidia/nemotron-4-340b-instruct` to the leaderboard.
* [June 18, 2024] [#470](https://github.com/ShishirPatil/gorilla/pull/470): Add new model `firefunction-v2-FC` to the leaderboard.
* [June 15, 2024] [#437](https://github.com/ShishirPatil/gorilla/pull/437): Fix prompting issues for `Nexusflow-Raven-v2 (FC)`.
* [June 7, 2024] [#407](https://github.com/ShishirPatil/gorilla/pull/407), [#462](https://github.com/ShishirPatil/gorilla/pull/462): Update the AST evaluation logic to allow the use of `int` values for Python parameters expecting `float` values. This is to accommodate the Python auto-conversion feature from `int` to `float`.
* [May 14, 2024] [#426](https://github.com/ShishirPatil/gorilla/pull/426):
- Add the following new models to the leaderboard:
+ `gpt-4o-2024-05-13`
+ `gpt-4o-2024-05-13-FC`
+ `gemini-1.5-pro-preview-0514`
+ `gemini-1.5-flash-preview-0514`
- Update price for the following models:
+ All Gemini Series
+ `Claude-2.1 (Prompt)` and `Claude-instant-1.2 (Prompt)`
+ `Mistral-large` and `Mistral-Small`
+ `GPT-3.5-Turbo-0125`
* [May 8, 2024] [#406](https://github.com/ShishirPatil/gorilla/pull/406) and [#421](https://github.com/ShishirPatil/gorilla/pull/421): Update the `gemini_handler.py` to better handle parallel function calls for Gemini models.
* [May 6, 2024] [#412](https://github.com/ShishirPatil/gorilla/pull/412): Bug fix in evaluation dataset for AST categories. This includes updates to both prompts and function docs.
* [May 2, 2024] [#405](https://github.com/ShishirPatil/gorilla/pull/405): Bug fix in the possible answers for the AST Simple evaluation dataset. Prompt and function docs are not affected.
* [April 28, 2024] [#397](https://github.com/ShishirPatil/gorilla/pull/397): Add new model `snowflake/arctic` to the leaderboard. Note that there are multiple ways to inference the model, and we choose to do it via Nvidia API catalog.
* [April 27, 2024] [#390](https://github.com/ShishirPatil/gorilla/pull/390): Bug fix in cost and latency calculation for open-source models, which are now all calculated when serving the model with [vLLM](https://github.com/vllm-project/vllm) using 8 V100 GPUs for consistency. $$\text{Cost} = \text{Latency per 1000 function call} * (\text{8xV100 azure-pay-as-you-go-price per hour / 3600})$$
* [April 25, 2024] [#386](https://github.com/ShishirPatil/gorilla/pull/386): Add 5 new models to the leaderboard: `meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B-Instruct`, `gemini-1.5-pro-preview-0409`, `command-r-plus`, `command-r-plus-FC`.
* [April 19, 2024] [#377](https://github.com/ShishirPatil/gorilla/pull/377):
- Bug fix for the evaluation dataset in the executable test categories. This includes updates to both prompts and function docs.
- The `evaluation_result` field has been removed to accommodate the variability in API execution results across different evaluation runs. Instead, a human-verified `ground_truth` is now included for the executable test categories. During each evaluation run, `evaluation_result` is generated anew using the `ground_truth`, and then compared against the model output.
- A stricter metric has been adopted when using the `structural_match` (aka. type match) evaluation criteria ---- For `list` results, the lengths are compared; for `dict` results, the keys are matched. This is to account for the fast-changing nature of some of the real-time API results while ensuring the evaluation remains meaningful.
- Added another evaluation criteria `real_time_match` for the executable category, which is a looser form of `exact_match` specifically for numerical execution results. The execution result must be within a certain percentage threshold (20%) from the expected result to accommodate the live updates of API responses. User can change this threshold value in `eval_checker_constant.py`.
* [April 18, 2024] [#375](https://github.com/ShishirPatil/gorilla/pull/375): A more comprehensive API sanity check is included; the APIs that are invoked during the non-REST executable evaluation process will also be checked for their availability before running the evaluation. Also, add support for the shortcut `-s` for the `--skip-api-sanity-check` flag, based on the community feedback.
* [April 16, 2024] [#366](https://github.com/ShishirPatil/gorilla/pull/366): Switch to use Anthropic's new Tool Use Beta `tools-2024-04-04` when generating Claude 3 FC series data. `gpt-4-turbo-2024-04-09` and `gpt-4-turbo-2024-04-09-FC` are also added to the leaderboard.
* [April 11, 2024] [#347](https://github.com/ShishirPatil/gorilla/pull/347): Add the 95th percentile latency to the leaderboard statistics. This metric is useful for understanding the latency distribution of the models, especially the worst-case scenario.
* [April 10, 2024] [#339](https://github.com/ShishirPatil/gorilla/pull/339): Introduce REST API sanity check for the REST executable test category. It ensures that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, the evaluation process will be stopped by default as the result will be inaccurate. Users can choose to bypass this check by setting the `--skip-api-sanity-check` flag or `-s` for short.
* [April 9, 2024] [#338](https://github.com/ShishirPatil/gorilla/pull/338): Bug fix in the evaluation datasets (including both prompts and function docs). Bug fix for possible answers as well.
* [April 8, 2024] [#330](https://github.com/ShishirPatil/gorilla/pull/330): Fixed an oversight that was introduced in [#299](https://github.com/ShishirPatil/gorilla/pull/299). For function-calling (FC) models that cannot take `float` type in input, when the parameter type is a `float`, the evaluation procedure will convert that type to `number` in the model input and mention in the parameter description that `This is a float type value.`. An additional field `format: float` will also be included in the model input to make it clear about the type. Updated the model handler for Claude, Mistral, and OSS to better parse the model output.
* [April 8, 2024] [#327](https://github.com/ShishirPatil/gorilla/pull/327): Add new model `NousResearch/Hermes-2-Pro-Mistral-7B` to the leaderboard.
* [April 3, 2024] [#309](https://github.com/ShishirPatil/gorilla/pull/309): Bug fix for evaluation dataset possible answers. Implement **string standardization** for the AST evaluation pipeline, i.e. removing white spaces and a subset of punctuations (`,./-_*^`) to make the AST evaluation more robust and accurate. Fixed AST evaluation issue for type `tuple`. Add 2 new models `meetkai/functionary-small-v2.4 (FC)`, `meetkai/functionary-medium-v2.4 (FC)` to the leaderboard.
* [April 1, 2024] [#299](https://github.com/ShishirPatil/gorilla/pull/299): Leaderboard update with new models (`Claude-3-Haiku`, `Databrick-DBRX-Instruct`), more advanced AST evaluation procedure, and updated evaluation datasets. Cost and latency statistics during evaluation are also measured. We also released the manual that our evaluation procedure is based on, available [here](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html#metrics).
* [Mar 11, 2024] [#254](https://github.com/ShishirPatil/gorilla/pull/254): Leaderboard update with 3 new models: `Claude-3-Opus-20240229 (Prompt)`, `Claude-3-Sonnet-20240229 (Prompt)`, and `meetkai/functionary-medium-v2.2 (FC)`
* [Mar 5, 2024] [#237](https://github.com/ShishirPatil/gorilla/pull/237) and [238](https://github.com/ShishirPatil/gorilla/pull/238): leaderboard update resulting from [#223](https://github.com/ShishirPatil/gorilla/pull/223); 3 new models: `mistral-large-2402`, `gemini-1.0-pro`, and `gemma`.
* [Feb 29, 2024] [#223](https://github.com/ShishirPatil/gorilla/pull/223): modifications to REST evaluation.
Loading