llms-benchmarking

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

benchmark reasoning vision-and-language multimodal-deep-learning human-annotation foundation-models large-language-models llms vision-language-model multimodal-large-language-models evaluation-llms llms-benchmarking

Updated Aug 6, 2024
Jupyter Notebook

epfl-dlab / cc_flows

Star

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ai competitive-programming agents competitive-programming-contests competitive-coding llms llms-reasoning llms-benchmarking aiflows

Updated Feb 12, 2024
Python

declare-lab / resta

Star

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

amazon-science / llm-code-preference

Star

Training and Benchmarking LLMs for Code Preference.

code-generation llm-training llm-evaluation llms-benchmarking

Updated Oct 31, 2024
Python

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

Paulescu / text-embedding-evaluation

Star

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

logikon-ai / cot-eval

Star

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Oct 6, 2024
Jupyter Notebook

nachoDRT / MERIT-Dataset

Star

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated Sep 6, 2024
Python

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 436 NYT Connections puzzles

testing benchmark evaluation puzzles reasoning llm llms-benchmarking gpt-4o

Updated Nov 5, 2024
Python

cosmaadrian / romath

Star

Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"

mathematics dataset romanian llms-benchmarking

Updated Sep 23, 2024
Python

SuperBruceJia / Awesome-Mixture-of-Experts

Star

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

artificial-intelligence sparse moe load-balancing multimodal-learning mixture-of-experts mome gating-network foundation-models large-language-models llms large-language-model large-vision-language-models expert-network llms-reasoning llms-benchmarking mixtrure-of-multimodal-experts sparse-moe sparse-mixture-of-experts sparse-mixture-of-multimodal-experts

Updated Sep 25, 2024

microsoft / MEGAVERSE

Star

Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)

multilingual llms-benchmarking

Updated Aug 29, 2024
Python

PrincySinghal / Html-code-generation-from-LLM

Star

Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.

machine-learning code-generation llms fine-tuning-llm llms-benchmarking

Updated Jan 6, 2024
Jupyter Notebook

microsoft / private-benchmarking

Star

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

platform benchmarking inference secure private mpc contamination trusted-execution-environment confidential-computing large-language-models llms-benchmarking private-benchmarking ezpc

Updated Sep 16, 2024
Python

dippatel1994 / Large-Language-Models-Evaluation-Benchmarks-Collection

Star

This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.

benchmarks large-language-models llm llms llms-benchmarking

Updated Feb 26, 2024

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llms-benchmarking

Here are 38 public repositories matching this topic...

ChemFoundationModels / ChemLLMBench

parea-ai / parea-sdk-py

bboylyg / BackdoorLLM

lamalab-org / chem-bench

FSoft-AI4Code / XMainframe

RaptorMai / CompBench

epfl-dlab / cc_flows

declare-lab / resta

amazon-science / llm-code-preference

minnesotanlp / cobbler

Paulescu / text-embedding-evaluation

logikon-ai / cot-eval

nachoDRT / MERIT-Dataset

lechmazur / nyt-connections

cosmaadrian / romath

SuperBruceJia / Awesome-Mixture-of-Experts

microsoft / MEGAVERSE

PrincySinghal / Html-code-generation-from-LLM

microsoft / private-benchmarking

dippatel1994 / Large-Language-Models-Evaluation-Benchmarks-Collection

Improve this page

Add this topic to your repo