What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
-
Updated
Jul 26, 2024 - Jupyter Notebook
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
How good are LLMs at chemistry?
Language Model for Mainframe Modernization
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Restore safety in fine-tuned language models through task arithmetic
Training and Benchmarking LLMs for Code Preference.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
Benchmark that evaluates LLMs using 436 NYT Connections puzzles
Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"
Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)
Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."