Language Model Evaluation Harness with Medical Specialities Classification

Fork from the official lm-evaluation-harness repo with the task medical_specialities included to classify different questions into their medical specialities.

Follow the official lm-evaluation-harness guide with the task medical_specialities

lm_eval --model hf \
    --model_args pretrained=EleutherAI/pythia-160m \
    --tasks medical_specialities \
    --device cuda:0 \
    --batch_size 8

With you will get a list of the metrics per speciality, which can help you identify if there is one category underrepresented, or other biases.

Tasks	Filter	Metric		Value		Stderr
Allergy	none	acc	↑	0.3200	±	0.0469
Anatomy	none	acc	↑	0.2862	±	0.0186
Anesthesiology	none	acc	↑	0.2577	±	0.0344
Biochemistry	none	acc	↑	0.2388	±	0.0104
Cardiology	none	acc	↑	0.2659	±	0.0211
Chemistry	none	acc	↑	0.2587	±	0.0193
Dermatology	none	acc	↑	0.2660	±	0.0323
Emergency	none	acc	↑	0.2871	±	0.0319
Endocrinology	none	acc	↑	0.2456	±	0.0216
Gastroenterology	none	acc	↑	0.2364	±	0.0207
Genetics	none	acc	↑	0.2776	±	0.0192
Geriatrics	none	acc	↑	0.2609	±	0.0532
Gynecology	none	acc	↑	0.3015	±	0.0395
Hematology	none	acc	↑	0.2220	±	0.0184
Microbiology	none	acc	↑	0.2576	±	0.0141
Nephrology	none	acc	↑	0.2747	±	0.0271
Neurology	none	acc	↑	0.2801	±	0.0210
Nursing	none	acc	↑	0.2374	±	0.0303
Obstetrics	none	acc	↑	0.2655	±	0.0235
Odontology	none	acc	↑	0.3337	±	0.0149
Oncology	none	acc	↑	0.2367	±	0.0272
Ophthalmology	none	acc	↑	0.2500	±	0.0367
Orthopedics	none	acc	↑	0.3180	±	0.0317
Otorhinolaryngology	none	acc	↑	0.2775	±	0.0310
Pathology	none	acc	↑	0.2680	±	0.0452
Pediatrics	none	acc	↑	0.2959	±	0.0267
Pharmacology	none	acc	↑	0.2772	±	0.0158
Physiology	none	acc	↑	0.2559	±	0.0254
Psychiatry	none	acc	↑	0.2601	±	0.0143
Psychology	none	acc	↑	0.2686	±	0.0202
Radiology	none	acc	↑	0.3371	±	0.0504
Respiratory	none	acc	↑	0.2600	±	0.0235
Rheumatology	none	acc	↑	0.2110	±	0.0393
Surgery	none	acc	↑	0.2697	±	0.0334
Urology	none	acc	↑	0.2727	±	0.0427

More info about the datasets: https://huggingface.co/datasets/HPAI-BSC/medical-specialities
More info about the code to classify the questions: https://github.com/HPAI-BSC/medical-specialities
Notebook with usage example: link

Name		Name	Last commit message	Last commit date
Latest commit History 3,515 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
lm_eval		lm_eval
results		results
scripts		scripts
templates/new_yaml_task		templates/new_yaml_task
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.bib		CITATION.bib
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
ignore.txt		ignore.txt
mypy.ini		mypy.ini
pile_statistics.json		pile_statistics.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py