Asclepius-synthetic-clincal-notes-retrieval-benchmark

Retrieval Results

Question Answering Task

Embedding Model	Embedding Dimension	Hit Rate @ 20	MRR @ 20	Link to Experiment
jina-v2-base-en	768	0.82	0.72	here
bge-base-en-v1.5	768	0.81	0.70	here
text-embedding-ada-002	1536	0.81	0.69	here
text-embedding-3-small	512	0.84	0.73	here
text-embedding-3-small	1536	0.84	0.73	here
text-embedding-3-large	256	0.82	0.68	here
text-embedding-3-large	3072	0.85	0.73	here

Paraphrasing Task

Embedding Model	Embedding Dimension	Hit Rate @ 20	MRR @ 20	Link to Experiment
jina-v2-base-en	768	0.42	0.32	here
bge-base-en-v1.5	768	0.50	0.37	here
text-embedding-ada-002	1536	0.57	0.44	here
text-embedding-3-small	512	0.59	0.46	here
text-embedding-3-small	1536	0.59	0.46	here
text-embedding-3-large	256	0.61	0.44	here
text-embedding-3-large	3072	0.64	0.48	here

Using LLM to measure hit rate without having access to the correct label

While evaluating retrieval during experimentation is easy as one typically evaluates the retrieval on a dataset which is labeled, it is oftentimes not possible to do the same when evaluating retrieval which is used in production. This motivates to define an evaluation method which is reference-free, i.e., it doesn’t have access to the correct answer.

The LLM eval metric was tested on a dataset over 400 randomly-sampled samples for each dataset & embedding model.

Creation of the LLM metric

The easiest way to create a LLM-based eval metric is to use a zero-shot GPT-3.5-turbo-0125 based evaluation. The prompt instructs the model to assess whether the answer to a given question is among 20 retrieved results. It uses JSON mode to instruct the model to return a field called thoughts (which gives the model the ability to think before deciding) and a field called final_verdict (which is used to parse the decision of the LLM). This is encapsulated in Parea's pre-built LLM evaluation (Link to Python implementation and docs).

Using Few-Shot Examples

To improve the accuracy of the LLM-based eval metric, few-shot examples were used. Concretely, the few-shot examples were:

few shot example 1: an example where jina-v2-base-en didn’t retrieve the right answer for a Q&A task
- name is indicated by false_1 / false_sample_1 in evaluation metric name
few shot example 2: an example where bge-base-en-v1.5 didn’t retrieve the right answer for a paraphrasing task
- name is indicated by false_2 / false_sample_2 in evaluation metric name

The implementation of the eval metrics are in evals.py.

Results

The results are presented as a heatmap measuring accuracy, false positive (fpr) & false negative rate (fnr) of evaluation metric with hit rate.

Question Answering Task

0-shot evaluation metric has great overlap with hit rate: 81-88%%
1-shot evaluation metric based on few shot example 1 consistently improves accuracy of eval metric with hit rate: 83-88%
1-shot eval metric based on few shot example 2 degrades performance;
combining both few-shot examples degrades performance by significantly increasing the false negative rate

Link to experiments:

Paraphrasing Task

0-shot evaluation metric has low accuracy
both 1-shot eval metric improve upon 0-shot
combining 2 examples into 2-shot eval metric yields synergistic effects when order is few shot example 1 followed by few shot example 2
there isn’t always an improvement over using 1-shot eval metric when combining both examples in the other way

Link to experiments:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation-using-thoughts-paraphrasing.png		ablation-using-thoughts-paraphrasing.png
ablation-using-thoughts-qa.png		ablation-using-thoughts-qa.png
acc-fpr-fnr.png		acc-fpr-fnr.png
analysis-llm-eval-ablation-thought.ipynb		analysis-llm-eval-ablation-thought.ipynb
analysis-llm-eval-open-source.png		analysis-llm-eval-open-source.png
analysis-llm-eval-openai.png		analysis-llm-eval-openai.png
analysis-llm-eval-para.png		analysis-llm-eval-para.png
analysis-llm-eval-qa.png		analysis-llm-eval-qa.png
analysis-llm-eval.ipynb		analysis-llm-eval.ipynb
evals.py		evals.py
evals_no_thoughts.py		evals_no_thoughts.py
experiment.py		experiment.py
requirements.txt		requirements.txt
retry_manager.py		retry_manager.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asclepius-synthetic-clincal-notes-retrieval-benchmark

Retrieval Results

Question Answering Task

Paraphrasing Task

Using LLM to measure hit rate without having access to the correct label

Creation of the LLM metric

Using Few-Shot Examples

Results

Question Answering Task

Paraphrasing Task

About

Releases

Packages

Languages

License

parea-ai/Asclepius-synthetic-clincal-notes-retrieval-benchmark

Folders and files

Latest commit

History

Repository files navigation

Asclepius-synthetic-clincal-notes-retrieval-benchmark

Retrieval Results

Question Answering Task

Paraphrasing Task

Using LLM to measure hit rate without having access to the correct label

Creation of the LLM metric

Using Few-Shot Examples

Results

Question Answering Task

Paraphrasing Task

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages