Skip to content

parea-ai/Asclepius-synthetic-clincal-notes-retrieval-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Asclepius-synthetic-clincal-notes-retrieval-benchmark

Retrieval Results

Question Answering Task

Embedding Model Embedding Dimension Hit Rate @ 20 MRR @ 20 Link to Experiment
jina-v2-base-en 768 0.82 0.72 here
bge-base-en-v1.5 768 0.81 0.70 here
text-embedding-ada-002 1536 0.81 0.69 here
text-embedding-3-small 512 0.84 0.73 here
text-embedding-3-small 1536 0.84 0.73 here
text-embedding-3-large 256 0.82 0.68 here
text-embedding-3-large 3072 0.85 0.73 here

Paraphrasing Task

Embedding Model Embedding Dimension Hit Rate @ 20 MRR @ 20 Link to Experiment
jina-v2-base-en 768 0.42 0.32 here
bge-base-en-v1.5 768 0.50 0.37 here
text-embedding-ada-002 1536 0.57 0.44 here
text-embedding-3-small 512 0.59 0.46 here
text-embedding-3-small 1536 0.59 0.46 here
text-embedding-3-large 256 0.61 0.44 here
text-embedding-3-large 3072 0.64 0.48 here

Using LLM to measure hit rate without having access to the correct label

While evaluating retrieval during experimentation is easy as one typically evaluates the retrieval on a dataset which is labeled, it is oftentimes not possible to do the same when evaluating retrieval which is used in production. This motivates to define an evaluation method which is reference-free, i.e., it doesn’t have access to the correct answer.

The LLM eval metric was tested on a dataset over 400 randomly-sampled samples for each dataset & embedding model.

Creation of the LLM metric

The easiest way to create a LLM-based eval metric is to use a zero-shot GPT-3.5-turbo-0125 based evaluation. The prompt instructs the model to assess whether the answer to a given question is among 20 retrieved results. It uses JSON mode to instruct the model to return a field called thoughts (which gives the model the ability to think before deciding) and a field called final_verdict (which is used to parse the decision of the LLM). This is encapsulated in Parea's pre-built LLM evaluation (Link to Python implementation and docs).

Using Few-Shot Examples

To improve the accuracy of the LLM-based eval metric, few-shot examples were used. Concretely, the few-shot examples were:

  • few shot example 1: an example where jina-v2-base-en didn’t retrieve the right answer for a Q&A task
    • name is indicated by false_1 / false_sample_1 in evaluation metric name
  • few shot example 2: an example where bge-base-en-v1.5 didn’t retrieve the right answer for a paraphrasing task
    • name is indicated by false_2 / false_sample_2 in evaluation metric name

The implementation of the eval metrics are in evals.py.

Results

The results are presented as a heatmap measuring accuracy, false positive (fpr) & false negative rate (fnr) of evaluation metric with hit rate.

Question Answering Task

  • 0-shot evaluation metric has great overlap with hit rate: 81-88%%
  • 1-shot evaluation metric based on few shot example 1 consistently improves accuracy of eval metric with hit rate: 83-88%
  • 1-shot eval metric based on few shot example 2 degrades performance;
  • combining both few-shot examples degrades performance by significantly increasing the false negative rate

Heatmap

Link to experiments:

Paraphrasing Task

  • 0-shot evaluation metric has low accuracy
  • both 1-shot eval metric improve upon 0-shot
  • combining 2 examples into 2-shot eval metric yields synergistic effects when order is few shot example 1 followed by few shot example 2
  • there isn’t always an improvement over using 1-shot eval metric when combining both examples in the other way

Heatmap

Link to experiments:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published