Embedding Model | Embedding Dimension | Hit Rate @ 20 | MRR @ 20 | Link to Experiment |
---|---|---|---|---|
jina-v2-base-en | 768 | 0.82 | 0.72 | here |
bge-base-en-v1.5 | 768 | 0.81 | 0.70 | here |
text-embedding-ada-002 | 1536 | 0.81 | 0.69 | here |
text-embedding-3-small | 512 | 0.84 | 0.73 | here |
text-embedding-3-small | 1536 | 0.84 | 0.73 | here |
text-embedding-3-large | 256 | 0.82 | 0.68 | here |
text-embedding-3-large | 3072 | 0.85 | 0.73 | here |
Embedding Model | Embedding Dimension | Hit Rate @ 20 | MRR @ 20 | Link to Experiment |
---|---|---|---|---|
jina-v2-base-en | 768 | 0.42 | 0.32 | here |
bge-base-en-v1.5 | 768 | 0.50 | 0.37 | here |
text-embedding-ada-002 | 1536 | 0.57 | 0.44 | here |
text-embedding-3-small | 512 | 0.59 | 0.46 | here |
text-embedding-3-small | 1536 | 0.59 | 0.46 | here |
text-embedding-3-large | 256 | 0.61 | 0.44 | here |
text-embedding-3-large | 3072 | 0.64 | 0.48 | here |
While evaluating retrieval during experimentation is easy as one typically evaluates the retrieval on a dataset which is labeled, it is oftentimes not possible to do the same when evaluating retrieval which is used in production. This motivates to define an evaluation method which is reference-free, i.e., it doesn’t have access to the correct answer.
The LLM eval metric was tested on a dataset over 400 randomly-sampled samples for each dataset & embedding model.
The easiest way to create a LLM-based eval metric is to use a zero-shot GPT-3.5-turbo-0125 based evaluation. The prompt instructs the model to assess whether the answer to a given question is among 20 retrieved results. It uses JSON mode to instruct the model to return a field called thoughts (which gives the model the ability to think before deciding) and a field called final_verdict (which is used to parse the decision of the LLM). This is encapsulated in Parea's pre-built LLM evaluation (Link to Python implementation and docs).
To improve the accuracy of the LLM-based eval metric, few-shot examples were used. Concretely, the few-shot examples were:
- few shot example 1: an example where
jina-v2-base-en
didn’t retrieve the right answer for a Q&A task- name is indicated by
false_1
/false_sample_1
in evaluation metric name
- name is indicated by
- few shot example 2: an example where
bge-base-en-v1.5
didn’t retrieve the right answer for a paraphrasing task- name is indicated by
false_2
/false_sample_2
in evaluation metric name
- name is indicated by
The implementation of the eval metrics are in evals.py.
The results are presented as a heatmap measuring accuracy, false positive (fpr) & false negative rate (fnr) of evaluation metric with hit rate.
- 0-shot evaluation metric has great overlap with hit rate: 81-88%%
- 1-shot evaluation metric based on few shot example 1 consistently improves accuracy of eval metric with hit rate: 83-88%
- 1-shot eval metric based on few shot example 2 degrades performance;
- combining both few-shot examples degrades performance by significantly increasing the false negative rate
Link to experiments:
- jina-v2-base-en Q&A 400 samples
- bge-base-en-v1.5 Q&A 400 samples
- text-embedding-3-small 512-dims Q&A 400 samples
- text-embedding-3-small 1536-dims Q&A 400 samples
- text-embedding-3-large 256-dims Q&A 400 samples
- text-embedding-3-large 3072-dims Q&A 400 samples
- 0-shot evaluation metric has low accuracy
- both 1-shot eval metric improve upon 0-shot
- combining 2 examples into 2-shot eval metric yields synergistic effects when order is few shot example 1 followed by few shot example 2
- there isn’t always an improvement over using 1-shot eval metric when combining both examples in the other way
Link to experiments:
- jina-v2-base-en Paraphrasing 400 samples
- bge-base-en-v1.5 Paraphrasing 400 samples
- text-embedding-3-small 512-dims Paraphrasing 400 samples
- text-embedding-3-small 1536-dims Paraphrasing 400 samples
- text-embedding-3-large 256-dims Paraphrasing 400 samples
- text-embedding-3-large 3072-dims Paraphrasing 400 samples