Added extra experiments - mainly around macro chunking #16

dannyjameswilliams · 2024-09-26T19:44:39Z

Overview

Experiments added:

LongEmbed Examples against chunk size (nDCG@10 and mAP@10)
Macro chunking approach vs 'hard' boundary approach with 0 overlap
Example with Anthropics contextual retrieval

LongEmbed Examples against chunk size (nDCG@10 and mAP@10)

Similarly to run_chunked_eval.py, run_chunked_eval_with_macro_chunking.py can just be run on the command line with e.g.

python3 run_chunked_eval_with_macro_chunks.py --task-name LEMBWikimQARetrievalChunked

To reproduce easily

I recommend the bash file

#!/bin/bash

# Define arrays for names and embedding sizes
names=(LEMBWikimQARetrievalChunked LEMBQMSumRetrievalChunked LEMBNarrativeQARetrievalChunked LEMBSummScreenFDRetrievalChunked)

# Loop over each name
for name in "${names[@]}"; do
  echo $name
  python3 run_chunked_eval_with_macro_chunks.py --task-name $name
  done

to run them all at once. Then the results can be displayed graphically in a matplotlib plot via running the file plot_chunk_size_experiments.py.

Macro chunking approach vs 'hard' boundary approach with 0 overlap

Similar to the above - comparing macro chunking to non-macro chunking, with experiment file run_macro_chunking_experiments.py and plot file plot_macro_chunking_experiments.py.

Example with Anthropics contextual retrieval

You can run the explanatory_contextual_retrieval.py to see a comparison between Anthropics contextual retrieval, which manually adds context to each chunk, late chunking, and naive chunking. This is performed via a running on a generated document which deliberately has context missing in later sentences (via 'Its' instead of a company name). The comparison is via cosine similarities on the chunks and the corresponding embeddings on jina-embeddings-v2-base-en.

guenthermi

Thank you for contributing all the code, I added some comments. Would be nice if you could address them before we merge this

chunked_pooling/chunked_eval_tasks.py

chunked_pooling/chunking.py

plot_chunk_size_experiments.py

plot_macro_chunking_experiments.py

guenthermi · 2024-10-01T11:48:24Z

run_chunked_eval_with_macro_chunks.py

+        #     overwrite_results=True,
+        #     batch_size=BATCH_SIZE,
+        #     encode_kwargs={'batch_size': BATCH_SIZE},
+        # )


I think this script needs a bit of a clean up, maybe we can also integrate it into the main script

I have integrated into the main script and also renamed macro chunking to long late chunking. I've got long late chunking off by default but when defining e.g. --long-late-chunking-embed-size 8192, hope this is what you meant

run_macro_chunking_experiments.py

Co-authored-by: Michael Günther <[email protected]>

dannyjameswilliams · 2024-10-02T09:03:43Z

all comments hopefully addressed

guenthermi

just added some minor comments

guenthermi · 2024-10-02T09:09:47Z

explanatory_contextual_retrieval.py

+        # self.llm = pipeline(
+        #     "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",
+        #     max_length = 1000
+        # )


Suggested change

# self.llm = pipeline(

# "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",

# max_length = 1000

# )

guenthermi · 2024-10-02T09:11:09Z

explanatory_contextual_retrieval.py

+# to late chunking to see if the similarities are similar (which they appear to be)
+# 
+# pip requirements:
+# accelerate?


can you add this to the pyproject.toml if necessary?

guenthermi · 2024-10-02T09:11:47Z

explanatory_contextual_retrieval.py

+    """.strip().replace("\n", "")
+
+
+    # llm_model_name = "microsoft/Phi-3.5-mini-instruct"


Suggested change

# llm_model_name = "microsoft/Phi-3.5-mini-instruct"

dannyjameswilliams and others added 25 commits September 26, 2024 10:51

initial experiment setup

7ce00ae

changed len tokens

ffed829

soft boundary correctly overlaps for later batches

b467575

removed incorrect soft

d0ef88a

added loop over overlap sizes

7985c66

added results

56a6a41

experiment for WikimQA

f3eb17d

added narrativeQA task

50b2b00

added remainder of longembed datasets

3f226a5

typo

19ab752

more tasks for soft/hard

57cc8ba

experiments

9043a3f

fix merge

e1d66e1

added macro chunk experiment file

0a4a27e

for merge

043eec6

merge def values

f960b15

chunk size results

8e0c80d

added benchmark files for macro chunks

753b603

removed raw results

6b52c8c

added plotting files for results, requires running them first

9fc7d52

renamed file

7cff570

renamed file

b831989

added plt.show()

1eccfe5

renamed to macro chunking

4219ca6

moved file

ad9f37c

dannyjameswilliams marked this pull request as ready for review September 26, 2024 19:47

guenthermi reviewed Oct 1, 2024

View reviewed changes

dannyjameswilliams and others added 3 commits October 1, 2024 15:36

Update chunked_pooling/chunking.py according to comment

ada2a91

Co-authored-by: Michael Günther <[email protected]>

updated main experiment file with long late chunking

d4f99ce

remove redundant macro chunking file

936ae51

dannyjameswilliams added 2 commits October 2, 2024 10:00

updated default to truncation (8192)

1ed7fb6

updated error message/print statement

abab7fa

guenthermi approved these changes Oct 2, 2024

View reviewed changes

dannyjameswilliams added 2 commits October 2, 2024 10:40

changed how local llm is loaded

b46d469

removed comment on pip and update default model to phi

b64c2a6

guenthermi merged commit 7c2bc57 into jina-ai:main Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added extra experiments - mainly around macro chunking #16

Added extra experiments - mainly around macro chunking #16

dannyjameswilliams commented Sep 26, 2024

guenthermi left a comment

guenthermi Oct 1, 2024

dannyjameswilliams Oct 2, 2024

dannyjameswilliams commented Oct 2, 2024

guenthermi left a comment

guenthermi Oct 2, 2024

guenthermi Oct 2, 2024

guenthermi Oct 2, 2024

		""".strip().replace("\n", "")


		# llm_model_name = "microsoft/Phi-3.5-mini-instruct"

Added extra experiments - mainly around macro chunking #16

Added extra experiments - mainly around macro chunking #16

Conversation

dannyjameswilliams commented Sep 26, 2024

Overview

LongEmbed Examples against chunk size (nDCG@10 and mAP@10)

To reproduce easily

Macro chunking approach vs 'hard' boundary approach with 0 overlap

Example with Anthropics contextual retrieval

guenthermi left a comment

Choose a reason for hiding this comment

guenthermi Oct 1, 2024

Choose a reason for hiding this comment

dannyjameswilliams Oct 2, 2024

Choose a reason for hiding this comment

dannyjameswilliams commented Oct 2, 2024

guenthermi left a comment

Choose a reason for hiding this comment

guenthermi Oct 2, 2024

Choose a reason for hiding this comment

guenthermi Oct 2, 2024

Choose a reason for hiding this comment

guenthermi Oct 2, 2024

Choose a reason for hiding this comment