Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added extra experiments - mainly around macro chunking #16

Merged
merged 32 commits into from
Oct 2, 2024

Conversation

dannyjameswilliams
Copy link
Contributor

Overview

Experiments added:

  • LongEmbed Examples against chunk size (nDCG@10 and mAP@10)
  • Macro chunking approach vs 'hard' boundary approach with 0 overlap
  • Example with Anthropics contextual retrieval

LongEmbed Examples against chunk size (nDCG@10 and mAP@10)

Similarly to run_chunked_eval.py, run_chunked_eval_with_macro_chunking.py can just be run on the command line with e.g.

python3 run_chunked_eval_with_macro_chunks.py --task-name LEMBWikimQARetrievalChunked

To reproduce easily

I recommend the bash file

#!/bin/bash

# Define arrays for names and embedding sizes
names=(LEMBWikimQARetrievalChunked LEMBQMSumRetrievalChunked LEMBNarrativeQARetrievalChunked LEMBSummScreenFDRetrievalChunked)

# Loop over each name
for name in "${names[@]}"; do
  echo $name
  python3 run_chunked_eval_with_macro_chunks.py --task-name $name
  done

to run them all at once. Then the results can be displayed graphically in a matplotlib plot via running the file plot_chunk_size_experiments.py.

Macro chunking approach vs 'hard' boundary approach with 0 overlap

Similar to the above - comparing macro chunking to non-macro chunking, with experiment file run_macro_chunking_experiments.py and plot file plot_macro_chunking_experiments.py.

Example with Anthropics contextual retrieval

You can run the explanatory_contextual_retrieval.py to see a comparison between Anthropics contextual retrieval, which manually adds context to each chunk, late chunking, and naive chunking. This is performed via a running on a generated document which deliberately has context missing in later sentences (via 'Its' instead of a company name). The comparison is via cosine similarities on the chunks and the corresponding embeddings on jina-embeddings-v2-base-en.

@dannyjameswilliams dannyjameswilliams marked this pull request as ready for review September 26, 2024 19:47
Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing all the code, I added some comments. Would be nice if you could address them before we merge this

chunked_pooling/chunked_eval_tasks.py Outdated Show resolved Hide resolved
chunked_pooling/chunking.py Outdated Show resolved Hide resolved
plot_chunk_size_experiments.py Outdated Show resolved Hide resolved
plot_macro_chunking_experiments.py Outdated Show resolved Hide resolved
# overwrite_results=True,
# batch_size=BATCH_SIZE,
# encode_kwargs={'batch_size': BATCH_SIZE},
# )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this script needs a bit of a clean up, maybe we can also integrate it into the main script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have integrated into the main script and also renamed macro chunking to long late chunking. I've got long late chunking off by default but when defining e.g. --long-late-chunking-embed-size 8192, hope this is what you meant

run_macro_chunking_experiments.py Outdated Show resolved Hide resolved
@dannyjameswilliams
Copy link
Contributor Author

all comments hopefully addressed

Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added some minor comments

Comment on lines 97 to 100
# self.llm = pipeline(
# "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",
# max_length = 1000
# )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.llm = pipeline(
# "text-generation", model=llm_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto",
# max_length = 1000
# )

# to late chunking to see if the similarities are similar (which they appear to be)
#
# pip requirements:
# accelerate?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this to the pyproject.toml if necessary?

""".strip().replace("\n", "")


# llm_model_name = "microsoft/Phi-3.5-mini-instruct"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# llm_model_name = "microsoft/Phi-3.5-mini-instruct"

@guenthermi guenthermi merged commit 7c2bc57 into jina-ai:main Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants