Determine embedding size with Titan Embedding v2 model #39

bigbernnn · 2024-05-06T20:03:23Z

Support for Titan Embedding v2 with changing embedding sizes based on this blog. This change lets the user specify dimension.

from langchain_aws.embeddings import BedrockEmbeddings

prompt_data = """Priority should be funding retirement through ROTH/IRA/401K over HSA extra.  
You need to fund your HSA for reasonable and expected medical expenses. """

bedrock_embedding_model_id = "amazon.titan-embed-text-v2:0"

embed_model = BedrockEmbeddings(model_id=bedrock_embedding_model_id)
response = embed_model.embed_documents([prompt_data], 256, True) 
response = embed_model.embed_query(prompt_data, 512, True)

3coins

Looks good!

rsgrewal-aws · 2024-05-26T22:34:39Z

Looks good

trendafil-gechev · 2024-06-28T14:25:23Z

@bigbernnn @3coins Hi all,
I think this change breaks the use of model_kwargs when passed to the BedrockEmbeddings instance. Consider this example:

from langchain_aws import BedrockEmbeddings

model_kwargs = {
    "dimensions": 256,
    "normalize": False
}

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                               region_name="eu-central-1", model_kwargs=model_kwargs)

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
embedded = embeddings.embed_documents(texts=[text])

print(embedded[0])
print(len(embedded[0]))

This would still result in a normalized vector of 1024 dimensions since the method embed_documents resolves to its default values for dim and norm.

This is also problematic when using the BedrockEmbeddings with a vector store e.g:

from langchain_aws import BedrockEmbeddings
from langchain_postgres.vectorstores import PGVector
import os

DB_USER = os.getenv('PGUSER')
DB_PASSWORD = os.getenv('PGPASSWORD')
DB_HOST = os.getenv('PGHOST')
DB_PORT = os.getenv('PGPORT')
DB_NAME = os.getenv('PGDATABASE')

model_kwargs = {
    "dimensions": 256,
    "normalize": False
}

connection = f"postgresql+psycopg://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                               region_name="eu-central-1", model_kwargs=model_kwargs)

pgvector_store = PGVector(
    connection=connection, embeddings=embeddings)

pgvector_store.add_texts(texts=[text])

The add_texts method calls the embed_documents underneath which again results in a vector with settings determined by the dim and norm parameters rather than the model_kwargs supplied to the BedrockEmbeddings.

Determine embedding size with Titan Embedding model

b867d08

3coins approved these changes May 17, 2024

View reviewed changes

3coins merged commit 622756c into langchain-ai:main May 17, 2024
12 checks passed

trendafil-gechev mentioned this pull request Jul 2, 2024

BedrockEmbeddings BUG: Default values for dim and norm in embed_documents() overwrite model_kwargs for Titan Embedding v2 model #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine embedding size with Titan Embedding v2 model #39

Determine embedding size with Titan Embedding v2 model #39

bigbernnn commented May 6, 2024

3coins left a comment

rsgrewal-aws commented May 26, 2024

trendafil-gechev commented Jun 28, 2024

Determine embedding size with Titan Embedding v2 model #39

Determine embedding size with Titan Embedding v2 model #39

Conversation

bigbernnn commented May 6, 2024

3coins left a comment

Choose a reason for hiding this comment

rsgrewal-aws commented May 26, 2024

trendafil-gechev commented Jun 28, 2024