Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with Vector Databases #1

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions SimplerLLM/language/llm.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import SimplerLLM.language.llm_providers.openai_llm as openai_llm
import SimplerLLM.language.llm_providers.gemini_llm as gemini_llm
import SimplerLLM.language.llm_providers.anthropic_llm as anthropic_llm
from SimplerLLM.prompts.messages_template import MessagesTemplate
import os
from dotenv import load_dotenv
from SimplerLLM.tools.vector_db import VectorDB
from SimplerLLM.language.llm_providers.openai_llm import generate_response as openai_generate_response
from SimplerLLM.language.llm_providers.openai_llm import generate_response_async as openai_generate_response_async
SaiNivedh26 marked this conversation as resolved.
Show resolved Hide resolved
from enum import Enum


Expand All @@ -23,6 +24,8 @@ def __init__(
self.model_name = model_name
self.temperature = temperature
self.top_p = top_p
self.vector_db = VectorDB()

Comment on lines +27 to +28
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of LLM class constructor.

The constructor initializes a VectorDB instance. It's good practice to allow dependency injection for better testing and flexibility.

- self.vector_db = VectorDB()
+ def __init__(self, vector_db=None):
+     self.vector_db = vector_db if vector_db else VectorDB()
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.vector_db = VectorDB()
def __init__(self, vector_db=None):
self.vector_db = vector_db if vector_db else VectorDB()


@staticmethod
def create(
Expand Down Expand Up @@ -52,6 +55,11 @@ def prepare_params(self, model_name, temperature, top_p):
"temperature": temperature if temperature else self.temperature,
"top_p": top_p if top_p else self.top_p,
}
def store_response_as_vector(self, texts):
self.vector_db.store_vectors(texts)

def find_similar_responses(self, text):
return self.vector_db.query_similar(text)


class OpenAILLM(LLM):
Expand Down
27 changes: 27 additions & 0 deletions SimplerLLM/tools/vector_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import os
SaiNivedh26 marked this conversation as resolved.
Show resolved Hide resolved
import chromadb
from chromadb.utils import embedding_functions

class VectorDB:
def __init__(self):
persistence_directory = "./chroma_db"
self.client = chromadb.PersistentClient(path=persistence_directory)
self.embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
self.collection = self.client.get_or_create_collection(
name="responses",
embedding_function=self.embedding_function
)
Comment on lines +6 to +13
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of VectorDB class constructor.

The constructor initializes the PersistentClient and sets up a collection with an embedding function. The hard-coded path for the database ("./chroma_db") could be made configurable to enhance flexibility in different environments.

- persistence_directory = "./chroma_db"
+ def __init__(self, persistence_directory="./chroma_db"):
+     self.client = chromadb.PersistentClient(path=persistence_directory)

Committable suggestion was skipped due to low confidence.


def store_vectors(self, texts):
self.collection.add(documents=texts, ids=[f"id_{i}" for i in range(len(texts))])

def query_vectors(self, query_text):
results = self.collection.query(query_texts=[query_text], n_results=5)
return results['documents'][0]
SaiNivedh26 marked this conversation as resolved.
Show resolved Hide resolved

def store_response(self, text):
self.collection.add(documents=[text], ids=[f"id_{self.collection.count()}"])
Comment on lines +22 to +23
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of store_response method.

The method adds a single document to the collection. Using the collection's count as an ID is risky as it can lead to race conditions in a concurrent environment. Consider using a more robust method for generating unique IDs.

- ids=[f"id_{self.collection.count()}"]
+ import uuid
+ ids=[str(uuid.uuid4())]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def store_response(self, text):
self.collection.add(documents=[text], ids=[f"id_{self.collection.count()}"])
def store_response(self, text):
import uuid
self.collection.add(documents=[text], ids=[str(uuid.uuid4())])


def query_similar(self, query_text):
return self.query_vectors(query_text)

Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file added chroma_db/chroma.sqlite3
Binary file not shown.
67 changes: 67 additions & 0 deletions new.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from SimplerLLM.language.llm import LLM, LLMProvider
from dotenv import load_dotenv
import os
SaiNivedh26 marked this conversation as resolved.
Show resolved Hide resolved
import time

load_dotenv()

def test_vector_storage_and_retrieval():
llm = LLM(provider=LLMProvider.OPENAI, model_name="gpt-3.5-turbo")

prompts = [
"What is artificial intelligence and how does it differ from human intelligence?",
"Explain the process of machine learning and its key components.",
"Describe the architecture of deep neural networks and their layers.",
"What are the applications of natural language processing in everyday technology?",
"How does computer vision work and what are its real-world applications?",
"Explain the concept of reinforcement learning and its use in robotics.",
"What are the ethical concerns surrounding AI development and deployment?",
"How does transfer learning accelerate AI model development?",
"Describe the differences between supervised, unsupervised, and semi-supervised learning.",
"What is the role of big data in advancing AI capabilities?",
"Explain the concept of explainable AI and why it's important.",
"How do genetic algorithms work in optimization problems?",
"What are the challenges in developing artificial general intelligence (AGI)?",
"Describe the impact of AI on healthcare diagnostics and treatment.",
"How does AI contribute to autonomous vehicle technology?"
]

print("Storing responses as vectors...")
start_time = time.time()
try:
llm.store_response_as_vector(prompts)
except Exception as e:
print("Error occurred:", e)
end_time = time.time()
print(f"Responses stored successfully. Time taken: {end_time - start_time:.2f} seconds")

query_prompts = [
"What are the fundamental principles of AI?",
"How do machines learn from data?",
"Explain the inner workings of neural networks.",
"What are some practical applications of NLP?",
"How is AI changing the automotive industry?",
"What are the moral implications of using AI in decision-making?",
"How is AI transforming the healthcare sector?",
"What are the key differences between AI learning paradigms?",
"How does AI handle complex optimization problems?",
"What are the challenges in making AI systems more transparent?"
]

print("\nQuerying for similar responses:")
for query_prompt in query_prompts:
print(f"\nQuery: {query_prompt}")
start_time = time.time()
similar_responses = llm.find_similar_responses(query_prompt)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")
print("Similar responses:")
for i, response in enumerate(similar_responses, 1):
print(f"{i}. {response}")

Comment on lines +51 to +61
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handling to the querying process.

The loop for querying similar responses is clear and straightforward. However, adding error handling would improve the robustness of the test.

+ try:
  similar_responses = llm.find_similar_responses(query_prompt)
+ except Exception as e:
+     print("Error occurred:", e)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("\nQuerying for similar responses:")
for query_prompt in query_prompts:
print(f"\nQuery: {query_prompt}")
start_time = time.time()
similar_responses = llm.find_similar_responses(query_prompt)
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")
print("Similar responses:")
for i, response in enumerate(similar_responses, 1):
print(f"{i}. {response}")
print("\nQuerying for similar responses:")
for query_prompt in query_prompts:
print(f"\nQuery: {query_prompt}")
start_time = time.time()
try:
similar_responses = llm.find_similar_responses(query_prompt)
except Exception as e:
print("Error occurred:", e)
continue
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")
print("Similar responses:")
for i, response in enumerate(similar_responses, 1):
print(f"{i}. {response}")

def main():
print("Starting vector storage and retrieval test...")
test_vector_storage_and_retrieval()

if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ python_docx==1.1.0
pytube==15.0.0
Requests==2.31.0
youtube_transcript_api==0.6.2

sentence-transformers==3.0.1
chromadb==0.5.3