Faster Memory._add_to_vector_store #1888

Kowalskiexe · 2024-09-19T13:36:54Z

Kowalskiexe
Sep 19, 2024

Hi, I'm currently working on a chat bot so response times are crucial.

I've been looking into the source code of mem0 and I'm curious about this code snippet:

function_calling_prompt = get_update_memory_messages(retrieved_old_memory, new_retrieved_facts)
new_memories_with_actions = self.llm.generate_response(
    messages=[{"role": "user", "content": function_calling_prompt}],
    response_format={"type": "json_object"},
)
new_memories_with_actions = json.loads(new_memories_with_actions)

In short, using an LLM it decides what to do with newly extracted memories and assigns them actions. In my application this snippet can easily take over 3 seconds.

My question / idea is: Since response time from LLMs is directly proportional to the number of generated tokens, wouldn't this task be accomplished way faster if we were to make a separate, concurrent LLM call per every extracted memory? As far as I know they don't need to be processed sequentially.

Such approach would consume more input tokens but the time necessary to run this would be capped by the slowest LLM call which would be way faster than the LLM call being made currently, as it would consist of far less output tokens.

spike-spiegel-21 · 2024-09-30T11:34:57Z

spike-spiegel-21
Sep 30, 2024
Collaborator

Great suggestion @Kowalskiexe. I will look into this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Memory._add_to_vector_store #1888

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Faster Memory._add_to_vector_store #1888

Kowalskiexe Sep 19, 2024

Replies: 1 comment

spike-spiegel-21 Sep 30, 2024 Collaborator

Kowalskiexe
Sep 19, 2024

spike-spiegel-21
Sep 30, 2024
Collaborator