[DERCBOT-1173] RAG Evaluation - langfuse

theopenconversationkit · Nov 6, 2024 · 3c14892 · 3c14892
1 parent 29de731
commit 3c14892
Show file tree

Hide file tree

Showing 11 changed files with 1,593 additions and 1,216 deletions.
diff --git a/gen-ai/orchestrator-server/src/main/python/server/poetry.lock b/gen-ai/orchestrator-server/src/main/python/server/poetry.lock
diff --git a/gen-ai/orchestrator-server/src/main/python/server/pyproject.toml b/gen-ai/orchestrator-server/src/main/python/server/pyproject.toml
@@ -21,7 +21,7 @@ colorlog = "^6.8.2"
 boto3 = "^1.35.37"
 urllib3 = "^2.2.3"
 jinja2 = "^3.1.4"
-langfuse = "^2.52.0"
+langfuse = "2.36.2"
 httpx-auth-awssigv4 = "^0.1.4"
 langchain-postgres = "^0.0.12"
 google-cloud-secret-manager = "^2.20.2"

diff --git a/gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/.env.exemple b/gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/.env.exemple
@@ -0,0 +1,11 @@
+#for LangFuse dataset provider
+LANGFUSE_SECRET_KEY=
+LANGFUSE_PUBLIC_KEY=
+LANGFUSE_HOST=
+
+# for LangsSmith dataset_provider
+LANGCHAIN_API_KEY=
+
+# for smarttribune_consumer.py script
+API_KEY=
+API_SECRET=
diff --git a/gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md b/gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md
@@ -205,12 +205,12 @@ To configure the default vector store, you can use the following environment var
 
 ### generate_dataset.py
 
-Generates a testing dataset based on an input file. The input file should have the correct format (see generate_datset_input.xlsx for sample). The generated dataset can be saved on filesystem, using the --csv-output option, on langsmith, using the --langsmith-dataset-name option, or both.
+Generates a testing dataset based on an input file. The input file should have the correct format (see generate_datset_input.xlsx for sample). The generated dataset can be saved on filesystem, using the --csv-output option, on langsmith, using the --langsmith-dataset-name option, on langfuse using the --langfuse-dataset-name option, or both.
 
 ```
 Usage:
-    generate_dataset.py [-v] <input_excel> --range=<s> [--csv-output=<path>] [ --langsmith-dataset-name=<name> ] [--locale=<locale>] [--no-answer=<na>]
-    generate_dataset.py [-v] <input_excel> --sheet=<n>... [--csv-output=<path>] [ --langsmith-dataset-name=<name> ] [--locale=<locale>] [--no-answer=<na>]
+    generate_dataset.py [-v] <input_excel> --range=<s> [--csv-output=<path>] [ --langsmith-dataset-name=<name> ] [ --langfuse-dataset-name=<name> ] [--locale=<locale>] [--no-answer=<na>]
+    generate_dataset.py [-v] <input_excel> --sheet=<n>... [--csv-output=<path>] [ --langsmith-dataset-name=<name> ] [ --langfuse-dataset-name=<name> ] [--locale=<locale>] [--no-answer=<na>]
 
 Arguments:
     input_excel path to the input excel file
@@ -220,22 +220,22 @@ Options:
     --sheet=<n>                     Sheet numbers to be parsed. Indices are 0-indexed.
     --csv-output=<path>             Output path of csv file to be generated.
     --langsmith-dataset-name=<name> Name of the dataset to be saved on langsmith.
+    --langfuse-dataset-name=<name> Name of the dataset to be saved on langfuse.
     --locale=<locale>               Locale to be included in de dataset. [default: French]
     --no-answer=<na>                Label of no_answer to be included in the dataset. [default: NO_RAG_SENTENCE]
     -h --help                       Show this screen
     --version                       Show version
     -v                              Verbose output for debugging (without this option, script will be silent but for errors)
-
-Generates a testing dataset based on an input file. The input file should have the correct format (see generate_datset_input.xlsx for sample). The generated dataset can be saved on filesystem, using the --csv-output option, on langsmith, using the --langsmith-dataset-name option, or both.
+Generates a testing dataset based on an input file. The input file should have the correct format (see generate_datset_input.xlsx for sample). The generated dataset can be saved on filesystem, using the --csv-output option, on langsmith, using the --langsmith-dataset-name option, on langfuse using the --langfuse-dataset-name option, or both.
 ```
 
 ### rag_testing_tool.py
 
-Retrieval-Augmented Generation (RAG) endpoint settings testing tool based on LangSmith's SDK: runs a specific RAG Settings configuration against a reference dataset.
+Retrieval-Augmented Generation (RAG) endpoint settings testing tool based on LangSmith's or LangFuse's SDK: runs a specific RAG Settings configuration against a reference dataset.
 
 ```
 Usage:
-    rag_testing_tool.py [-v] <rag_query> <dataset_name> <test_name> [<delay>]
+    rag_testing_tool.py [-v] <rag_query> <dataset_provider> <dataset_name> <test_name> [<delay>]
     rag_testing_tool.py -h | --help
     rag_testing_tool.py --version
 
@@ -245,6 +245,7 @@ Arguments:
                     provider, indexation session's unique id, and 'k', i.e. nb
                     of retrieved docs (question and chat history are ignored,
                     as they will come from the dataset)
+    dataset_provider the dataset provider (langsmith or langfuse)
     dataset_name    the reference dataset name
     test_name       name of the test run
 
@@ -256,7 +257,7 @@ Options:
                 be silent but for errors)
 ```
 
-Build a RAG (Lang)chain from the RAG Query and runs it against the provided LangSmith dataset. The chain is created anew for each entry of the dataset, and if a delay is provided each chain creation will be delayed accordingly.
+Build a RAG (Lang)chain from the RAG Query and runs it against the provided LangSmith or LangSmith dataset. The chain is created anew for each entry of the dataset, and if a delay is provided each chain creation will be delayed accordingly.
 ### export_run_results.py
 
 Export a LangSmith dataset run results, in csv format.
@@ -280,3 +281,27 @@ The exported CSV file will have these columns :
 'Reference input'|'Reference output'|'Response 1'|'Sources 1'|...|'Response N'|'Sources N'
 NB: There will be as many responses as run sessions
 ```
+
+### export_run_results_langfuse.py
+
+Export a LangFuse dataset run results, in csv format.
+
+```
+Usage:
+    export_run_results_langfuse.py [-v] <dataset_name> <runs_names>...
+    export_run_results_langfuse.py -h | --help
+    export_run_results_langfuse.py --version
+
+Arguments:
+    dataset_name      dataset id
+    runs_names     list of session ids
+
+Options:
+    -h --help   Show this screen
+    --version   Show version
+    -v          Verbose output for debugging
+
+The exported CSV file will have these columns :
+'Reference input'|'Reference output'|'Response 1'|'Sources 1'|...|'Response N'|'Sources N'
+NB: There will be as many responses as run sessions
+```
diff --git a/...rator-server/src/main/python/tock-llm-indexing-tools/docs/rag_testing_tools.png b/...rator-server/src/main/python/tock-llm-indexing-tools/docs/rag_testing_tools.png