[DERCBOT-1168] Indexing improvements #1766

assouktim · 2024-10-15T09:15:40Z

No description provided.

Benvii

Thanks for this PR

Benvii · 2024-10-25T10:17:06Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md

@@ -162,7 +162,6 @@ Documents will be indexed in OpenSearch DB under index_name index (index_name sh
 | id               | a uuid for each document (one per line in the input file)        |
 | chunk            | the nb of the chunk if the original document was splitted: 'n/N' |
 | title            | the 'title' column from original input CSV                       |
-| url              | the 'url' column from original input CSV                         |


Should be replace by source ? not just removed

Benvii · 2024-10-28T09:43:48Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

@@ -29,39 +28,42 @@
    vector_store_json_config  path to a vector store configuration file (JSON format)
                    (shall describe settings for one of OpenSearch or PGVector store)
    chunks_size     size of the embedded chunks of documents
-
+    ignore_source   To ignore source


Maybe add that sources should be valide URLs

Suggested change

ignore_source To ignore source

ignore_source To ignore source, useful if sources aren't valid URL

Benvii · 2024-10-28T14:31:42Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

-    df['source'] = df['source'].replace('UNKNOWN', None)
-    loader = DataFrameLoader(df, page_content_column='text')
+    if bool(args['<ignore_source>']):
+        df_filtered['source'] = None


Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.

It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.

AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :

from pydantic import AnyUrl file_url = AnyUrl('file:///tmp/ah.gif')

Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.

I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.

I see two things:

Yes, I'm in favor of keeping this information in the metadata.

We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.

And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.

Benvii · 2024-10-28T15:00:27Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

+    if bool(args['<ignore_source>']):
+        df_filtered['source'] = None
+    else:
+        df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN')


Why not using directly ? seems possible according to it's documentation as it's the default value

Suggested change

df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN')

df_filtered['source'] = df_filtered['source'].fillna(value=None)

No, I can't do that. This line of code no longer exists.

Benvii · 2024-10-28T15:05:10Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

@@ -29,39 +28,42 @@
    vector_store_json_config  path to a vector store configuration file (JSON format)
                    (shall describe settings for one of OpenSearch or PGVector store)
    chunks_size     size of the embedded chunks of documents
-
+    ignore_source   To ignore source
+TODO MASS


What for ? doc seems ok maybe you can remove it.

Suggested change

TODO MASS

Benvii · 2024-10-28T15:09:09Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

    vector_store = vector_store_factory.get_vector_store()

    await embedding_and_indexing(splitted_docs, vector_store)

-    # Return indexing details
-    return index_name, session_uuid, len(docs), len(splitted_docs)
+    return IndexingDetails(


Good to have a clean format it could be used to output this script as json format in an S3 bucket for instance when the ingestion is finished so that the pipeline could easily fetch the indexing session ID

Benvii · 2024-10-28T15:10:06Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

+    return await index_documents(args)
+
+if __name__ == '__main__':
+    # Parse command-line arguments
    cli_args = docopt(__doc__, version='Webscraper 0.1.0')


Suggested change

cli_args = docopt(__doc__, version='Webscraper 0.1.0')

cli_args = docopt(__doc__, version='Document Indexing tool')

Benvii · 2024-10-28T15:11:00Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

+    log_dir = Path('logs')
+    log_dir.mkdir(exist_ok=True)
+
+    log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"


It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️

Benvii requested changes Oct 28, 2024

View reviewed changes

Benvii assigned assouktim Oct 28, 2024

assouktim added 2 commits October 31, 2024 15:31

[DERCBOT-1168] Indexing improvements

d5a78c5

[DERCBOT-1168] WIP

29de731

assouktim force-pushed the feature/dercbot-1168 branch from 2e769e4 to 29de731 Compare October 31, 2024 14:47

assouktim added the enhancement label Nov 8, 2024

[DERCBOT-1168] WIP

b3bd9cc

assouktim changed the title ~~Indexing improvements~~ [DERCBOT-1168] Indexing improvements Nov 12, 2024

assouktim marked this pull request as ready for review November 12, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DERCBOT-1168] Indexing improvements #1766

[DERCBOT-1168] Indexing improvements #1766

assouktim commented Oct 15, 2024

Benvii left a comment

Benvii Oct 25, 2024

Benvii Oct 28, 2024

Benvii Oct 28, 2024

assouktim Oct 31, 2024

assouktim Nov 8, 2024

Benvii Oct 28, 2024

assouktim Nov 8, 2024

Benvii Oct 28, 2024

Benvii Oct 28, 2024

Benvii Oct 28, 2024

Benvii Oct 28, 2024

	ignore_source To ignore source
	ignore_source To ignore source, useful if sources aren't valid URL

	df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN')
	df_filtered['source'] = df_filtered['source'].fillna(value=None)

	cli_args = docopt(__doc__, version='Webscraper 0.1.0')
	cli_args = docopt(__doc__, version='Document Indexing tool')

[DERCBOT-1168] Indexing improvements #1766

Are you sure you want to change the base?

[DERCBOT-1168] Indexing improvements #1766

Conversation

assouktim commented Oct 15, 2024

Benvii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment