Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DERCBOT-1168] Indexing improvements #1766

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

assouktim
Copy link
Contributor

No description provided.

Copy link
Member

@Benvii Benvii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR

@@ -162,7 +162,6 @@ Documents will be indexed in OpenSearch DB under index_name index (index_name sh
| id | a uuid for each document (one per line in the input file) |
| chunk | the nb of the chunk if the original document was splitted: 'n/N' |
| title | the 'title' column from original input CSV |
| url | the 'url' column from original input CSV |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be replace by source ? not just removed

@@ -29,39 +28,42 @@
vector_store_json_config path to a vector store configuration file (JSON format)
(shall describe settings for one of OpenSearch or PGVector store)
chunks_size size of the embedded chunks of documents

ignore_source To ignore source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that sources should be valide URLs

Suggested change
ignore_source To ignore source
ignore_source To ignore source, useful if sources aren't valid URL

df['source'] = df['source'].replace('UNKNOWN', None)
loader = DataFrameLoader(df, page_content_column='text')
if bool(args['<ignore_source>']):
df_filtered['source'] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.

It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.

AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :

from pydantic import AnyUrl
file_url = AnyUrl('file:///tmp/ah.gif')

Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.

I see two things:

  • Yes, I'm in favor of keeping this information in the metadata.
  • We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.

if bool(args['<ignore_source>']):
df_filtered['source'] = None
else:
df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using directly ? seems possible according to it's documentation as it's the default value

Suggested change
df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN')
df_filtered['source'] = df_filtered['source'].fillna(value=None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can't do that. This line of code no longer exists.

@@ -29,39 +28,42 @@
vector_store_json_config path to a vector store configuration file (JSON format)
(shall describe settings for one of OpenSearch or PGVector store)
chunks_size size of the embedded chunks of documents

ignore_source To ignore source
TODO MASS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What for ? doc seems ok maybe you can remove it.

Suggested change
TODO MASS

vector_store = vector_store_factory.get_vector_store()

await embedding_and_indexing(splitted_docs, vector_store)

# Return indexing details
return index_name, session_uuid, len(docs), len(splitted_docs)
return IndexingDetails(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have a clean format it could be used to output this script as json format in an S3 bucket for instance when the ingestion is finished so that the pipeline could easily fetch the indexing session ID

return await index_documents(args)

if __name__ == '__main__':
# Parse command-line arguments
cli_args = docopt(__doc__, version='Webscraper 0.1.0')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cli_args = docopt(__doc__, version='Webscraper 0.1.0')
cli_args = docopt(__doc__, version='Document Indexing tool')

log_dir = Path('logs')
log_dir.mkdir(exist_ok=True)

log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️

@assouktim assouktim changed the title Indexing improvements [DERCBOT-1168] Indexing improvements Nov 12, 2024
@assouktim assouktim marked this pull request as ready for review November 12, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

2 participants