-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DERCBOT-1168] Indexing improvements #1766
base: master
Are you sure you want to change the base?
[DERCBOT-1168] Indexing improvements #1766
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR
@@ -162,7 +162,6 @@ Documents will be indexed in OpenSearch DB under index_name index (index_name sh | |||
| id | a uuid for each document (one per line in the input file) | | |||
| chunk | the nb of the chunk if the original document was splitted: 'n/N' | | |||
| title | the 'title' column from original input CSV | | |||
| url | the 'url' column from original input CSV | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be replace by source ? not just removed
@@ -29,39 +28,42 @@ | |||
vector_store_json_config path to a vector store configuration file (JSON format) | |||
(shall describe settings for one of OpenSearch or PGVector store) | |||
chunks_size size of the embedded chunks of documents | |||
|
|||
ignore_source To ignore source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that sources should be valide URLs
ignore_source To ignore source | |
ignore_source To ignore source, useful if sources aren't valid URL |
df['source'] = df['source'].replace('UNKNOWN', None) | ||
loader = DataFrameLoader(df, page_content_column='text') | ||
if bool(args['<ignore_source>']): | ||
df_filtered['source'] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.
It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.
AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :
from pydantic import AnyUrl
file_url = AnyUrl('file:///tmp/ah.gif')
Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.
I see two things:
- Yes, I'm in favor of keeping this information in the metadata.
- We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.
if bool(args['<ignore_source>']): | ||
df_filtered['source'] = None | ||
else: | ||
df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using directly ? seems possible according to it's documentation as it's the default value
df_filtered['source'] = df_filtered['source'].fillna('UNKNOWN') | |
df_filtered['source'] = df_filtered['source'].fillna(value=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can't do that. This line of code no longer exists.
@@ -29,39 +28,42 @@ | |||
vector_store_json_config path to a vector store configuration file (JSON format) | |||
(shall describe settings for one of OpenSearch or PGVector store) | |||
chunks_size size of the embedded chunks of documents | |||
|
|||
ignore_source To ignore source | |||
TODO MASS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What for ? doc seems ok maybe you can remove it.
TODO MASS |
vector_store = vector_store_factory.get_vector_store() | ||
|
||
await embedding_and_indexing(splitted_docs, vector_store) | ||
|
||
# Return indexing details | ||
return index_name, session_uuid, len(docs), len(splitted_docs) | ||
return IndexingDetails( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to have a clean format it could be used to output this script as json format in an S3 bucket for instance when the ingestion is finished so that the pipeline could easily fetch the indexing session ID
return await index_documents(args) | ||
|
||
if __name__ == '__main__': | ||
# Parse command-line arguments | ||
cli_args = docopt(__doc__, version='Webscraper 0.1.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cli_args = docopt(__doc__, version='Webscraper 0.1.0') | |
cli_args = docopt(__doc__, version='Document Indexing tool') |
log_dir = Path('logs') | ||
log_dir.mkdir(exist_ok=True) | ||
|
||
log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️
2e769e4
to
29de731
Compare
No description provided.