Add a notebook demonstrating the use of DPK connector for RAG #740

Qiragg · 2024-10-24T09:18:46Z

Why are these changes needed?

These changes will allow us to demonstrate an end-to-end pipeline starting from the target acquisition of crawled content.

This PR will also update the requirements.txt and utils to support certain functions used during the acquisition.

The notebook uploaded demonstrates the download of only research papers (PDF files) published in NeurIPS 2017 using path_focus and mime_type extraction. The crawled PDFs can be fed further in our RAG pipeline and the rest of the steps remain the same starting from Step 2.2 of https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb

The only issue I see is somewhere in the notebook : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_2B_llamaindex_query.ipynb we do queries on the attention mechanism and the granite model. Papers pertaining to those are not downloaded in the notebook but I can make changes to have them be downloaded during the crawl. If I make the change and don't limit the number of downloads, around ~700 PDFs will get downloaded. I don't think it will cause hiccups during the rest of RAG steps but I haven't tested that yet.

Let's use this PR to discuss how can we this notebook fit with the rest of the RAG pipeline and whether any further changes are needed.

Related issue number (if any).

#739

Added notebook for DPK connector, requirements and utils

Qiragg · 2024-10-24T09:25:56Z

I noticed that there is a minor bug in the PR.

This is being executed

    if saved_pages<20:
        print(f"Visited url: {url}")

even when the visited url count is greater than 20. I will fix that in the next commit after I get feedback on the notebook.

shahrokhDaijavad · 2024-10-24T16:33:04Z

Thanks, @Qiragg. Nice job! I tested your branch by running the Notebook and got the expected output. 58 pages were retrieved, and 18 were downloaded to the input subdirectory. Together with the "attention" and "granite" papers, it is the 20-limit bug that you have noted above. This input directory is now ready to be used by @sujee's process Notebook. Of course, the top part of Notebook (overview section, including the picture) needs to be changed.

hmtbr

Good job. One minor comment.

hmtbr · 2024-10-25T01:06:55Z

examples/notebooks/rag/requirements.txt

@@ -2,7 +2,7 @@

 data-prep-toolkit-transforms==0.2.1
 data-prep-toolkit-transforms-ray==0.2.1
-
+data-prep-connector==0.2.2.dev1


We may want to use the official release version.

Suggested change

data-prep-connector==0.2.2.dev1

data-prep-connector==0.2.2

sujee · 2024-10-25T03:55:37Z

My feedback is

This is a great utility!
move it out of the RAG example (I prefer to keep that example simple)
And release it as another example, downloading HTML files and feeding it to html2pq pipeline.

shahrokhDaijavad · 2024-10-25T15:26:58Z

OK, @sujee. I appreciate that you want to keep the RAG example simple. @Qiragg We can discuss what to call the example, put it parallel to the RAG directory and what to add (pdf2parquet, html2parquet alone or add some other transform) to do something meaningful after the crawl.

Add dpk connector

c66c579

Added notebook for DPK connector, requirements and utils

hmtbr reviewed Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a notebook demonstrating the use of DPK connector for RAG #740

Add a notebook demonstrating the use of DPK connector for RAG #740

Qiragg commented Oct 24, 2024

Qiragg commented Oct 24, 2024 •

edited

Loading

shahrokhDaijavad commented Oct 24, 2024

hmtbr left a comment

hmtbr Oct 25, 2024

sujee commented Oct 25, 2024

shahrokhDaijavad commented Oct 25, 2024

Add a notebook demonstrating the use of DPK connector for RAG #740

Are you sure you want to change the base?

Add a notebook demonstrating the use of DPK connector for RAG #740

Conversation

Qiragg commented Oct 24, 2024

Why are these changes needed?

Related issue number (if any).

Qiragg commented Oct 24, 2024 • edited Loading

shahrokhDaijavad commented Oct 24, 2024

hmtbr left a comment

Choose a reason for hiding this comment

hmtbr Oct 25, 2024

Choose a reason for hiding this comment

sujee commented Oct 25, 2024

shahrokhDaijavad commented Oct 25, 2024

Qiragg commented Oct 24, 2024 •

edited

Loading