Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notebook demonstrating the use of DPK connector for RAG #740

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Qiragg
Copy link

@Qiragg Qiragg commented Oct 24, 2024

Why are these changes needed?

These changes will allow us to demonstrate an end-to-end pipeline starting from the target acquisition of crawled content.

This PR will also update the requirements.txt and utils to support certain functions used during the acquisition.

The notebook uploaded demonstrates the download of only research papers (PDF files) published in NeurIPS 2017 using path_focus and mime_type extraction. The crawled PDFs can be fed further in our RAG pipeline and the rest of the steps remain the same starting from Step 2.2 of https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb

The only issue I see is somewhere in the notebook : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_2B_llamaindex_query.ipynb we do queries on the attention mechanism and the granite model. Papers pertaining to those are not downloaded in the notebook but I can make changes to have them be downloaded during the crawl. If I make the change and don't limit the number of downloads, around ~700 PDFs will get downloaded. I don't think it will cause hiccups during the rest of RAG steps but I haven't tested that yet.

Let's use this PR to discuss how can we this notebook fit with the rest of the RAG pipeline and whether any further changes are needed.

Related issue number (if any).

#739

Added notebook for DPK connector, requirements and utils
@Qiragg
Copy link
Author

Qiragg commented Oct 24, 2024

I noticed that there is a minor bug in the PR.

This is being executed

    if saved_pages<20:
        print(f"Visited url: {url}")

even when the visited url count is greater than 20. I will fix that in the next commit after I get feedback on the notebook.

@shahrokhDaijavad
Copy link
Member

Thanks, @Qiragg. Nice job! I tested your branch by running the Notebook and got the expected output. 58 pages were retrieved, and 18 were downloaded to the input subdirectory. Together with the "attention" and "granite" papers, it is the 20-limit bug that you have noted above. This input directory is now ready to be used by @sujee's process Notebook. Of course, the top part of Notebook (overview section, including the picture) needs to be changed.

Copy link
Collaborator

@hmtbr hmtbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. One minor comment.

@@ -2,7 +2,7 @@

data-prep-toolkit-transforms==0.2.1
data-prep-toolkit-transforms-ray==0.2.1

data-prep-connector==0.2.2.dev1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use the official release version.

Suggested change
data-prep-connector==0.2.2.dev1
data-prep-connector==0.2.2

@sujee
Copy link
Contributor

sujee commented Oct 25, 2024

My feedback is

  • This is a great utility!
  • move it out of the RAG example (I prefer to keep that example simple)
  • And release it as another example, downloading HTML files and feeding it to html2pq pipeline.

@shahrokhDaijavad
Copy link
Member

OK, @sujee. I appreciate that you want to keep the RAG example simple. @Qiragg We can discuss what to call the example, put it parallel to the RAG directory and what to add (pdf2parquet, html2parquet alone or add some other transform) to do something meaningful after the crawl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants