Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler transform #797

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from
Open

Crawler transform #797

wants to merge 14 commits into from

Conversation

touma-I
Copy link
Collaborator

@touma-I touma-I commented Nov 13, 2024

Why are these changes needed?

Implement crawler transforms using the dpi-connector API. This is based on the work done by the data sift but also had to add CLI in order to integrate with python runtime. This implementation uses the new layout for the transform using module name dpk_web2parquet

Related issue number (if any).

#751

@touma-I touma-I requested a review from hmtbr November 13, 2024 02:23
Copy link
Collaborator

@hmtbr hmtbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I Thank you very much for making this change! This simple implementation looks good to me. I added several comments, but most of them are nitpicking.

transforms/universal/web2parquet/dpk_web2parquet/config.py Outdated Show resolved Hide resolved
transforms/universal/web2parquet/dpk_web2parquet/config.py Outdated Show resolved Hide resolved
transforms/universal/web2parquet/dpk_web2parquet/config.py Outdated Show resolved Hide resolved
transforms/universal/web2parquet/dpk_web2parquet/local.py Outdated Show resolved Hide resolved
if self.folder:
dao=DataAccessLocal(local_config={'output_folder':self.folder,'input_folder':'.'})
for x in self.docs:
dao.save_file(self.folder+'/'+x['filename'], x['contents'])
Copy link
Collaborator

@hmtbr hmtbr Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: How does this DAO handle a file which has the same filename as one saved in the folder before? e.g. We crawl test.com and get test.com/path1/doc.pdf and test.com/path2/doc.pdf. It looks the two have the same filename value: doc.pdf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmtbr I think we are using the full URL path to form the filename (i.e. the filename for test.com/path1/doc.pdf becomes path1_doc.pdf and test.com/path2/doc.pdf becomes path2_doc.pdf). This filename generation is not handled by the DAO but by the transform itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should work well when we build a filename from the request URL. My concern is the filename when we build it from the value of the Content-Disposition header. If the two URLs I gave return the same Content-Disposition header value, it looks the two have the same filename as a result. We may also want to add the path info to the filename in the same way as the case of no Content-Disposition. Correct me if I'm wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmtbr I don't know if we should over-engineer this one. If that situation occurs, we can address it based on a concrete example.

transforms/universal/web2parquet/dpk_web2parquet/utils.py Outdated Show resolved Hide resolved
transforms/universal/web2parquet/dpk_web2parquet/utils.py Outdated Show resolved Hide resolved
Signed-off-by: Maroun Touma <[email protected]>
@touma-I touma-I requested review from hmtbr and daw3rd November 14, 2024 12:40
@touma-I touma-I marked this pull request as ready for review November 14, 2024 12:54
transforms/.make.modules Outdated Show resolved Hide resolved
transforms/.make.modules Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants