Crawler transform #797

touma-I · 2024-11-13T02:23:29Z

Why are these changes needed?

Implement crawler transforms using the dpi-connector API. This is based on the work done by the data sift but also had to add CLI in order to integrate with python runtime. This implementation uses the new layout for the transform using module name dpk_web2parquet

Related issue number (if any).

#751

…dURLs Signed-off-by: Maroun Touma <[email protected]>

Signed-off-by: Maroun Touma <[email protected]>

hmtbr

@touma-I Thank you very much for making this change! This simple implementation looks good to me. I added several comments, but most of them are nitpicking.

transforms/universal/web2parquet/dpk_web2parquet/config.py

transforms/universal/web2parquet/dpk_web2parquet/local.py

transforms/universal/web2parquet/dpk_web2parquet/transform.py

hmtbr · 2024-11-14T06:34:17Z

transforms/universal/web2parquet/dpk_web2parquet/transform.py

+        if self.folder:
+            dao=DataAccessLocal(local_config={'output_folder':self.folder,'input_folder':'.'})
+            for x in self.docs:
+                dao.save_file(self.folder+'/'+x['filename'], x['contents'])


Q: How does this DAO handle a file which has the same filename as one saved in the folder before? e.g. We crawl test.com and get test.com/path1/doc.pdf and test.com/path2/doc.pdf. It looks the two have the same filename value: doc.pdf.

@hmtbr I think we are using the full URL path to form the filename (i.e. the filename for test.com/path1/doc.pdf becomes path1_doc.pdf and test.com/path2/doc.pdf becomes path2_doc.pdf). This filename generation is not handled by the DAO but by the transform itself.

Yes, it should work well when we build a filename from the request URL. My concern is the filename when we build it from the value of the Content-Disposition header. If the two URLs I gave return the same Content-Disposition header value, it looks the two have the same filename as a result. We may also want to add the path info to the filename in the same way as the case of no Content-Disposition. Correct me if I'm wrong.

@hmtbr I don't know if we should over-engineer this one. If that situation occurs, we can address it based on a concrete example.

transforms/universal/web2parquet/dpk_web2parquet/utils.py

Signed-off-by: Maroun Touma <[email protected]>

transforms/universal/web2parquet/dpk_web2parquet/local_python.py

transforms/.make.modules

transforms/universal/web2parquet/dpk_web2parquet/transform.py

Signed-off-by: Maroun Touma <[email protected]>

touma-I added 5 commits November 8, 2024 08:34

first implementation of web2parquet for crawling/downloading from see…

41bed68

…dURLs Signed-off-by: Maroun Touma <[email protected]>

use makefile template

cf516b5

Signed-off-by: Maroun Touma <[email protected]>

complete full implementation and testing with python runtime

acc35cd

Signed-off-by: Maroun Touma <[email protected]>

identified current requirements for web2parquet module

3e05f30

Signed-off-by: Maroun Touma <[email protected]>

relaxed dependencies

5710653

Signed-off-by: Maroun Touma <[email protected]>

touma-I requested a review from hmtbr November 13, 2024 02:23

touma-I added 4 commits November 13, 2024 13:02

added build target

80e4ebe

Signed-off-by: Maroun Touma <[email protected]>

Merge branch 'dev' into crawler-transform

cf20268

added licence block

4dcebb6

Signed-off-by: Maroun Touma <[email protected]>

Merge branch 'dev' into crawler-transform

137d92c

hmtbr reviewed Nov 14, 2024

View reviewed changes

fix filename issue

d2404f4

Signed-off-by: Maroun Touma <[email protected]>

touma-I requested review from hmtbr and daw3rd November 14, 2024 12:40

touma-I marked this pull request as ready for review November 14, 2024 12:54

hmtbr approved these changes Nov 14, 2024

View reviewed changes

touma-I added 2 commits November 14, 2024 08:23

generate cicd workflow for new transform

1e810d0

Signed-off-by: Maroun Touma <[email protected]>

build image only if a Dockerfile is defined

fcbcc0a

Signed-off-by: Maroun Touma <[email protected]>

daw3rd requested changes Nov 14, 2024

View reviewed changes

touma-I added 2 commits November 14, 2024 15:19

Ignore page content as long as we get the right count

b5031c9

Signed-off-by: Maroun Touma <[email protected]>

rename make.cicd.target

9ad3d18

Signed-off-by: Maroun Touma <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler transform #797

Crawler transform #797

touma-I commented Nov 13, 2024 •

edited

Loading

hmtbr left a comment

hmtbr Nov 14, 2024 •

edited

Loading

touma-I Nov 14, 2024

hmtbr Nov 14, 2024

touma-I Nov 14, 2024

Crawler transform #797

Are you sure you want to change the base?

Crawler transform #797

Conversation

touma-I commented Nov 13, 2024 • edited Loading

Why are these changes needed?

Related issue number (if any).

hmtbr left a comment

Choose a reason for hiding this comment

hmtbr Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

touma-I Nov 14, 2024

Choose a reason for hiding this comment

hmtbr Nov 14, 2024

Choose a reason for hiding this comment

touma-I Nov 14, 2024

Choose a reason for hiding this comment

touma-I commented Nov 13, 2024 •

edited

Loading

hmtbr Nov 14, 2024 •

edited

Loading