-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler transform #797
base: dev
Are you sure you want to change the base?
Crawler transform #797
Conversation
…dURLs Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@touma-I Thank you very much for making this change! This simple implementation looks good to me. I added several comments, but most of them are nitpicking.
if self.folder: | ||
dao=DataAccessLocal(local_config={'output_folder':self.folder,'input_folder':'.'}) | ||
for x in self.docs: | ||
dao.save_file(self.folder+'/'+x['filename'], x['contents']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: How does this DAO handle a file which has the same filename as one saved in the folder before? e.g. We crawl test.com and get test.com/path1/doc.pdf
and test.com/path2/doc.pdf
. It looks the two have the same filename
value: doc.pdf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmtbr I think we are using the full URL path to form the filename (i.e. the filename for test.com/path1/doc.pdf becomes path1_doc.pdf and test.com/path2/doc.pdf becomes path2_doc.pdf). This filename generation is not handled by the DAO but by the transform itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should work well when we build a filename from the request URL. My concern is the filename when we build it from the value of the Content-Disposition
header. If the two URLs I gave return the same Content-Disposition
header value, it looks the two have the same filename as a result. We may also want to add the path info to the filename in the same way as the case of no Content-Disposition
. Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmtbr I don't know if we should over-engineer this one. If that situation occurs, we can address it based on a concrete example.
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Why are these changes needed?
Implement crawler transforms using the dpi-connector API. This is based on the work done by the data sift but also had to add CLI in order to integrate with python runtime. This implementation uses the new layout for the transform using module name dpk_web2parquet
Related issue number (if any).
#751