Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping URLs from JavaScript and CSS #142

Open
marchellodev opened this issue Jul 7, 2021 · 3 comments
Open

Scraping URLs from JavaScript and CSS #142

marchellodev opened this issue Jul 7, 2021 · 3 comments

Comments

@marchellodev
Copy link
Contributor

Currently, the main issue that prevents from downloading large websites with a lot of JavaScript is the suckit's inability to scrape links from JS and CSS as well as from HTML.

I think JS (as well as CSS) is too complicated to parse, so we could just use regular expressions to find urls, add them to the queue, and then replace them with the local format.

Thanks for such an amazing piece of software!

Related to #68 #70

@Skallwar
Copy link
Owner

Skallwar commented Jul 7, 2021

Thanks for reporting this. Regex might be the way to go indeed.
If you are interested to do it, feel free to assign yourself and open a PR 😃

Thanks for such an amazing piece of software!

Thanks a lot, this means a lot to @CohenArthur and I

@marchellodev
Copy link
Contributor Author

@Skallwar I'd love to!

It seems like find_urls_as_strings() returns mutable list of all the urls, that you can change, and this change will be instantly reflected in dom via the kuchiki library. I try to select every <script> element and then find urls inside via regex. I am not sure how I can do the same here - return the mutable string that will be attached to the dom.

I think we should refactor the code a bit. Instead of returning all the strings, then filtering them out, and then changing them to the paths of the downloaded files, we should probably first find single url, check if it belongs to the domain, and then change it right away, all in the same method, even before adding the url to the queue. This will offer much more flexibility for getting urls not only from HTML, but CSS, JS, and other types of files. What do you think?

@Skallwar
Copy link
Owner

Skallwar commented Jul 8, 2021

I'm note sure regex are the way to go. That's slow and matching url and relative url will be quite hard.

I think there a some good CSS parser out there based on servo this one or this one

Js might be more tricky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants