Scraping URLs from JavaScript and CSS #142

marchellodev · 2021-07-07T18:58:04Z

Currently, the main issue that prevents from downloading large websites with a lot of JavaScript is the suckit's inability to scrape links from JS and CSS as well as from HTML.

I think JS (as well as CSS) is too complicated to parse, so we could just use regular expressions to find urls, add them to the queue, and then replace them with the local format.

Thanks for such an amazing piece of software!

Related to #68 #70

Skallwar · 2021-07-07T20:41:36Z

Thanks for reporting this. Regex might be the way to go indeed.
If you are interested to do it, feel free to assign yourself and open a PR 😃

Thanks for such an amazing piece of software!

Thanks a lot, this means a lot to @CohenArthur and I

marchellodev · 2021-07-08T12:13:16Z

@Skallwar I'd love to!

It seems like find_urls_as_strings() returns mutable list of all the urls, that you can change, and this change will be instantly reflected in dom via the kuchiki library. I try to select every <script> element and then find urls inside via regex. I am not sure how I can do the same here - return the mutable string that will be attached to the dom.

I think we should refactor the code a bit. Instead of returning all the strings, then filtering them out, and then changing them to the paths of the downloaded files, we should probably first find single url, check if it belongs to the domain, and then change it right away, all in the same method, even before adding the url to the queue. This will offer much more flexibility for getting urls not only from HTML, but CSS, JS, and other types of files. What do you think?

Skallwar · 2021-07-08T19:29:03Z

I'm note sure regex are the way to go. That's slow and matching url and relative url will be quite hard.

I think there a some good CSS parser out there based on servo this one or this one

Js might be more tricky

Skallwar mentioned this issue Mar 9, 2023

Add URLs to depth tree from CSS #212

Open

dsgallups mentioned this issue Mar 15, 2023

Support to pull URLs from CSS #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping URLs from JavaScript and CSS #142

Scraping URLs from JavaScript and CSS #142

marchellodev commented Jul 7, 2021

Skallwar commented Jul 7, 2021

marchellodev commented Jul 8, 2021

Skallwar commented Jul 8, 2021

Scraping URLs from JavaScript and CSS #142

Scraping URLs from JavaScript and CSS #142

Comments

marchellodev commented Jul 7, 2021

Skallwar commented Jul 7, 2021

marchellodev commented Jul 8, 2021

Skallwar commented Jul 8, 2021