-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping URLs from JavaScript and CSS #142
Comments
Thanks for reporting this. Regex might be the way to go indeed.
Thanks a lot, this means a lot to @CohenArthur and I |
@Skallwar I'd love to! It seems like I think we should refactor the code a bit. Instead of returning all the strings, then filtering them out, and then changing them to the paths of the downloaded files, we should probably first find single url, check if it belongs to the domain, and then change it right away, all in the same method, even before adding the url to the queue. This will offer much more flexibility for getting urls not only from HTML, but CSS, JS, and other types of files. What do you think? |
Currently, the main issue that prevents from downloading large websites with a lot of JavaScript is the suckit's inability to scrape links from JS and CSS as well as from HTML.
I think JS (as well as CSS) is too complicated to parse, so we could just use regular expressions to find urls, add them to the queue, and then replace them with the local format.
Thanks for such an amazing piece of software!
Related to #68 #70
The text was updated successfully, but these errors were encountered: