You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files.
Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature?
I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.
The text was updated successfully, but these errors were encountered:
But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?
That is correct. Simple regex based grepping could significantly increase the set of URLs fetched. I am running some analysis of my own, and happy to report back with some empirical data, but to give an example, here is a code snippet from www.nytimes.com
e = "https://static01.nyt.com/ads/tpc-check.html",
a = document.body,
(r = document.createElement("iframe")).src = e,
As you can see, the entire URL is statically embedded and doesn't require any JS execution to construct.
Regarding what kind of regex to use, here is how InternetArchive greps for such URLs, though I am not sure what kind of false positives/negatives they get with this. If I am able to determine a better regex as a part of my analysis, will share that here.
From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files.
Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature?
I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.
The text was updated successfully, but these errors were encountered: