Extract embedded URLs from JavaScript files #279

goelayu · 2023-03-23T23:23:10Z

From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files.
Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature?
I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.

rockdaboot · 2023-03-24T11:59:33Z

We had this kind of request before and rejected it because often JS produces URLs dynamically and we don't want to run external code (e.g. via V8).

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

goelayu · 2023-03-24T12:20:22Z

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

That is correct. Simple regex based grepping could significantly increase the set of URLs fetched. I am running some analysis of my own, and happy to report back with some empirical data, but to give an example, here is a code snippet from www.nytimes.com

e = "https://static01.nyt.com/ads/tpc-check.html",
a = document.body,
(r = document.createElement("iframe")).src = e,

As you can see, the entire URL is statically embedded and doesn't require any JS execution to construct.

Regarding what kind of regex to use, here is how InternetArchive greps for such URLs, though I am not sure what kind of false positives/negatives they get with this. If I am able to determine a better regex as a part of my analysis, will share that here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract embedded URLs from JavaScript files #279

Extract embedded URLs from JavaScript files #279

goelayu commented Mar 23, 2023

rockdaboot commented Mar 24, 2023

goelayu commented Mar 24, 2023

Extract embedded URLs from JavaScript files #279

Extract embedded URLs from JavaScript files #279

Comments

goelayu commented Mar 23, 2023

rockdaboot commented Mar 24, 2023

goelayu commented Mar 24, 2023