Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract embedded URLs from JavaScript files #279

Open
goelayu opened this issue Mar 23, 2023 · 2 comments
Open

Extract embedded URLs from JavaScript files #279

goelayu opened this issue Mar 23, 2023 · 2 comments

Comments

@goelayu
Copy link

goelayu commented Mar 23, 2023

From what I understand, based on the source code, the parser doesn't extract embedded URLs from JavaScript files.
Is there any particular reason for not supporting this? Maybe because wget doesn't support that feature?
I feel it's a simple add-on and can significantly improve the fidelity of statically crawled pages.

@rockdaboot
Copy link
Owner

We had this kind of request before and rejected it because often JS produces URLs dynamically and we don't want to run external code (e.g. via V8).

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

@goelayu
Copy link
Author

goelayu commented Mar 24, 2023

But yeah, "grepping" out static URLs with a regex could be an option. Is this what you are thinking of ? Or do you have something else in mind ?

That is correct. Simple regex based grepping could significantly increase the set of URLs fetched. I am running some analysis of my own, and happy to report back with some empirical data, but to give an example, here is a code snippet from www.nytimes.com

e = "https://static01.nyt.com/ads/tpc-check.html",
a = document.body,
(r = document.createElement("iframe")).src = e,

As you can see, the entire URL is statically embedded and doesn't require any JS execution to construct.

Regarding what kind of regex to use, here is how InternetArchive greps for such URLs, though I am not sure what kind of false positives/negatives they get with this. If I am able to determine a better regex as a part of my analysis, will share that here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants