Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimalcss is unable to extract css from many webarchive snapshots #308

Open
layoutanalysis opened this issue Mar 11, 2019 · 3 comments
Open

Comments

@layoutanalysis
Copy link

layoutanalysis commented Mar 11, 2019

I would like to use minimalcss to extract the used css from http://web.archive.org/ snapshots of a webpage (e.g http://web.archive.org/web/20110310061818/http://www.bloomberg.com/) and compare the results over time to find out how often the layout/appeareance of a webpage has changed in the past.

Unfortunately this is not so easy with minimalcss, because it stops working whenever a stylesheet cannot be fetched (404 error). 404s are a very common thing on web.archive.org, as many captures are incomplete. I could partially work around them using the skippable function, but it only lets me skip the request upfront - i cannot react on response errors. My preferred behaviour would be to output the used css to stdout vs. logging the unretrievable stylesheet urls to stderr.

Another issue is the mandatory CSSO-Optimisation, which crashes on certain CSS property values. I could mitigate some crashes by setting cssoOptions: {restructure: false}, but it would be nicer if i could disable the optimisation altogether.

I'm aware that my use case is somewhat uncommon for minimalcss, but maybe the library can be extended to make it possible?

@layoutanalysis
Copy link
Author

layoutanalysis commented Mar 11, 2019

I also noticed that minimalcss times out on certain web.archive snapshots:

const minimalcss = require("minimalcss");

minimalcss
  .minimize({ 
      urls: ['http://web.archive.org/web/20161001001006/https://www.theguardian.com/us'],
    ignoreJSErrors: true,
    withoutjavascript: true,
    ignoreCSSErrors: true,
    loadimages: false,
    enableServiceWorkers: true,
    timeout: 90000,
    cssoOptions: {restructure: false},
    skippable: request => {
        return request.url().indexOf('theguardian.com') === -1;
    }
})
  .then(result => {
    console.log(result.finalCss);
  })
  .catch(error => {
    console.error(`Failed the minimize CSS: ${error}`);
  });

results in the error

Failed the minimize CSS: TimeoutError: Navigation Timeout Exceeded: 90000ms exceeded
Tracked URLs that have not finished: http://web.archive.org/web/20161001001006/https://www.theguardian.com/us, http://web.archive.org/web/20161001001006/https://www.theguardian.com/us-news/series/politics-for-humans/rss

This error also happened with timeout: 560000 (9 minutes timeout).
Maybe it makes sense to stop all pending requests at start_time + (timeout - 10%) and use the remaining time to calculate the used_css and return it?

@stereobooster
Copy link
Collaborator

Timeout can be a puppeteer bug. Related #112

@peterbe
Copy link
Owner

peterbe commented Mar 12, 2019

What @stereobooster said is true.

But I wonder, why do you have enableServiceWorkers: true in there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants