Skip to content

Latest commit

 

History

History
770 lines (560 loc) · 68.1 KB

help.org

File metadata and controls

770 lines (560 loc) · 68.1 KB

How to read this document

It is highly recommended you view this page by clicking the Help button in the extension’s own UI. Doing that will make this page interactive: the settings popup will be displayed on the right on this page and hovering over or clicking on any links pointing to popup.html will highlight those elements in the popup.

See screenshots if you want to see how it will look.

You can still read this page outside of the extension’s UI, but be prepared for all links pointing to popup.html to be useless. Also, the version hosted on the author’s web site is superior to what GitHub’s web UI renders (this pages is written in org-mode markup language, converting it to GitHub Markdown will make things much harder, since it uses a lot of advanced markup features of org-mode to simplify things, and GitHub does not render org-mode files very well at the moment).

What?

Hoardy-Web is a browser extension (add-on) that passively captures and collects dumps of HTTP requests and responses as you browse the web, and then archives them using one or more of the following methods:

To view your archived data, see the accompanying hoardy-web CLI tool (also there).

Glossary

  • A reqres (REQuest + RESponse) is an internal object containing captured information about an HTTP request and its response, including their headers and data, and some meta-information (whether it originates from an extension, tabId it originates from, its state, etc).

General operation

State Diagram

Reqres change their internal states according to the following state diagram (which is explained below):

(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived)
   |                           |                              |             |
   |                           v                              v             v
   |                     (no_response)                   (incomplete)   (complete)
   |                           |                              |             |
   |                           \                              |             |
   |\---> (canceled) ----\      \                             |             |
   |                      \      \                            \             |
   |\-> (incomplete_fc) ---\      \                            \            v
   |                        >------>---------------------------->-----> (finished)
   |\--> (complete_fc) ----/                                             /  |
   |                      /                                             /   |
   \----> (snapshot) ----/       /- (collected) <--------- (picked) <--/    |
                                /        ^                     |            |
               (stashIO?) <----/         |                     v            v
                   |                     \-- (in_limbo) <- (stashIO?) <- (dropped)
                   v                              |                         |
                (queued) <------------------\     |                         |
                / |  ^ \                     \    \-----> (discarded) <-----/
  (exported) <-/  |  |  \----------------\    \                ^
      |           |  |                    \    \               |
      |       /---/  \-----------------\   \    \              |
      |       |                        |    \    \             |
      |       v                        |     \    \            |
      |\-> (srvIO) -> (stashIO?) -> (failed) |     \           |
      |       |                        ^     /      \          |
      |       v                        |    v        |         |
      |   (sumbitted) --------------> (saveIO) --> (saved)     | {{!saving}}
      |       \                                                |
      \-------->-----------------------------------------------/

Step 1: Tracking

Hoardy-Web attaches to your browser’s runtime and tracks progress of HTTP requests and their responses, capturing both their request and response headers and data at appropriate times in the browser’s request and response processing pipeline.

Whether Hoardy-Web will track a given request depends on the Track new reqres toggles in the settings popup, e.g:

  • this toggle allows you to disable tracking of newly spawned HTTP requests globally, thus essentially disabling Hoardy-Web,
  • this one controls whether Hoardy-Web will track new reqres originating from the currently active tab,
  • this one controls whether it will track new reqres originating from new tabs opened from the currently active tab (aka “children tabs”, e.g. via middle mouse click, context menu, etc),
  • while this one controls whether it will track new reqres originating from new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press ? symbols to see a tooltip explaining what each of them does).

Disabling any of these toggles does not stop tracking of already initiated requests, it only stops new requests controlled by that toggle from being tracked.

The networking states of the State Diagram

As shown on the above diagram, a new reqres proceeds through the following networking states:

  • start: the starting state;
  • request sent, (response) headers received, (response) body recived: these are the normal HTTP request stages (stages of =webRequest= sub-API of WebExtensions API);
  • nIO: normal network IO performed by the browser in between HTTP request stages;
  • canceled: request was canceled before it was sent
    • you, the user, canceled it manually, via the Stop button;
    • by the browser when redirecting an http:// URL to an https:// URL in HTTPS-only mode;
    • by an ad-blocking extension like uBlock Origin;
    • etc;

    unsent would have probably been a better name for this, but all browsers call it canceled internally, so Hoardy-Web follows that convention;

  • no_response: request was sent, but no response was received:
    • you canceled it manually via the Stop button before it got a response;
    • connection to the server was rejected;
    • the server decided to ignore the request completely;
    • network timeout was reached;
    • etc;
  • incomplete: request was sent, response headers were received, but then the loading was interrupted before all of the response body was received;
  • incomplete_fc: only on Firefox-based browsers: the browser loaded the response data of this reqres directly from its cache, but did not give it to Hoardy-Web;

    this is just how Firefox handles things sometimes; usually, this only happens for images;

    this is a separate state, because usually this means this URL was successfully archived before; if it was not, reload the page with Control+F5;

  • complete: request was completed successfully;
  • complete_fc: request was completed successfully from browser’s cache;
  • snapshot: this reqres was produced by taking a DOM (Document Object Model) snapshot (using one of the appropriate-buttons in the popup), i.e. it was produced by capturing a raw HTML or XML of the current state of the tab/frame, not by capturing a network request;
  • finished: the terminal state of this step, no new events for this reqres will come from the browser.

The states after the finished state

In principle, at reaching finished state the reqres can be serialized and saved to disk, but Hoardy-Web provides more states and UI for convenience and to workaround limitations of various browser APIs (a WebExtensions API function call that writes a data chunk into a file on a local file system while reporting out-of-disk-space errors does not exists).

Glossary

  • An /in-flight reqres/ (current tab) is a reqres that did not reach the finished state yet, in history-log such reqres will be shown to be in in_flight state.

    These two stats are represented as sums of two numbers:

    • the number of reqres that are still being tracked via webRequest or debugger API; and
    • the number of reqres that have finished being tracked and are now waiting for all their events to finish processing.

    On Firefox, nothing should ever get stuck, if something seems to be stuck in in_flight state, it’s probably still loading (or it is a bug in the browser, which does happen, very rarely).

    On Chromium, limitations of the Chromium’s debugging interface mean a request can get stuck among the reqres represended by the first number above. If the first number is zero, however, then the second should also rapidly become zero, at most after two times this many seconds.

    If some reqres got stuck in one of the in_flight states, you can forcefully move them out of that state using this and/or that popup buttons.

  • A finished reqres is a reqres that reached the finished state.
  • Final networking state is the last state a reqres had before it finished: i.e. complete, incomplete, canceled, etc.

Step 2: Classification

On reaching the finished state, Hoardy-Web performs reqres classification controlled by =Pick reqres for archival when they finish= and =Mark reqres as ‘problematic’ when they finish= settings. The former set decides whether the reqres in question should be picked or dropped, which influences the actions Hoardy-Web will perform in the next step. The latter set decides if the reqres in question should be marked as problematic.

Problematic reqres

The problematic reqres status is a flag (NOT a state) that does not influence archival or any actions discussed in the latter steps. It exists because browsers provide no indication when some parts of the page failed to load properly — they expect you to actually look at the page with your eyes to notice something looking broken (and reload it manually) instead — which is counterproductive when you want to be sure that the whole page with all its resources was archived.

After all, parts of a dynamically loaded page might simply silently fail to be rendered by associated JavaScript because some of the HTTP requests that JavaScript did in background failed, or, on a static web page, layout and `CSS` might have made some of the incompletely loaded parts of the page invisible (by design or by accident).

So, to provide an indicator for such cases, Hoardy-Web keeps the log of problematic reqres and displays the number of elements in the log in its toolbar button’s badge.

By default, HTTP requests that failed to get a response, those that have incomplete response bodies, and those for which the browser reported potentially problematic errors but then Hoardy-Web picked them anyway, will be marked as problematic.

Problematic errors are errors like

  • “this request failed because of a networking issue”,
  • “this request was aborted because the JavaScript function making it decided to cancel it when you moved your mouse cursor away from a video thumbnail it was needed for”,
  • and similar things that probably imply some part of the page was left unfetched,

but NOT errors like

  • “fetching of this request was aborted because the server redirected it to a URL blocked by uBlock Origin”,
  • “the browser decided against rendering of this data”,
  • “the browser failed to render this data because this image file is broken”,
  • and similar errors where the data was properly fetched.

(In principle, Hoardy-Web could have been designed to never record the errors of the latter category in the first place, thus simplifying the above bit, but Hoardy-Web is designed to follow the philosophy or “collect everything as browser gives it, as raw as possible, do all the post-processing logic separately, allow for no logic at all, if the user asks for it”.)

The raw error strings reported by the browser for each reqres can be seen in the history-log.

If this option is enabled Hoardy-Web will generate a desktop notification each time a new problematic reqres get produced. If you don’t care about the problematic flag and it annoys you, you should disable that option, not options under =Mark reqres as ‘problematic’ when they finish= settings. This way you could then still see the number of problematic reqres in extension’s toolbar button’s badge.

Glossary

Step 3: Collection, Discarding, and Limbo

On exit from the finished state each reqres gets split into

  • a loggable, which is a hollow reqres structure without any request or response data, i.e. it only keeps the metadata used by history-log, and
  • a dump, which is a serialized CBOR-formatted dump of the original reqres structure.

Since those tuples can be reconstructed back into the original reqres structures, the following will continue to refer to them as if nothing changed when the fact they are now being internally represented by those tuples is not relevant.

Normally, picked reqres proceed to the collected state and get queued for archival while dropped reqres proceed to being discarded from memory.

When =Archive ‘collected’ reqres by= toggle is enabled, those queued reqres proceed directly to the next step.

Limbo mode

However, sometimes you might want to actually look at a web page before deciding if you want to archive it or not. The naive way to do it would be to load a page with capture disabled first, look at it, and then, if you want to save it, enable it, and reload the page again with browser’s cache disabled via Control+F5 (and it has to be Control+F5, not just F5, because otherwise some URLs, on Firefox, might produce reqres in incomplete_fc state, on Chromium, their fetching could be silently skipped).

Obviously, this is both annoying and will force you to fetch everything twice.

Which is why Hoardy-Web implements “limbo mode”. With one of the limbo mode options enabled, Hoardy-Web will instead capture everything as normal, but then, instead of sending the reqres in question to collected or discarded states immediately, it will put them into in_limbo state where they would linger until you collect it or discard them manually by pressing the appropriate-buttons, or until =Automatic actions for recently closed tabs= options make a decision semi-automatically for you.

A picked reqres will be put into in_limbo when =Pick into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.

Similarly, a dropped reqres will be put into in_limbo when =Drop into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources. (This latter option mainly exists for debugging.)

If this option is enabled and there are more than this number reqres in_limbo or the total size of all dumps in_limbo is more than this size (in MiB), Hoardy-Web will complain to remind you to collect or discard some of them so that your browser does not waste too much memory (and so that you won’t loose too much data if something crashes while =Stash ‘collected’ reqres into local storage= option discussed below is disabled).

Glossary

Step 3.5: Stashing

The stashed reqres status is, essentially, a flag that says this reqres was temporarily backed up to browser’s local storage.

When =Archive ‘collected’ reqres by= option is disabled but =Stash ‘collected’ reqres into local storage= option is enabled, instead of archiving newly queued reqres, Hoardy-Web will stash their (loggable, dump) tuples into browser’s local storage.

Similarly, when both =Stash ‘collected’ reqres into local storage= option and =Stash ‘in_limbo’ reqres= option (or one-of-the-other similar options) is enabled, then newly generated in_limbo reqres will also get immediately stashed into browser’s local storage.

Moreover, the following section will discuss how Hoardy-Web will try stashing failed to archive reqres into browser’s local storage too.

In other words, stashing exists to prevent loss of successfully captured but yet unarchived data in situations where

  • you quit or restart your browser, or
  • Hoardy-Web crashes or gets reloaded unexpectedly,
  • your computer unexpectedly looses power,

before you collected or discarded everything from in_limbo or Hoardy-Web has successfully archived everything from its archiving queue.

Note however, that even with stashing enabled Hoardy-Web will skip disk IO whenever possible: e.g., if both =Archive ‘collected’ reqres by= and [[./popup.html#div-config.archiveSubmitHTTP][=Archive ‘collected’ reqres by > … submitting them via ‘HTTP’]] options discussed below are enabled, =Hoardy-Web will first try to archive each new collected reqres straight from memory to the archiving server and only if that process fails will it attempt stashing them to local storage instead.

Meaning that

  • stashing of non-=in_limbo= reqres is usually completely free and so you should probably keep that option always enabled;
  • stashing of in_limbo reqres via-one-of-the-those options is not free, so if you almost never archive from limbo then keeping those options enabled will waste disk IO, so you might want to disable at least some of them in that case;

The above also implies that, technically, stashing is not a silver bullet against data loss. To try and make it such would mean unconditional immediate stashing of all captured data, which would waste a lot of disk IO on most Hoardy-Web configurations.

When both =Archive ‘collected’ reqres by= option and =Stash ‘collected’ reqres into local storage= option are disabled, then, after a new reqres gets queued, Hoardy-Web will generate a new desktop notification complaining about it, unless that option is disabled too.

You can also forcefully stash all currently queued, in_limbo, and failed reqres by pressing this button. It stashes everything immediately and unconditionally, ignoring all other stashing settings.

Glossary

  • A stuck queued reqres is a queued reqres that got stuck in the archival queue, e.g. because it got queued while =Archive ‘collected’ reqres by= option was disabled.
  • A /stashed reqres/ is a reqres that was temporarily stashed (backed-up) into browser’s local storage while it is still being kept in Hoardy-Web’s memory. I.e., the stash is a persistent on-disk backup for in-memory reqres.

Step 3.75: Logging

On entering collected or discarded state, loggable metadata of each reqres is copied into the recent reqres history-log and is kept there until the size of the log reaches this many elements, at which point the older elements of the log start being elided automatically.

You can also ask Hoardy-Web to forget all history manually by pressing this button, or to forget history of reqres generated by the currently active tab by pressing that button instead, or do the same by using similar buttons in the-log. Using the-log will also allow the use of reqres filtering options for doing this, allowing you to selectively forget parts of history.

Note, however, that problematic reqres will not get automatically elided from the log, nor forgotten by using the above buttons. To forget about them, you will have to unset the problematic flag on the respective reqres via this button, or that button, or use similar buttons in the-log.

Step 4: Archival

When =Archive ‘collected’ reqres by= toggle is enabled, Hoardy-Web will pop queued reqres from the archival queue one by one and then perform one or more of the following (in order they are listed):

You can enable more than one archival method at the same time. For a given loggable, Hoardy-Web will remember and skip previously successful archival methods if the loggable ever returns to the archival queue again (e.g., when one of the archival methods fails and you later ask Hoardy-Web to retry the archival, or when you re-queue a reqres from local storage from the Saved in Local Storage page).

Note the difference between stashed and saved reqres:

  • stashed reqres are kept in memory until they get successfully archived by all configured archival methods (or until you manually discard them, in case they were stashed in_limbo);
  • saved reqres get dumped into browser’s local storage and, if that succeeds, discarded from memory (until you manually load them back from there).

Buckets

Sometimes you might want to split your archivals into separate buckets to simplify future hoarding and sharing of collected archives. E.g., say, by default you might want to put everything into the “default” bucket, but then you might want to put reqres produced by a select tab where you just logged in into you personal account into the “private” bucket instead.

To implement this, for each reqres in the archival queue, Hoardy-Web computes a bucket parameter from the appropriate “Bucket” setting, e.g.

  • this one will be used for requests originating from the currently active tab,
  • this one will be used for requests originating from new child tabs opened from the currently active tab (e.g. via middle mouse click, context menu, etc),
  • while this one will be used for new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar, Control+T, menu item, etc),
  • and so forth for the others (press ? symbols to see a tooltip explaining what each of them does).

Evaluation of the bucket parameter is done just before each archival attempt, so if the queue is not yet empty, and you disable =Archive ‘collected’ reqres by=, edit some of the “Bucket” settings, and enable it again, Hoardy-Web will start using the new setting immediately.

When exporting via saveAs, bucket value will be used in the file name of the generated fake-Download .wrrb file and the dumps will be split into separate fake-Download files by said bucket. I.e., internally, the bundle discussed above is actually a set of per-=bucket= bundle’s.

When submitting to an HTTP server, Hoardy-Web will specify bucket as a query parameter (named “profile”, for historical reasons) to each HTTP POST request.

When stashing or saving to local storage, Hoardy-Web will record the value of bucket into each loggable before saving data to disk. If you restart your browser, thus starting a new Hoardy-Web session, Hoardy-Web will use the old stashed/saved bucket values for all new attempted archivals of old reqres generated by previous sessions.

Glossary

  • An /exported reqres/ is a reqres that was successfully exported by generating a fake-Download containing its dump.
  • A /submitted reqres/ is a reqres that was successfully submitted to the archiving server and thus was discarded from memory.
  • A /saved reqres/ is a reqres that was successfully saved by being archived into browser’s local storage.
  • An archived reqres is either exported, submitted, or saved reqres.

Handling of Failures

As noted above, if any of the archival methods fail, the reqres in question will be moved into the failed state.

Submissions of reqres that failed because of networking issues will be retried automatically every 60 seconds. Archivals of reqres rejected by the archiving server or those that failed to be saved to browser’s local storage will not be retried automatically as those usually happen when there is no space left on the device you are archiving to.

You can retry all failed archivals by pressing this button. You can also use it to nudge the archiving sub-process awake if some things got stuck in the queue by accident. E.g., after the extension got reloaded with a non-empty queue, or if you previously quit your browser before everything was archived.

If this option is enabled and a new reqres recently moved to the failed state, a new desktop notification will be generated. If this option is enabled, a new desktop notification will be generated when the archival queue gets empty the very first time or after any failures.

Glossary

Shortcuts

Hoardy-Web provides a bunch of keyboard and context menu shortcuts to allow using it in more efficient ways.

  • On Firefox-based browsers, you can see and edit all keyboard shortcuts via Add-ons and themes (about:addons) -> the gear icon -> Manage Extension Shortcuts.
  • On Chromium-based browsers, you can see and edit all keyboard shortcuts via the menu -> Extensions -> Manage Extensions (chrome://extensions/) -> Keyboard shortcuts (on the left).

Keyboard shortcuts

Hoardy-Web provides shortcuts to:

Context menu actions

Hoardy-Web provides context menu actions to:

  • open a given link in a new tab with currently active tab’s tracking in children tabs setting negated. I.e.,
    • right-mouse clicking while pointing at a link and
    • selecting Hoardy-Web > Open Link in New Tracked/Untracked Tab menu item,

    is equivalent to

    • toggling this,
    • middle-mouse clicking a link,
    • toggling this again.
  • do the same thing, but opening it in a new window.

Quirks and Bugs

Known extension issues

  • When the extension is (re-)loaded, all tabs inherit the values of these settings.
  • Hoardy-Web does not implement collection of WebSockets data on any of the supported browsers.

    (Firefox does not support it. Chromium does support it, in theory, but I have not tried using that API, so I have no idea how well it works.)

    This is low-priority issue since you can simply take a DOM snapshot instead of capturing and later replaying WebSocket messages to in-page JavaScript. Also, capturing and archiving a DOM snapshot will free you from needing to run any JavaScript at all when you decide to return to view the archived page later, which is nice.

  • On Chromium, response data of background requests and requests made by other extensions does not get collected, since there’s no tab to attach a debugger to, and I have not figured out how to attach debugger to other things yet.
  • On Firefox, fetches that spawn new downloads will be marked as problematic by default, since Firefox’s implementation of webRequest.filterResponseData API does not provide their contents to the extension and I have not figured out how to distinguish them from other fetches yet.

Relevant issues of all browsers

  • On all supported browsers, if an HTTP server supplies the same header multiple times — which happens sometimes, most commonly with Set-Cookie headers — then the archived response headers will usually become weird, with multiple headers squished into a single value, separated by newline symbols.

    This is just the way both Firefox (usually) and Chromium (always) supply those headers to extensions and Hoardy-Web does not try to undo it.

Relevant issues of Firefox-based desktop browsers: Firefox, Tor Browser, LibreWolf, etc

  • On Firefox-based browsers, without the patch (also there), the browser only supplies formData to webRequest.onBeforeRequest handlers, thus making impossible to recover the actual request body for a POST request.

    Hoardy-Web will mark such requests as having a “partial request body” and try its best to recover the data from formData structure, but if a POST request was uploading files, they won’t be recoverable from formData (in fact, it is not even possible to tell if there were any files attached there), and so your archived request data will be incomplete even after Hoardy-Web did its best.

    Disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

    With the above patch applied, small POST requests will be archived completely and correctly. POST requests that upload large files and only those will be marked as having a “partial request body”.

  • If-Modified-Since and If-None-Match headers never get archived, because the browser never supplies them to the extensions. Thus, you can get 304 Not Modified reqres response to a seemingly normal GET request.
  • Reqres of already cached media files (images, audio, video, except for svg and favicons) will end in incomplete_fc state because webRequest.filterResponseData API does not provide response bodies for such requests. This toggle controls if such reqres should be picked.

    By default, Hoardy-Web will drop them. Usually this is not a problem since such media will be archived on first (non-cached) access. But if you want to force everything on the page to be archived, you can reload the page without the cache with Control+F5.

  • Firefox fails to run onstop method for webRequest.filterResponseData filter for the very first HTTP/2 request the browser makes after you start it, thus making the reqres of that request incomplete. If this option is enabled, Hoardy-Web will transparently work around this bug by redirecting the very first navigation request to about:blank and then reloading the tab with its original URL.
  • Firefox-based browsers provide no API for archiving WebSockets data at the moment, unfortunately.

Relevant issues of Firefox-based mobile browsers: Fenix aka Firefox for Android, Fennec, Mull, etc

All of the above apply, moreover:

Relevant issues of Chromium-based desktop browsers: Chromium, Chrome, etc

On Chromium-based browsers, there is no way to get HTTP response data without attaching Chromium’s debugger to a tab from which a request originates from. This makes things a bit tricky, for instance:

  • With this and this option enabled, new tabs will be reset to this value (about:blank by default) because the default of chrome://newtab/ does not allow attaching debugger to the tabs with chrome: URLs.
  • Requests made before the debugger is attached will get canceled by Hoardy-Web. So, for instance, when you middle-click a link, Chromium will open a new tab, but Hoardy-Web will block the requests from there until the debugger gets attached and then automatically reload the tab after. As side-effect of this, Chromium will show Request blocked page until the debugger is attached and the page is reloaded, meaning it will get visually stuck on Request blocked page if fetching the request ended up spawning a download instead of showing a page. The download will proceed as normal, though.
  • You will get an annoying notification bar constantly displayed in the browser while =Hoardy-Web= is enabled. Closing that notification will detach the debugger. Hoardy-Web will reattach it immediately because it assumes you don’t want to lose data and closing that notification on accident is, unfortunately, quite easy.

    However, closing the notification will make all in-flight requests lose their response data.

    All alternatives to Hoardy-Web that work with Chromium suffer from the same issue.

    If you disable this option the debuggers will get detached only after all requests finish. But even if there are no requests in-flight the notification will not disappear immediately. Chromium takes its time updating the UI after the debugger is detached.

Moreover, Chromium has the following long-standing issues/bugs making things difficult:

  • Chromium will automatically detach a debugger from a tab if it tries to save too much data into its debugger state. Which means that a tab that loads too much data too fast will get its debugger detached. Chromium does this to try and save memory, but this, among other issues, means that large images will fail to be properly archived, and any page that loads such files is likely to fail to be archived too.

    This is a design limitation of Chromium debugging interface, there appears to be no work-around for this at the moment.

    Meanwhile, on Firefox, Hoardy-Web uses webRequest.filterResponseData API (not available no Chromium, because it greatly enhances browser’s ad-blocking capabilities) which does not suffer from this problem.

  • Chromium will occasionally detach debuggers from some tabs at random. It just happens. Fortunately, Hoardy-Web will mark the resulting broken reqres as problematic by default as they match the conditions of at least one of this, this, or that options.
  • Chromium handling of media files (audio and video) within its debugging interface is very strange. When Chromium encounters a media file, it immediately loads a first few frames of it, then cancels the rest of the download, generates a networking error debugging event, but forgets to give the already loaded data to it, and then, when the user clicks the play button, continues the download by requesting the rest of the file as normal. Thus, on Chromium, for media files Hoardy-Web will only ever get 206 Partial Content HTTP responses with the first few kilobytes of file data missing. This bug has no good workaround, all alternatives to Hoardy-Web that work with Chromium work it around by silently re-downloading the file the second time in background.
  • Similarly to unpatched Firefox, Chromium-based browsers do not supply contents of files in POST request data. They do, however, provide a way to see if files were present in the request, so Hoardy-Web will mark such and only such requests as having a “partial request body”. There is no patch for Chromium to fix this, nor do I plan to make one (feel free to contribute one, though).

    As with Firefox, disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.

  • Chromium fails to provide openerTabId to tabs created with chrome.tabs.create API so in the unlikely case of opening two or more new tabs/windows in rapid succession via Hoardy-Web context menu actions and not giving them time to initialize Hoardy-Web could end up mixing up settings between the newly created tabs/windows. This bug is impossible to trigger unless your system is very slow or you are clicking things with automation tools like AutoHotKey or xnee.
  • To properly collect all the data about a reqres, Hoardy-Web has to use both the data generated by webRequest API and Chromium’s own debugging API events, using only one of those is usually insufficient. But Chromium generates different request IDs for events generated by these two different APIs and also generates those events in arbitrary order. Therefore, Hoardy-Web tracks reqres generated by both sets of APIs separately and then matches those two lists against each other heuristically, merging matching reqres together. Which is ugly enough. But then Chromium sometimes generates debugging API events and forgets to produce the corresponding webRequest API events, or vice versa, thus leaving some of those reqres unmatched.

    To work around that, Hoardy-Web waits this many seconds for new events to arrive, and if none do, forcefully finishes all unmatched but network-complete in_flight reqres. Yes, this means that some minor metadata fields (like document_url) of those reqres might be missing, but waiting more time usually won’t fix it, so Hoardy-Web can’t do anything else there.

  • However, sometimes Chromium forgets to generate both loading-complete and loading-failed debugging events. This usually happens when a request gets started and then canceled by a page’s JavaScript, or when you navigate between pages too fast.

    In that case, Hoardy-Web can’t tell if a reqres is just slow at being loaded or if Chromium forgot about it, so those reqres will get stuck in the in_flight state indefinitely, at least until their originator tab gets closed, or until you press one of this or that buttons.

    Hoardy-Web might get another workaround for this bug later.

Error messages and codes

Desktop notifications

  • Failed to archive <N> items in the queue because Hoardy-Web can't establish a connection to the archiving server at <URL>

    Are you running the the archiving server script?

  • Failed to archive <N> items in the queue because requests to the archiving server failed with: <STATUS> <REASON>: <RESPONSE>

    Your archiving sever is returning HTTP errors when Hoardy-Web is trying to archive data to it. See your archiving server’s console for more information.

    Some common reasons it could be failing:

    • No space left on the device you are archiving to.
    • It’s a bug.
  • Failed to open/create a database via `IndexedDB` API, all data persistence will be done via `storage.local` API instead. This is not ideal, but not particularly bad. However, the critical issue is that it appears Hoardy-Web previously used `IndexedDB` for archiving and/or stashing reqres.

    So, it worked before, but why doesn’t it work now? The most likely reason is: you are running Hoardy-Web under a browser based on an older version of Firefox and you have recently enabled Always use private browsing mode setting in your browser’s config. Older versions of Firefox forbid the use of IndexedDB API when that setting is set.

    To make archives currently saved in IndexedDB accessible to Hoardy-Web under Always use private browsing mode you need to:

    All old data should be available from the Saved in Local Storage page now.

Errors recorded in reqres, as seen in the-log

Most error codes are produced by attaching one of the following prefixes to the raw error code given by the browser:

  • webRequest:: prefix is prepended to errors produced by the code working with webRequest API;
  • debugger:: prefix is prepended to errors produced by the code working with Chromium’s Debugger API;
  • filterResponseData:: prefix is prepended to errors produced by webRequest.filterResponseData API (these can usually be ignored, since Firefox generates normal webRequest:: codes for those reqres too, when it was an actual error; but Hoardy-Web still collects them, adhering to “collect everything as browser gives it, when possible” philosophy).

In particular, webRequest::NS_ prefix on Firefox, and webRequest::net:: and debugger::net:: prefixes on Chromium signify various issues produced by the networking stacks of those browsers. For instance:

  • webRequest::NS_ERROR_ABORT on Firefox and webRequest::net::ERR_ABORTED on Chromium signify that this request was aborted before it finished, e.g. because the originator tab was closed before it was fully loaded; Firefox also uses this code to mean what Chromium signifies with various BLOCKED codes;
  • webRequest::net::ERR_BLOCKED_BY_CLIENT on Chromium signifies that an extension blocked it;
  • debugger::net::ERR_BLOCKED:: is a prefix for other errors when the request was blocked, e.g. by CSP;
  • webRequest::NS_ERROR_NET prefix on Firefox and webRequest::net::ERR_FAILED error on Chromium signify various networking issues.

The exception to the above rule of keeping everything as raw as possible are webRequest::capture:: and debugger::capture:: prefixes which signify various errors produced by Hoardy-Web itself in its webRequest- or debugger-handling code, respectively. In particular:

  • webRequest::capture::EMIT_FORCED::BY_USER and debugger::capture::EMIT_FORCED::BY_USER are produced when you forcefully advance a reqres from in-flight state by pressing this or that button;
  • debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER is produced when Chromium debugger gets detached from its tab while a reqres inside that tab is still in flight;
  • debugger::capture::EMIT_FORCED::BY_CLOSED_TAB is produced when a tab gets closed while a reqres inside of it is still in flight;
  • debugger::capture::NO_RESPONSE_BODY:: is a prefix for errors produced when getting request’s response body from Chromium’s debugger fails for various reasons;
  • webRequest::capture::CANCELED::NO_DEBUGGER is produced when a non-main-frame request is canceled by Hoardy-Web because no debugger is available to capture it; in the case of a main frame request, Hoardy-Web will cancel the request and reload the tab, as discussed above, so this error will not be produced; but it can happen if a page tries to load a sub-frame (like iframe) while the debugger for the tab (and, thus, the main frame) did not attach yet (which only happens for pages where Chromium disallows debugging, or when Hoardy-Web gets enabled after the page in question already started loading, e.g. the very first page after the browser starts); also, this can happen when the debugger gets detached after the main frame was captured but its resources are still loading.
  • webRequest::capture::RESPONSE::BROKEN is produced when some response metadata is unavailable.

    At the moment, this only appears to happen on Firefox when a request gets fulfilled by a service or shared worker after Firefox had already sent it to the server. Firefox then interrupts the networking code and generates NS_ERROR_NET_ON_* error about the event failing to supply the response metadata generated by the service/shared worker.

Frequently Asked Questions

Does Hoardy-Web send any of my captured web browsing data anywhere?

Hoardy-Web only ever sends your data to the archiving Server URL=]] you specify when [[./popup.html#div-config.archiveSubmitHTTP][the =Archive 'collected' reqres by > ... submitting them via 'HTTP' option is enabled.

Nowhere else. Never else.

Does Hoardy-Web collect and send any telemetry anywhere?

For your convenience, Hoardy-Web saves some global stats across restarts (e.g., the Collected, Discarded, Picked, and Dropped lines).

However, none of those are ever sent anywhere and you can reset them at any time.

Will the answers to the above two questions ever change in a future version of Hoardy-Web?

No. I (the author) hate non-consensual data collection.

In fact, as you might have noticed, Hoardy-Web, unlike most other browser extensions, is almost trivial to reproducible-build from source on a POSIX-compliant system with a Nix package manager installed, and it has a privately operated source code mirror.

This is by design, I expect a chunk of Hoardy-Web users to be paranoid enough to only ever build it from source and install the results manually into their LibreWolf or some such, leaving zero telemetry fingerprints anywhere.

Hoardy-Web asks for a lot of permissions, what does it use all those permissions for?

  • <all_urls> permission is used so that Hoardy-Web could capture all URLs.
  • webRequest and webRequestBlocking permissions are used to track and capture HTTP requests and responses; on Chromium the latter also requires the debugger permission, which Hoardy-Web also asks for there.
  • tabs permission is used for tracking per-tab state and stats, making Hoardy-Web’s toolbar icon show per-tab state, taking DOM snapshot of all tabs, buttons switching to a related tab in the-log, etc.
  • storage permission is used to save extension config and stats.
  • unlimitedStorage permission is used for archival of captured data to browser’s local storage.
  • menus (contextMenus on Chromium) permission is used to add context-menu shortcut actions for links.
  • notifications permission is used to send desktop notifications, which mostly used for reporting various issues.

Can I use Hoardy-Web to capture web pages while my browser runs with JavaScript disabled?

Yes.

Can I use Hoardy-Web to capture web pages that use a lot of JavaScript?

This is why DOM-snapshot buttons exist, see the following question.

In principle, Hoardy-Web will capture everything your browser fetches from the network as you browse the web, except for, at the moment, WebSockets data. So, web pages using only simple UI-related JavaScript code will work fine when you start replaying them “from scratch” via =hoardy-web export mirror= (also there) or some such.

However, in the most general case, “from scratch” replay of pages dynamically generated via JavaScript is not guaranteed. For example, consider a web page with a JavaScript code that generates a random number, then queries a remote server with that number, and then renders the result somehow. Obviously, such a web page can not be replayed “from scratch” since it will generate a new random number and your archive probably won’t have the corresponding server’s response for it.

Can I use Hoardy-Web to capture a web page as it currently is, after all JavaScript was run, not as it was when it was last fetched from the network?

Yes, you can capture DOM (Document Object Model) snapshots of all frames of the currently active tab by pressing this button in the popup.

Doing that will generate and capture snapshots of raw HTML’s or XML’s for each frame contained in the currently active tab. (Reqres-wise they will be 200 OK responses, but with protocol set to SNAPSHOT and method set to DOM.)

You can also do that for all open tabs at once by pressing that button.

How can I make Hoardy-Web capture a web page completely, especially when parts of it are loaded lazily?

In the most general case, you will have to scroll the page around and click random buttons and media elements.

Hoardy-Web has no “autopilot” for doing this, nor will it ever get one, at least as part of Hoardy-Web extension, since “autopiloting” is very website-specific. So, at the moment, the most general semi-automated solution is to run a website-specific UserScript via Tampermonkey or some such, wait until everything finishes loading, and then take a snapshot. (Hoardy-Web will get an integration for automating that, eventually.)

On the other hand, if you

  • run Hoardy-Web under Firefox,
  • just want to load all lazily-loaded images the page already has (NOT load more stuff), and
  • the page in question uses modern HTML5 lazy loading attributes instead of using JavaScript to do the same,

then you can simply go to about:config and toggle dom.image-lazy-loading.enabled to false. All images will start being loaded eagerly after that.

Can I use Hoardy-Web to capture a web page without archiving it, look at it, decide if I want to save it, and archive it only if I do, all without reloading the page a second time?

Yes. This is why =Pick into limbo= setting exists. See above for more info.

In combination with =Automatic actions for recently closed tabs= options you can implement any of the following workflows:

  • archive everything by default, but allow to exclude some things by manually discarding them from limbo;
  • only archive things that are explicitly manually collected, discard everything else by default.

Why do pages under https://addons.mozilla.org/ and https://chromewebstore.google.com/ can not be captured by Hoardy-Web?

Browsers prevent extensions from running on extension store pages to prevent them from manipulating ratings, reviews, and etc such things. However, you can archive https://addons.mozilla.org/ pages by running Hoardy-Web under Chromium and https://chromewebstore.google.com/ pages by running Hoardy-Web under Firefox.

When running Hoardy-Web under Chromium, a lot of my captures fail with debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER, debugger::capture::NO_RESPONSE_BODY::DETACHED_DEBUGGER, webRequest::capture::CANCELED::NO_DEBUGGER, and similar errors. What do I do?

You are either

  • pressing the Cancel or Close (cross) buttons in the Chromium’s popup-toolbar telling you about the debugger being enabled, and so Chromium detaches it, breaking everything (see above);
  • pressing Space or Escape keyboard keys when doing things in Chromium’s UI, but nothing at that particular moment reacts to the key you pressed, except there is that popup-toolbar… and so Chromium decides it must mean you want to press Cancel button there … and detaches the debugger, breaking everything (again);

    yes, this is really annoying, and this is a common problem for me, since I usually page-down using Space and press Escape a lot (usually to cancel selection, but sometimes also as a trauma of a long-time Vim user);

    the only solution to this I know of is to just not touch the keyboard at all, at least while things are still loading; i.e. just click on stuff using the mouse/track-point/touch-pad/touchscreen/etc, wait for the T (“Tracking”) to vanish from the extension’s badge, and only then let your (grabby and impatient for exercise via keyboard shortcuts) fingers to touch the keyboard;

    even then, Chromium will detach debuggers from time to time seemingly at random, but at least it will be rare enough that you won’t need to reload much;

  • trying to capture large or media files; as discussed above, this has no workaround, run Hoardy-Web under Firefox instead.

Also, Chromium will occasionally detach its debugger at random, it just happens.

When running Hoardy-Web under Firefox, some of my captures fail with webRequest::capture::RESPONSE::BROKEN. What do I do?

This is a rare error caused by a race condition between webpage’s service/shared worker and browser’s networking code.

Usually, you can ignore this error, since loading another related page is likely to fulfill the same URL.

However, if this happens a lot to you, or if it annoys you, you can go to about:config, toggle dom.serviceWorkers.enabled to false, and restart the browser. Alternatively, you can use NoScript or some such extension to disable JavaScript, and thus the offending service/shared workers, on the page in question.

Why does a (specific) URL or some part of it fails to be properly captured by Hoardy-Web?

Did you read the notes on the bugs of the browser you are using above?

Most notably:

  • both Firefox- and Chromium-based browsers in their default builds fail to properly supply POST request data to their extensions; for Firefox-based browsers there exists a patch that fixes it, mostly; Chromium users are out of luck at the moment;
  • on a Chromium-based browser, because of limitations of the Chromium’s debugging interface, it is impossible to properly capture media files (both audio and video) and large files in general; this issue has no good work-around and, AFAIK, all alternatives to Hoardy-Web running on Chromium-based browser suffer from it (and work around it by silently re-downloading said files the second time in background); try using Hoardy-Web under a Firefox-based browser instead.

The documentation claims that all Hoardy-Web archival methods except for submission via =HTTP= are unsafe. Why?

Archival by exporting using =saveAs= (generation of fake-Downloads) can fail and **lose a bit of your collected data at a time** if you press a wrong button in you browser’s UI, mis-reconfigure your browser a bit, or your disk gets out of space unexpectedly.

Archival to browser’s local storage (which is what Hoardy-Web is doing by default) can **loose all your collected data at the same time** if you uninstall the extension by accident.

Meanwhile, archival by submission via =HTTP= has none of these problems:

  • Hoardy-Web will keep each reqres in memory until the archiving server responds with 200 OK for that reqres;
  • the archiving will only respond with 200 OK response to Hoardy-Web after the dump is written and fsync-ed to disk;
  • the archiving server never deletes any of your archived data; by using an archiving server, you can only loose your archived data if you go to its directory and delete some of it yourself, or if your disk dies, or if your file system gets corrupted; all of those problems are solved by regular backups.

Archival to browser’s local storage was added because it was very easy to implement after the stash was added. It is the default because it usually works fine, it properly reports errors, has the most consistent behaviour across all browsers, and does not require the user to install any Python code, which helps with on-boarding.

In the ideal world, browsers would provide a better saveAs API which would have a less annoying UI for the user and would return out-of-disk-space errors to the extension, in which case exporting via =saveAs= would be the default.

As it is now, the only way to be absolutely sure you data is properly forever saved to disk when the extension reports it archived is to use submission via =HTTP=.

When running Hoardy-Web under Firefox, enabling export via =saveAs= makes the browser’s UI quite annoying. Can it be fixed?

Yes, go to about:config and toggle browser.download.alwaysOpenPanel to false.

This page does not answer my question. What do I do?

If the whole content of this page (not just this section, did you try searching for stuff with Control+F? there’s a lot of info here) does not explain your problem, open an issue on GitHub or get in touch otherwise.