It is highly recommended you view this page by clicking the Help
button in the extension’s own UI.
Doing that will make this page interactive: the settings popup will be displayed on the right on this page and hovering over or clicking on any links pointing to popup.html
will highlight those elements in the popup.
See screenshots if you want to see how it will look.
You can still read this page outside of the extension’s UI, but be prepared for all links pointing to popup.html
to be useless.
Also, the version hosted on the author’s web site is superior to what GitHub’s web UI renders (this pages is written in org-mode
markup language, converting it to GitHub Markdown will make things much harder, since it uses a lot of advanced markup features of org-mode
to simplify things, and GitHub does not render org-mode
files very well at the moment).
Hoardy-Web
is a browser extension (add-on) that passively captures and collects dumps of HTTP
requests and responses as you browse the web, and then archives them using one or more of the following methods:
- by generating fake-Downloads containing either
- separate dumps (one dump of an
HTTP
request+response per file, also there) or - bundles of them (many dumps in a single file),
- separate dumps (one dump of an
- by archiving separate dumps to your own private archiving server (like the
hoardy-web-sas
simple archiving server, also there), or - by archiving separate dumps to your browser’s local storage.
To view your archived data, see the accompanying hoardy-web
CLI tool (also there).
- A reqres (REQuest + RESponse) is an internal object containing captured information about an
HTTP
request and its response, including their headers and data, and some meta-information (whether it originates from an extension,tabId
it originates from, its state, etc).
Reqres change their internal states according to the following state diagram (which is explained below):
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (no_response) (incomplete) (complete) | | | | | \ | | |\---> (canceled) ----\ \ | | | \ \ \ | |\-> (incomplete_fc) ---\ \ \ v | >------>---------------------------->-----> (finished) |\--> (complete_fc) ----/ / | | / / | \----> (snapshot) ----/ /- (collected) <--------- (picked) <--/ | / ^ | | (stashIO?) <----/ | v v | \-- (in_limbo) <- (stashIO?) <- (dropped) v | | (queued) <------------------\ | | / | ^ \ \ \-----> (discarded) <-----/ (exported) <-/ | | \----------------\ \ ^ | | | \ \ | | /---/ \-----------------\ \ \ | | | | \ \ | | v | \ \ | |\-> (srvIO) -> (stashIO?) -> (failed) | \ | | | ^ / \ | | v | v | | | (sumbitted) --------------> (saveIO) --> (saved) | {{!saving}} | \ | \-------->-----------------------------------------------/
Hoardy-Web
attaches to your browser’s runtime and tracks progress of HTTP
requests and their responses, capturing both their request and response headers and data at appropriate times in the browser’s request and response processing pipeline.
Whether Hoardy-Web
will track a given request depends on the Track new reqres
toggles in the settings popup, e.g:
- this toggle allows you to disable tracking of newly spawned
HTTP
requests globally, thus essentially disablingHoardy-Web
, - this one controls whether
Hoardy-Web
will track new reqres originating from the currently active tab, - this one controls whether it will track new reqres originating from new tabs opened from the currently active tab (aka “children tabs”, e.g. via middle mouse click, context menu, etc),
- while this one controls whether it will track new reqres originating from new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar,
Control+T
, menu item, etc), - and so forth for the others (press
?
symbols to see a tooltip explaining what each of them does).
Disabling any of these toggles does not stop tracking of already initiated requests, it only stops new requests controlled by that toggle from being tracked.
As shown on the above diagram, a new reqres proceeds through the following networking states:
start
: the starting state;request sent
, (response)headers received
, (response)body recived
: these are the normalHTTP
request stages (stages of =webRequest= sub-API ofWebExtensions
API);nIO
: normal network IO performed by the browser in betweenHTTP
request stages;canceled
: request was canceled before it was sent- you, the user, canceled it manually, via the
Stop
button; - by the browser when redirecting an
http://
URL to anhttps://
URL inHTTPS
-only mode; - by an ad-blocking extension like
uBlock Origin
; - etc;
unsent
would have probably been a better name for this, but all browsers call itcanceled
internally, soHoardy-Web
follows that convention;- you, the user, canceled it manually, via the
no_response
: request was sent, but no response was received:- you canceled it manually via the
Stop
button before it got a response; - connection to the server was rejected;
- the server decided to ignore the request completely;
- network timeout was reached;
- etc;
- you canceled it manually via the
incomplete
: request was sent, response headers were received, but then the loading was interrupted before all of the response body was received;incomplete_fc
: only on Firefox-based browsers: the browser loaded the response data of this reqres directly from its cache, but did not give it toHoardy-Web
;this is just how Firefox handles things sometimes; usually, this only happens for images;
this is a separate state, because usually this means this URL was successfully archived before; if it was not, reload the page with
Control+F5
;complete
: request was completed successfully;complete_fc
: request was completed successfully from browser’s cache;snapshot
: this reqres was produced by taking a DOM (Document Object Model) snapshot (using one of the appropriate-buttons in the popup), i.e. it was produced by capturing a rawHTML
orXML
of the current state of the tab/frame, not by capturing a network request;finished
: the terminal state of this step, no new events for this reqres will come from the browser.
In principle, at reaching finished
state the reqres can be serialized and saved to disk, but Hoardy-Web
provides more states and UI for convenience and to workaround limitations of various browser APIs (a WebExtensions
API function call that writes a data chunk into a file on a local file system while reporting out-of-disk-space errors does not exists).
- An /in-flight reqres/ (current tab) is a reqres that did not reach the
finished
state yet, in history-log such reqres will be shown to be inin_flight
state.These two stats are represented as sums of two numbers:
- the number of reqres that are still being tracked via
webRequest
ordebugger
API; and - the number of reqres that have finished being tracked and are now waiting for all their events to finish processing.
On Firefox, nothing should ever get stuck, if something seems to be stuck in
in_flight
state, it’s probably still loading (or it is a bug in the browser, which does happen, very rarely).On Chromium, limitations of the Chromium’s debugging interface mean a request can get stuck among the reqres represended by the first number above. If the first number is zero, however, then the second should also rapidly become zero, at most after two times this many seconds.
If some reqres got stuck in one of the
in_flight
states, you can forcefully move them out of that state using this and/or that popup buttons. - the number of reqres that are still being tracked via
- A finished reqres is a reqres that reached the
finished
state. - Final networking state is the last state a reqres had before it
finished
: i.e.complete
,incomplete
,canceled
, etc.
On reaching the finished
state, Hoardy-Web
performs reqres classification controlled by =Pick reqres for archival when they finish= and =Mark reqres as ‘problematic’ when they finish= settings.
The former set decides whether the reqres in question should be picked
or dropped
, which influences the actions Hoardy-Web
will perform in the next step.
The latter set decides if the reqres in question should be marked as problematic
.
The problematic
reqres status is a flag (NOT a state) that does not influence archival or any actions discussed in the latter steps.
It exists because browsers provide no indication when some parts of the page failed to load properly — they expect you to actually look at the page with your eyes to notice something looking broken (and reload it manually) instead — which is counterproductive when you want to be sure that the whole page with all its resources was archived.
After all, parts of a dynamically loaded page might simply silently fail to be rendered by associated JavaScript
because some of the HTTP
requests that JavaScript
did in background failed, or, on a static web page, layout and `CSS` might have made some of the incompletely loaded parts of the page invisible (by design or by accident).
So, to provide an indicator for such cases, Hoardy-Web
keeps the log of problematic
reqres and displays the number of elements in the log in its toolbar button’s badge.
By default, HTTP
requests that failed to get a response, those that have incomplete response bodies, and those for which the browser reported potentially problematic errors but then Hoardy-Web
picked
them anyway, will be marked as problematic
.
Problematic errors are errors like
- “this request failed because of a networking issue”,
- “this request was aborted because the
JavaScript
function making it decided to cancel it when you moved your mouse cursor away from a video thumbnail it was needed for”, - and similar things that probably imply some part of the page was left unfetched,
but NOT errors like
- “fetching of this request was aborted because the server redirected it to a URL blocked by
uBlock Origin
”, - “the browser decided against rendering of this data”,
- “the browser failed to render this data because this image file is broken”,
- and similar errors where the data was properly fetched.
(In principle, Hoardy-Web
could have been designed to never record the errors of the latter category in the first place, thus simplifying the above bit, but Hoardy-Web
is designed to follow the philosophy or “collect everything as browser gives it, as raw as possible, do all the post-processing logic separately, allow for no logic at all, if the user asks for it”.)
The raw error strings reported by the browser for each reqres can be seen in the history-log.
If this option is enabled Hoardy-Web
will generate a desktop notification each time a new problematic reqres get produced.
If you don’t care about the problematic flag and it annoys you, you should disable that option, not options under =Mark reqres as ‘problematic’ when they finish= settings.
This way you could then still see the number of problematic
reqres in extension’s toolbar button’s badge.
- A picked reqres/ (current tab) is a /finished reqres that satisfied the conditions controlled by =Pick reqres for archival when they finish= settings on entering the
finished
state. - A dropped reqres/ (current tab) is a /finished reqres that did NOT satisfy the conditions controlled by =Pick reqres for archival when they finish= settings on entering the
finished
state. - A problematic reqres/ (current tab) is a /finished reqres that satisfies the conditions set by =Mark reqres as ‘problematic’ when they finish= settings.
On exit from the finished
state each reqres gets split into
- a
loggable
, which is a hollowreqres
structure without any request or response data, i.e. it only keeps the metadata used by history-log, and - a
dump
, which is a serialized CBOR-formatted dump of the originalreqres
structure.
Since those tuples can be reconstructed back into the original reqres
structures, the following will continue to refer to them as if nothing changed when the fact they are now being internally represented by those tuples is not relevant.
Normally, picked reqres proceed to the collected
state and get queued
for archival while dropped reqres proceed to being discarded
from memory.
When =Archive ‘collected’ reqres by= toggle is enabled, those queued
reqres proceed directly to the next step.
However, sometimes you might want to actually look at a web page before deciding if you want to archive it or not.
The naive way to do it would be to load a page with capture disabled first, look at it, and then, if you want to save it, enable it, and reload the page again with browser’s cache disabled via Control+F5
(and it has to be Control+F5
, not just F5
, because otherwise some URLs, on Firefox, might produce reqres in incomplete_fc
state, on Chromium, their fetching could be silently skipped).
Obviously, this is both annoying and will force you to fetch everything twice.
Which is why Hoardy-Web
implements “limbo mode”.
With one of the limbo mode options enabled, Hoardy-Web
will instead capture everything as normal, but then, instead of sending the reqres in question to collected
or discarded
states immediately, it will put them into in_limbo
state where they would linger until you collect it or discard them manually by pressing the appropriate-buttons, or until =Automatic actions for recently closed tabs= options make a decision semi-automatically for you.
A picked reqres will be put into in_limbo
when =Pick into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.
Similarly, a dropped reqres will be put into in_limbo
when =Drop into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.
(This latter option mainly exists for debugging.)
If this option is enabled and there are more than this number reqres in_limbo
or the total size of all dumps in_limbo
is more than this size (in MiB), Hoardy-Web
will complain to remind you to collect or discard some of them so that your browser does not waste too much memory (and so that you won’t loose too much data if something crashes while =Stash ‘collected’ reqres into local storage= option discussed below is disabled).
- A /collected reqres/ (current tab) is a reqres that was (either automatically or manually) sent to the
collected
state. - A /discarded reqres/ (current tab) is a reqres that was (either automatically or manually) sent to the
discarded
state. - An in-limbo reqres/ (current tab) is a reqres that is being held =in_limbo= until you manually /collect or discard it.
- A /queued reqres/ (displayed on the Queued/Failed line) is a
collected
reqres that is stillqueued
for archival.
The stashed
reqres status is, essentially, a flag that says this reqres was temporarily backed up to browser’s local storage.
When =Archive ‘collected’ reqres by= option is disabled but =Stash ‘collected’ reqres into local storage= option is enabled, instead of archiving newly queued
reqres, Hoardy-Web
will stash their (loggable, dump)
tuples into browser’s local storage.
Similarly, when both =Stash ‘collected’ reqres into local storage= option and =Stash ‘in_limbo’ reqres= option (or one-of-the-other similar options) is enabled, then newly generated in_limbo
reqres will also get immediately stashed into browser’s local storage.
Moreover, the following section will discuss how Hoardy-Web
will try stashing failed
to archive reqres into browser’s local storage too.
In other words, stashing exists to prevent loss of successfully captured but yet unarchived data in situations where
- you quit or restart your browser, or
Hoardy-Web
crashes or gets reloaded unexpectedly,- your computer unexpectedly looses power,
before you collected
or discarded
everything from in_limbo
or Hoardy-Web
has successfully archived everything from its archiving queue.
Note however, that even with stashing enabled Hoardy-Web
will skip disk IO whenever possible: e.g., if both =Archive ‘collected’ reqres by= and [[./popup.html#div-config.archiveSubmitHTTP][=Archive ‘collected’ reqres by > … submitting them via ‘HTTP’]] options discussed below are enabled, =Hoardy-Web
will first try to archive each new collected
reqres straight from memory to the archiving server and only if that process fails will it attempt stashing them to local storage instead.
Meaning that
- stashing of non-=in_limbo= reqres is usually completely free and so you should probably keep that option always enabled;
- stashing of
in_limbo
reqres via-one-of-the-those options is not free, so if you almost never archive from limbo then keeping those options enabled will waste disk IO, so you might want to disable at least some of them in that case;
The above also implies that, technically, stashing is not a silver bullet against data loss.
To try and make it such would mean unconditional immediate stashing of all captured data, which would waste a lot of disk IO on most Hoardy-Web
configurations.
When both =Archive ‘collected’ reqres by= option and =Stash ‘collected’ reqres into local storage= option are disabled, then, after a new reqres gets queued
, Hoardy-Web
will generate a new desktop notification complaining about it, unless that option is disabled too.
You can also forcefully stash all currently queued
, in_limbo
, and failed
reqres by pressing this button.
It stashes everything immediately and unconditionally, ignoring all other stashing settings.
- A stuck queued reqres is a
queued
reqres that got stuck in the archival queue, e.g. because it got queued while =Archive ‘collected’ reqres by= option was disabled. - A /stashed reqres/ is a reqres that was temporarily
stashed
(backed-up) into browser’s local storage while it is still being kept inHoardy-Web
’s memory. I.e., the stash is a persistent on-disk backup for in-memory reqres.
On entering collected
or discarded
state, loggable
metadata of each reqres is copied into the recent reqres history-log and is kept there until the size of the log reaches this many elements, at which point the older elements of the log start being elided automatically.
You can also ask Hoardy-Web
to forget all history manually by pressing this button, or to forget history of reqres generated by the currently active tab by pressing that button instead, or do the same by using similar buttons in the-log.
Using the-log will also allow the use of reqres filtering options for doing this, allowing you to selectively forget parts of history.
Note, however, that problematic reqres will not get automatically elided from the log, nor forgotten by using the above buttons. To forget about them, you will have to unset the problematic flag on the respective reqres via this button, or that button, or use similar buttons in the-log.
When =Archive ‘collected’ reqres by= toggle is enabled, Hoardy-Web
will pop queued
reqres from the archival queue one by one and then perform one or more of the following (in order they are listed):
- if =Archive ‘collected’ reqres by > … exporting them using ‘saveAs’= option is enabled,
Hoardy-Web
will - if =Archive ‘collected’ reqres by > … submitting them via ‘HTTP’= option is enabled,
Hoardy-Web
will submit thedump
to the archiving server at =Server URL= setting by making anHTTP POST
request with thedump
as request body (which is denoted bysrvIO
states on the diagram above); - if any of the above fails
Hoardy-Web
will- move the reqres into the
failed
state, - if =Stash ‘collected’ reqres into local storage= option is enabled, it will try stashing the
(loggable, dump)
tuple into browser’s local storage (which is denoted bystashIO
states on the diagram above) and record but ignore any errors produced while doing that, and - stop processing this reqres;
- move the reqres into the
- otherwise, if =Archive ‘collected’ reqres by > … saving them into local storage= option is enabled,
Hoardy-Web
will- try to save the
(loggable, dump)
tuple into browser’s local storage (which is denoted bysaveIO
states on the diagram above), - if saving fails, it will move the reqres into the
failed
state instead, and stop processing this reqres;
- try to save the
- finally, if =Archive ‘collected’ reqres by > … saving them into local storage= option is disabled or if saving to local storage succeeds,
Hoardy-Web
will discard the reqres from memory.
You can enable more than one archival method at the same time.
For a given loggable
, Hoardy-Web
will remember and skip previously successful archival methods if the loggable
ever returns to the archival queue again (e.g., when one of the archival methods fails and you later ask Hoardy-Web
to retry the archival, or when you re-queue a reqres from local storage from the Saved in Local Storage
page).
Note the difference between stashed and saved reqres:
- stashed reqres are kept in memory until they get successfully archived by all configured archival methods (or until you manually discard them, in case they were stashed
in_limbo
); - saved reqres get dumped into browser’s local storage and, if that succeeds, discarded from memory (until you manually load them back from there).
Sometimes you might want to split your archivals into separate buckets to simplify future hoarding and sharing of collected archives. E.g., say, by default you might want to put everything into the “default” bucket, but then you might want to put reqres produced by a select tab where you just logged in into you personal account into the “private” bucket instead.
To implement this, for each reqres in the archival queue, Hoardy-Web
computes a bucket
parameter from the appropriate “Bucket” setting, e.g.
- this one will be used for requests originating from the currently active tab,
- this one will be used for requests originating from new child tabs opened from the currently active tab (e.g. via middle mouse click, context menu, etc),
- while this one will be used for new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar,
Control+T
, menu item, etc), - and so forth for the others (press
?
symbols to see a tooltip explaining what each of them does).
Evaluation of the bucket
parameter is done just before each archival attempt, so if the queue is not yet empty, and you disable =Archive ‘collected’ reqres by=, edit some of the “Bucket” settings, and enable it again, Hoardy-Web
will start using the new setting immediately.
When exporting via saveAs
, bucket
value will be used in the file name of the generated fake-Download .wrrb
file and the dumps will be split into separate fake-Download files by said bucket
.
I.e., internally, the bundle
discussed above is actually a set of per-=bucket= bundle
’s.
When submitting to an HTTP
server, Hoardy-Web
will specify bucket
as a query parameter (named “profile”, for historical reasons) to each HTTP POST
request.
When stashing or saving to local storage, Hoardy-Web
will record the value of bucket
into each loggable
before saving data to disk.
If you restart your browser, thus starting a new Hoardy-Web
session, Hoardy-Web
will use the old stashed/saved bucket
values for all new attempted archivals of old reqres generated by previous sessions.
- An /exported reqres/ is a reqres that was successfully
exported
by generating a fake-Download containing itsdump
. - A /submitted reqres/ is a reqres that was successfully
submitted
to the archiving server and thus was discarded from memory. - A /saved reqres/ is a reqres that was successfully
saved
by being archived into browser’s local storage. - An archived reqres is either exported, submitted, or saved reqres.
As noted above, if any of the archival methods fail, the reqres in question will be moved into the failed
state.
Submissions of reqres that failed
because of networking issues will be retried automatically every 60 seconds.
Archivals of reqres rejected by the archiving server or those that failed to be saved to browser’s local storage will not be retried automatically as those usually happen when there is no space left on the device you are archiving to.
You can retry all failed archivals by pressing this button. You can also use it to nudge the archiving sub-process awake if some things got stuck in the queue by accident. E.g., after the extension got reloaded with a non-empty queue, or if you previously quit your browser before everything was archived.
If this option is enabled and a new reqres recently moved to the failed
state, a new desktop notification will be generated.
If this option is enabled, a new desktop notification will be generated when the archival queue gets empty the very first time or after any failures.
- A /failed to archive reqres/ (displayed on the Queued/Failed line) is a reqres that
failed
to be archived by one of the enabled archival methods.
Hoardy-Web
provides a bunch of keyboard and context menu shortcuts to allow using it in more efficient ways.
- On Firefox-based browsers, you can see and edit all keyboard shortcuts via
Add-ons and themes
(about:addons
) -> the gear icon ->Manage Extension Shortcuts
. - On Chromium-based browsers, you can see and edit all keyboard shortcuts via the menu ->
Extensions
->Manage Extensions
(chrome://extensions/
) ->Keyboard shortcuts
(on the left).
Hoardy-Web
provides shortcuts to:
- open the
Internal State and Logs
page, {{{shortcut(showState)}}}; - open the
Internal State and Logs
page, scrolled to the end of the log, {{{shortcut(showLog)}}}; - open the
Internal State and Logs
page narrowed to the currently active tab’s data, {{{shortcut(showTabState)}}}; - open the
Internal State and Logs
page narrowed to the currently active tab’s data, scrolled to the end of the log, {{{shortcut(showTabLog)}}}; - toggle tracking of newly spawned
HTTP
requests in the currently active tab and set tracking in its children to the same value at the same time, {{{shortcut(toggleTabConfigTracking)}}}; - toggle tracking of newly spawned
HTTP
requests currently active tab’s children, {{{shortcut(toggleTabConfigChildrenTracking)}}}; - toggle limbo mode in the currently active tab and set limbo mode in its children to the same value at the same time, {{{shortcut(toggleTabConfigLimbo)}}};
- toggle limbo mode in currently active tab’s children, {{{shortcut(toggleTabConfigChildrenLimbo)}}};
- unmark all problematic reqres, {{{shortcut(unmarkAllProblematic)}}};
- unmark all current tab’s problematic reqres, {{{shortcut(unmarkAllTabProblematic)}}};
- collect all reqres from limbo, {{{shortcut(collectAllInLimbo)}}};
- collect all reqres from limbo for the currently active tab, {{{shortcut(collectAllTabInLimbo)}}};
- discard all reqres from limbo, {{{shortcut(discardAllInLimbo)}}};
- discard all reqres from limbo for the currently active tab, {{{shortcut(discardAllTabInLimbo)}}};
- take DOM snapshot of all tabs for which =Track newly generated requests= setting is enabled, {{{shortcut(snapshotAll)}}};
- take DOM snapshot of the currently active tab, {{{shortcut(snapshotTab)}}}.
Hoardy-Web
provides context menu actions to:
- open a given link in a new tab with currently active tab’s tracking in children tabs setting negated.
I.e.,
- right-mouse clicking while pointing at a link and
- selecting
Hoardy-Web > Open Link in New Tracked/Untracked Tab
menu item,
is equivalent to
- do the same thing, but opening it in a new window.
- When the extension is (re-)loaded, all tabs inherit the values of these settings.
Hoardy-Web
does not implement collection of WebSockets data on any of the supported browsers.(Firefox does not support it. Chromium does support it, in theory, but I have not tried using that API, so I have no idea how well it works.)
This is low-priority issue since you can simply take a DOM snapshot instead of capturing and later replaying WebSocket messages to in-page
JavaScript
. Also, capturing and archiving a DOM snapshot will free you from needing to run anyJavaScript
at all when you decide to return to view the archived page later, which is nice.- On Chromium, response data of background requests and requests made by other extensions does not get collected, since there’s no tab to attach a debugger to, and I have not figured out how to attach debugger to other things yet.
- On Firefox, fetches that spawn new downloads will be marked as
problematic
by default, since Firefox’s implementation ofwebRequest.filterResponseData
API does not provide their contents to the extension and I have not figured out how to distinguish them from other fetches yet.
- On all supported browsers, if an
HTTP
server supplies the same header multiple times — which happens sometimes, most commonly withSet-Cookie
headers — then the archived response headers will usually become weird, with multiple headers squished into a single value, separated by newline symbols.This is just the way both Firefox (usually) and Chromium (always) supply those headers to extensions and
Hoardy-Web
does not try to undo it.
- On Firefox-based browsers, without the patch (also there), the browser only supplies
formData
towebRequest.onBeforeRequest
handlers, thus making impossible to recover the actual request body for aPOST
request.Hoardy-Web
will mark such requests as having a “partial request body” and try its best to recover the data fromformData
structure, but if aPOST
request was uploading files, they won’t be recoverable fromformData
(in fact, it is not even possible to tell if there were any files attached there), and so your archived request data will be incomplete even afterHoardy-Web
did its best.Disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.
With the above patch applied, small
POST
requests will be archived completely and correctly.POST
requests that upload large files and only those will be marked as having a “partial request body”. If-Modified-Since
andIf-None-Match
headers never get archived, because the browser never supplies them to the extensions. Thus, you can get304 Not Modified
reqres response to a seemingly normalGET
request.- Reqres of already cached media files (images, audio, video, except for svg and favicons) will end in
incomplete_fc
state becausewebRequest.filterResponseData
API does not provide response bodies for such requests. This toggle controls if such reqres should bepicked
.By default,
Hoardy-Web
willdrop
them. Usually this is not a problem since such media will be archived on first (non-cached) access. But if you want to force everything on the page to be archived, you can reload the page without the cache withControl+F5
. - Firefox fails to run
onstop
method forwebRequest.filterResponseData
filter for the very firstHTTP/2
request the browser makes after you start it, thus making the reqres of that requestincomplete
. If this option is enabled,Hoardy-Web
will transparently work around this bug by redirecting the very first navigation request toabout:blank
and then reloading the tab with its original URL. - Firefox-based browsers provide no API for archiving WebSockets data at the moment, unfortunately.
All of the above apply, moreover:
- Archival by exporting using =saveAs= is not supported at the moment because of this bug.
On Chromium-based browsers, there is no way to get HTTP
response data without attaching Chromium’s debugger to a tab from which a request originates from.
This makes things a bit tricky, for instance:
- With this and this option enabled, new tabs will be reset to this value (
about:blank
by default) because the default ofchrome://newtab/
does not allow attaching debugger to the tabs withchrome:
URLs. - Requests made before the debugger is attached will get canceled by
Hoardy-Web
. So, for instance, when you middle-click a link, Chromium will open a new tab, butHoardy-Web
will block the requests from there until the debugger gets attached and then automatically reload the tab after. As side-effect of this, Chromium will showRequest blocked
page until the debugger is attached and the page is reloaded, meaning it will get visually stuck onRequest blocked
page if fetching the request ended up spawning a download instead of showing a page. The download will proceed as normal, though. - You will get an annoying notification bar constantly displayed in the browser while =Hoardy-Web= is enabled.
Closing that notification will detach the debugger.
Hoardy-Web
will reattach it immediately because it assumes you don’t want to lose data and closing that notification on accident is, unfortunately, quite easy.However, closing the notification will make all in-flight requests lose their response data.
All alternatives to
Hoardy-Web
that work with Chromium suffer from the same issue.If you disable this option the debuggers will get detached only after all requests finish. But even if there are no requests in-flight the notification will not disappear immediately. Chromium takes its time updating the UI after the debugger is detached.
Moreover, Chromium has the following long-standing issues/bugs making things difficult:
- Chromium will automatically detach a debugger from a tab if it tries to save too much data into its debugger state.
Which means that a tab that loads too much data too fast will get its debugger detached.
Chromium does this to try and save memory, but this, among other issues, means that large images will fail to be properly archived, and any page that loads such files is likely to fail to be archived too.
This is a design limitation of Chromium debugging interface, there appears to be no work-around for this at the moment.
Meanwhile, on Firefox,
Hoardy-Web
useswebRequest.filterResponseData
API (not available no Chromium, because it greatly enhances browser’s ad-blocking capabilities) which does not suffer from this problem. - Chromium will occasionally detach debuggers from some tabs at random.
It just happens.
Fortunately,
Hoardy-Web
will mark the resulting broken reqres as problematic by default as they match the conditions of at least one of this, this, or that options. - Chromium handling of media files (audio and video) within its debugging interface is very strange.
When Chromium encounters a media file, it immediately loads a first few frames of it, then cancels the rest of the download, generates a networking error debugging event, but forgets to give the already loaded data to it, and then, when the user clicks the play button, continues the download by requesting the rest of the file as normal.
Thus, on Chromium, for media files
Hoardy-Web
will only ever get206 Partial Content
HTTP
responses with the first few kilobytes of file data missing. This bug has no good workaround, all alternatives toHoardy-Web
that work with Chromium work it around by silently re-downloading the file the second time in background. - Similarly to unpatched Firefox, Chromium-based browsers do not supply contents of files in
POST
request data. They do, however, provide a way to see if files were present in the request, soHoardy-Web
will mark such and only such requests as having a “partial request body”. There is no patch for Chromium to fix this, nor do I plan to make one (feel free to contribute one, though).As with Firefox, disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.
- Chromium fails to provide
openerTabId
to tabs created withchrome.tabs.create
API so in the unlikely case of opening two or more new tabs/windows in rapid succession viaHoardy-Web
context menu actions and not giving them time to initializeHoardy-Web
could end up mixing up settings between the newly created tabs/windows. This bug is impossible to trigger unless your system is very slow or you are clicking things with automation tools likeAutoHotKey
orxnee
. - To properly collect all the data about a reqres,
Hoardy-Web
has to use both the data generated bywebRequest
API and Chromium’s own debugging API events, using only one of those is usually insufficient. But Chromium generates different request IDs for events generated by these two different APIs and also generates those events in arbitrary order. Therefore,Hoardy-Web
tracks reqres generated by both sets of APIs separately and then matches those two lists against each other heuristically, merging matching reqres together. Which is ugly enough. But then Chromium sometimes generates debugging API events and forgets to produce the correspondingwebRequest
API events, or vice versa, thus leaving some of those reqres unmatched.To work around that,
Hoardy-Web
waits this many seconds for new events to arrive, and if none do, forcefully finishes all unmatched but network-completein_flight
reqres. Yes, this means that some minor metadata fields (likedocument_url
) of those reqres might be missing, but waiting more time usually won’t fix it, soHoardy-Web
can’t do anything else there. - However, sometimes Chromium forgets to generate both
loading-complete
andloading-failed
debugging events. This usually happens when a request gets started and then canceled by a page’sJavaScript
, or when you navigate between pages too fast.In that case,
Hoardy-Web
can’t tell if a reqres is just slow at being loaded or if Chromium forgot about it, so those reqres will get stuck in thein_flight
state indefinitely, at least until their originator tab gets closed, or until you press one of this or that buttons.Hoardy-Web
might get another workaround for this bug later.
Failed to archive <N> items in the queue because Hoardy-Web can't establish a connection to the archiving server at <URL>
Are you running the the archiving server script?
Failed to archive <N> items in the queue because requests to the archiving server failed with: <STATUS> <REASON>: <RESPONSE>
Your archiving sever is returning
HTTP
errors whenHoardy-Web
is trying to archive data to it. See your archiving server’s console for more information.Some common reasons it could be failing:
- No space left on the device you are archiving to.
- It’s a bug.
Failed to open/create a database via `IndexedDB` API, all data persistence will be done via `storage.local` API instead. This is not ideal, but not particularly bad. However, the critical issue is that it appears Hoardy-Web previously used `IndexedDB` for archiving and/or stashing reqres.
So, it worked before, but why doesn’t it work now? The most likely reason is: you are running
Hoardy-Web
under a browser based on an older version of Firefox and you have recently enabledAlways use private browsing mode
setting in your browser’s config. Older versions of Firefox forbid the use ofIndexedDB
API when that setting is set.To make archives currently saved in
IndexedDB
accessible toHoardy-Web
underAlways use private browsing mode
you need to:- Disable
Always use private browsing mode
browser setting and restart the browser, thus allowingHoardy-Web
access toIndexedDB
again. - Ensure =Prefer ‘IndexedDB’ API= setting is disabled.
- Ensure =Archive ‘collected’ reqres by > … saving them into local storage= option is enabled.
- Ensure =Archive ‘collected’ reqres by= is enabled.
- Open the
Saved in Local Storage
page. - Set
In 'storage.local'
filter there tofalse
(red). - Press
Re-queue
button there to re-archive all those saved reqres fromIndexedDB
tostorage.local
. - Now, you can re-enable the
Always use private browsing mode
browser setting and restart you browser again.
All old data should be available from the
Saved in Local Storage
page now.- Disable
Most error codes are produced by attaching one of the following prefixes to the raw error code given by the browser:
webRequest::
prefix is prepended to errors produced by the code working withwebRequest
API;debugger::
prefix is prepended to errors produced by the code working with Chromium’s Debugger API;filterResponseData::
prefix is prepended to errors produced bywebRequest.filterResponseData
API (these can usually be ignored, since Firefox generates normalwebRequest::
codes for those reqres too, when it was an actual error; butHoardy-Web
still collects them, adhering to “collect everything as browser gives it, when possible” philosophy).
In particular, webRequest::NS_
prefix on Firefox, and webRequest::net::
and debugger::net::
prefixes on Chromium signify various issues produced by the networking stacks of those browsers.
For instance:
webRequest::NS_ERROR_ABORT
on Firefox andwebRequest::net::ERR_ABORTED
on Chromium signify that this request was aborted before it finished, e.g. because the originator tab was closed before it was fully loaded; Firefox also uses this code to mean what Chromium signifies with variousBLOCKED
codes;webRequest::net::ERR_BLOCKED_BY_CLIENT
on Chromium signifies that an extension blocked it;debugger::net::ERR_BLOCKED::
is a prefix for other errors when the request was blocked, e.g. by CSP;webRequest::NS_ERROR_NET
prefix on Firefox andwebRequest::net::ERR_FAILED
error on Chromium signify various networking issues.
The exception to the above rule of keeping everything as raw as possible are webRequest::capture::
and debugger::capture::
prefixes which signify various errors produced by Hoardy-Web
itself in its webRequest
- or debugger
-handling code, respectively.
In particular:
webRequest::capture::EMIT_FORCED::BY_USER
anddebugger::capture::EMIT_FORCED::BY_USER
are produced when you forcefully advance a reqres from in-flight state by pressing this or that button;debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER
is produced when Chromium debugger gets detached from its tab while a reqres inside that tab is still in flight;debugger::capture::EMIT_FORCED::BY_CLOSED_TAB
is produced when a tab gets closed while a reqres inside of it is still in flight;debugger::capture::NO_RESPONSE_BODY::
is a prefix for errors produced when getting request’s response body from Chromium’s debugger fails for various reasons;webRequest::capture::CANCELED::NO_DEBUGGER
is produced when a non-main-frame request is canceled byHoardy-Web
because no debugger is available to capture it; in the case of a main frame request,Hoardy-Web
will cancel the request and reload the tab, as discussed above, so this error will not be produced; but it can happen if a page tries to load a sub-frame (likeiframe
) while the debugger for the tab (and, thus, the main frame) did not attach yet (which only happens for pages where Chromium disallows debugging, or whenHoardy-Web
gets enabled after the page in question already started loading, e.g. the very first page after the browser starts); also, this can happen when the debugger gets detached after the main frame was captured but its resources are still loading.-
webRequest::capture::RESPONSE::BROKEN
is produced when some response metadata is unavailable.At the moment, this only appears to happen on Firefox when a request gets fulfilled by a service or shared worker after Firefox had already sent it to the server. Firefox then interrupts the networking code and generates
NS_ERROR_NET_ON_*
error about the event failing to supply the response metadata generated by the service/shared worker.
Hoardy-Web
only ever sends your data to the archiving Server URL=]] you specify when [[./popup.html#div-config.archiveSubmitHTTP][the =Archive 'collected' reqres by > ... submitting them via 'HTTP'
option is enabled.
Nowhere else. Never else.
For your convenience, Hoardy-Web
saves some global stats across restarts (e.g., the Collected, Discarded, Picked, and Dropped lines).
However, none of those are ever sent anywhere and you can reset them at any time.
No. I (the author) hate non-consensual data collection.
In fact, as you might have noticed, Hoardy-Web
, unlike most other browser extensions, is almost trivial to reproducible-build from source on a POSIX-compliant system with a Nix package manager installed, and it has a privately operated source code mirror.
This is by design, I expect a chunk of Hoardy-Web
users to be paranoid enough to only ever build it from source and install the results manually into their LibreWolf or some such, leaving zero telemetry fingerprints anywhere.
<all_urls>
permission is used so thatHoardy-Web
could capture all URLs.webRequest
andwebRequestBlocking
permissions are used to track and captureHTTP
requests and responses; on Chromium the latter also requires thedebugger
permission, whichHoardy-Web
also asks for there.tabs
permission is used for tracking per-tab state and stats, makingHoardy-Web
’s toolbar icon show per-tab state, taking DOM snapshot of all tabs, buttons switching to a related tab in the-log, etc.storage
permission is used to save extension config and stats.unlimitedStorage
permission is used for archival of captured data to browser’s local storage.menus
(contextMenus
on Chromium) permission is used to add context-menu shortcut actions for links.notifications
permission is used to send desktop notifications, which mostly used for reporting various issues.
Yes.
This is why DOM-snapshot buttons exist, see the following question.
In principle, Hoardy-Web
will capture everything your browser fetches from the network as you browse the web, except for, at the moment, WebSockets data.
So, web pages using only simple UI-related JavaScript
code will work fine when you start replaying them “from scratch” via =hoardy-web export mirror= (also there) or some such.
However, in the most general case, “from scratch” replay of pages dynamically generated via JavaScript
is not guaranteed.
For example, consider a web page with a JavaScript
code that generates a random number, then queries a remote server with that number, and then renders the result somehow.
Obviously, such a web page can not be replayed “from scratch” since it will generate a new random number and your archive probably won’t have the corresponding server’s response for it.
Can I use Hoardy-Web
to capture a web page as it currently is, after all JavaScript
was run, not as it was when it was last fetched from the network?
Yes, you can capture DOM (Document Object Model) snapshots of all frames of the currently active tab by pressing this button in the popup.
Doing that will generate and capture snapshots of raw HTML
’s or XML
’s for each frame contained in the currently active tab.
(Reqres-wise they will be 200 OK
responses, but with protocol
set to SNAPSHOT
and method
set to DOM
.)
You can also do that for all open tabs at once by pressing that button.
How can I make Hoardy-Web
capture a web page completely, especially when parts of it are loaded lazily?
In the most general case, you will have to scroll the page around and click random buttons and media elements.
Hoardy-Web
has no “autopilot” for doing this, nor will it ever get one, at least as part of Hoardy-Web
extension, since “autopiloting” is very website-specific.
So, at the moment, the most general semi-automated solution is to run a website-specific UserScript via Tampermonkey or some such, wait until everything finishes loading, and then take a snapshot.
(Hoardy-Web
will get an integration for automating that, eventually.)
On the other hand, if you
- run
Hoardy-Web
under Firefox, - just want to load all lazily-loaded images the page already has (NOT load more stuff), and
- the page in question uses modern HTML5 lazy loading attributes instead of using
JavaScript
to do the same,
then you can simply go to about:config
and toggle dom.image-lazy-loading.enabled
to false
.
All images will start being loaded eagerly after that.
Can I use Hoardy-Web
to capture a web page without archiving it, look at it, decide if I want to save it, and archive it only if I do, all without reloading the page a second time?
Yes. This is why =Pick into limbo= setting exists. See above for more info.
In combination with =Automatic actions for recently closed tabs= options you can implement any of the following workflows:
- archive everything by default, but allow to exclude some things by manually discarding them from limbo;
- only archive things that are explicitly manually collected, discard everything else by default.
Why do pages under https://addons.mozilla.org/ and https://chromewebstore.google.com/ can not be captured by Hoardy-Web
?
Browsers prevent extensions from running on extension store pages to prevent them from manipulating ratings, reviews, and etc such things.
However, you can archive https://addons.mozilla.org/ pages by running Hoardy-Web
under Chromium and https://chromewebstore.google.com/ pages by running Hoardy-Web
under Firefox.
When running Hoardy-Web
under Chromium, a lot of my captures fail with debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER
, debugger::capture::NO_RESPONSE_BODY::DETACHED_DEBUGGER
, webRequest::capture::CANCELED::NO_DEBUGGER
, and similar errors. What do I do?
You are either
- pressing the
Cancel
orClose
(cross) buttons in the Chromium’s popup-toolbar telling you about the debugger being enabled, and so Chromium detaches it, breaking everything (see above); - pressing
Space
orEscape
keyboard keys when doing things in Chromium’s UI, but nothing at that particular moment reacts to the key you pressed, except there is that popup-toolbar… and so Chromium decides it must mean you want to pressCancel
button there … and detaches the debugger, breaking everything (again);yes, this is really annoying, and this is a common problem for me, since I usually page-down using
Space
and pressEscape
a lot (usually to cancel selection, but sometimes also as a trauma of a long-time Vim user);the only solution to this I know of is to just not touch the keyboard at all, at least while things are still loading; i.e. just click on stuff using the mouse/track-point/touch-pad/touchscreen/etc, wait for the
T
(“Tracking”) to vanish from the extension’s badge, and only then let your (grabby and impatient for exercise via keyboard shortcuts) fingers to touch the keyboard;even then, Chromium will detach debuggers from time to time seemingly at random, but at least it will be rare enough that you won’t need to reload much;
- trying to capture large or media files; as discussed above, this has no workaround, run
Hoardy-Web
under Firefox instead.
Also, Chromium will occasionally detach its debugger at random, it just happens.
When running Hoardy-Web
under Firefox, some of my captures fail with webRequest::capture::RESPONSE::BROKEN
. What do I do?
This is a rare error caused by a race condition between webpage’s service/shared worker and browser’s networking code.
Usually, you can ignore this error, since loading another related page is likely to fulfill the same URL.
However, if this happens a lot to you, or if it annoys you, you can go to about:config
, toggle dom.serviceWorkers.enabled
to false
, and restart the browser.
Alternatively, you can use NoScript
or some such extension to disable JavaScript
, and thus the offending service/shared workers, on the page in question.
Did you read the notes on the bugs of the browser you are using above?
Most notably:
- both Firefox- and Chromium-based browsers in their default builds fail to properly supply
POST
request data to their extensions; for Firefox-based browsers there exists a patch that fixes it, mostly; Chromium users are out of luck at the moment; - on a Chromium-based browser, because of limitations of the Chromium’s debugging interface, it is impossible to properly capture media files (both audio and video) and large files in general; this issue has no good work-around and, AFAIK, all alternatives to
Hoardy-Web
running on Chromium-based browser suffer from it (and work around it by silently re-downloading said files the second time in background); try usingHoardy-Web
under a Firefox-based browser instead.
The documentation claims that all Hoardy-Web
archival methods except for submission via =HTTP= are unsafe. Why?
Archival by exporting using =saveAs= (generation of fake-Downloads) can fail and **lose a bit of your collected data at a time** if you press a wrong button in you browser’s UI, mis-reconfigure your browser a bit, or your disk gets out of space unexpectedly.
Archival to browser’s local storage (which is what Hoardy-Web
is doing by default) can **loose all your collected data at the same time** if you uninstall the extension by accident.
Meanwhile, archival by submission via =HTTP= has none of these problems:
Hoardy-Web
will keep each reqres in memory until the archiving server responds with200 OK
for that reqres;- the archiving will only respond with
200 OK
response toHoardy-Web
after the dump is written andfsync
-ed to disk; - the archiving server never deletes any of your archived data; by using an archiving server, you can only loose your archived data if you go to its directory and delete some of it yourself, or if your disk dies, or if your file system gets corrupted; all of those problems are solved by regular backups.
Archival to browser’s local storage was added because it was very easy to implement after the stash was added. It is the default because it usually works fine, it properly reports errors, has the most consistent behaviour across all browsers, and does not require the user to install any Python code, which helps with on-boarding.
In the ideal world, browsers would provide a better saveAs
API which would have a less annoying UI for the user and would return out-of-disk-space errors to the extension, in which case exporting via =saveAs= would be the default.
As it is now, the only way to be absolutely sure you data is properly forever saved to disk when the extension reports it archived is to use submission via =HTTP=.
When running Hoardy-Web
under Firefox, enabling export via =saveAs= makes the browser’s UI quite annoying. Can it be fixed?
Yes, go to about:config
and toggle browser.download.alwaysOpenPanel
to false
.
If the whole content of this page (not just this section, did you try searching for stuff with Control+F
? there’s a lot of info here) does not explain your problem, open an issue on GitHub or get in touch otherwise.