Skip to content

Releases: davidfstr/Crystal-Web-Archiver

v1.9.0b

23 Jun 01:55
Compare
Choose a tag to compare

This release contains error-handling improvements and bug fixes
in preparation for Crystal to exit beta status.

  • Parsing improvements

    • Can identify 'https://' inside JavaScript as a URL reference,
      which helps download/serve sites using Disqus.
    • Can identify URL references inside <style> elements.
    • Can identify URL references inside <* style="...">,
      which helps download/serve sites based on phpBB.
    • Can identify data: URL references inside <* srcset="...">.
    • Can rewrite URL references that use [Subresource Integrity].
  • Error handling improvements

    • Crashes now provide tracebacks with more context,
      back to the start of the thread in which the crash occurred.
  • Major fixes

    • Fix crash when dynamically downloading a served URL which is a member of
      a group being actively downloaded.
  • Minor fixes

    • Do not show a progress dialog if the related operation completes quickly,
      fixing a flickering effect especially noticeable on Windows.
    • Can now save projects named with url-unsafe characters like #.
    • Fix closing a project to no longer have a race condition that
      could cause use-after-free of wxPython objects and corrupt memory,
      potentially crashing Crystal later.

v1.8.0b

18 Mar 02:49
Compare
Choose a tag to compare

This release contains many workflow improvements, error-handling improvements,
and bug fixes in preparation for Crystal to exit beta status.

It is faster than ever before to define the structure of a site using the UI,
with support for loose browser-style URL entry, better guesses for names and
sources of entities, and the ability to rename entities after creation.

  • First-time-run experience improvements

    • New Root URL Dialog:
      • Accept URLs in loose format, similar to what regular web browsers accept.
      • Better error message when try to create duplicate root URL.
      • Disallow create of empty root URL.
    • New Group Dialog:
      • Improve suggested source when creating a new group.
    • New Root URL and New Group Dialogs:
      • Improve suggested name when creating a new root URL or group.
      • Make it optional to provide a name.
      • Rearrange fields to deemphasize the name field.
      • Update the selected node in the Entity Tree intelligently after
        creating or forgetting a root URL or a group.
      • Allow resizing.
      • Ignore leading and trailing whitespace in URLs and URL patterns.
    • Main Window
      • Prevent resizing the window to be too small.
    • Use ⚓️ and 📁 icons consistently in the UI to refer to
      Root URLs and Groups respectively.
  • Workflow improvements

    • Can now edit the name and source of Root URLs and Groups after creation.
    • Can mark resource group as "do not download" to prevent their members
      from being downloaded when in an embedded context.
    • By default the Default Domain will be set to match the first Root URL
      created, enabling reliable serving of more-modern websites with
      client-side URL routing.
  • Improved support for Default URL Prefixes

    • Can now set a Default Domain when serving a downloaded project.
      Previously only a Default Directory could be set.
    • The Default Domain/Directory can be set to match a Root URL that
      is being created or edited.
    • A top-level menuitem can now be used to set the Default Domain/Directory
      to match an existing Root URL. Previously it was necessary to use a
      right-click menuitem instead.
    • Hovering over a URL or Group in the Entity Tree always shows the
      full URL or URL Pattern for the entity, even if a Default Domain
      or Default Directory is set.
  • Downloading improvements

    • Remaining time is now reported while downloading groups whose members
      are slow to download, taking >7 seconds each.
    • Don't crash when downloading a group that already contains some member
      URLs that were already downloaded.
      • This crash bug was introduced in v1.7.0b with the new strategy of
        creating member download tasks on demand rather than upfront.
    • Don't crash when try to download a URL that is already downloading.
    • Don't crash when try to download a group that has no member URLs.
  • Crawling improvements

    • Download the implicit favicon referenced by the root page of any domain.
  • Parsing improvements

    • Can identify URL references to images inside <source srcset="...">.
    • Gracefully handle references to invalid URLs like "//*[@id='"
      rather than crashing.
    • Parse links from RSS and Atom feeds advertised with a specialized XML MIME type
      like application/rss+xml or application/atom+xml.
  • Serving improvements

    • XML files like Atom feeds and RSS feeds are now served correctly,
      without introducing an invalid <script> tag.
    • The log showing HTTP requests made to the served project now always
      displays inside the main window, rather than attempting to appear as an
      attached drawer on macOS and Windows.
      • Drawers are not a concept not natively supported by any OS except
        macOS, and even there they are deprecated.
      • Drawers have never worked properly on Linux, due to Wayland not
        providing APIs to position windows precisely relative to each other.
      • The old drawer mode didn't stay attached to the main window properly
        when using Mission Control on macOS.
  • Error handling improvements

    • If a task crashes, show it as crashed in the UI and allow it to be dismissed.
    • If the scheduler thread crashes, show it as crashed in the UI and allow it to be restarted.
    • If an update to the entity tree crashes,
      show the crash in the UI and allow the entity tree to be refreshed.
  • Testing improvements

    • Waits now use a soft timeout in addition to a hard timeout,
      which makes it easier to tune/bump timeout durations as needed.
      • Triggering a soft timeout causes a warning to be logged.
    • Warnings logged during a test run are collected and reported
      at the end of the test run.
    • Warnings logged during a test run are reported to GitHub Actions
      as warning annotations.
    • A screenshot is taken automatically whenever a timeout error occurs
      and whenever a rich assertion method (from asserts.py) fails.
    • A terminal bell sound is played automatically when tests finish running.
    • When an abort() or SIGABRT occurs while running tests during continuous integration,
      print a stack trace using faulthandler.
  • Major fixes

    • Fixed multiple cases where code updating the task tree accessed
      the task hierarchy without synchronizing with the scheduler thread,
      which could cause crashes when downloading groups containing members
      that were already downloaded in the same session.
      • Access to the task hierarchy is now protected with @scheduler_affinity
        and explicit is_synced_with_scheduler_thread() checks.
      • This issue was first introduced in v1.7.0b and is now fixed.
    • For projects on a non-SSD drive, fix issue where newly created groups
      did not find any member URLs that were discovered since the project was opened.
      It was previously necessary to reopen a project to reliably find all
      members of a recently created group.
      • This issue was first introduced in v1.7.0b and is now fixed.
  • Minor fixes

    • Prevent system idle sleep while tasks are running,
      in more situations on macOS.
    • Fix disappearance of error nodes when new root URL or group is added.
    • If try to create group with empty URL pattern, show error dialog
      rather than silently failing.

v1.7.0b

19 Dec 02:54
Compare
Choose a tag to compare

This release features further improvements to downloading large websites
(up to 10 million URLs). Projects open in constant time. Memory usage
while downloading large groups remains constant.

Additionally .crystalproj documents now have an appropriate icon and can be
easily opened by double-clicking that icon on all supported operating systems.

There has also been a major change to the .crystalproj format: Revisions are
now stored in a hierarchy of directories rather than as a single flat directory.
Crystal continues to be able to read and write projects of all major versions.

  • First-time-run experience improvements

    • App name, logo, and icon fixes
      • macOS: Fix application menu title and title of its menuitems
      • Windows: Add app icon and Windows-friendly title to main window
      • Linux: Fix app title and icon in dock to be correct
    • .crystalproj package changes
      • .crystalproj packages now contain a README so that users on computers
        without Crystal are informed about what a .crystalproj package is
        and how to open it
      • Windows/Linux: .crystalproj packages now have an icon
      • Windows: .crystalproj packages that are double-clicked open in Crystal
      • Windows/Linux: .crystalproj packages now contain a .crystalopen file
        so that it is easy to open a .crystalproj from a file browser
      • macOS: Hide .crystalproj and .crystalopen file extensions
  • Large project improvements (with 3,000,000 - 11,000,000 URLs)

    • Large projects now open immediately because URLs and group members
      are now loaded on demand rather than upfront.
    • Large groups now start downloading faster because member
      download tasks are now created on demand rather than upfront.
    • The .crystalproj format has a new major change:
      • .crystalproj format now stores revisions in a hierarchy of nested
        directories rather than all of them inside a single directory,
        to provide faster performance on filesystems which behave poorly
        when a single directory has very many files.
      • .crystalproj format now stores revisions in lexicographic order
        in the filesystem so that when a project is copied to a new location,
        the order of revisions on disk is preserved in the new copy.
      • Projects using this new format have a major_version of 2
        in the project_property table.
    • Projects whose database is on a solid state drive (SSD)
      will use less memory because such projects will now prefer to
      load group members via a database query rather than loading
      all project URLs into memory.
  • Support changes

    • Add Kubuntu as a supported Linux distribution
    • Add KDE as a supported desktop environment, in addition to GNOME
    • Drop support for macOS 10.14. macOS 12+ remains supported.
  • Parsing improvements

    • Can identify <link rel="preload"> references as embedded.
    • Can identify URLs inside <script> blocks with a trailing ?... query.
    • Improved reporting of unknown types of <link rel="...">.
  • Serving improvements

    • Multiple projects can be open and serving URLs at the same time.
  • Error handling improvements

    • When expanding an URL in the Entity Tree that downloaded with an error,
      display an error node appropriately.
    • When expanding an undownloaded URL in the Entity Tree that could not be
      downloaded because the disk is full or the project has too many revisions,
      display an error node appropriately.
    • When expanding an URL in the Entity Tree whose revision body has been
      deleted, try to redownload it automatically.
  • Critical fixes

    • Fix continuous integration to regularly run UI tests on macOS once again.
    • Fix continuous integration to reliably fail if UI tests fail.
  • Minor fixes

    • Eliminated race condition where scheduler thread could try to read from
      the root task's children list concurrently with a different thread
      adding a new child to it.
    • Hide .crystalproj extension in main window title if extension hidden
      in file browser
  • Backward-incompatible API changes

    • fg_call_later, fg_call_and_wait, bg_call_later:
      • Keyword arguments are now required for all optional parameters.
      • Arguments passed in the format *_call_*(callable, ...),
        must now be passed as *_call_*(callable, args=(...)).
      • no_profile= is replaced with profile=.
      • force= is renamed to force_later=.
    • The OpenProjectProgressListener interface has substantially changed
      to reflect the new strategy for opening projects.
      • Additionally, a new LoadUrlsProgressListener interface is
        introduced to allow monitoring of when a project decides to
        load its URLs. It can be provided to Project.__init__.
    • Project.title has been removed.
      Calculate a reasonable title from Project.path instead.

v1.6.0b

04 Sep 18:58
Compare
Choose a tag to compare

This release features significant improvements to downloading large websites
that have about 10 million URLs. Projects open and close faster. The UI is faster.
Downloads are faster. Progress bars are shown for all slow operations.
Estimated time remaining is shown when downloading groups.

  • Large project improvements (with 3,000,000 - 11,000,000 URLs)

    • Open projects containing many URLs in about 50% as much time as before:
      • Approximate the URL count when loading a project in O(1) time
        rather than getting an exact URL count in O(r) time,
        where r = the number of URLs in the project
      • Decrease the time to load groups from O(r·g) to about O(r + g·log(r)),
        where r = the number of URLs in the project and
        g = the number of groups in the project
      • Defer creation of Entity Tree nodes corresponding to group members
        until the group is actually expanded
    • Close projects with very many queued tasks (such as download tasks)
      in O(1) time rather than O(t) time, where t = the number of queued tasks
    • Speed up interacting with the Entity Tree and Task Tree when
      there are very many URLs in a project:
      • Entity Tree: Speed up expanding URL nodes when large groups exist,
        now in O(k) time rather than O(r·k) time,
        where k = the number of links originating from the URL node and
        r = the number of URLs in the project.
      • Entity Tree: Load only the first 100 members of each group, on demand
      • Task Tree: Show only up to 100 children when downloading a group
    • Speed up interacting with the Add Group dialog when
      there are very many URLs in a project:
      • When typing each character of a new URL pattern and no wildcard
        has yet been typed, perform an O(1) search for matching URLs
        in the preview pane.
      • When typing each character of a new URL pattern and at least one
        wildcard has been typed, perform an O(log(r)) search for matching URLs
        in the preview pane, where r = the number of URLs in the project.
      • Previously an O(r) search was performed in both of the above cases.
    • Show progress while upgrading project with many URLs
    • Show progress dialog when starting to download a large group
    • Show elapsed time in all progress dialogs
    • Prevent system idle sleep while tasks are running (on macOS and Windows)
    • Print large numbers with comma separators or whatever the appropriate
      separator is for the current locale
    • Minimize memory use when there are very many URLs in a project
      by shrinking in-memory Resource, Task, TaskTreeNode, and NodeView objects
      by defining explicit __slots__
    • Minimize memory growth while downloading URLs in a project for
      multiple hours or days
    • If free disk space drops too low then refuse to download further resources
    • Quit immediately even when a project with many resources was open recently
    • Open preferences dialog significantly faster for projects containing many URLs
    • Significantly speedup creation of tasks that have many children,
      such as tasks that download groups with very many members
  • First-time-run experience improvements

    • Improve defaults
      • New/Open Project Dialog: Default to creating a new project rather
        than opening an existing one.
      • New Group Dialog: Expand "Preview Members" by default.
    • Polish user interface
      • Use consistent words to refer to common concepts
        • {Create, Add} -> New
        • {URL, Root URL} -> Root URL
      • Add menus
      • macOS: Add proxy icon to the project window, making it easier to navigate
        to the project in the Finder.
      • Add app name to version label in lower-left corner of project window.
    • Add keyboard shortcuts everywhere
    • Groups without a source can now be downloaded, as one would expect.
    • Task Tree: Remove top-level tasks that complete periodically,
      rather than waiting for all of them to complete first
  • Critical fixes

    • Linux: Fix dialog that appears on app launch to be sized correctly.
    • Linux: Fix View button to open browser even if Crystal run from read-only volume.
    • Linux: Fix most other dialogs to be sized correctly.
    • macOS: Fix issue where dialogs could appear at unusual locations,
      including offscreen.
  • Crawling improvements

    • Don't recurse infinitely if resource identifies alias of itself as an
      embedded resource.
  • Downloading improvements

    • Show estimated time remaining and speed when downloading groups and URLs
    • Download faster
      • Reinstate the ASSUME_RESOURCES_DOWNLOADED_IN_SESSION_WILL_ALWAYS_REMAIN_FRESH
        optimization that was disabled in v1.4.0b, which significantly speeds up
        downloading groups of HTML pages that link to similar URLs
      • Support immediate early completion of download tasks for URLs
        that were downloaded in the current session or a recent session
      • Record links while downloading faster by writing all of them to the
        project in bulk rather than one by one
      • Open the project's underlying SQLite database
        in Write-Ahead Logging (WAL) mode
        which is faster than the default mode
      • Change delay between downloads to be inserted after each HTML page downloads
        (with its embedded resources), rather than after each single resource downloads.
        This new behavior simulates user browsing more closely and results in
        much faster downloading of HTML pages with many images
        (or other embedded resources).
      • Parallelize download of URLs from origin server with writes to local
        database where possible.
      • Avoid querying the database for revisions of an URL if it is already
        known that there are no revisions because of other information
        cached in memory
      • Precompile XPath selectors used to parse links from HTML
      • Use an optimized version of shutil.copyfileobj that avoids
        repeatedly allocating intermediate buffers
      • Maximum download speed increased from 1 item/sec to 2 items/sec
    • Autopopulate an HTTP Date header when downloading if none provided
      by origin server, as per RFC 7231 §7.1.1.2.
    • Load HTTPS CA certificates from certifi on Windows,
      in addition to from the system CA store.
    • Load HTTPS CA certificates from $SSL_CERT_FILE if specified.
  • Parsing improvements

    • Links are parsed in about 18% as much time as before.
    • Can identify URL references inside <img srcset="...">.
    • Skip parsing links in downloaded files known to be binary files.
  • Serving improvements

    • Server logs are now displayed in a UI drawer.
    • Links to anchors on the same page are no longer rewritten,
      for better compatibility with JavaScript libraries that
      treat such links specially.
    • Archived pages are read from disk about 45% faster by avoiding an
      unnecessary os.stat call.
    • Archived pages are served faster and more efficiently by using
      the os.sendfile primitive when supported by the operating system.
    • Don't warn about unknown X- HTTP headers.
  • CLI improvements

    • Profiling warnings:
      • Several foreground tasks are optimized so that they
        no longer print slow foreground task warnings
      • Slow garbage collection operations now print a profiling warning
      • Slow "Recording links" operations now print a profiling warning
    • Include guppy module for manual memory leak profiling.
    • A $PYTHONSTARTUP file can be defined that is run automatically
      at the beginning of a shell session.
  • Error handling improvements

    • When attempting to download a previously-downloaded revision that is
      missing a body file on disk, delete & redownload the old revision.
  • Testing improvements

    • An entire test module can now be run with --test, in addition to
      individual test functions.
  • Minor fixes

    • Clear completed root tasks in all cases, even in the rare case where
      all tasks except the first one are complete
    • When deleting a ResourceRevision, don't delete revision body if project
      is read-only and also properly mark related Resource as no longer being
      downloaded this session
    • When querying a ResourceRevision's size, don't crash with a traceback
    • When running as a macOS .app, log stdout and stderr to files correctly
      once more
  • Backward-incompatible API changes

    • Resource.revisions() now returns Iterable[ResourceRevision] instead
      of List[ResourceRevision] to support streaming results.
      • If the old behavior is desired, wrap calls to Resource.revisions()
        inside of a list(...) expression.
    • MainWindow.frame is no longer public.
    • ResourceRevision.load() has been renamed to
      ResourceRevision._load_from_data() and privatized.
      • A replacement ResourceRevision.load() method now exists that loads
        an existing revision given an ID.

v1.5.0b

02 Apr 14:37
Compare
Choose a tag to compare

This release focuses on making it easy to install Crystal from PyPI,
adds support for running on Linux from source (but not from a binary),
and fixes many bugs with the built-in CLI shell.

Additionally items in the main window are easier to understand
because icons and tooltips have been added for all tree nodes.

  • Distribution improvements

    • Can install Crystal using pipx and pip, from PyPI:
      • pipx install crystal-web
      • crystal
    • Can run Crystal using crystal binary:
      • poetry run crystal
    • Can run Crystal using python -m crystal:
      • poetry run python -m crystal
    • Add support for Linux platform (Ubuntu 22.04, Fedora 37)
  • CLI improvements

    • Fixed shell to not hang if exited before UI exited, under certain circumstances.
    • Fixed {help, exit, quit} functions to be available when Crystal runs as an .app or .exe.
    • Altered exiting message while windows open to be more accurate.
    • Pinned the public API of Project and MainWindow.
  • Testing improvements

    • Tests are much faster now that download delays are minimized while running tests.
    • Failure messages are improved whenever a WaitTimedOut.
    • A screenshot is taken whenever a test fails.
    • Several race conditions related to accessing the foreground thread are fixed.
  • UI Improvements

    • Icons and tooltips added to all tree nodes in the main window,
      clarifying the different types of entities, links, and tasks that exist.
      • Easy to distinguish between URLs and groups.
      • Easy to see whether a URL was downloaded,
        and whether it was downloaded successfully.
    • URL clusters now show in their title how many members they contain.
    • Fixed "Offsite" cluster nodes to update children appropriately whenever
      the Default URL Prefix is changed.
    • Fixed right-click on non-URL node to no longer print a traceback.
    • Fixed attempt to download a group with no source to no longer print a traceback.

v1.4.0b

23 Aug 14:16
Compare
Choose a tag to compare

This release adds early support for incrementally redownloading sites
with new page versions
.

It is also now possible to download sites requiring login from the UI
and a tutorial has been added showing how to do that.

There are also many stability improvements, with fewer wxPython-related
Segmentation Faults and dramatically improved automated test coverage.

For more information see the release notes.

v1.3.0b

11 Jul 12:47
Compare
Choose a tag to compare

This release allows more kinds of advanced sites to be downloaded,
including sites requiring login and sites relying on JSON APIs,
especially those with infinitely scrolling pages.

Projects can now be opened in a read-only mode such that
browsing existing downloaded content will never attempt to
dynamically download additional content.

Advanced manipulation of projects can now be done from a
shell launched from the command-line interface.

Last but not least, Substack-based sites are now recognized specially and can be
downloaded effectively without creating an explosion of URL combinations.

For more information see the release notes.

v1.2.0b

14 Apr 03:06
Compare
Choose a tag to compare

This release primarily features better support for large projects and groups.
Downloads of large groups are dramatically faster and now only require a
constant amount of memory no matter how large the group is. Also a progress bar
is now displayed when opening a large project.

A few more link types in CSS and <script> tags are now recognized.

Last but not least, phpBB forums are now recognized specially and can be
downloaded effectively without creating an explosion of URL combinations.
phpBB support is still experimental and likely requires additional tuning.

For more information see the release notes.

v1.1.1b

03 Apr 05:00
Compare
Choose a tag to compare

Several first-time-launch issues were fixed. And domains are now recognized
in a case-insensitive fashion, eliminating duplicate URLs within some sites.

For more information see the release notes.

v1.1.0b

23 Mar 06:30
Compare
Choose a tag to compare

Our first beta release brings support for downloading more complex static sites,
recognizing vastly more link types than ever before. It also supports various
kinds of dynamic link-rewriting, beyond the usual static link-rewriting.

Additionally the code has been modernized to work properly on the latest
operating systems and use newer versions of the BeautifulSoup parser and
the wxWidgets UI library. Unfortunately this has meant dropping support for
some older macOS versions and Windows XP.

For more information see the release notes.