Skip to content

icaew-digital-archive/digital-archiving-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

digital-archiving-scripts

A collection of scripts to help with various digital archiving tasks.

archived scripts

Contains various scripts for ad-hoc tasks that may or may not be repeated in the future.

browsertrix-crawler files

Contains scripts relating to browsertrix-crawler

downloading items from the Internet Archive

Contains a script to reformat the json response from the Internet Archive's CDX API and provides better duplicate removal. Outputs to a .txt file.

opex scripts

Contains scripts to partially automate the production of OPEX XML files for use with Preservica.

pypreservica scripts

Contains various scripts that utilise Preservica's API using pyPreservica.

semaphore-helper.py

Uses Semaphore's CLSClient to auto-classify documents and sorts by topic score.

sitemap tools

Contains a script to produce a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal).

warc_reader.py

A script which reads a folder of WARC files and cross-references the content with a list of URLs. It also uses BS4 to search the HTML content for specific HTML elements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published