-
Notifications
You must be signed in to change notification settings - Fork 51
Importing Works from the Internet Archive
FromThePage supports transcribing books hosted on the Internet Archive. This is a great way to explore documents that have already been digitized, and it's actually easier to set up than uploading scans directly to FromThePage.
- Log in as a user who is authorized to own works.
- Click the Dashboard link (located next to the login link)
- On the left side of the screen, you'll see an area called "Owner Actions".
- Click on "Import a Book from the Internet Archive"
- Cut-and-paste the URL of the Internet Archive page for the book you want to import, then press "next"
- See the section "Import an Internet Archive Book" below
- Log in as a user who is authorized to own works.
- Click the Dashboard link (located next to the login link)
- On the left side of the screen, you'll see an area called "Owner Actions".
- Click Explore OAI Repositories
- Click "Show All Sets" next to the Archive.org link Wait a very long time (possibly several minutes) for FromThePage to query Archive.org for all its OAI sets. This is a very long list indeed.
- Search the page for your institution.
- Click "Save for future use" next to the spec.
- This should redirect you to the dashboard again. There should now be a link in the owner's section saying "List works to import from your collection".
- Click the "List works to import" link. This will query Archive.org for the works it has in that OAI set.
- Click the Import button beside one of the field notes.
-
Purge Delete Scans: Some leaves that Archive.org scans are classified as of type="Delete". These are apparently things like color calibration cards and such, and are never displayed by Archive.org. These should be purged, so press this button.
-
Retitle from OCR: this is unique to pre-printed 20th-century daybooks. For these materials, the OCR has done a pretty good job of parsing the date that's printed at the top of each page. I've written code to re-title the numeric page numbers (which are really leaf titles) based on these parsed OCR entries. Press this button and wait a few minutes for the parsing to happen. The pages will be re-titled from OCR, and while they may need correction, it can save a lot of effort for diaries and journals.
-
Convert to FromThePage: This converts an Archive.org-imported book and its leaves into a FromThePage work with corresponding pages. This is the final piece of the IA book importer. It also may take a few minutes to run, so please be patient.
Once the converter is finished, you can access the work from the dashboard. Move on to Preparing a Work for Transcription.