Skip to content

Migrate data from Sufia 6 to Sufia 7.2 [Work In Progress]

hackmastera edited this page Jan 27, 2017 · 6 revisions

Please note these scripts are not fully finalized by have been used for a real live migration at CHF. Do plenty of testing and check out the outstanding issues.

This is a two step process. From a Sufia 6 application export to a set of JSON files the metadata of all the GenericFiles and Collections. From a Sufia 7 application read the JSON files and create GenericWorks/FileSet/Files, then Collections.

Step 1 (from a Sufia 6 application)

  • Upgrade your app to the latest Sufia 6 release, and then pin to sufia-6.x to get the most recent export code.
  • You'll want to put your application in read-only or maintenance mode.
  • First run the survey script; this writes the ID of every collection and generic file to the database: $ RAILS_ENV=production bundle exec sufia_survey -v
  • Then run the export; see options via --help as there are ways to override the exported fields if you have customized your data model: $ bundle exec sufia_export --help $ RAILS_ENV=production bundle exec sufia_export
    • The export will output to a JSON file the metadata for each file, but not the actual binary of the file. The binary will be read from Fedora at the time of the import.
  • Make sure your Fedora port is open to whatever server will be running the import.
  • Check your json files. If you've had fedora.yml set up with '127.0.0.1' and will be migrating on a different server, you will need to replace all instances of 127.0.0.1 with a real IP or domain name.
  • Move all those json files to the import location. Also migrate your minter statefile, your postgres db, and (at some point) your Redis data.

Step 2 (from a Sufia 7.2 application)

  • pin to the 7.2-migration branch of Sufia.
  • Make sure you've configured fedora_sufia6_user and fedora_sufia6_password in config/application (or a new file in config/initializers) so you can reach your fedora instance to retrieve the binaries.
  • In a Sufia 7 application import the JSON files exported from the Sufia 6 application. Again, use --help to see options; you may want to create overrides if you have customized your data model: $ RAILS_ENV=production bundle exec sufia_import
    • In particular take a look at your rights data values; the default for "All rights reserved" has changed from a string to a URI and you may want to take this opportunity to migrate your data.
  • Derivatives will not be migrated; they will be re-generated by Sufia 7.
  • [Future step -- in development] Run the validation script, which checks that all the ids recorded during the survey step have been migrated.

Outstanding issues

(not an exact list because some of these are related to code migration as opposed to data migration) https://github.com/projecthydra/sufia/issues?q=is%3Aopen+is%3Aissue+label%3Amigration

Timing

As a rough test here is a sample migration of approximately 71 GB of data from a Sufia 6 instance to a Sufia 7 instance.

The Sufia 6 instance is a AWS m4.large box (2 vCPU, 8 GB of RAM, with the data stored on a RAID 1 array of magnetic disk class storage). The Sufia 6 instance is on Fedora 4.6 with a levelDB database.

Two Sufia 7 instances were tested, both running Fedora 4.7 with a Postgres database.

Migration times:

  • A m4.large box (2 vCPU, 8 GB RAM, magnetic disk non-raid): 34 hours 23 minutes
  • A c4.2xlarge box (4 vCPU, 7.5 GB RAM, SSD storage): 7 hours 40 minutes The CPU and disk I/O are the two factors most involved in the speed increase. Testing with the same CPU but a different disk should help isolate how much the disk I/O affects the transfer.