git clone
Get ruby (>= 2.0.0)
gpg --keyserver hkp:// --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
\curl -sSL | bash -s stable
rvm install 2.2.3
rvm --default use 2.2.3
Install bundler
gem install bundler
Install Dependencies
cd ratsinfo
bundle install
To start scraping use on other console:
This will extract all documents to the path of the environment variable DOWNLOAD_PATH (defaults to "data") and convert it to xml files, containing metadata and full text of the pdfs
To scrape an individual session for example:
rake testmonth
To just download a tiny set of Data, only session data. Just for testing.
To display all tasks use:
rake -T
The download directory will have the following scheme:
each session have a directory, where the id is the directory name
every document belonging to this session will be extracted to this directory
additionally a JSON file is created, with the session id in its name. This is a machine-readable version of the index.htm file, which is contained in the document archives
We do now follow the OParl specification!
Deviations from the OParl spec:
- Numerical
everywhere, because we don't yet serve the data on HTTP URIs