Skip to content

SEP Archives

Jaimie Murdock edited this page Apr 13, 2018 · 10 revisions

See also: SEP Mirror

Once the SEP Mirror has finished running, topic models for each archive can be trained through use of the sep-corpus-builder.

To train an individual quarter use the script in ~inphosite/sep-corpus-builder/build.py to get the documents. Then zip using zip $QUARTER.zip data_$QUARTER/*.

#!/bin/bash
SEASONYEAR=$1
SEASON=${SEASONYEAR::-4}
YEAR=${SEASONYEAR#$SEASON}

case $SEASON in
  'win') SEASONDESC='Winter';;
  'spr') SEASONDESC='Spring';;
  'sum') SEASONDESC='Summer';;
  'fall') SEASONDESC='Fall';;
esac

DESC="Stanford Encyclopedia of Philosophy ($SEASONDESC $YEAR)"
INI="sep.$SEASONYEAR.ini"

# python build.py $SEASONYEAR
topicexplorer init --name $DESC data_$SEASONYEAR $INI -q
topicexplorer prep $INI --high-percent 45 --low-percent 5 --lang en --min-word-len 3 -q
topicexplorer train $INI -k 20 40 60 80 100 120 --iter 500 -p 24
topicexplorer export -o /tmp/sep.$SEASONYEAR.tez $INI
aws s3 cp /tmp/sep.$SEASONYEAR.tez s3://hypershelf/sep.$SEASONYEAR.tez --acl bucket-owner-full-control
Clone this wiki locally