Skip to content

SEP Archives

Colin Allen edited this page Apr 28, 2020 · 10 revisions

See also: SEP Mirror

Once the SEP Mirror has finished running, topic models for each archive can be trained through use of the sep-corpus-builder.

To train an individual quarter use the script in ~inphosite/sep-corpus-builder/build.py to get the documents. Then zip using zip $QUARTER.zip data_$QUARTER/*.

Preparing the Corpus

To update the SEP models, we will use the sep-corpus-builder package.

This is installed on [email protected].

cd ~inphosite/sep-corpus-builder
python corpusbuilder.py
python build.py spr2018

This will create a directory data_spr2018. Verify that the document counts match with what you expect:

ls data_spr2018 | wc

Next we need to train the models.

Training the model

Extract the corpus to a zip file:

zip -r data_spr2018.zip data_spr2018
topicexplorer init --name "Stanford Encyclopedia of Philosophy (Spring 2018)" data_spr2018 sep.ini
topicexplorer prep sep --low-percent 5 --min-word-len 3 --high-percent 45 --lang en
topicexplorer train sep.ini -p 4 -k 20 40 60 80 100 120 --iter 500
#!/bin/bash
SEASONYEAR=$1
SEASON=${SEASONYEAR::-4}
YEAR=${SEASONYEAR#$SEASON}

case $SEASON in
  'win') SEASONDESC='Winter';;
  'spr') SEASONDESC='Spring';;
  'sum') SEASONDESC='Summer';;~inphosite/sep-corpus-builder
  'fall') SEASONDESC='Fall';;
esac

DESC="Stanford Encyclopedia of Philosophy ($SEASONDESC $YEAR)"
INI="sep.$SEASONYEAR.ini"

cd ~inphosite/sep-corpus-builder
python corpusbuilder.py
python build.py $SEASONYEAR

cd /tmp

topicexplorer init --name $DESC ~inphosite/sep-corpus-builder/data_$SEASONYEAR $INI -q
topicexplorer prep $INI --high-percent 45 --low-percent 5 --lang en --min-word-len 3 -q
# do next two steps locally with downloaded zip file
topicexplorer train $INI -k 20 40 60 80 100 120 --iter 500 -p 24 
topicexplorer export -o /tmp/sep.$SEASONYEAR.tez $INI
aws s3 cp /tmp/sep.$SEASONYEAR.tez s3://hypershelf/sep.$SEASONYEAR.tez --acl bucket-owner-full-control