-
Notifications
You must be signed in to change notification settings - Fork 0
SEP Archives
Colin Allen edited this page Apr 28, 2020
·
10 revisions
See also: SEP Mirror
Once the SEP Mirror has finished running, topic models for each archive can be trained through use of the sep-corpus-builder
.
To train an individual quarter use the script in ~inphosite/sep-corpus-builder/build.py
to get the documents. Then zip using zip $QUARTER.zip data_$QUARTER/*
.
To update the SEP models, we will use the sep-corpus-builder package.
This is installed on [email protected]
.
cd ~inphosite/sep-corpus-builder
python corpusbuilder.py
python build.py spr2018
This will create a directory data_spr2018
. Verify that the document counts match with what you expect:
ls data_spr2018 | wc
Next we need to train the models.
Extract the corpus to a zip file:
zip -r data_spr2018.zip data_spr2018
topicexplorer init --name "Stanford Encyclopedia of Philosophy (Spring 2018)" data_spr2018 sep.ini
topicexplorer prep sep --low-percent 5 --min-word-len 3 --high-percent 45 --lang en
topicexplorer train sep.ini -p 4 -k 20 40 60 80 100 120 --iter 500
#!/bin/bash
SEASONYEAR=$1
SEASON=${SEASONYEAR::-4}
YEAR=${SEASONYEAR#$SEASON}
case $SEASON in
'win') SEASONDESC='Winter';;
'spr') SEASONDESC='Spring';;
'sum') SEASONDESC='Summer';;~inphosite/sep-corpus-builder
'fall') SEASONDESC='Fall';;
esac
DESC="Stanford Encyclopedia of Philosophy ($SEASONDESC $YEAR)"
INI="sep.$SEASONYEAR.ini"
cd ~inphosite/sep-corpus-builder
python corpusbuilder.py
python build.py $SEASONYEAR
cd /tmp
topicexplorer init --name $DESC ~inphosite/sep-corpus-builder/data_$SEASONYEAR $INI -q
topicexplorer prep $INI --high-percent 45 --low-percent 5 --lang en --min-word-len 3 -q
# do next two steps locally with downloaded zip file
topicexplorer train $INI -k 20 40 60 80 100 120 --iter 500 -p 24
topicexplorer export -o /tmp/sep.$SEASONYEAR.tez $INI
aws s3 cp /tmp/sep.$SEASONYEAR.tez s3://hypershelf/sep.$SEASONYEAR.tez --acl bucket-owner-full-control
- master
- mining
- sep-topics
- hypershelf