- fetches Wikipedia documents (download)
- prepares the data
- stores it into the database
- create an inverse index
- create hashed n-grams of a provided size
WIKI_FILE=https://dumps.wikimedia.org/dewiki/..........-pages-articles-multistream.xml.bz2
HADOOP_PATH=hdfs://...
# download zip file
curl $WIKI_FILE | hadoop fs -put - $HADOOP_PATH/wiki.bz2
# extract file to xml
hadoop fs -text $HADOOP_PATH/wiki.bz2 | hadoop fs -put - $HADOOP_PATH/wiki.xml
sbt clean assembly
spark-submit $SPARK_OPTIONS wiki_data_fetcher.jar $OPTIONS
$SPARK_OPTIONS are:
--executor-memory 5G --driver-memory 2G
see Docs
$OPTIONS are:
Options:
-e, --extract <hadoop_file> parse wiki XML file and saves it
-i, --index use db-entries to create an inverse index
-n, --ngrams <ngram> use db-entries to create hashed n-grams of a provided size
-h, --help
-mh, --mongodb_host <host> MongoDB Host
-mp, --mongodb_port <port> MongoDB Port
-mu, --mongodb_user <user> MongoDB User
-mpw,--mongodb_password <password> MongoDB Password
-md, --mongodb_database <database> MongoDB Database
e.g.
wiki_data_fetcher.jar -e $HADOOP_PATH/wiki.xml -mh mongoHost -mp mongoPort -mu mongoUser -mpw mongoPassword -md mondoDatabase
- /var/log/spark
- /run/spark/work