Skip to content

fetches Wikipedia documents, prepares the data and stores it into the database

Notifications You must be signed in to change notification settings

WikiPlag/wiki_data_fetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki Data Fetcher

  • fetches Wikipedia documents (download)
  • prepares the data
  • stores it into the database
  • create an inverse index
  • create hashed n-grams of a provided size

Get XML Dump

WIKI_FILE=https://dumps.wikimedia.org/dewiki/..........-pages-articles-multistream.xml.bz2
HADOOP_PATH=hdfs://...

# download zip file 
curl $WIKI_FILE | hadoop fs -put - $HADOOP_PATH/wiki.bz2 

# extract file to xml
hadoop fs -text $HADOOP_PATH/wiki.bz2 | hadoop fs -put - $HADOOP_PATH/wiki.xml

Build

sbt clean assembly

Commands

spark-submit $SPARK_OPTIONS wiki_data_fetcher.jar $OPTIONS   
$SPARK_OPTIONS are:
--executor-memory 5G --driver-memory 2G

see Docs

$OPTIONS are: 

Options:
 -e, --extract <hadoop_file>          parse wiki XML file and saves it 
 -i, --index                          use db-entries to create an inverse index 
 -n, --ngrams <ngram>                 use db-entries to create hashed n-grams of a provided size 
 -h, --help
 -mh, --mongodb_host     <host>       MongoDB Host
 -mp, --mongodb_port     <port>       MongoDB Port
 -mu, --mongodb_user     <user>       MongoDB User
 -mpw,--mongodb_password <password>   MongoDB Password
 -md, --mongodb_database <database>   MongoDB Database

e.g.

wiki_data_fetcher.jar -e $HADOOP_PATH/wiki.xml -mh mongoHost -mp mongoPort -mu mongoUser -mpw mongoPassword -md mondoDatabase

Spark Logs

  • /var/log/spark
  • /run/spark/work

About

fetches Wikipedia documents, prepares the data and stores it into the database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages