Skip to content

HerveRiviere/WikipediaXMLHadoopParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WikipediaXMLHadoopParser

Input :

Wikipedia dump, metadata history XML files (XXwiki-latest-stub-meta-historyXX.xml.gz) from http://dumps.wikimedia.org/enwiki

Output :

Flats files with the following structure
ArticleTitle +"\t" + EditorUsername + "\t" +EditorType+"\t" + ArticleSizeAfterModification + "\t"+ IsMinorChange + "\t" + EditorComment+"\t"+ModificationTimestamp

One file per year (year partitionner)

How to use the job :

hadoop jar WikipediaXmlHadoopParser hadoop.wikipedia.parse.job.WikiParseDriver InputFolder OutputFolder

Use output file with Hive :

hive> create EXTERNAL table wikipedia(title String, user String, typeUser String, size Int, typeModification String, comment String, date Timestamp) PARTITIONED BY (YEAR int) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION "/";
hive> ALTER TABLE WIKIPEDIA ADD PARTITION (YEAR="2001");
....
ALTER TABLE WIKIPEDIA ADD PARTITION (YEAR="2014");

Move corresponding output files in the right folder (hadoop dfs -mv /OutputFile/2001 /year=2001....)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages