-
Notifications
You must be signed in to change notification settings - Fork 0
HerveRiviere/WikipediaXMLHadoopParser
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ArticleTitle +"\t" + EditorUsername + "\t" +EditorType+"\t" + ArticleSizeAfterModification + "\t"+ IsMinorChange + "\t" + EditorComment+"\t"+ModificationTimestamp
One file per year (year partitionner) hadoop jar WikipediaXmlHadoopParser hadoop.wikipedia.parse.job.WikiParseDriver InputFolder OutputFolder hive> create EXTERNAL table wikipedia(title String, user String, typeUser String, size Int, typeModification String, comment String, date Timestamp) PARTITIONED BY (YEAR int) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION "/";
hive> ALTER TABLE WIKIPEDIA ADD PARTITION (YEAR="2001");
....
ALTER TABLE WIKIPEDIA ADD PARTITION (YEAR="2014");
Move corresponding output files in the right folder (hadoop dfs -mv /OutputFile/2001 /year=2001....)
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published