Skip to content

Latest commit

 

History

History
18 lines (15 loc) · 875 Bytes

README.md

File metadata and controls

18 lines (15 loc) · 875 Bytes

Analysis of big data with Spark

Case studies

  • Wikipedia: Compare different methods for finding most common keywords in RDD
    • Data: http://alaska.epfl.ch/~dockermoocs/bigdata/wikipedia.dat data/
    • Run:
      • spark-shell --master local[*] -i wikipedia.scala
      • WikipediaAnalysis.Wikipedia.compareRankingMethods(sc)
  • StackOverflow: KMeans clustering of StackOverflow questions & answers
    • Data: http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv
    • Run:
      • spark-shell --master local[*] -i stackoverflow.scala
      • StackOverflowAnalysis.StackOverflow.clusterPostsUsingKMeans(sc)
  • Record Linkage [In Progress] Deduplication of records
    • Data: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
    • Run: spark-shell -i linkage.scala