This project performs an in-depth analysis of baseball statistics to identify potential Hall of Fame candidates who may have been overlooked. The analysis is implemented using various big data technologies and frameworks to demonstrate different approaches to handling and processing large datasets.
The project has been implemented using the following technologies:
- MapReduce (HDFS)
- Dask
- MongoDB
- PySpark
- SQL
Each implementation achieves the same result but leverages different tools and paradigms for big data processing.
The analysis follows these general steps across all implementations:
-
Load and process data from multiple CSV files:
woba.csv
: Weighted On-Base Average (wOBA) coefficients by yearHallOfFame.csv
: Information about players inducted into the Hall of FameBatting.csv
: Players' batting statisticsAppearances.csv
: Players' game appearance data
-
Calculate each player's career wOBA and determine their primary playing position.
-
For each position (Catcher, Outfielder, First Base, Second Base, Third Base, Shortstop):
- Identify players already inducted into the Hall of Fame
- Calculate the 25th percentile wOBA for inducted players
- Find non-inducted players who exceed this 25th percentile wOBA
- Calculate the percentage of inducted players that each candidate's wOBA exceeds
-
Output a list of potential Hall of Fame candidates, including their playerID, position, wOBA, and the percentage of inducted players they outperform.
- Python 3.7+
- Implement additional statistical analyses
- Create visualizations of the results
- Expand the dataset to include more recent years
- Optimize performance for each implementation