demo.py
showcases major program abilities in data analysis.
The program aims to answer the following questions using the provided dataset:
- How do league-wide batting statistics (such as AVG, OBP, and SLG) change over the history of the MLB (from 1871 to 2015)?
- How do the same batting statistics change over a given player's individual career?
- Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?
The data analyzed by this program is The Baseball Databank, which provides information on baseball players, teams, and games from 1871 to 2015.
This program utilizes the Master and Batting tables.
import pandas as pd
# Convert CSV files into Pandas DataFrames
batting_data = pd.read_csv(r"data\\Batting.csv")
master_data = pd.read_csv(r"data\\Master.csv")
The Master table has 18,846 rows and contains information about individual players. The table looks like this:
playerID | birthYear | birthMonth | ... | nameFirst | nameLast | weight | height | bats |
---|---|---|---|---|---|---|---|---|
bondsba01 | 1964 | 7 | ... | Barry | Bonds | 185 | 73 | L |
ordonma01 | 1974 | 1 | ... | Magglio | Ordonez | 215 | 72 | R |
palmera01 | 1964 | 9 | ... | Rafael | Palmeiro | 180 | 72 | L |
The Batting table has 101,332 rows and contains batting data for individual players in a year. The table looks like this:
playerID | yearID | ... | AB | H | 2B | 3B | ... | BB |
---|---|---|---|---|---|---|---|---|
bondsba | 1995 | ... | 506 | 149 | 30 | 7 | ... | 120 |
bondsba | 1996 | ... | 517 | 159 | 27 | 3 | ... | 151 |
ordonma01 | 2004 | ... | 202 | 59 | 8 | 2 | ... | 16 |
palmera01 | 2004 | ... | 550 | 142 | 29 | 0 | ... | 86 |
For this data to be useful, the Batting Table and Master Table are joined with playerID
as the key index, and the desired columns are specified.
# Join two tables with the playerID as the index
d.merge(master_data, \
batting_data, \
on=["playerID"]) \
[["playerID", "nameLast", "nameFirst", "birthYear", "yearID", "AB", "H", "2B", "3B", "HR", "BB", "HBP", "SO", "SF"]] \
.sort_values(by=["yearID", "nameLast", "nameFirst"])
The resulting table includes all the rows in the Batting table, but columns with player information from the Master table are added:
playerID | nameLast | nameFirst | yearID | AB | H | 2B | ... | BB |
---|---|---|---|---|---|---|---|---|
bondsba01 | Bonds | Barry | 1995 | 506 | 149 | 30 | ... | 120 |
bondsba01 | Bonds | Barry | 1996 | 517 | 159 | 27 | ... | 151 |
ordonma01 | Ordonez | Magglio | 2004 | 202 | 59 | 8 | ... | 16 |
palmera01 | Rafael | Palmeiro | 2004 | 550 | 142 | 29 | ... | 86 |
The BattingDataDisplay
class handles printing and displaying tables and graphs for different subsets of the data. All images and tables in this section are generated in this way.
How have league-wide batting statistics (such as AVG, OBP, and SLG) changed over the history of the MLB (from 1871 to 2015)?
We can identify trends in league-wide batting statistics over a period of almost 150 years with the following graph:
This graph is generated with the following method:
def graph_league_batting_statistics(self):
league_data = self.batting_data.statistics_for_league()
league_data.plot(marker="none")
plt.title("MLB League Statistics 1871-2015")
plt.xlabel("Season")
plt.show()
Just as with league statistics, given a playerID, we can graph a player's statistics over his career. These graphs show trends and changes in a player's batting statistics over his individual career:
These graphs are generated with the following method given the playerID of any of the 18,000+ players in the dataset:
def graph_player_batting_statistics(self, player_id):
player_data = self.batting_data.statistics_for_player(player_id)
player_data.plot(marker="o", linestyle="dashed")
plt.title(f"{self.batting_data.name_for_player(player_id)} Statistics")
plt.xlabel("Season")
plt.show()
Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?
Firstly, a predictive model that uses a machine learning algorithm called Extreme Gradient Boosting is implemented using the xgboost
library. This algorithm is an optimized and more complex version of the decision tree machine learning algorithm.
from xgboost import XGBRegressor
gradient_model = XGBRegressor(random_state=0, n_estimators=500, learning_rate=0.04)
With the model generated. We use the for_predict_model()
method of the BattingData
class to generate a table that looks like this:
yearID | playerID | nameLast | nameFirst | age | AB | AVG | careerAVG |
---|---|---|---|---|---|---|---|
2015 | zimmery01 | Zimmerman | Ryan | 31.0 | 346.0 | 0.248555 | 0.284188 |
1988 | phelpke01 | Phelps | Ken | 34.0 | 190.0 | 0.284211 | 0.244003 |
2004 | kotsama01 | Kotsay | Mark | 29.0 | 606.0 | 0.313531 | 0.282828 |
This is the data that we will use to fit the predictive model.
As can be seen, this method creates new columns such as careerAVG
and age
based on other data in the tables. This is called feature engineering; it is key to getting more accurate predictions. By engineering these features, we are giving the predictive model more information to make decisions based off of.
The following code separates the data into the Prediction Target y
and the Features X
. The features are the columns that the predictive model will use to predict the target.
# Create x and y sets
batting_data = BattingData()
X = batting_data.for_predict_model().query(f"yearID != {year}")[["age", "AB", "AVG", "careerAVG"]]
X.dropna(axis=0, subset=["AVG"], inplace=True)
y = X['AVG']
X.drop(['AVG'], axis=1, inplace=True)
# Fit the model to the data
gradient_model.fit(X, y)
With the model generated, we can make our predictions based on a random sample of playerIDs from 2015: ['morelmi01', 'beckhti01', 'swihabl01', 'gomezca01', 'crawfbr01', 'gardnbr01']
# Generate predictions with the model
predictions = pd.Series(gradient_model.predict(batting_data_year[["age", "AB", "careerAVG"]]))
This table compares the results of the predictive model with the actual batting averages for each player in 2015
playerID | nameFirst | nameLast | real_AVG | predicted_AVG |
---|---|---|---|---|
beckhti01 | Tim | Beckham | 0.221675 | 0.246462 |
crawfbr01 | Brandon | Crawford | 0.256410 | 0.257204 |
gardnbr01 | Brett | Gardner | 0.259194 | 0.272702 |
gomezca01 | Carlos | Gomez | 0.262238 | 0.252414 |
gomezca01 | Carlos | Gomez | 0.241611 | 0.244528 |
morelmi01 | Mitch | Moreland | 0.278132 | 0.255698 |
swihabl01 | Blake | Swihart | 0.274306 | 0.253211 |
We can test the accuracy of the predictive model by randomly splitting the original data into two groups: test and validation. This is done using the sklearn
library:
from sklearn.model_selection import train_test_split
# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
Then, we implement the model with training data only and test it with the validation data:
gradient_model.fit(X_train, y_train)
# Generate predictions with the model
predictions = pd.Series(gradient_model.predict(X_valid))
Lastly, we can check these predictions with the validation target data and get the mean Absolute Error of the predictive model's predictions:
# Return mean absolute error for predictions
mean_absolute_error = mean_absolute_error(y_valid, predictions)
The Mean Absolute Error for this predictive model was measured at 0.02592
. This, in the context of this program, means that on average, the model's prediction for a player's batting average differs from the actual batting average by 0.02592. Not bad :-)
This program was developed using Visual Studio Code and Python 3.11.3.
The following Python libraries were used:
- pandas
- sklearn
- matplotlib
This project was an exercise in writing software to analyze complex and large datasets. I am not a data scientist, and therefore many improvements in the methods of analyzing this data are possible, such as:
- Engineer more effective features for the
for_predict_model()
table. This would increase the accuracy of the predictions. - Improvements in how statistical data is displayed.
- More sophisticated methods of analyzing the data.
In terms of the software itself, there are many improvements and additions that could be made:
- A UI to interact with the data and the way in which it is displayed.
- Better processing and managing of individual pandas DataFrames to increase program efficiency and readability.