Overview

demo.py showcases major program abilities in data analysis.

Program Objectives

The program aims to answer the following questions using the provided dataset:

How do league-wide batting statistics (such as AVG, OBP, and SLG) change over the history of the MLB (from 1871 to 2015)?
How do the same batting statistics change over a given player's individual career?
Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?

Dataset

The data analyzed by this program is The Baseball Databank, which provides information on baseball players, teams, and games from 1871 to 2015.

This program utilizes the Master and Batting tables.

import pandas as pd

# Convert CSV files into Pandas DataFrames
batting_data = pd.read_csv(r"data\\Batting.csv")
master_data = pd.read_csv(r"data\\Master.csv")

The Master table has 18,846 rows and contains information about individual players. The table looks like this:

playerID	birthYear	birthMonth	...	nameFirst	nameLast	weight	height	bats
bondsba01	1964	7	...	Barry	Bonds	185	73	L
ordonma01	1974	1	...	Magglio	Ordonez	215	72	R
palmera01	1964	9	...	Rafael	Palmeiro	180	72	L

The Batting table has 101,332 rows and contains batting data for individual players in a year. The table looks like this:

playerID	yearID	...	AB	H	2B	3B	...	BB
bondsba	1995	...	506	149	30	7	...	120
bondsba	1996	...	517	159	27	3	...	151
ordonma01	2004	...	202	59	8	2	...	16
palmera01	2004	...	550	142	29	0	...	86

For this data to be useful, the Batting Table and Master Table are joined with playerID as the key index, and the desired columns are specified.

# Join two tables with the playerID as the index
d.merge(master_data, \
         batting_data, \
         on=["playerID"]) \
         [["playerID", "nameLast", "nameFirst", "birthYear", "yearID", "AB", "H", "2B", "3B", "HR", "BB", "HBP", "SO", "SF"]] \
         .sort_values(by=["yearID", "nameLast", "nameFirst"])

The resulting table includes all the rows in the Batting table, but columns with player information from the Master table are added:

playerID	nameLast	nameFirst	yearID	AB	H	2B	...	BB
bondsba01	Bonds	Barry	1995	506	149	30	...	120
bondsba01	Bonds	Barry	1996	517	159	27	...	151
ordonma01	Ordonez	Magglio	2004	202	59	8	...	16
palmera01	Rafael	Palmeiro	2004	550	142	29	...	86

Data Analysis Results

The BattingDataDisplay class handles printing and displaying tables and graphs for different subsets of the data. All images and tables in this section are generated in this way.

How have league-wide batting statistics (such as AVG, OBP, and SLG) changed over the history of the MLB (from 1871 to 2015)?

We can identify trends in league-wide batting statistics over a period of almost 150 years with the following graph:

This graph is generated with the following method:

def graph_league_batting_statistics(self):
    league_data = self.batting_data.statistics_for_league()
    league_data.plot(marker="none")
    plt.title("MLB League Statistics 1871-2015")
    plt.xlabel("Season")
    plt.show()

How have the same batting statistics changed over a given player's individual career?

Just as with league statistics, given a playerID, we can graph a player's statistics over his career. These graphs show trends and changes in a player's batting statistics over his individual career:

These graphs are generated with the following method given the playerID of any of the 18,000+ players in the dataset:

def graph_player_batting_statistics(self, player_id):
    player_data = self.batting_data.statistics_for_player(player_id)
    player_data.plot(marker="o", linestyle="dashed")
    plt.title(f"{self.batting_data.name_for_player(player_id)} Statistics")
    plt.xlabel("Season")
    plt.show()

Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?

The Predictive Model

Firstly, a predictive model that uses a machine learning algorithm called Extreme Gradient Boosting is implemented using the xgboost library. This algorithm is an optimized and more complex version of the decision tree machine learning algorithm.

from xgboost import XGBRegressor

gradient_model = XGBRegressor(random_state=0, n_estimators=500, learning_rate=0.04)

With the model generated. We use the for_predict_model() method of the BattingData class to generate a table that looks like this:

yearID	playerID	nameLast	nameFirst	age	AB	AVG	careerAVG
2015	zimmery01	Zimmerman	Ryan	31.0	346.0	0.248555	0.284188
1988	phelpke01	Phelps	Ken	34.0	190.0	0.284211	0.244003
2004	kotsama01	Kotsay	Mark	29.0	606.0	0.313531	0.282828

This is the data that we will use to fit the predictive model.

As can be seen, this method creates new columns such as careerAVG and age based on other data in the tables. This is called feature engineering; it is key to getting more accurate predictions. By engineering these features, we are giving the predictive model more information to make decisions based off of.

The following code separates the data into the Prediction Target y and the Features X. The features are the columns that the predictive model will use to predict the target.

# Create x and y sets
batting_data = BattingData()
X = batting_data.for_predict_model().query(f"yearID != {year}")[["age", "AB", "AVG", "careerAVG"]]
X.dropna(axis=0, subset=["AVG"], inplace=True)
y = X['AVG']
X.drop(['AVG'], axis=1, inplace=True)

# Fit the model to the data
gradient_model.fit(X, y)

With the model generated, we can make our predictions based on a random sample of playerIDs from 2015: ['morelmi01', 'beckhti01', 'swihabl01', 'gomezca01', 'crawfbr01', 'gardnbr01']

# Generate predictions with the model
predictions = pd.Series(gradient_model.predict(batting_data_year[["age", "AB", "careerAVG"]]))

This table compares the results of the predictive model with the actual batting averages for each player in 2015

playerID	nameFirst	nameLast	real_AVG	predicted_AVG
beckhti01	Tim	Beckham	0.221675	0.246462
crawfbr01	Brandon	Crawford	0.256410	0.257204
gardnbr01	Brett	Gardner	0.259194	0.272702
gomezca01	Carlos	Gomez	0.262238	0.252414
gomezca01	Carlos	Gomez	0.241611	0.244528
morelmi01	Mitch	Moreland	0.278132	0.255698
swihabl01	Blake	Swihart	0.274306	0.253211

Testing the Accuracy of the Predictive Model

We can test the accuracy of the predictive model by randomly splitting the original data into two groups: test and validation. This is done using the sklearn library:

from sklearn.model_selection import train_test_split

# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

Then, we implement the model with training data only and test it with the validation data:

gradient_model.fit(X_train, y_train)

# Generate predictions with the model
predictions = pd.Series(gradient_model.predict(X_valid))

Lastly, we can check these predictions with the validation target data and get the mean Absolute Error of the predictive model's predictions:

# Return mean absolute error for predictions
mean_absolute_error = mean_absolute_error(y_valid, predictions)

The Mean Absolute Error for this predictive model was measured at 0.02592. This, in the context of this program, means that on average, the model's prediction for a player's batting average differs from the actual batting average by 0.02592. Not bad :-)

Development

This program was developed using Visual Studio Code and Python 3.11.3.

The following Python libraries were used:

pandas
sklearn
matplotlib

Useful Resources

Kaggle

Future Work

This project was an exercise in writing software to analyze complex and large datasets. I am not a data scientist, and therefore many improvements in the methods of analyzing this data are possible, such as:

Engineer more effective features for the for_predict_model() table. This would increase the accuracy of the predictions.
Improvements in how statistical data is displayed.
More sophisticated methods of analyzing the data.

In terms of the software itself, there are many improvements and additions that could be made:

A UI to interact with the data and the way in which it is displayed.
Better processing and managing of individual pandas DataFrames to increase program efficiency and readability.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
images		images
.gitignore		.gitignore
BattingData.py		BattingData.py
BattingDataDisplay.py		BattingDataDisplay.py
README.md		README.md
demo.py		demo.py
predict_model.py		predict_model.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Program Objectives

Dataset

Data Analysis Results

How have league-wide batting statistics (such as AVG, OBP, and SLG) changed over the history of the MLB (from 1871 to 2015)?

How have the same batting statistics changed over a given player's individual career?

Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?

The Predictive Model

Testing the Accuracy of the Predictive Model

Development

Useful Resources

Future Work

About

Releases

Packages

Languages

blainefreestone/mlb_data_analysis

Folders and files

Latest commit

History

Repository files navigation

Overview

Program Objectives

Dataset

Data Analysis Results

How have league-wide batting statistics (such as AVG, OBP, and SLG) changed over the history of the MLB (from 1871 to 2015)?

How have the same batting statistics changed over a given player's individual career?

Based on a set of features for a given player (past batting data, age, etc.) and given the number of at-bats in a season, what is a predicted batting average (AVG)?

The Predictive Model

Testing the Accuracy of the Predictive Model

Development

Useful Resources

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages