This repo is about the Premier League analysis. I wanted to understand more about the history of this League and so collected all the data about the games and players, cleaned it, conducted an analysis, and wrote an article on Medium about it.
This project is decomposed into three sections (i.e three jupyter notebooks):
-
Data collection: this was the process of collecting using webscraping, all the data about the games, the players, and all the events that happened in every game (goals, red/yellow cards, substitution, etc). I used the official website of the Premier League as the main data source and after navigating through it, I understood how the data was displayed and the best way of getting it:
- First, collect the ID of each season from a dropdown menu on the games' pages
- Loop through each season's page (https://www.premierleague.com/.../{season_id}) to collect each game id
- Loop through each game's page (https://www.premierleague.com/.../{match_id}) to collect each game's data. Luckily for me, the website stores the game's data (with a lot more data that I needed) in a JSON format readable in the html. I just had to flatten the json to have tabular data that I splitted in multiple files (games, events, players)
-
Data cleaning: There was way too much information in those JSONs so I removed some columns, reformatted others, dealed with missing values and generally cleaned the data to have a suitable format for the analysis (EDA).
-
Data Analysis & Visualization: Definitely the most exciting and sexy to read! I analyzed the data to find some interesting facts about the league. Not all insights have a viz but here are some viz where you can enjoy the interactibility as they're all made with Plotly!
- I couldn't managed to make the charts of the EDA notebook interactive on github so the only way to play with them is to run the EDA notebook or to check out some below.
- The EDA has a lot of insights that dont have charts, go take a look!
If you don't wanna run EDA.ipynb, check the charts below or in docs/images, you can open them by putting http://htmlpreview.github.io/? in front of http in the url.