Skip to content

Latest commit

 

History

History
27 lines (15 loc) · 2.21 KB

README.md

File metadata and controls

27 lines (15 loc) · 2.21 KB

Almanac

Deep learning predictor for the outcome of football (soccer) games. The models created here are encoder-only transformer models that leverage multi-headed self-attention to capture the time-series impact of previous matches on deciding each team's likelihood of winning a future match.

How to train the model

  1. Run 'ScrapeData.py' to obtain the datasets of previous football matches and their statistics from the internet.

  2. Run 'TransformData.py' to create the training and test datasets for the model.

  3. Run 'EncoderPretraining.py'. This pretrains the transformer model and saves the learnt weights to a .pt file.

  4. Run 'Training.py'. This fine-tunes the pretrained model to predict the percentage chance of each team winning a given match. This model is saves to a new .pt file.

Current Performance

The model performance is limited by many factors, such as number of features, dataset size, and model scale. Nevertheless, the model achieves very good performance. The plot below shows how the % predictions (Predicted Accuracy) matched up to the % of those selections that were true (Model Accuracy) (including win, draw and loss), over an unseen test set of ~1000 matches. I've included definitions of model accuracy and predicted accuracy below.

Performance

You can see the predicted probabilities match up quite well with the real-world results, proving the efficacy of the trained model.

Predicted Accuracy - For all the matches in the test set, the probability of win, lose and draw have each been rounded to the nearest 5% (with a 2.5% offset). These matches are averaged with other samples that fall under the same 'bucket' to form the data for the data that is considered for the given bar.

Model Accuracy - For all these matches for each bar, the correct predictions and incorrect predictions are used to calculate the mean accuracy for this bar. This value is the model accuracy. How close this value is to the predicted accuracy (the rounded to 5% value) defines how well the model is performing.

NOTE: Some predicted values (such as 97.5) have no value - this is because the model didn't predict any outcomes had this probability, so the number of samples considered for that bar is zero.