Skip to content

tsai-emlo4-emlo4-session-01-emlo4-session-02 created by GitHub Classroom

Notifications You must be signed in to change notification settings

The-School-of-AI/emlo4-session-02-KillerStrike17

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Review Assignment Due Date Open in Visual Studio Code

emlov3-session-02

PyTorch Docker Assignment

Welcome to the PyTorch Docker Assignment. This assignment is designed to help you understand and work with Docker and PyTorch.

Assignment Overview

In this assignment, you will:

  1. Create a Dockerfile for a PyTorch (CPU version) environment.
  2. Keep the size of your Docker image under 1GB (uncompressed).
  3. Train any model on the MNIST dataset inside the Docker container.
  4. Save the trained model checkpoint to the host operating system.
  5. Add an option to resume model training from a checkpoint.

Starter Code

The provided starter code in train.py provides a basic structure for loading data, defining a model, and running training and testing loops. You will need to complete the code at locations marked by TODO: comments.

Submission

When you have completed the assignment, push your code to your Github repository. The Github Actions workflow will automatically build your Docker image, run your training script, and check if the assignment requirements have been met. Check the Github Actions tab for the results of these checks. Make sure that all checks are passing before you submit the assignment.

Solution

This repository contains a PyTorch implementation to train and test a neural network on the MNIST dataset. The model architecture includes convolutional and fully connected layers designed to classify images of handwritten digits (0-9). The script allows for customizable training options via command-line arguments.

Features

  • Customizable model training with configurable batch size, epochs, learning rate, and more.
  • Model checkpointing for saving and resuming training from saved states.
  • Logging of training progress and performance metrics during each epoch.
  • Support for CUDA and macOS Metal (MPS) GPU acceleration.
  • Command-line argument parsing for ease of use.

Requirements

  • Python 3.7+
  • PyTorch 1.9+
  • Torchvision
  • argparse

Install the required dependencies with:

pip install torch torchvision

Usage

Running the Script

To train the model from scratch, use the following command:

python train.py --batch-size 64 --epochs 14 --lr 1.0

Command-Line Arguments

The following arguments are supported to customize the training process:

  • --batch-size (default: 64): Input batch size for training.
  • --test-batch-size (default: 1000): Input batch size for testing.
  • --epochs (default: 14): Number of epochs to train.
  • --lr (default: 1.0): Learning rate for the optimizer.
  • --gamma (default: 0.7): Learning rate step decay factor.
  • --no-cuda: Disable CUDA (GPU) training even if CUDA is available.
  • --no-mps: Disable macOS GPU training.
  • --dry-run: Run a quick single batch to check if the pipeline works.
  • --log-interval (default: 10): Number of batches to wait before logging training status.
  • --save-model (default: True): Save the model at each epoch.
  • --resume: Resume training from the last saved checkpoint.
  • --seed (default: 1): Seed for random number generation.

Resuming Training

To resume training from a saved checkpoint, use the --resume flag:

python train.py --resume

Ensure that the model_checkpoint.pth file is present in the current directory.

Example

python train.py --batch-size 64 --epochs 10 --lr 0.1 --gamma 0.9 --log-interval 20 --save-model

Model Architecture

The model consists of the following layers:

  • Two convolutional layers (conv1 and conv2)
  • Two dropout layers (dropout1 and dropout2) to prevent overfitting
  • Fully connected layers (fc1 and fc2)
  • Softmax output for multi-class classification

Training and Testing

The model is trained using the Negative Log-Likelihood (NLL) loss function and optimized using the Adadelta optimizer. The script implements both a training loop and a testing loop to evaluate model performance on the MNIST test set after each epoch.

Training logs are printed periodically based on the --log-interval argument, showing the progress and loss for each batch.

Checkpointing

The model, optimizer state, and current epoch are saved after each epoch in model_checkpoint.pth. This allows you to resume training from where you left off using the --resume flag.

References

About

tsai-emlo4-emlo4-session-01-emlo4-session-02 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published