Google - Fast or Slow? Predict AI Model Runtime - 33rd Place Solution

Solution writeup: 33rd Solution Writeup and Discussion on GST

Overview

In this competition, competitors are challenged to estimate the runtime of DL computational graphs compiled with specific configurations. Concretely, the configurations with shorter runtimes should be ranked to the top. For implementation, I choose to use the off-the-shelf Graph Segment Training (GST) as the main training framework, aiming at reducing the memory consumption for model training.

How to Run

1. Download Dataset

You have to follow instructions on the data tab to download the dataset. Then, you need to unzip and put the raw data under the directory ./data/raw/. Under ./data/raw/, the folder structure looks like the follows,

npz_all/npz/
├── layout/
│   ├── nlp/
│   │   ├── default/
│   │   │   ├── test/
│   │   │   ├── train/
│   │   │   └── valid/
│   │   └── random/
│   │       ├── test/
│   │       ├── train/
│   │       └── valid/
│   └── xla/
│       ├── default/
│       │   ├── test/
│       │   ├── train/
│       │   └── valid/
│       └── random/
│           ├── test/
│           ├── train/
│           └── valid/
└── tile/
    └── xla/
        ├── test/
        ├── train/
        └── valid/

2. Prepare Data

To obtain the processed data, please run the following commands,

python -m data.preparation.gen_tile_df
python -m data.preparation.gen_layout_df
python -m data.preparation.dump_layout_node_config_feat

As can be observed in the folder structure of ./data/raw/, there are five data collections in total (e.g., layout-nlp-default), each of which has its own processed counterpart. Let's take layout-nlp-default as an example. Under ./data/processed/, the folder structure is shown as follows,

layout/nlp/default/
├── train.pkl
├── valid.pkl
├── test.pkl
└── node_config_feat/
    ├── test/
    ├── train/
    └── valid/

3. Train Models

Each experiment is controlled via 3 configuration files (somewhat annoying, which I'll improve in the next project), ./config/defaults.yaml, ./config/dp.yaml and ./config/<model_name>.yaml, where <mode_name> is the name of the model architecture. After setup, you can now train models by running,

# Note: If you don't want to track the experiment with wandb, please set it to False

# Train tile model
python -m tools.main --project-name <any_project_name> --model-name <model_name> --use-wandb True

# Train layout model
python -m tools.main_layout --project-name <any_project_name> --model-name <model_name> --use-wandb True

The output objects (e.g., model checkpoint model.pth, log file train_eval.log) will be dumped under the path ./output/<%m%d-%H_%M_%S>/, because we use timestamp to do a simple version control.

4. Run Inference

After models are trained, you can run inference on test set for the final submission,

# Run inference on tile
python -m tools.infer --exp-id <%m%d-%H_%M_%S> --data-split test --model-name <model_name> --mid last --seeds 0

# Infer layout
python -m tools.infer_layout --exp-id <%m%d-%H_%M_%S> --coll <collection> --data-split test --model-name <model_name> --mid <model_identifier> --seeds <seeds>

# options:
# --exp-id            experiment identifier represented as timestamp
# --coll              {xla-default, xla-random, nlp-default, nlp-random}
# --mid               model checkpoint identifier, always choose last to avoid overfitting on validation set
# --seeds             [0, 1, ...], feel free to ensemble with models trained on different random seeds

Then, submission.csv is automatically generated under the specified output path ./output/<%m%d-%H_%M_%S>/.

5. Experimental Results

The local CV score of my final best result is shown in the following table,

Collection	CV
tile	0.9551
xla-default	0.3188
xla-random	0.5569
nlp-default	0.5053
nlp-random	0.8845

The average score across all data collections are shown as follows,

	Average Score
CV	0.64412
Public LB	0.65282
Private LB	0.62695

References

[1] Learning Large Graph Property Prediction via Graph Segment Training
[2] TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
base		base
config		config
criterion		criterion
cv		cv
data		data
engine		engine
evaluating		evaluating
experiment		experiment
modeling		modeling
solver		solver
tools		tools
trainer		trainer
utils		utils
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
metadata.py		metadata.py
paths.py		paths.py
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google - Fast or Slow? Predict AI Model Runtime - 33rd Place Solution

Overview

How to Run

1. Download Dataset

2. Prepare Data

3. Train Models

4. Run Inference

5. Experimental Results

References

About

Releases

Packages

Languages

JiangJiaWei1103/Google-Fast-or-Slow-33rd-Place-Solution

Folders and files

Latest commit

History

Repository files navigation

Google - Fast or Slow? Predict AI Model Runtime - 33rd Place Solution

Overview

How to Run

1. Download Dataset

2. Prepare Data

3. Train Models

4. Run Inference

5. Experimental Results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages