flux-framework · vsoch · Apr 13, 2023 · Apr 13, 2023 · Apr 13, 2023
diff --git a/examples/launchers/merlin/singularity-openfoam/README.md b/examples/launchers/merlin/singularity-openfoam/README.md
@@ -17,7 +17,7 @@ $ kind create cluster --config ../../../kind-config.yaml
 And the Flux Operator namespace created:
 
 ```bash
-$ kubectl create -n flux-operator
+$ kubectl create namespace flux-operator
 ```
 
 And then generate the (separate) pods to run redis and rabbitmq in the flux-operator namespace.

diff --git a/examples/machine-learning/mlcommons-deepcam/README.md b/examples/machine-learning/mlcommons-deepcam/README.md
@@ -0,0 +1,82 @@
+# Deepcam
+
+> Deep Learning Climate Segmentation Benchmark
+
+This shows a  PyTorch implementation for the climate segmentation benchmark, based on the
+Exascale Deep Learning for Climate Analytics paper: https://arxiv.org/abs/1810.01993.
+The workflow is provided from [mlcommons/deepcam](https://github.com/mlcommons/hpc/tree/main/deepcam).
+
+## Create MiniCluster
+
+First, cd to the directory here, and create the minikube cluster (kind did not work to create a sandbox for the SIF):
+
+```bash
+$ minikube start
+```
+
+If you use minikube, you'll want to create a mount:
+
+```bash
+$ minikube mount $PWD/:/tmp/workflow
+$ docker pull ghcr.io/rse-ops/singularity:tag-mamba
+$ minikube image load ghcr.io/rse-ops/singularity:tag-mamba
+```
+
+And the Flux Operator namespace created:
+
+```bash
+$ kubectl create namespace flux-operator
+```
+
+And install the flux operator (from the repository here):
+
+```bash
+$ kubectl apply -f ../../dist/flux-operator.yaml
+```
+
+We don't want to create the minicluster quite yet! We want to prepare the data first.
+
+## Dataset
+
+You can read [more about the dataset here](https://github.com/mlcommons/hpc/tree/main/deepcam#dataset).
+You'll need to download the dataset from [this globus endpoint](https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F) and into the current directory.
+Note that I did this by setting up [Globus Connect Personal](https://www.globus.org/globus-connect-personal) and
+then downloading to a scoped location on my computer, and then moving to the directory here.
+First, extract the data (make sure you have ~50GB of space):
+
+```bash
+$ tar -xzvf deepcam-data-n512.tgz
+$ chmod +x install_mini_dataset.sh
+```
+
+This will extract the data to a directory, `deepcam-data-n512` and then we can run the script to prepare it:
+
+```bash
+$ mkdir -p ./data
+$ ./install_mini_dataset.sh ./deepcam-data-n512 ./data
+```
+
+This will basically copy the data over, and create the needed structure for training, etc.
+It should look like this, with most of the files under "training":
+
+```bash
+$ ls ./data
+stats.h5  train  validation
+```
+
+Note that the root directory here is bound to /tmp/workflow in our cluster, so it should
+show up as `/tmp/workflow/data`.
+
+## Training
+
+Now that we have our data ready, we can create the minicluster (which will pull the container to run the job)
+Note that we will use default parameters, but you can learn more about the defaults and parameters
+[in the repository](https://github.com/mlcommons/hpc/tree/main/deepcam).
+
+Then create the MiniCluster to use them! Let's hope your computer doesn't run out of space, or something like that.
+
+```bash
+$ kubectl apply -f minicluster.yaml
+```
+
+**WIP** this likely will work, but needs to be tested on a machine with GPU, etc. It will not work on a CPU.
diff --git a/examples/machine-learning/mlcommons-deepcam/install_mini_dataset.sh b/examples/machine-learning/mlcommons-deepcam/install_mini_dataset.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+# This script will take the downloaded small batch of data files,
+# make the required number of duplicates, and install in the specified
+# directory in train/ and validation/ subfolders.
+
+if [ $# -lt 2 ]; then
+    echo "Usage:"
+    echo "  $0 DOWNLOADED_DATA_DIR INSTALLATION_TARGET_DIR [NUM_COPIES]"
+    exit 1
+fi
+
+sourceDir=$1
+targetDir=$2
+numCopies=1
+if [ $# -ge 3 ]; then
+    numCopies=$3
+fi
+
+# First, we prepare the train directory by duplicating every file numCopies times
+mkdir -p $targetDir/train
+for f in $(ls $sourceDir | grep "data-.*.h5"); do
+    echo $f
+    for (( i=0; i<$numCopies; i++ )); do
+        outFile=$targetDir/train/${f/.h5/-$i.h5}
+        echo "  $outFile"
+        cp $sourceDir/$f $outFile
+    done
+done
+
+# Copy in the stats file
+cp $sourceDir/stats.h5 $targetDir/
+
+# Now copy the training directory to the validation directory
+cp -r $targetDir/train $targetDir/validation
diff --git a/examples/machine-learning/mlcommons-deepcam/minicluster.yaml b/examples/machine-learning/mlcommons-deepcam/minicluster.yaml
@@ -0,0 +1,50 @@
+apiVersion: flux-framework.org/v1alpha1
+kind: MiniCluster
+metadata:
+  name: flux-sample
+  namespace: flux-operator
+spec:
+
+  # IMPORTANT: see the README.md to see how to prepare data first!
+  # You should have a local ./data folder with training and stats
+  # Number of pods to create for MiniCluster
+  size: 2
+  tasks: 2
+  interactive: true
+
+  # Make this kind of persistent volume and claim available to pods
+  # This is a path in minikube (e.g., minikube ssh)
+  volumes:
+    data:
+      storageClass: hostpath
+      path: /tmp/workflow
+
+  # This is a list because a pod can support multiple containers
+  containers:
+    # The container URI to pull (currently needs to be public)
+    - image: ghcr.io/rse-ops/singularity:tag-mamba
+      cores: 4
+
+      # This will run with the defaults, targeting our ./data directory
+      command: singularity exec --pwd /opt/deepCam ./deepcam.sif /bin/bash /tmp/workflow/run_training.sh
+      workingDir: /tmp/workflow
+
+      # This pulls the container (once) by the broker to workingDir /data
+      commands:
+        pre: mkdir -p /tmp/workflow/output
+        brokerPre: |
+           if [[ ! -e "/tmp/workflow/deepcam.sif" ]]; then
+               singularity pull /tmp/workflow/deepcam.sif docker://ghcr.io/rse-ops/mlcommons-deepcam:tag-21.12-py3
+           fi
+
+      fluxUser:
+        name: fluxuser
+
+      # Container will be pre-pulled here only by the broker
+      volumes:
+        data:
+          path: /tmp/workflow
+
+      # Running a container in a container
+      securityContext:
+        privileged: true
diff --git a/examples/machine-learning/mlcommons-deepcam/run_training.sh b/examples/machine-learning/mlcommons-deepcam/run_training.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+
+# The MIT License (MIT)
+#
+# Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy of
+# this software and associated documentation files (the "Software"), to deal in
+# the Software without restriction, including without limitation the rights to
+# use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+# the Software, and to permit persons to whom the Software is furnished to do so,
+# subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+# FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+# COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+# parameters
+# IMPORTANT: use absolute paths
+data_dir="/tmp/workflow/data"
+output_dir="/tmp/workflow/output"
+run_tag="test_run"
+local_batch_size=2
+
+python /tmp/workflow/train.py \
+       --wireup_method "dummy" \
+       --run_tag ${run_tag} \
+       --data_dir_prefix ${data_dir} \
+       --output_dir ${output_dir} \
+       --model_prefix "segmentation" \
+       --optimizer "LAMB" \
+       --start_lr 0.0055 \
+       --lr_schedule type="multistep",milestones="800",decay_rate="0.1" \
+       --lr_warmup_steps 400 \
+       --lr_warmup_factor 1. \
+       --weight_decay 1e-2 \
+       --logging_frequency 10 \
+       --save_frequency 0 \
+       --max_epochs 200 \
+       --max_inter_threads 4 \
+       --seed $(date +%s) \
+       --batchnorm_group_size 1 \
+       --local_batch_size ${local_batch_size}
+
+# Removed (not an argument)
+#       --adam_eps 1e-6 \