-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using bigstitcher with slurm #20
Comments
Hi Eugene, In May 2022 for an I2K Workshop, @StephanPreibisch talked for 90 minutes about BigStitcher and I talked for 20 minutes about BigStitcher-Spark and the spark-janelia script library. Depending upon your interest/available time, that YouTube recording might be helpful. Stephan also has a bunch of other Big-Stitcher HowTo videos listed here. If you'd prefer to go right to code/scripts on GitHub, we run Spark on Janelia's LSF cluster using this script library. Last summer, I worked with folks at MDC Berlin to adapt those scripts to support runs on the SGE/Univa. We don't currently have anything for Slurm, though the core concepts/ideas are likely similar. If you are interested in adding support for Slurm to the spark-janelia scripts, I'm happy to help you integrate that - but I would need you to provide the core pieces since we don't use Slurm at Janelia and I don't have access to a Slurm cluster. Others (maybe @martinschorb at EMBL?) have also likely solved the problem of setting up a Spark cluster on top of a Slurm HPC cluster. Sorry I don't have a direct solution for you, |
Hi, If you want t have a look, here is how the submission script looks like: #!/bin/bash
#SBATCH --job-name=spark-master # create a short name for your job
#SBATCH --time=00:10:00 # total run time limit (HH:MM:SS)
#SBATCH -o sparkslurm-%j.out
#SBATCH -e sparkslurm-%j.err
# # --- Master resources ---
#SBATCH --mem-per-cpu=2G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
# # --- Worker resources ---
#SBATCH hetjob
#SBATCH --job-name spark-worker
#SBATCH --nodes=4
#SBATCH --mem-per-cpu=4G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
# import Parameters
module load Java
export DISPLAY=""
export LOGDIR=`pwd`
# this is where spark is installed.
# modify the directories if you call the spark executables from a module
export SPARK_HOME=$YOURSPARKDIR
JOB="$SLURM_JOB_NAME-$SLURM_JOB_ID"
export MASTER_URL="spark://$(hostname):7077"
export MASTER_HOST=`hostname`
export MASTER_IP=`host $MASTER_HOST | sed 's/^.*address //'`
export MASTER_WEB="http://$MASTER_IP:8080"
mkdir $LOGDIR
mkdir $LOGDIR/$JOB
# SET UP ENV for the spark run
echo $MASTER_IP > $LOGDIR/$JOB/master
export SPARK_LOG_DIR="$LOGDIR/$JOB/logs"
export SPARK_WORKER_DIR="$LOGDIR/$JOB/worker"
export SPARK_LOCAL_DIRS="$TMPDIR/$JOB"
export SPARK_WORKER_CORES=$SLURM_CPUS_PER_TASK_HET_GROUP_1
export TOTAL_CORES=$(($SPARK_WORKER_CORES * $SLURM_JOB_NUM_NODES_HET_GROUP_1))
# export SPARK_DRIVER_MEM=$((4 * 1024))
export SPARK_MEM=$(( $SLURM_MEM_PER_CPU_HET_GROUP_1 * $SLURM_CPUS_PER_TASK_HET_GROUP_1))m
export SPARK_DAEMON_MEMORY=$SPARK_MEM
export SPARK_WORKER_MEMORY=$SPARK_MEM
# MAIN CALLS
#======================================
# start MASTER
$SPARK_HOME/sbin/start-master.sh
# wait for master to start
wait=1
while [ $wait -gt 0 ]
do
{ # try
curl "$MASTER_WEB" > /dev/null && wait=0 && echo "Found spark master, will submit tasks."
} || { # catch
sleep 10 && echo "Waiting for spark master to become available."
}
done
#echo Starting slaves
srun --het-group=1 $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL -d $SPARK_WORKER_DIR &
# again, sleep a tiny little bit
sleep 5s
# this is the general call we use together with command line parameters.
# sparksubmitcall="$SPARK_HOME/bin/spark-submit --master $MASTER_URL --driver-memory 2g --conf spark.default.parallelism=$TOTAL_CORES --conf spark.executor.cores=$SPARK_WORKER_CORES --executor-memory $SPARK_MEM --class $CLASS $JARFILE $PARAMS"
# this is the spark example to compute Pi
sparksubmitcall="$SPARK_HOME/bin/run-example --master $MASTER_URL --driver-memory 2g --conf spark.default.parallelism=$TOTAL_CORES --conf spark.executor.cores=$SPARK_WORKER_CORES --executor-memory $SPARK_MEM SparkPi"
echo $sparksubmitcall
$sparksubmitcall
# this keeps the master alive.
# You can also have the compute job write a file once done and exit the job upon its existance
sleep infinity
|
Just wanted to shout out the excellent nextflow-spark repo which allows you to start a spark cluster on slurm (or kubernetes aws, anything supported by nextflow) |
@trautmane also helped @bellonet at the MDC Berlin to set up Spark on their cluster ... not sure if she can help with some insights as well? |
I also started to write better documentation on BigStitcher-Spark https://github.com/JaneliaSciComp/BigStitcher-Spark (that also links the YouTube video). It would be great if people could contribute small HowTo's for how they set it up on their respective clusters: https://github.com/JaneliaSciComp/BigStitcher-Spark#installcluster to help other users ... |
@StephanPreibisch I am on the same boat here trying to setup BigStitcher-Spark on our LSF cluster. A really naive question to start, can I run the "Define Dataset" step were I can distribute workload across different nodes and directly resaving TIFF files into N5/HDF5 rather than using an an already BigStitcher created XML+HDF5 pair? Or is BigStitcher-Spark only compatible with the next steps (i.e stitching, align, interest-points and fusion). |
Hi everyone!
I am a sysadmin, trying to help our users run bigstitcher on the HPC cluster. I don't necessarily know how BigStitcher works or what it does, and I am also not too familiar with spark. I was hoping that you could give us a few pointers to how to run this in a distributed mode in slurm?
Here is how I currently run it within a single node
How can I tell
affine-fusion
to distribute across multiple compute nodes (once I request multiple nodes from slurm).Thanks!
The text was updated successfully, but these errors were encountered: