- Get access to a cluster. On each machine, install as many VMs as it can support. Ensure at least 20GB of disk space on each VM, if you plan on running large tests. For our testing, we decided to use Chameleon Cloud. We have also compiled the instructions specific to setting up instances on the Chameleon Cloud.
- Ensure you have the private key to ssh to the created instances.
- Additionally create a file called
live_nodes
and add the IP of all live instances you have.
-
Given a list of cpp files, we need to evenly distribute them to all the VMs we have. Do a math of how many workloads need to be copied over per VM by dividing the total number of workloads by the number of VMs you have. All the scripts referenced below can be found here.
-
Use the script
segregate_workloads.sh
to divide the workloads evenly into all the VMs. Run./segregate_workloads.sh <num_per_vm> <start_workload_number> <max> <path_to_workload_files> <output_workload_path>
wherenum_per_vm
is the approximate number of workloads per VM,max
is the total number of workloads, and you provide the path to the workloads to segregate along with the path to the directory where you want to place the segregated files. Note that all the workload files will be of the formatj-lang<workload_num>.cpp
. What you'll see after running this script, is a bunch of subdirectories within the output directory, each with equal number of test files. For each VMj
in the instancei
with IPip
, you will find the corresponding workloads in the directory.<output_workload_path>/<node<i>>-<ip>/vm<j>
-
Now, we need to get these segregated workloads to the instances.
- By default, these are scp'ed to ~/seq2 of the instances (hardcoded in the script) and you can run it as
./scp_segregated_workloads.sh <start_serv_num> <end_serv_num> <output_workload_path>
wherestart_serv_num
andend_serv_num
are just integers between 1 and number of live nodes we have (points to the corresponding line in the live_nodes file). - You need to delete any files under ~/seq2 directory in all of the chameleon instances (by cssh'ing into all instances)
- This will scp the workloads serially which is slow, so use
./scp_segregated_workloads_parallel.sh
to trigger the scp in parallel (background processes). - Note that the parallel script right now handles 64 nodes, if the number of live nodes changes, you need to change the parallel script as well.
- Also note that scp is not complete until
ps aux | grep scp_segregated
yields empty results. - If the instances had changed (recreated), then the entries have to be removed from ~/.ssh/known_hosts file otherwise the scp will fail with host verification failed. So, if the instances had changed, remove ~/.ssh/known_hosts and then trigger the pull script.
- Now, we need to get the remote scripts and the workloads to the VMs from the instances. To do that, cssh into all instances, copy the latest vm_scripts/* (as checked in the repo) from a local server. Please set your username and password in all the following scripts.
- SCP all the remote scripts to the VMs using
./scp_remote_scripts_to_vms.sh
. - SCP the workload files to the VMs using
./scp_workloads_to_vms.sh
which will scp the workload files to the corresponding VMs under~/seq2
(you don't have to delete explicitly the folder in the VMs like you did for the chameleon instances). - The final step is to trigger the workload using
./trigger_remote_script_trigger_xfsMonkey.sh <file_system>
which will trigger thevm_remote_trigger_script.sh
in all the VMs. This will take some time (15-30 minutes) to complete.
Since all the workloads are running on the VMs, it is difficult to monitor the status of the runs. We have a few helper scripts for that.
- To check if the workload triggered in the VMs has completed execution, login to the instances through cssh, pull latest copy of vm_scripts and run
./trigger_remote_script.sh vm_remote_check_running.sh
which either prints 'Some run in progress' or 'No runs in progress' for each of the VMs running. - To check how many workloads are complete, cssh in to the instances and run
./trigger_remote_script.sh vm_remote_progress_script.sh
which will print how many workloads are complete for each of the VMs.
Once the workloads are complete (which you can make sure by using the above steps), you can pull the list of log files and the diff files as follows -
- If the instances had changed (recreated), then the entries have to be removed from ~/.ssh/known_hosts file otherwise the scp will fail with host verification failed. So, if the instances had changed, remove ~/.ssh/known_hosts and then trigger the pull script.
- Run
./pull_diff_files_and_xfsmonkey_log.sh <dir> <start_vm> <end_vm>
which creates a directory dir and pulls the log files and diff files from the servers in the range [start_vm
,end_vm
]. - The above is a serial script which will take a long time. So, you can use
./pull_diff_files_and_xfsmonkey_log_parallel.sh <dir>
which pulls the necessary files from all the VMs. Similar toscp_segregated_workloads_parallel.ssh
script, you should change this parallel script also if the number of live nodes changes. - Use
ps aux | grep pull_diff_files
to check if the pull script has completed. Once it's complete, you can find all the log and diff files underdir
specified.