Skip to content

sands-lab/colext

Repository files navigation

CoLExT: Collaborative Learning Experimentation Testbed

CoLExT is a testbed built for machine learning researchers to realistically execute and profile Federated Learning (FL) algorithms on real edge devices and smartphones. This repo contains the software library developed to seamlessly deploy and monitor FL experiments compatible with the Flower Framework. Consider checking the CoLExT website for an overview of the project.

CoLExT contains 20 smartphones and 28 Single Board Computers (SBC) and is hosted at the King Abdullah University of Science and Technology (KAUST). If you’d like to experiment with CoLExT, please show your interest by filling out this form.

Warning

CoLExT supports both SBC and Android deployments; however, Android deployment is not yet available in this branch.

CoLExt Diagram

Using CoLExT

Note: Requires prior approval. Please fill out the show of interest form if you're interested.

  1. Access the CoLExT server. How to access CoLExT.

  2. Install the CoLExT package in a local Python environment, e.g. with venv.

    $ python3 -m venv .colext_env && source .colext_env/bin/activate
    
    # The plotting extras automatically generatess metric plots
    (.colext_env)$ python3 -m pip install colext[plotting] git+ssh://git@github.com/sands-lab/colext.git
  3. In the FL code, import the colext decorators and wrap Flower's client and strategy classes. Note: If used outside of the testbed, these decorators do not modify the program behavior and thus can safely be included in the code in general.

    from colext import MonitorFlwrClient, MonitorFlwrStrategy
    
    @MonitorFlwrClient
    class FlowerClient(fl.client.NumPyClient):
      [...]
    @MonitorFlwrStrategy
    class FlowerStrategy(flwr.server.strategy.Strategy):
      [...]
  4. Create a CoLExT configuration file colext_config.yaml. CoLExT exposes bash environment variables with information about the experiment. Please remember to pass the FL server address to the clients. Here's an example configuration file using SBC devices:

    # colext_config.yaml
    project: colext_example
    
    code:
      client:
        # Assumes relative paths from the config file
        entrypoint: "./src/client.py"
        args: "--client_id=${COLEXT_CLIENT_ID} --server_addr=${COLEXT_SERVER_ADDRESS}"
      server:
        entrypoint: "./src/server.py"
        args: "--n_clients=${COLEXT_N_CLIENTS} --n_rounds=3"
    devices:
      - { dev_type: LattePandaDelta3, count: 4 }
      - { dev_type: OrangePi5B,  count: 2 }
      - { dev_type: JetsonOrinNano, count: 4 }
    monitoring:
      scraping_interval: 0.3  # in seconds
      push_to_db_interval: 10 # in seconds
  5. Specify your Python dependencies using a requirements.txt file on the same directory as the CoLExT configuration file.

  6. Deploy, monitor the experiment in real time, and download the collected performance metrics as CSV files.

    # Execute in the directory with the colext_config.yaml
    $ colext_launch_job
    # Prints a job-id and a dashboard link
    
    # After the job finishes, retrieve metrics for job-id as CSV files
    $ colext_get_metrics --job_id <job-id>

    Dashboard example:

    CoLExt Dashboard

  7. To confirm the deployment is working, try launching an example:

      $ cd colext/examples/flwr_tutorial
      $ colext_launch_job

Continue reading for more information on the above steps and check the tips section for the deployment type you're interested in:

CoLExT configuration file (colext_config.yaml)

This section describes the possible configuration options for the CoLExT configuration file.

Currently available devices

# SBCs
  - { device_type: JetsonAGXOrin,  count: 2 }
  - { device_type: JetsonOrinNano, count: 4 }
  - { device_type: JetsonXavierNX, count: 2 }
  - { device_type: JetsonNano, count: 6 }
  - { device_type: LattePandaDelta3, count: 6 }
  - { device_type: OrangePi5B, count: 8 }
# !!! Currently, this config file will not work with smartphones !!!
# Smartphones
  - { device_type: SamsungXCover6Pro, count: 3 }
  - { device_type: SamsungGalaxyM54, count: 2 }
  - { device_type: Xiaomi12, count: 2 }
  - { device_type: XiaomiPocoX5Pro, count: 2 }
  - { device_type: GooglePixel7, count: 5 }
  - { device_type: AsusRogPhone6, count: 2 }
  - { device_type: OnePlusNord2T5G, count: 2 }

Exposed environment variables

CoLExT exposes several bash environment variables that are passed to the execution environment. These can be used as arguments in the args section of the client and server code by expanding the variables as ${ENV_VAR}. See the usage example for an example.

Name Description
COLEXT_CLIENT_ID ID of the client (0..n_clients)
COLEXT_N_CLIENTS Number of clients in experiment
COLEXT_DEVICE_TYPE Hardware type of the client
COLEXT_SERVER_ADDRESS Server address (host:port)
COLEXT_PYTORCH_DATASETS Pytorch datasets caching path

CoLExT also exposes the following variables which are meant to be used internally by the colext package.

Name Description
COLEXT_ENV True if inside a CoLExT environment
COLEXT_JOB_ID Experiment job ID
COLEXT_CLIENT_DB_ID Unique client ID on the database
COLEXT_MONITORING_LIVE_METRICS True if metrics are pushed in real-time
COLEXT_MONITORING_PUSH_INTERVAL Interval between metric push to DB
COLEXT_MONITORING_SCRAPE_INTERVAL Interval between HW metric scraping

Performance monitoring options

monitoring:
  live_metrics: True # True/False: True if metrics are pushed in real-time
  push_interval: 10 # in seconds: Metric buffer time before pushing metrics to the DB
  scraping_interval: 0.3 # in seconds: Interval between metric scraping

Python version and deployers

Deployers:

  • sbc (default) - Deployer for SBC experiments. It's the default deployer.
  • local_py - Deployer that launches a local experiment. Clients and the server are launched in the CoLExT server.
  • android (pending merge) - This deployer has been developed but needs to be merged here.

Python versions: 3.10 (default) | 3.8.

deployer: local_py
code:
  python_version: "3.8"

Collected metrics

After calling colext_get_metrics --job_id <job-id>, csv files prepended with colext_<job_id>_ are downloaded to the current directory. Here are the contents for each CSV:

client_round_timings.csv

  • round_number: Number of the FL round
  • stage: Stage of the round: FIT or EVAL
  • start_time: Start of the round as measured by the FL server
  • end_time: End of the round as measured by the FL server
  • srv_accuracy: Evaluation of the server global model (Strategy evaluate result with dict key "accuracy")
  • dist_accuracy: Result of aggregation of 'evaluate' results from clients (Strategy aggregate_evaluate result with dict key "accuracy"))

client_info.csv

  • client_id: ID of the client
  • device_type: Device type as requested in the config file
  • device_name: Name of the device with the associated device type

client_round_timings.csv

  • client_id: ID of the client
  • round_number: Number of the FL round
  • stage: Stage of the round: FIT or EVAL
  • start_time: Start of the round as measured by the client
  • end_time: End of the round as measured by the client

hw_metrics.csv:

  • client_id: ID of the client
  • time: Local timestamp when the metrics were collected
  • cpu_util: CPU Utilization (%) - Percentage over 100%, indicates multiple cores being used
  • gpu_util: GPU Utilization (%)
  • mem_util: Memory Utilization (Bytes) - The memory reported is the RSS memory.
  • n_bytes_sent: Number of bytes sent (Bytes)
  • n_bytes_rcvd: Number of bytes received (Bytes)
  • net_usage_out: Upload bandwidth usage (Bytes/s)
  • net_usage_in: Download bandwidth usage (Bytes/s)
  • power_consumption: Power consumption (Watts) - Reported power differs between devices
    • Nvidia Jetsons: Entire board power consumption
    • LattePandas: CPU power consumption
    • OrangePis: Can be measured using High Voltage Power Meter

Summary data

Coming soon...

Tips for SBC deployment

The SBC deployment containerizes user code and deploys it using Kubernetes. Before starting the deployment process, it's recommended to make sure that a local deployment is working as expected. Local deployment can be used by changing the deployer to local_py. More information on setting the deployer here.

Once this is verified, check the containarization is working as expected by running the colext_launch_job command in the "prepare" mode. This mode will perform checks and containerize the application making sure all the dependencies can be installed in the container.

# Prepares app for deployment
$ colext_launch_job -p

Excluding files from CoLExT containerization

Files can be excluded from the automatic containerization by using a .dockerignore file on the same directory as the colext_config.yaml. For details on .dockerignore, check the docker documentation.

Debugging

Once the code is successfully deployed to CoLExT, it can be useful to debug issues using the Kubernetes CLI. In our deployment, we're using Microk8s with an alias to kubectl we called mk. Here are the most useful commands:

# See current pods deployed in the cluster
$ mk get pods -o wide
# See and (f)ollow the logs of a specific pod
$ mk logs <pod_name> -f
# Read logs of a failed pod
$ mk logs <pod_name> -p

Porting Poetry projects

Currently, CoLExT does not work with dependencies specified by Poetry. These can easily be converted to the supported requirements.txt format with these commands:

# Prevents Poetry resolver from being stuck waiting for a keyring that we don't need
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
poetry export --without-hashes -f requirements.txt --output requirements.txt

Tips for Android deployment

Coming soon...

Accessing the CoLExT server

Currently, the CoLExT server is not reachable through a public IP. To enable access to the server, we're using ZeroTier to create a virtual private network. This allows external users to interact with the server as if they were directly on the same private network.

Connect to CoLExT server:

  1. Install ZeroTier.
  2. Get your ZeroTier device ID. The ID is displayed at installation and can be retrieved later with:
    sudo zerotier-cli info | awk '{print $3}'
    
  3. Share your ZeroTier device ID and a public SSH key with your CoLExT contact point
  4. Add an SSH config. Note that you need to replace the username:
    # Add to ~/.ssh/config
    Host colext
        Username <your_username>
        Hostname 10.244.96.246
        ForwardAgent yes # Required to use local SSH keys in CoLExT
    
  5. Add the CoLExT server to your hosts file.
    # Add to /etc/hosts
    10.244.96.246 colext
    
  6. Wait for your device to be added to the ZeroTier network
  7. Test connection to CoLExT server:
    # Confirm connectivity
    $ ping colext
    # Confirm ssh access
    $ ssh colext

At this point, if you're just getting started, continue reading the next step in the using colext section. If you find issues connecting to CoLExT, reach out to your CoLExT concact point.

Developing the CoLExT package

Install the CoLExT package locally with the --editable flag.

$ python3 -m pip install -e <root_dir>

Useful:

  • Experiment with launching an example using the local deployer
    # colext_config.yaml
    deployer: local_py
  • Prepare the deployment only by launching the job with the prepare flag:
    $ colext_launch -p

Repo overview

.
├── src/colext/     # Python package used to deploy user code and interact with results
│   ├── scripts/    # Folder with CoLExT CLI commands: launch_job + get_metrics
├── examples/       # Example of Flower code integrations with CoLExT
├── plotting/       # Ploting related code
├── colext_setup/   # CoLExT setup automation
│   ├── ansible/            # Ansible playbooks that perform the initial configuration of SBC devices
│   ├── db_setup/           # DB schema and initial DB populate file
│   ├── base_docker_imgs/   # Base docker images used when containerizing the code

Testbed configuration notes (private):

Notion page

Maximizing jetson performance

Jetson clocks - maxes clocks and turns off dynamic voltage and frequency scaling (DVFS). Can be disabled with jetson_stats.

Nvidia Power Model Tool (NVP) models

Jetson power consumption measurements:

Total SBC power consumption.

  power_mw = jetson.power["tot"]["power"]

AGX Orin, Orin Nano - can separate CPU and GPU power from SOC power

Limitations:

  • Currently, CoLExT only directly supports FL code using the Flower framework 1.5 + 1.6.
  • tensorflow package does not work with LattePandas. Tensorflow builds from PiPy for x86 arch expect them to have support for AVX instructions, but the LattePandas do not have them.
  • Currently, Nvidia Jetsons defaults to supporting Pytorch 2.2.0. Except for Jetson Nanos, which only support up to Pytorch 1.13. Additional Pytorch versions can be supported upon request.
  • Currently, only Python version 3.8 and 3.10 are supported.

POTENTIAL FUTURE UPDATES

  • Support more recent versions of Flower >= 1.7
  • Support the new way of using Flwr client-server - Super link + Super node
  • Look into the official Docker support for the new Flwr deployment option
  • Add our Android template template as a Flwr template

Consider supporting conda-lock

Alternative to requirements.txt. Conda-lock grounds the conda environment to the specific os and platform

conda env export --from-history | egrep -v "^(name|prefix): " > environment.yaml
conda-lock -f environment.yml -p linux-64 -p linux-aarch64