Skip to content

Launch your own swarm

Alexander Borzunov edited this page Feb 17, 2023 · 45 revisions

This tutorial will walk you through the steps of setting up your own private swarm to inference and fine-tune BLOOM. Please make sure you have already installed Petals and is familiar with the "Getting started" tutorial.

Before we begin:

  • This tutorial covers BLOOM-176B. It requires ~200GB of combined GPU memory in 8 bit. If you want to try this on a smaller scale, use bigscience/test-bloomd-6b3 model.
  • If something does not work for you, don't hesitate to reach us out in the #running-a-server channel of our Discord.

Step 1: Set up the network

If you plan to work with unreliable GPU servers (e.g. spot instances), it is a good practice to have a few non-GPU devices that are always online. These "backbone" peers can be used as --initial_peers, to connect new GPU servers to the existing ones. They can also serve as relays for GPU servers that lack open ports.

If you have reliable GPU servers, you can skip this step entirely and use these servers as initial peers, like in the basic tutorial.

To start a non-GPU peer, run this line in a tmux / screen shell: hivemind-dht --identity peer1.id --host_maddrs /ip4/0.0.0.0/tcp/8989

Once you run it, look at the outputs and find the following line:

Mon 00 01:23:45.678 [INFO] Running a DHT instance. To connect other peers to this one, use --initial_peers /ip4/YOUR_ADDRESS_HERE/tcp/8989/p2p/QmTPAIfThisIsMyAddressGoFindYoursnCfj

You can provide this address as --initial_peers to GPU servers or other backbone peers. If there is a risk that this peer goes down, you can launch additional hivemind-dht instances and provide multiple addresses. New peers will be able to join the swarm as long as at least one of their initial peers is alive.

Here's a few tips to help you set up:

The --host_maddrs contains "multi-addresses" containing an IP address, port and network protocols. Learn more about them in this guide.

  • The last part of a multi-address defines the network port (8989), which should be accessible to other peers. You can set port to 0 to choose it at random.
  • Depending on your network, you may need to manually dial your IP to avoid connection issues, e.g. /ip4/12.34.56.78/tcp/8989 When running over the internet, you can auto-detect IP with this script:
        export IPV4=$(dig -4 TXT +short o-o.myaddr.l.google.com @ns1.google.com |  tr -d '"')
        export IPV6=$(dig -6 TXT +short o-o.myaddr.l.google.com @ns1.google.com |  tr -d '"')
        echo "My IP v4: [ $IPV4 ] v6: [ $IPV6 ] - must be non-empty!"  # if IP is empty, the script has failed (e.g. no internet)

The --identity defines the "p2p/QmWhatever" part of your peer's address. Each peer's identity must be unique!

  • Set --identity option to a file (created if missing) to ensure that your peer has the same identity each time you restart it.
  • If you omit this option, Petals will generate a new identity each time a process is started. This is fine for "temporary" peers.

Step 2: Start Petals servers

We will run bigscience-workshop/bloom-petals - the BLOOM-176B model that was converted to Petals format. Use this command:

python -m petals.cli.run_server bigscience/bloom-petals --initial_peers $INITIAL_PEERS

You should replace $INITIAL_PEERS with the space-separated multiaddresses of the initial peers you set up in Step 1. Alternatively, you can use multiaddresses of the previously joined servers.

Check out the server FAQ if you encounter any issues.

Step 3: Use the model

You can use test that everything works using the same interface as in README:

import torch
import torch.nn.functional as F
import transformers
from petals import DistributedBloomForCausalLM

initial_peers = [TODO_put_one_or_more_server_addresses_here]  # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained("bigscience/bloom-petals")
model = DistributedBloomForCausalLM.from_pretrained("bigscience/bloom-petals", initial_peers=initial_peers)

inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
remote_outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(remote_outputs[0]))

# "train" input embeddings by backprop through distributed transformer blocks
model.transformer.word_embeddings.weight.requires_grad = True
outputs = model.forward(input_ids=inputs)
loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
loss.backward()
print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())

For a more advanced usage example, please see our example on "deep" prompt-tuning here: examples/prompt-tuning-personachat.ipynb.


If you encounter any issues or want to share feedback, please join #running-a-server channel of our Discord.