-
Notifications
You must be signed in to change notification settings - Fork 518
Launch your own swarm
This tutorial will walk you through the steps of setting up your own private swarm to inference and fine-tune BLOOM. Please make sure you have already installed Petals and followed the basic example from getting started section.
Before we begin:
- This tutorial covers BLOOM-176B. It requires ~200GB of combined GPU memory in 8 bit. If you want to try this on a smaller scale, use
bigscience/test-bloomd-6b3
model. - If something does not work for you, please contact us by opening an issue.
If you plan to work with unreliable GPU servers (e.g. spot instances), it is a good practice to have a few non-GPU devices that are always online. These "backbone" peers can be used as --initial_peers
, to connect new GPU servers to the existing ones. They can also serve as relays for GPU servers that lack open ports.
If you have reliable GPU servers, you can skip this step entirely and use these servers as initial peers, like in the basic tutorial.
To start a non-GPU peer, run this line in a tmux / screen shell:
hivemind-dht --identity peer1.id --host_maddrs /ip4/0.0.0.0/tcp/8989
Once you run it, look at the outputs and find the following line:
Mon 00 01:23:45.678 [INFO] Running a DHT instance. To connect other peers to this one, use --initial_peers /ip4/YOUR_ADDRESS_HERE/tcp/8989/p2p/QmTPAIfThisIsMyAddressGoFindYoursnCfj
You can provide this address as --initial_peers
to GPU servers or other backbone peers. If there is a risk that this peer goes down, you can launch additional hivemind-dht instances and provide multiple addresses. New peers will be able to join the swarm as long as at least one of their initial peers is alive.
Here's a few tips to help you set up:
The host_maddrs contains "multi-addresses" containing an IP address, port and network protocols. Learn more about them in this guide.
- The last part of a multi-address defines the network port (
8989
), which should be accessible to other peers. You can set port to 0 to choose it at random. - Depending on your network, you may need to manually dial your IP to avoid connection issues, e.g.
/ip4/12.34.56.78/tcp/8989
When running over the internet, you can auto-detect IP with this script:
export IPV4=$(dig -4 TXT +short o-o.myaddr.l.google.com @ns1.google.com | tr -d '"')
export IPV6=$(dig -6 TXT +short o-o.myaddr.l.google.com @ns1.google.com | tr -d '"')
echo "My IP v4: [ $IPV4 ] v6: [ $IPV6 ] - must be non-empty!" # if IP is empty, the script has failed (e.g. no internet)
The identity defines the "p2p/QmWhatever" part of your peer's address. Each peer's identity must be unique!
- set
--identity
option to a file (created if missing) to ensure that your peer has the same identity each time you restart it; - if you omit this option, Petals will generate a new identity each time a process is started. This is fine for "temporary" peers.
We will run bigscience-workshop/bloom-petals
- the BLOOM-176B model that was converted to Petals format.
Here's the full script that we used to benchmark Petals over the internet (section 3.2 here). Don't worry, we'll explain everything.
export CUDA_VISIBLE_DEVICES="0" # choose one GPU index (e.g. "0") or leave blank to run on CPU
export NUM_BLOCKS=<TODO pick the number of blocks based on your GPU memory, see below>
export COMPRESSION=<use "NONE" when running locally or "BLOCKWISE_8BIT" to run over the internet>
export INITIAL_PEERS=<TODO add one or several multi-addresses to connect to>
export CACHE_SIZE="1.0GiB" # rule of thumb: 250MB per block, see notes
export PORT=6789 # select an open port
export IPV4=$(dig -4 TXT +short o-o.myaddr.l.google.com @ns1.google.com | tr -d '"')
echo "My IP v4: [ $IPV4 ] - must be non-empty!" # if IP is empty, you need to specify IP manually
python -m cli.run_server bigscience/bloom-petals --num_blocks $NUM_BLOCKS --throughput auto \
--torch_dtype float16 --load_in_8bit True --compression $COMPRESSION --attn_cache_size $CACHE_SIZE \
--host_maddrs /ip4/0.0.0.0/tcp/$PORT /ip4/::/udp/$PORT/quic --announce_maddrs /ip4/$IPV4/tcp/$PORT /ip4/$IPV4/udp/$PORT/quic \
--identity_path ./agirlhasnoname.id --initial_peers $INITIAL_PEERS
That's a lot of stuff. Let's cover it one parameter at a time:
-
num_blocks
depends on your GPU memory. A good rune of thumb isnum_blocks = (gpu_memory_gb - 2) / 2.75
. -
throughput
measures your server's throughput for load-balancing. Currently, it runs speedtest. If it does not work, you can set throughput manually, e.g.--throughput=150
-
torch_dtype
for BLOOM, pickbfloat16
for Ampere (e.g. RTX 3060, A100) or newer GPUs;float16
for other GPUs,float32
for CPU. -
load_in_8_bit
use LLM.8bit() to fit more transformer blocks in the same memory. Remove this argument on older pre-turing GPUs or running on CPU. -
attn_cache_size
- maximum memory used for generation (and only generation). If not specified, server may run out of memory when processing too many inference queries. A good rule of thumb is 2 gb per 8 blocks. Scales proportionally.
The remaining parameters: --host_maddrs
, --announce_maddrs
, --identity
and --initial_peers
are discussed in the networking section above ("Step 1"). When running multiple processes per server, make sure each one has a unique identity and port.
You can use test that everything works using the same interface as in README:
import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM
initial_peers = [TODO_put_one_or_more_server_addresses_here] # e.g. ["/ip4/127.0.0.1/tcp/more/stuff/here"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained("bigscience/bloom-petals")
model = DistributedBloomForCausalLM.from_pretrained(
"bigscience/bloom-petals", initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
) # this model requires 14GB memory to load word embeddings (size: 14336 x 250k)
inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
remote_outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(remote_outputs[0]))
# "train" input embeddings by backprop through distributed transformer blocks
model.transformer.word_embeddings.weight.requires_grad = True
outputs = model.forward(input_ids=inputs)
loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
loss.backward()
print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())
For a more advanced usage example, please see our example on "deep" prompt-tuning here: examples/prompt-tuning-personachat.ipynb.
If you encounter any issues or want to share feedback, please join #running-a-server channel of our Discord.