WIP commit to test serving of 125M model. Once it's ready, we can serve the larger models easily.
Important files:
- The Python serving script is in
H3/examples/serving_h3.py
. - The bash serving script is in
serve.sh
. - The Dockerfile is in
Together_dockerfile
. - The config is in
local-cfg.yaml
.
By default, we expect the model checkpoint to be in /home/user/.together/models/H3-125M/model.pt
inside the Docker container.
You can get it by running these commands:
git lfs install
git clone https://huggingface.co/danfu09/H3-125M
You may need to run these commands after install to install FlashAttention (we can add them to the Dockerfile):
git submodule init
git submodule update
cd flash-attention
git submodule init
git submodule update
pip install -e .