Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added docker-file and edited the instructions for it #50

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rpj09
Copy link

@rpj09 rpj09 commented Mar 16, 2023

close #40 . i have added the docker-file and added the instructions to it.


# Create OpenChatKit environment
COPY environment.yml .
RUN conda env create -f environment.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should use mamba here instead. This can take a very long time to build the image.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a while (forever) for me. Meanwhile it doesn't recognize conda yet, have to use /opt/conda/bin/conda instead of conda.

COPY . .

# Prepare GPT-NeoX-20B model
RUN python pretrained/GPT-NeoX-20B/prepare.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be pretty large. It would be better to bind mount a directory and have the entrypoint call prepare.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @csris i am quite new to this project and community. can you help me out by explaining the project structure and whats the requirement in detail. I need a little bit of guidance regarding this project.

@csris
Copy link
Contributor

csris commented Mar 18, 2023

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

@@ -0,0 +1,35 @@
# Base image
FROM ubuntu:20.04
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the Nvidia container toolkit so we can leverage NVIDIA GPUs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i will take care of that

@orangetin
Copy link
Member

orangetin commented Apr 22, 2023

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

@rpj09
Copy link
Author

rpj09 commented Apr 22, 2023

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

Yeah sure , actually i got busy in sem exams.

@rpj09
Copy link
Author

rpj09 commented Apr 22, 2023

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

@orangetin
Copy link
Member

orangetin commented Apr 22, 2023

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile

# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]

Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

@rpj09
Copy link
Author

rpj09 commented Apr 22, 2023

Hey @csris , apologies for replying this late .
When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.
To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.
To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.
Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile

# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]

Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

Sure @orangetin

@orangetin
Copy link
Member

orangetin commented Apr 22, 2023

I got the docker file working using a kinda different method. Instead of just opening up an empty shell, I wrote a bash script that executes when the docker container is run, which then runs the required scripts for prepping the data, training, and/or cmd inference.

Plus, with the volume method mentioned above, it'll be easy to handle the downloading of the required datasets.

In the mean time, @csris should I open a new pr for that or merge it in here?

Here's the branch: https://github.com/orangetin/OpenChatKit/tree/docker
It modifies the original dockerfile and adds a new bash script.

Build command:
sudo docker build -t openchatkit .

Sample run command:
sudo docker run -it --rm --volume $(pwd):/app openchatkit --model togethercomputer/Pythia-Chat-Base-7B

EDIT:
I've updated the files to use micromamba instead of miniconda/mamba because launching the container took forever with miniconda.

EDIT 2:
Other sample commands:
sudo docker run -it --rm openchatkit prepare --bitsandbytes # run prepare scripts and install bitsandbytes
sudo docker run -it --rm openchatkit train --model gpt-neox # train the gpt-neox model
sudo docker run -it --rm openchatkit train # defaults model to 'pythia'
sudo docker run -it --rm openchatkit --model togethercomputer/GPT-NeoXT-Chat-Base-20B

@orangetin
Copy link
Member

orangetin commented Apr 24, 2023

I've edited the branch mentioned above. I was able to shrink the image size from 20.5 GB to 13.8 GB by clearing conda cache.

Here are the updated commands:

Inference:

sudo docker create -i -t --name inference --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit --model togethercomputer/Pythia-Chat-Base-7B

sudo docker start inference -a

Prepare for training:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit prepare
Train:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit train -m pythia

^ The cache directory is where huggingface saves downloaded models. By mounting it as a volume, the downloaded model can be shared by multiple containers. This also makes it efficient to launch multiple containers (like multiple inference instances) concurrently without using up more disk space.

This needs more testing though.

Edit: The above method successfully loads the model into the gpu/cpu but does not produce outputs. It goes into an 'EOF' error loop. The reason for this is docker doesn't play nice with bash inputs from a python script inside a container. Making a docker container should work for training, but performance may not be the best. I'd say we should wait until we have a working gradio interface before continuing with this for inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build a docker image for openchatkit
5 participants