Added docker-file and edited the instructions for it #50

rpj09 · 2023-03-16T13:48:18Z

close #40 . i have added the docker-file and added the instructions to it.

csris · 2023-03-18T04:32:13Z

Dockerfile

+
+# Create OpenChatKit environment
+COPY environment.yml .
+RUN conda env create -f environment.yml


Perhaps we should use mamba here instead. This can take a very long time to build the image.

This takes a while (forever) for me. Meanwhile it doesn't recognize conda yet, have to use /opt/conda/bin/conda instead of conda.

csris · 2023-03-18T04:35:55Z

Dockerfile

+COPY . .
+
+# Prepare GPT-NeoX-20B model
+RUN python pretrained/GPT-NeoX-20B/prepare.py


This can be pretty large. It would be better to bind mount a directory and have the entrypoint call prepare.py.

hey @csris i am quite new to this project and community. can you help me out by explaining the project structure and whats the requirement in detail. I need a little bit of guidance regarding this project.

csris · 2023-03-18T04:37:57Z

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

sejpalk · 2023-03-19T04:55:24Z

Dockerfile

@@ -0,0 +1,35 @@
+# Base image
+FROM ubuntu:20.04


I think we should use the Nvidia container toolkit so we can leverage NVIDIA GPUs

sure i will take care of that

orangetin · 2023-04-22T04:14:57Z

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

rpj09 · 2023-04-22T05:21:42Z

@rpj09 are you still working on this? I'd like to help. I can make some time to work on it

Yeah sure , actually i got busy in sem exams.

rpj09 · 2023-04-22T05:34:09Z

Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful.

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

orangetin · 2023-04-22T06:43:54Z

Hey @csris , apologies for replying this late .

When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.

To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.

To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.

Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile

# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]

Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

rpj09 · 2023-04-22T11:02:56Z

Hey @csris , apologies for replying this late .
When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother.
To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it.
To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother.
Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient.

I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works

Dockerfile
# Base image
FROM ubuntu:20.04
VOLUME /app

# Set working directory
WORKDIR /app

# Update and install required packages
RUN apt-get update && \
    apt-get install git-lfs wget gcc -y && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /app/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/app/conda/bin:${PATH}

# Create OpenChatKit environment
COPY environment.yml .
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f environment.yml 

# Set conda to automatically activate base environment on login
RUN echo ". /app/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    echo "conda activate OpenChatKit" >> ~/.bashrc

# Copy OpenChatKit code
COPY . .

# Optional code to prepare for finetuning
# Install Git LFS
# RUN git lfs install
# 

# Set entrypoint to bash shell
ENTRYPOINT ["/bin/bash"]
Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args.

Want me to make a pr to your branch?

Sure @orangetin

orangetin · 2023-04-22T23:57:48Z

I got the docker file working using a kinda different method. Instead of just opening up an empty shell, I wrote a bash script that executes when the docker container is run, which then runs the required scripts for prepping the data, training, and/or cmd inference.

Plus, with the volume method mentioned above, it'll be easy to handle the downloading of the required datasets.

In the mean time, @csris should I open a new pr for that or merge it in here?

Here's the branch: https://github.com/orangetin/OpenChatKit/tree/docker
It modifies the original dockerfile and adds a new bash script.

Build command:
sudo docker build -t openchatkit .

Sample run command:
sudo docker run -it --rm --volume $(pwd):/app openchatkit --model togethercomputer/Pythia-Chat-Base-7B

EDIT:
I've updated the files to use micromamba instead of miniconda/mamba because launching the container took forever with miniconda.

EDIT 2:
Other sample commands:
sudo docker run -it --rm openchatkit prepare --bitsandbytes # run prepare scripts and install bitsandbytes
sudo docker run -it --rm openchatkit train --model gpt-neox # train the gpt-neox model
sudo docker run -it --rm openchatkit train # defaults model to 'pythia'
sudo docker run -it --rm openchatkit --model togethercomputer/GPT-NeoXT-Chat-Base-20B

orangetin · 2023-04-24T14:41:10Z

I've edited the branch mentioned above. I was able to shrink the image size from 20.5 GB to 13.8 GB by clearing conda cache.

Here are the updated commands:

Inference:

sudo docker create -i -t --name inference --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit --model togethercomputer/Pythia-Chat-Base-7B

sudo docker start inference -a

Prepare for training:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit prepare
Train:
sudo docker run -it --rm --volume $(pwd):/app --volume /TEMPCACHE:/root/.cache openchatkit train -m pythia

^ The cache directory is where huggingface saves downloaded models. By mounting it as a volume, the downloaded model can be shared by multiple containers. This also makes it efficient to launch multiple containers (like multiple inference instances) concurrently without using up more disk space.

This needs more testing though.

Edit: The above method successfully loads the model into the gpu/cpu but does not produce outputs. It goes into an 'EOF' error loop. The reason for this is docker doesn't play nice with bash inputs from a python script inside a container. Making a docker container should work for training, but performance may not be the best. I'd say we should wait until we have a working gradio interface before continuing with this for inference.

rpj09 added 2 commits March 16, 2023 19:01

Create Dockerfile

325394f

Update README.md

021ee4a

csris reviewed Mar 18, 2023

View reviewed changes

sejpalk reviewed Mar 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added docker-file and edited the instructions for it #50

Added docker-file and edited the instructions for it #50

rpj09 commented Mar 16, 2023

csris Mar 18, 2023

WeileiZeng Mar 22, 2023

csris Mar 18, 2023

rpj09 Mar 19, 2023

csris commented Mar 18, 2023

sejpalk Mar 19, 2023

rpj09 Mar 19, 2023

orangetin commented Apr 22, 2023 •

edited

Loading

rpj09 commented Apr 22, 2023

rpj09 commented Apr 22, 2023

orangetin commented Apr 22, 2023 •

edited

Loading

rpj09 commented Apr 22, 2023

orangetin commented Apr 22, 2023 •

edited

Loading

orangetin commented Apr 24, 2023 •

edited

Loading

Added docker-file and edited the instructions for it #50

Are you sure you want to change the base?

Added docker-file and edited the instructions for it #50

Conversation

rpj09 commented Mar 16, 2023

csris Mar 18, 2023

Choose a reason for hiding this comment

WeileiZeng Mar 22, 2023

Choose a reason for hiding this comment

csris Mar 18, 2023

Choose a reason for hiding this comment

rpj09 Mar 19, 2023

Choose a reason for hiding this comment

csris commented Mar 18, 2023

sejpalk Mar 19, 2023

Choose a reason for hiding this comment

rpj09 Mar 19, 2023

Choose a reason for hiding this comment

orangetin commented Apr 22, 2023 • edited Loading

rpj09 commented Apr 22, 2023

rpj09 commented Apr 22, 2023

orangetin commented Apr 22, 2023 • edited Loading

rpj09 commented Apr 22, 2023

orangetin commented Apr 22, 2023 • edited Loading

orangetin commented Apr 24, 2023 • edited Loading

orangetin commented Apr 22, 2023 •

edited

Loading

orangetin commented Apr 22, 2023 •

edited

Loading

orangetin commented Apr 22, 2023 •

edited

Loading

orangetin commented Apr 24, 2023 •

edited

Loading