-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added docker-file and edited the instructions for it #50
base: main
Are you sure you want to change the base?
Conversation
|
||
# Create OpenChatKit environment | ||
COPY environment.yml . | ||
RUN conda env create -f environment.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should use mamba
here instead. This can take a very long time to build the image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This takes a while (forever) for me. Meanwhile it doesn't recognize conda yet, have to use /opt/conda/bin/conda
instead of conda
.
COPY . . | ||
|
||
# Prepare GPT-NeoX-20B model | ||
RUN python pretrained/GPT-NeoX-20B/prepare.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be pretty large. It would be better to bind mount a directory and have the entrypoint call prepare.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @csris i am quite new to this project and community. can you help me out by explaining the project structure and whats the requirement in detail. I need a little bit of guidance regarding this project.
Thanks for the PR, @rpj09! Do you have any thoughts on how we should handle the datasets download? They can be very large, so having to redownload them every time the container is launched would be somewhat painful. |
@@ -0,0 +1,35 @@ | |||
# Base image | |||
FROM ubuntu:20.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the Nvidia container toolkit so we can leverage NVIDIA GPUs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure i will take care of that
@rpj09 are you still working on this? I'd like to help. I can make some time to work on it |
Yeah sure , actually i got busy in sem exams. |
Hey @csris , apologies for replying this late . When working with large and frequently updated datasets, having to redownload them every time a container is launched can be a real pain. Using a caching system. can help make this process much smoother. To implement this approach, you can create a separate container that is responsible for caching and storing the dataset. This container can use a caching system like docker-cache or Squid to ensure that the dataset is always available and up-to-date across all containers that use it. To update the dataset, you can set up a script to run periodically, which checks for new data and downloads it if necessary. Then, when launching the main container, you can mount the dataset volume from the caching container into the main container. This way, the main container can access the cached dataset without having to download it again, making the process much faster and smoother. Overall, using a caching system can make working with large and frequently updated datasets in Docker containers much more manageable. It minimizes the need for redownloads, ensures that the dataset is always available and up-to-date, and can help make the entire process much smoother and more efficient. |
I feel like this approach is more complicated than it needs to be. I modified @kailust 's dockerfile. I've tested it and it works
Right now, it just starts a bash shell, we could modify it to directly start training/inference and selecting a model with optional args. Want me to make a pr to your branch? |
Sure @orangetin |
I got the docker file working using a kinda different method. Instead of just opening up an empty shell, I wrote a bash script that executes when the docker container is run, which then runs the required scripts for prepping the data, training, and/or cmd inference. Plus, with the volume method mentioned above, it'll be easy to handle the downloading of the required datasets. In the mean time, @csris should I open a new pr for that or merge it in here? Here's the branch: https://github.com/orangetin/OpenChatKit/tree/docker Build command:
EDIT:
|
I've edited the branch mentioned above. I was able to shrink the image size from 20.5 GB to 13.8 GB by clearing conda cache. Here are the updated commands: Inference:
Prepare for training: ^ The cache directory is where huggingface saves downloaded models. By mounting it as a volume, the downloaded model can be shared by multiple containers. This also makes it efficient to launch multiple containers (like multiple inference instances) concurrently without using up more disk space. This needs more testing though. Edit: The above method successfully loads the model into the gpu/cpu but does not produce outputs. It goes into an 'EOF' error loop. The reason for this is docker doesn't play nice with bash inputs from a python script inside a container. Making a docker container should work for training, but performance may not be the best. I'd say we should wait until we have a working gradio interface before continuing with this for inference. |
close #40 . i have added the docker-file and added the instructions to it.