diff --git a/README.md b/README.md index 04a5e42..f57cc2b 100644 --- a/README.md +++ b/README.md @@ -1,31 +1,46 @@ -# textbook +# Textbook -The goal of this project is to distill chatGPT coding ability into a smaller model down to 1b parameters. We focus on solving the coding exercises task and use the [HumanEval](https://github.com/openai/human-eval) benchmark to evaluate our model. While we are aware that benchmark is far from -being perfect, we believe that it is a good starting point to prove our approach of distilling knowledge from a large model into a smaller one. Part of the idea were inspired by the [textbook](reproducing https://arxiv.org/abs/2306.11644) paper. +The goal of this project is to distill ChatGPT's Python coding ability into a smaller model with only 1 billion parameters. Our focus is on training the smaller model to solve coding tasks with natural language descriptions, and we use the [HumanEval](https://github.com/openai/human-eval) benchmark to evaluate our model. While we are aware that that benchmark is far from ideal, we believe that it is a good starting point to demonstrate the success of our approach to model distillation. We have drawn some inspiration from efforts to the results reported in the paper _Textbooks Are All You Need_ [(Gunasekar et al. 2023)](https://doi.org/10.48550/arXiv.2306.11644). +This repository consists of two parts: -The repo consist of two part: +* Dataset Generation: The code that we used to generate a \~200 million token dataset of Python programming exercises from ChatGPT 3.5. +* Model Fine-tuning: The code that we used to fine-tune the [Starcoder 1b model](https://github.com/bigcode-project/starcoder) using the generated dataset. +The generated exercises dataset is composed of a diverse set of \~120k Python code exercises (~120m total tokens) generated by ChatGPT 3.5. It follows the format of the [Human Eval benchmark](https://github.com/openai/human-eval): Each training sample is split into a Python function signature with a descriptive docstring, and a solution to the exercise. -* Dataset generation: We generate a 200 Millions tokens dataset from chatGPT 3.5 -* Model finetuning: We finetune starcoder 1b on the generated dataset - +## Synthetic exercise creation +Model distillation is the process of transferring some of the skilled performance of large models on specific classes of tasks to significantly smaller models. The purpose is to get performance comparable to the larger model, but at a fraction of the cost and at a vastly quicker speed. The general outline of this strategy is described (without technical implementation details) in [Textbooks Are All You Need](https://doi.org/10.48550/arXiv.2306.11644). -## Synthetic exercise creation +Key to the distillation process is the creation of synthetic data, generated by the larger AI model, to train the smaller model. We have applied this approach to Python programming tasks and are publishing a summary of our methods here along with the synthetic dataset. -Distillation of LLM's describes the process of capturing parts of foundational language model in a significantly smaller model. This allows for similar performance at a fraction of the cost and at vastly quicker speed. In the “textbooks are all you need”(https://arxiv.org/abs/2306.11644) paper this approach is explained for a storytelling LLM. Sadly, the technical details of distillation are not explained. Key to distillation is the creation of the synthetic data, on which the smaller model is trained. We applied the distillation approach to a coding task and more importantly publish our approach on creating a synthetic dataset. This document explains how we created the 40M tokens of synthetic exercises. +For fuller details and implementation code, see the [related GitHub repository](https://github.com/jina-ai/textbook). ### Diversity -The main problem of any large synthetic dataset is its diversity. Repeatedly (~400.000x) asking a LLM for a Python exercise will result in high similarity between the results. One could filter the exercises afterwards, however, this would increase the cost as you have to create more exercises. Also, it is unclear whether this approach will create exercises that cover all areas of a topic. Therefore, a different method is required to ensure diversity. +The main problem with model-generated synthetic data is its diversity. If we had constructed this dataset by giving ChatGPT 3.5 the same prompt several hundred thousand times, we would get many very similar, if not functionally identical, results. This would reduce the usefulness of the dataset for training. In principle, one might solve the problem by filtering the results for near duplicates, but this is a non-trivial problem, and even if it could be solved, it would be a wasteful and potentially expensive use of the larger model. -### Knowledge tree +And even then, we could not be sure the examples adequately covered the topic. To solve this problem, we introduced a novel scheme for systematically prompting large language models to produce diverse examples. -How do we force a LLM to create exercises for different topics? By regarding Python knowledge as a tree structure we can create different subtopics of the broader topic. These subtopics are then used to create exercises. First we curated a list of 42 subtopics of Python. These are topics as “Data Structures” and “Sorting Algorithms”. For each of those topics we created 10 subtopics using a LLM. These subtopics are then split into 5 topics each again, leaving us with ~2000 topics. Assuming 100 tokens per exercise we would have 2000*100 = 200.000 tokens. This is a far cry from the required millions of tokens necessary for knowledge injection during fine-tuning. We therefore combine topics with each other. Each individual topic is combined with 200 other topics to create new, unique topics. In our experiments these combined topics ensure data diversity. By combining topics we can create 200.000 * 200 = 40M tokens of exercises. +### Using a topic tree to build diverse prompts -Another way to inject diversity is prompt engineering. By having random aspects in the prompt the LLM is primed to create different exercises. We created a list of 50 professions of which the LLM chose one randomly per exercise. For example: “Write this exercise for a baker” or “Write this exercise for a doctor”. By priming the model with different exercises, different types of exercises are created. A baker might be more associated with baking which is associated with creating objects (bread), whereas a doctor is associated with changing states of patients. Therefore different professions require different exercises. For each of the 200 combinations we randomly selected a profession. If one wanted to create more exercises, you could take the same combination and sample more professions. For example, by using 5 different professions per combination you can create 200M tokens, whilst maintaining diversity. +We constructed a hierarchical model of subjects in Python programming, i.e. a topic tree. First, we manually identified 42 general topic areas in Python knowledge, for example, _data structures_ and _sorting algorithms_. We asked an LLM to propose 10 subtopics for each, and then for each of those 420 fine-grained topics, we asked the LLM to generate 5 even more fine-grained sub-subtopics. This resulted in roughly 2000 very fine-grained topics. + +We generated prompts by randomly selecting two of those roughly two thousand topics and combining them: + +``` +Create a code completion exercise on the intersection of {topic 1} and {topic 2}. +``` + +To increase randomness and diversity in the results, we also constructed a list of 40 professions, like _economist_, _engineer_, and _social worker_, and added them to the prompt: + +``` +Create a code completion exercise on the intersection of {topic 1} and {topic 2}. +Write it for a {profession}. +``` +In principle, there are approximately two million possible pairs of topics, and with 40 possible professions, this yields 80 million unique prompts. If the response to each prompt averages 100 tokens, this means our method can generate an 8 billion token synthetic dataset while maintaining a high degree of diversity. The dataset used here is only a small sample of the possible total. ## Install dependency