Colbert AI is a Deep Learning Language Model that generates text in the style of Stephen Colbert's famous monologues.
We used State of the Art Deep Learning Language model: Open AI's GPT-2 and Fine Tuned it using text from YouTube video captions.
- Transformers
- GPT-2
- Pytorch
- youtube_dl
- The playlist is specified by
PLAYLIST_URL
indownload.py
youtube_dl
module to download captions of each video from the playlist and saving all of them in data/captions folder
- We only looked for text where the speaker was Stephen Colbert
- Individual captions were merged into single file, separated by an End of Text Marker
- Clone this repository, using:
git clone https://github.com/NextTechLabAP/Colbert-AI.git
- Install all requirements on
requirements.txt
using:pip install -r requirements.txt
- Run
python3 download.py
to download the captions - Run
python3 caption_processing.py
to process the captions - Open the
Colbert-AI-v2.ipynb
Jupyter Notebook - Change path to captions.txt
- Rull all cells
- Clone this repository, using:
git clone https://github.com/NextTechLabAP/Colbert-AI.git
- Open the
Colbert-AI-v2.ipynb
Jupyter Notebook - Change path from
captions.txt
to the Custom Text Corpus file - Rull all cells
- GPT-2 Small (124M Model)
- GPT-2 Medium (345M Model)
- GPT-2 Large (774M Model)
- GPT-2 Extra Large (1558M Model)
We used GPT-2 Medium for our use case since we focused on building a lighter model so we could fine-tune it further.
- Function to first select top N tokens from the probability list and then based on the selected N-word distribution
- At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text.
Text Can be Generated using generate_text
. One of the Text Samples generated using prompt "Artificial Intelligence is ":
- Artificial general intelligence is the most likely future of the human race; it's a science which is not just possible but inevitable."
- Dataset has been preprocessed and prepared in
Text_Corpus
class. - Variable Hyperparameters
BATCH_SIZE = (1)
EPOCHS = (30)
LEARNING_RATE = (1e-5)
WARMUP_STEPS = (10000)
MAX_SEQ_LEN = (550)
We trained the model and saved the model weights after each epoch. Then we generated Text Samples from the saved weights.
-
Now, there are some people out there who think trump's a bad person. For instance, this weekend, I watched the presidential candidate's first candidate round-up, and he was named "The man who can't get anything he wants to get right." ( cheers and applause ) that's a good quality. That's a good quality, because the only person who can't get anything right is Donald Trump. ( laughter ) and I'm not sure he's read the new book, "The man who can't get anything wrong."
-
This is a big day for the president of the united states. Trump is about to be released from impala. (laughter) (applause) and this is huge news because this is a big week for him because the court has decided that he can no longer use the n-word, because, in a letter to his staff, the president said, "If I didn't use the n-word, then why are all the other white house staff members calling me a cuck?!" (laughter) (applause) (cheers and applause) (piano riff) and trump's not the only person who has been in jail for the "N-word." last week, Austin turns out to be a founder of "N-god," which was also the name of a movie. (cheers and applause) and now trump is going to have a new "N-god." (laughter) and, of course, the "N-god"
Mentions: