Large Language Models(LLM) are changing the landscape of data science - whether this represents a paradigm shift in the way information is produced, aggregated and consumed, or this is just another passing fad, the jury is still out. In any case, AI/LLM as a tool, increasingly an indispensable one, is becoming the accepted norm, especially among programmers and data scientists.
This collection of notes is basically a brain dump for me - it captures my personal learnings to make sense of this exploding field and try to keep myself sane.
- History
- Closed Foundation Models
- Open Foundation Models
- Datasets
- Frameworks and Ecosystems
- FAQs
- My Cookbook
These are the important milestones leading up to the current state of LLM.
-
Transformer:
- The OG paper: Attention Is All You Need
- A detailed anatomy of transfomer : Transformer from scratch
- The python implementation of the original transformer: The annotated transformer
-
The Zoo of LLMs
- An overview and history of LLMs: A nice review paper of LLMs.
- List of transformers: A github repo for all the transformer-based models, not just LLMs.
- A catalog of transformer models: This started as a blog post, later they converted it into a nice paper.
-
GPT style decoder only models
-
Bert style encoder only models
-
T5 style encoder-decoder models
-
"Making it bigger" - scaling and emergent properties
- Scaling Law: Scaling Laws for Neural Language Models
- Switch Transformers: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Chinchilla: An empirical analysis of compute-optimal large language model training
- Emergent Abilities: Emergent Abilities of Large Language Models
-
Instruction finetuning and alignment
-
The birth of ChatGPT: the cambrian explosion started from here: Nov 2022
-
The race to catch up with ChartGPT in the open source community
The field is moving so fast you absolutely need a hacker's mentality
-
Python/Numpy/Pandas: basic skills needed to code up something quickly. Fortunately with tools like copilot/ChatGPT/Replit, it's quite easy to get up to speed quickly in this department, especially if you are a programmer to begin with. For instance I came from a C++/R/Haskell background, and made a switch from R to Python quite smoothly.
-
PyTorch: Most transformer models are in torch. Invest some time to get you comfortable with both the tensor library and building blocks for neural networks. Read a lot of library code to get a sound foundation.
-
Git and github: You will clone a tons of repos to experiment, so invest some time in building your own commands to quickly get things done.
-
Huggingface: this is a must now. Not just the
transformers
library itself, but alsopeft
,accelerate
, etc. You will spend tons of time with HF. -
Linux, bash, and command line tools: Get a mac and get comfortable with command lines tools. Trust me it is worth your time.
-
GPUs. You can get away with CPUs for inference (GGLM is really coming up fast), but you will have to use GPUs for training models. You can build you own box with RTX 3090 (or 4090 if you have a few extra bucks), or rent online from one of those small guys: vast.ai, runpod, or Azure/AWS if you are not paying the bills out of your own pocket. Stick with A100s if you have budget - everything just works with A100s.
- OpenAI: Build stuff with OpenAI to get your feet wet, read their cookbooks, they are really good.
- Google: Have not really tried Bard model....
- Anthropic: On my list, never really played with Claude/Claude2.
FB's release of LLaMA set off a wave of fine tuned variants of LLaMA 7/13/30/65B models, with some fun playing with names of Llama family. See Huggingface Leaderboard for Open LLMs for some of those notable models.
Now with release of Llama 2, the real competition and fun starts !
-
Alpaca: This is the first model coming out of Standford. It was trained on the instructions generated from ChatGPT.
-
Vicuna: NormicAI is behind this project. The model was finetuned with ShareGPT data, a crowd sourced dataset via ChatGPT. Also it comes with a fast inference engine - underlying it there is GPU-optimized version of inference engine called vLLM. They also have other open source models like T5.
-
WizardLM: This is from Microsoft Reseach. It is based on Evol-instruct, a tree-based instructions.
-
WizardVicuna: a combo of Wizard and Vicuna.
-
Open Assistant: dataset, RLHF fine tuning, etc.
-
QLora: This is a big for guys with consumer-grade GPUs like RTX series, you can fine tune a sizeable model with a single GPU. It was trained with Open Assistant dataset.
LLaMA alternatives (Updated on 7/20, now they are less attractive with the release of Llama 2 .... )
- Falcon
- OpenLLaMA
- Togethecomputer - RedPajama
- Mosiac ML
- StabilityAI
- BigCode Project/HuggingFace
- Replit
Alpaca and variants:
- Original Alpaca Dataset
- AlpacaDataCleaned
- Alpaca Chat
- Alpaca Chain of Thought: A colleciton of datasets
Vicuna: Assistant clone of OAI ChatGPT(llmsys.org/fastchat/vllm/vicuna)
Wizard: DFS and BFS evolution of instructions
Open Assistant
Guanaco
Orca: Why not give system prompts ?
GPT4all:
ShareGPT:
Dolly:
Others:
- phi-1, Textbooks Are All You Need
- WebGLM-qa, Grounded QA
- LIMA dataset: less is more
- h2ogpt-fortune2000
- Google FLAN
-
Huggingface: transformers/feft/accelerate/bitsandbytes
-
PyTorch Lightning: More general high-level framework on top of PyTorch. Think of it as the Keras for PyTorch. Also it has two repos: one is the open source implementation of GPT, and one is a finetuning framework for open LLMs.
-
GGML/Llama.cpp: A lot of attention here, this project will probaby pave the road for LLM without GPUs.
-
GPT4all: Started as thin wrapper of GGML, but it's diverged since.
-
GPTQ/AutoGPTQ: An alternative to Int.8 quantization (
bitsandbytes
). -
FlexGen: New kid on the block, have not really looked into it.
-
LangChain: It's hot now, a very convinience package to interface with LLMs, and vector stores. Here are the key concepts and abstractions in LangChain:
- A
Chain
is just a LLM + a prompt template. - A agent is made of a llm chain and tools.
- A agent is usually wrapped in a Agent Executor, which itself is a type of chain.:wq
- The key ingredient of an agent is the ability to
plan
, which literally is a method defined for each type of agent.
My take on LC is that if you want to spin up something quickly, it's a great tool to get you started. But once you've moved beyond of building toy projects, you will probably need to build your own pipeline, or even your own abstractions. Treat LC as a huge cookbook, pick and choose whatever you need, especially the prompts. But keep in mind the field is moving lightning fast, a lot prompts might not be necessary now, especially with strong models like GPT4.
- A
-
LLamaIndex: It has some overlapping with LangChain - it's a data framework. Same as LC, awesome tool to get you started quickly.
- A document is split into chunks, or nodes.
- Chunks are wrapped into indicies - that's the building blocks.
- Index + retrieval mode => retrievers
- retrievers + synthesizing methods + post processing => query engines
- query engines + memory => chat engines
- query engines + json/pydantic descriptions => tools
- tools + LLM => agents
I like LlamaIndex codebase better - cleaner, better documented.
-
DSP: Coming out of Standford NLP research group. A nice programming model for working with LLMs: demostrate, search and predict. Probably not as mature as Langchain and LlamaIndex, but worthing checking out.
-
Text generation web gui: Nice playground for experimenting with various LLMs.
-
LocalAI: a drop-in replacement for OpenAI.
-
Axolotl: Nice rep for finetuning
-
FastChat: Train/Eval/Deployment pipeline.
- Twitter. Follow guys with real signals. I will post my twitter list of AI later.
- If you have extra time, look at Reddit board like LocalLlama
- If you really have time, go to discord servers of things you are into.
From Kaparthy:
- Why should I care about LLMs ?
- What are the common use cases for LLMs ?
- I am new to the field, where shoud I get started ?
- What do those weird names like Llama/alpaca/vicuna/guanaco come from ?