Skip to content

stewarthu/llm-notes

Repository files navigation

LLMs for working data scientists and machine learning practitioners

Large Language Models(LLM) are changing the landscape of data science - whether this represents a paradigm shift in the way information is produced, aggregated and consumed, or this is just another passing fad, the jury is still out. In any case, AI/LLM as a tool, increasingly an indispensable one, is becoming the accepted norm, especially among programmers and data scientists.

This collection of notes is basically a brain dump for me - it captures my personal learnings to make sense of this exploding field and try to keep myself sane.

Table of Contents


History

These are the important milestones leading up to the current state of LLM.

Prerequisites

The field is moving so fast you absolutely need a hacker's mentality

  • Python/Numpy/Pandas: basic skills needed to code up something quickly. Fortunately with tools like copilot/ChatGPT/Replit, it's quite easy to get up to speed quickly in this department, especially if you are a programmer to begin with. For instance I came from a C++/R/Haskell background, and made a switch from R to Python quite smoothly.

  • PyTorch: Most transformer models are in torch. Invest some time to get you comfortable with both the tensor library and building blocks for neural networks. Read a lot of library code to get a sound foundation.

  • Git and github: You will clone a tons of repos to experiment, so invest some time in building your own commands to quickly get things done.

  • Huggingface: this is a must now. Not just the transformers library itself, but also peft, accelerate, etc. You will spend tons of time with HF.

  • Linux, bash, and command line tools: Get a mac and get comfortable with command lines tools. Trust me it is worth your time.

  • GPUs. You can get away with CPUs for inference (GGLM is really coming up fast), but you will have to use GPUs for training models. You can build you own box with RTX 3090 (or 4090 if you have a few extra bucks), or rent online from one of those small guys: vast.ai, runpod, or Azure/AWS if you are not paying the bills out of your own pocket. Stick with A100s if you have budget - everything just works with A100s.

Closed Foundation Models

  • OpenAI: Build stuff with OpenAI to get your feet wet, read their cookbooks, they are really good.
  • Google: Have not really tried Bard model....
  • Anthropic: On my list, never really played with Claude/Claude2.

Open Foundation Models

LLaMA and finetuned variants

FB's release of LLaMA set off a wave of fine tuned variants of LLaMA 7/13/30/65B models, with some fun playing with names of Llama family. See Huggingface Leaderboard for Open LLMs for some of those notable models.

Now with release of Llama 2, the real competition and fun starts !

  • Alpaca: This is the first model coming out of Standford. It was trained on the instructions generated from ChatGPT.

  • Vicuna: NormicAI is behind this project. The model was finetuned with ShareGPT data, a crowd sourced dataset via ChatGPT. Also it comes with a fast inference engine - underlying it there is GPU-optimized version of inference engine called vLLM. They also have other open source models like T5.

  • WizardLM: This is from Microsoft Reseach. It is based on Evol-instruct, a tree-based instructions.

  • WizardVicuna: a combo of Wizard and Vicuna.

  • Open Assistant: dataset, RLHF fine tuning, etc.

  • QLora: This is a big for guys with consumer-grade GPUs like RTX series, you can fine tune a sizeable model with a single GPU. It was trained with Open Assistant dataset.

LLaMA alternatives (Updated on 7/20, now they are less attractive with the release of Llama 2 .... )

Datasets

Alpaca and variants:

Vicuna: Assistant clone of OAI ChatGPT(llmsys.org/fastchat/vllm/vicuna)

Wizard: DFS and BFS evolution of instructions

Open Assistant

Guanaco

Orca: Why not give system prompts ?

GPT4all:

ShareGPT:

Dolly:

Others:

Frameworks and Ecosystems

  • Huggingface: transformers/feft/accelerate/bitsandbytes

  • PyTorch Lightning: More general high-level framework on top of PyTorch. Think of it as the Keras for PyTorch. Also it has two repos: one is the open source implementation of GPT, and one is a finetuning framework for open LLMs.

  • GGML/Llama.cpp: A lot of attention here, this project will probaby pave the road for LLM without GPUs.

  • GPT4all: Started as thin wrapper of GGML, but it's diverged since.

  • GPTQ/AutoGPTQ: An alternative to Int.8 quantization (bitsandbytes).

  • FlexGen: New kid on the block, have not really looked into it.

  • LangChain: It's hot now, a very convinience package to interface with LLMs, and vector stores. Here are the key concepts and abstractions in LangChain:

    • A Chain is just a LLM + a prompt template.
    • A agent is made of a llm chain and tools.
    • A agent is usually wrapped in a Agent Executor, which itself is a type of chain.:wq
    • The key ingredient of an agent is the ability to plan, which literally is a method defined for each type of agent.

    My take on LC is that if you want to spin up something quickly, it's a great tool to get you started. But once you've moved beyond of building toy projects, you will probably need to build your own pipeline, or even your own abstractions. Treat LC as a huge cookbook, pick and choose whatever you need, especially the prompts. But keep in mind the field is moving lightning fast, a lot prompts might not be necessary now, especially with strong models like GPT4.

  • LLamaIndex: It has some overlapping with LangChain - it's a data framework. Same as LC, awesome tool to get you started quickly.

    • A document is split into chunks, or nodes.
    • Chunks are wrapped into indicies - that's the building blocks.
    • Index + retrieval mode => retrievers
    • retrievers + synthesizing methods + post processing => query engines
    • query engines + memory => chat engines
    • query engines + json/pydantic descriptions => tools
    • tools + LLM => agents

    I like LlamaIndex codebase better - cleaner, better documented.

  • DSP: Coming out of Standford NLP research group. A nice programming model for working with LLMs: demostrate, search and predict. Probably not as mature as Langchain and LlamaIndex, but worthing checking out.

  • Text generation web gui: Nice playground for experimenting with various LLMs.

  • LocalAI: a drop-in replacement for OpenAI.

  • Axolotl: Nice rep for finetuning

  • FastChat: Train/Eval/Deployment pipeline.

Stay Current

  1. Twitter. Follow guys with real signals. I will post my twitter list of AI later.
  2. If you have extra time, look at Reddit board like LocalLlama
  3. If you really have time, go to discord servers of things you are into.

From Kaparthy:

  1. Weekly Papers
  2. Papers with Code
  3. Trending Github Repos

FAQs

  1. Why should I care about LLMs ?
  2. What are the common use cases for LLMs ?
  3. I am new to the field, where shoud I get started ?
  4. What do those weird names like Llama/alpaca/vicuna/guanaco come from ?

About

LLMs for Data Scientists

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages