[libshortfin] Initial implementation of LLM inference server. #181

stellaraccident · 2024-09-11T01:49:37Z

This is very much a first draft that needs a fair bit of work to turn into a final form. As the first real user of libshortfin's core APIs, a number of rough spots were worked out even to get it to this first stage.

Just a few things that need attention:

Sampling is just a greedy decode loop right now, but this can be easily extended because it is done per item and is just linear Python (async).
Scaffolding for multi device in a single process is in place but we're still working on the models to mate up with it.
A lot of configuration is hardcoded.
The batcher is a toy. The aim is to implement something like sglang's cache aware batcher and radix attention in its place. Given that, no effort was taken to make the batcher more than write-once code.
There are many opportunities to use different shortfin fibers, workers and executors to better balance the host-side activity. This currently just runs async on a single worker for the moment.

This patch moves logging functions from nod-ai#181

This patch moves logging functions from #181

monorimet

I don't have any deep insights on this yet, I've left a few nit comments for typos but you already have noted the areas needing most improvement.

I see some of this could be abstracted out and reused for the next inference server implementation, but some of it, like the batcher, might not have a very substantial base class to carve out (and we might want to resolve some of your bugaboos about current implementation before trying to reuse it).

Anyway, this seems to cover all the points I have in my head, and there's lots for us to iterate on. Thanks.

Some class naming conventions were a bit general, but it seems like an intentional choice, and I see no problem besides a bit of readability -- e.g. InferencePhase, InferenceExecRequest. I doubt this would ever make much of a difference in developer experience and it saves us long ugly class names, but maybe worth mentioning even if it's extremely subjective.

libshortfin/python/shortfin_apps/llm/components/config_struct.py

libshortfin/python/shortfin_apps/llm/components/io_struct.py

libshortfin/tests/apps/llm/components/tokenizer_test.py

rsuderman

Overall this looks good. I noticed a few TODOs which I assume are still WIP but the structure felt right to me. Only high level thought was having some utility / abstraction for data transfer. The direct invocations of the IREE h2d or d2h commands felt slightly out of place (but thats just a nit).

libshortfin/python/shortfin_apps/llm/components/cache.py

libshortfin/python/shortfin_apps/llm/components/generate.py

stellaraccident · 2024-09-24T17:46:26Z

Overall this looks good. I noticed a few TODOs which I assume are still WIP but the structure felt right to me. Only high level thought was having some utility / abstraction for data transfer. The direct invocations of the IREE h2d or d2h commands felt slightly out of place (but thats just a nit).

Yeah, I swallowed my own objections while typing those transfers. To do them right needs more system configuration, which I don't have yet (the way you stage transfers and allocation is system and use case specific), so I just did the least bad thing and wrote it out long hand for now. That's why today, I'm building out the system config layer more -- that is where you root fixes for things like this.

stellaraccident force-pushed the shortfin_llm branch 3 times, most recently from 5b0b230 to de95c51 Compare September 18, 2024 15:02

pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024

[libshortfin] Move the logging setup

232142e

This patch moves logging functions from nod-ai#181

pashu123 mentioned this pull request Sep 19, 2024

[libshortfin] Move the logging setup #197

Merged

pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024

[libshortfin] Move the logging setup

139c981

This patch moves logging functions from nod-ai#181

pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024

[libshortfin] Move the logging setup

d422d28

This patch moves logging functions from nod-ai#181

pashu123 added a commit that referenced this pull request Sep 19, 2024

[libshortfin] Move the logging setup (#197)

5a48398

This patch moves logging functions from #181

stellaraccident force-pushed the shortfin_llm branch 5 times, most recently from 359eca8 to 2748783 Compare September 20, 2024 20:16

stellaraccident added 4 commits September 23, 2024 17:43

[WIP] LLM Server.

7a0eb7d

Properly load prefill request.

f61c835

Return logits

3959b5c

Use argmax to get token

a03658a

stellaraccident force-pushed the shortfin_llm branch from 806fb13 to a03658a Compare September 24, 2024 00:44

stellaraccident added 3 commits September 23, 2024 17:48

Adapt to scope->fiber rename

05ddb89

Server working.

ef786d8

Move to shortfin_apps

b6e565b

stellaraccident changed the title ~~[libshortfin] Implement LLM inference server.~~ [libshortfin] Initial implementation of LLM inference server. Sep 24, 2024

stellaraccident added 2 commits September 23, 2024 19:29

Fix import name

deeaf2e

Fix package

e8ee7b3

stellaraccident marked this pull request as ready for review September 24, 2024 02:32

stellaraccident requested review from rsuderman and monorimet September 24, 2024 02:33

stellaraccident added 3 commits September 23, 2024 19:45

Fix test

287e4a8

Add dep

e9337ca

Add lsan suppression

1c30f9e

monorimet approved these changes Sep 24, 2024

View reviewed changes

libshortfin/python/shortfin_apps/llm/components/config_struct.py Outdated Show resolved Hide resolved

libshortfin/python/shortfin_apps/llm/components/io_struct.py Outdated Show resolved Hide resolved

libshortfin/tests/apps/llm/components/tokenizer_test.py Show resolved Hide resolved

rsuderman reviewed Sep 24, 2024

View reviewed changes

libshortfin/python/shortfin_apps/llm/components/cache.py Outdated Show resolved Hide resolved

libshortfin/python/shortfin_apps/llm/components/generate.py Show resolved Hide resolved

stellaraccident added 2 commits September 24, 2024 15:05

Merge branch 'main' into shortfin_llm

3cba367

Comments

ee45a9d

stellaraccident merged commit 61eacac into main Sep 24, 2024
7 checks passed

stellaraccident deleted the shortfin_llm branch September 24, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libshortfin] Initial implementation of LLM inference server. #181

[libshortfin] Initial implementation of LLM inference server. #181

stellaraccident commented Sep 11, 2024 •

edited

Loading

monorimet left a comment •

edited

Loading

rsuderman left a comment

stellaraccident commented Sep 24, 2024

[libshortfin] Initial implementation of LLM inference server. #181

[libshortfin] Initial implementation of LLM inference server. #181

Conversation

stellaraccident commented Sep 11, 2024 • edited Loading

monorimet left a comment • edited Loading

Choose a reason for hiding this comment

rsuderman left a comment

Choose a reason for hiding this comment

stellaraccident commented Sep 24, 2024

stellaraccident commented Sep 11, 2024 •

edited

Loading

monorimet left a comment •

edited

Loading