Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libshortfin] Initial implementation of LLM inference server. #181

Merged
merged 14 commits into from
Sep 24, 2024

Conversation

stellaraccident
Copy link
Contributor

@stellaraccident stellaraccident commented Sep 11, 2024

This is very much a first draft that needs a fair bit of work to turn into a final form. As the first real user of libshortfin's core APIs, a number of rough spots were worked out even to get it to this first stage.

Just a few things that need attention:

  • Sampling is just a greedy decode loop right now, but this can be easily extended because it is done per item and is just linear Python (async).
  • Scaffolding for multi device in a single process is in place but we're still working on the models to mate up with it.
  • A lot of configuration is hardcoded.
  • The batcher is a toy. The aim is to implement something like sglang's cache aware batcher and radix attention in its place. Given that, no effort was taken to make the batcher more than write-once code.
  • There are many opportunities to use different shortfin fibers, workers and executors to better balance the host-side activity. This currently just runs async on a single worker for the moment.

@stellaraccident stellaraccident force-pushed the shortfin_llm branch 3 times, most recently from 5b0b230 to de95c51 Compare September 18, 2024 15:02
pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024
This patch moves logging functions from
nod-ai#181
pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024
This patch moves logging functions from
nod-ai#181
pashu123 added a commit to pashu123/sharktank that referenced this pull request Sep 19, 2024
This patch moves logging functions from
nod-ai#181
pashu123 added a commit that referenced this pull request Sep 19, 2024
This patch moves logging functions from
#181
@stellaraccident stellaraccident force-pushed the shortfin_llm branch 5 times, most recently from 359eca8 to 2748783 Compare September 20, 2024 20:16
@stellaraccident stellaraccident changed the title [libshortfin] Implement LLM inference server. [libshortfin] Initial implementation of LLM inference server. Sep 24, 2024
@stellaraccident stellaraccident marked this pull request as ready for review September 24, 2024 02:32
Copy link
Contributor

@monorimet monorimet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any deep insights on this yet, I've left a few nit comments for typos but you already have noted the areas needing most improvement.

I see some of this could be abstracted out and reused for the next inference server implementation, but some of it, like the batcher, might not have a very substantial base class to carve out (and we might want to resolve some of your bugaboos about current implementation before trying to reuse it).

Anyway, this seems to cover all the points I have in my head, and there's lots for us to iterate on. Thanks.

Some class naming conventions were a bit general, but it seems like an intentional choice, and I see no problem besides a bit of readability -- e.g. InferencePhase, InferenceExecRequest. I doubt this would ever make much of a difference in developer experience and it saves us long ugly class names, but maybe worth mentioning even if it's extremely subjective.

Copy link
Contributor

@rsuderman rsuderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I noticed a few TODOs which I assume are still WIP but the structure felt right to me. Only high level thought was having some utility / abstraction for data transfer. The direct invocations of the IREE h2d or d2h commands felt slightly out of place (but thats just a nit).

@stellaraccident
Copy link
Contributor Author

Overall this looks good. I noticed a few TODOs which I assume are still WIP but the structure felt right to me. Only high level thought was having some utility / abstraction for data transfer. The direct invocations of the IREE h2d or d2h commands felt slightly out of place (but thats just a nit).

Yeah, I swallowed my own objections while typing those transfers. To do them right needs more system configuration, which I don't have yet (the way you stage transfers and allocation is system and use case specific), so I just did the least bad thing and wrote it out long hand for now. That's why today, I'm building out the system config layer more -- that is where you root fixes for things like this.

@stellaraccident stellaraccident merged commit 61eacac into main Sep 24, 2024
7 checks passed
@stellaraccident stellaraccident deleted the shortfin_llm branch September 24, 2024 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants