Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU LLM Integration Test #373

Merged
merged 18 commits into from
Nov 5, 2024
Merged

Conversation

stbaione
Copy link
Contributor

  • Adds integration test for CPU LLM Server (will make sure all CIs are passing before merge)
  • Small fix in shortfin/python/shortfin_apps/llm/components/service.py to re-enable llm serving after recent changes

@stbaione stbaione self-assigned this Oct 30, 2024
@renxida
Copy link
Contributor

renxida commented Oct 30, 2024

https://github.com/nod-ai/SHARK-Platform/blob/fb4be2774500a6773e0db71ead7a039a0ca6f83e/.github/workflows/ci-sharktank.yml

Looks like we should maybe paste together the sharktank and shortfin ci ymls to make a new environment to do integration testing on.

@ScottTodd need your opinion on this.

Maybe also make it nightly because it is pretty slow running.

But I kindda want to make sure nobody pushes something to break shortfin serving & maybe we could make it fast enough via caching the HF downloads and tokenizer conversions.

@stbaione
Copy link
Contributor Author

https://github.com/nod-ai/SHARK-Platform/blob/fb4be2774500a6773e0db71ead7a039a0ca6f83e/.github/workflows/ci-sharktank.yml

Looks like we should maybe paste together the sharktank and shortfin ci ymls to make a new environment to do integration testing on.

@ScottTodd need your opinion on this.

Reached out to @ScottTodd this morning. Am currently working on setting up a workflow file

@renxida
Copy link
Contributor

renxida commented Oct 30, 2024

Sounds good! Lmk when you set that up - - I also have tests that need to go under the same bucket.

.github/workflows/ci_linux_x64_nogil-libshortfin.yml Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
shortfin/tests/apps/llm/cpu_llm_server_test.py Outdated Show resolved Hide resolved
@stbaione
Copy link
Contributor Author

@ScottTodd Thanks for the feedback! Working on requested changes and setting up CI workflow

stbaione added a commit that referenced this pull request Oct 30, 2024
Small patch reflecting some recent changes in `sf.Program` and
`sf.ProgramFunction`.

Was originally included as part of this PR, which adds an integration
test to shortfin llm serving:
#373

But, parsing it out, since that may take a little more time to make
adjustments/add workflow file.

Without it, you get the following error when trying to launch the
server:

```text
[2024-10-30 11:59:09.939] [info] [manager.py:40] System manager command processor stopped
[2024-10-30 11:59:09.991] [error] [on.py:121] Traceback (most recent call last):
  File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/home/amd/.pyenv/versions/3.12.5/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/shortfin_apps/llm/server.py", line 42, in lifespan
    service.start()
  File "/home/amd/stephen/repos/forks/SHARK-Platform/.venv/lib/python3.12/site-packages/shortfin_apps/llm/components/service.py", line 69, in start
    self.inference_program = sf.Program(
                             ^^^^^^^^^^^
TypeError: __new__(): incompatible function arguments. The following argument types are supported:
    1. __new__(cls: object, modules: collections.abc.Sequence[_shortfin_default.lib.local.ProgramModule], *, devices: collections.abc.Sequence[_shortfin_default.lib.local.Device], trace_execution: bool = False, isolation: _shortfin_default.lib.local.ProgramIsolation = ProgramIsolation.PER_FIBER) -> _shortfin_default.lib.local.Program

Invoked with types: nanobind.nb_type_0, kwargs = { modules: list, fiber: _shortfin_default.lib.local.Fiber, trace_execution: bool }

[2024-10-30 11:59:09.991] [error] [on.py:59] Application startup failed. Exiting.
```

With it, you're able to start server, send requests, and receive
responses.
@stbaione
Copy link
Contributor Author

Pushing in pieces to keep better track of feedback

@stbaione stbaione changed the title Fix LLM Service + CPU LLM Integration Test CPU LLM Integration Test Oct 30, 2024
@stbaione stbaione marked this pull request as ready for review October 31, 2024 21:48
@stbaione
Copy link
Contributor Author

stbaione commented Nov 4, 2024

The CI - shortfin - ASan workflow is failing on the integration test,

With the reason being a leak in the sentencepiece module:

==5818==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x562fe591c1a3 in malloc (/home/runner/work/SHARK-Platform/SHARK-Platform/pyenv/versions/3.12.3/bin/python3.12+0xc71a3) (BuildId: 967a1d6b86eb9d5315fb009c5d4c54dd3a71a6cb)
    #1 0x7f9854031cd8 in _PyObject_New /tmp/python-build.20240924192752.4394/Python-3.12.3/Objects/object.c:319:33
    #2 0x7f983ba6f7f4  (/home/runner/work/SHARK-Platform/SHARK-Platform/pyenv/versions/3.12.3/lib/python3.12/site-packages/sentencepiece/_sentencepiece.cpython-312-x86_64-linux-gnu.so+0x6f7f4) (BuildId: 27ff3b63261f7a54f28902f49e11139b4d3b711f)

This module had to be added for the tokenizer to work in the integration test:

E           ImportError: 
E           LlamaTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
E           installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
E           that match your environment. Please note that you may need to restart your runtime after installation.

The only test that I see which uses AutoTokenizer would be the CPU LLM integration test, which is outside of shortfin tests. The same tests pass ASan prior to sentencepiece being installed, so it's making me think that it's something under the hood at the build/installation layer which is causing the memory leaks.
I wasn't sure how to proceed with this. I was thinking filing an issue with sentencepiece and ignoring this specific failure in CI, if that's possible.

@stbaione stbaione requested a review from renxida November 4, 2024 18:38
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Mostly LGTM. A few comments about how this could continue to evolve.

.github/workflows/ci-sharktank-and-shortfin.yml Outdated Show resolved Hide resolved
.github/workflows/ci-sharktank-and-shortfin.yml Outdated Show resolved Hide resolved
build_tools/integration_tests/cpu_llm_server_test.py Outdated Show resolved Hide resolved
.github/workflows/ci-sharktank-and-shortfin.yml Outdated Show resolved Hide resolved
build_tools/integration_tests/cpu_llm_server_test.py Outdated Show resolved Hide resolved
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we decide on which triggers (pull_request, push, schedule) are appropriate.

Comment on lines 9 to +14
on:
workflow_dispatch:
schedule:
# Weekdays at 13:00 UTC = 05:00 PST / 06:00 PDT.
- cron: "5 4 * * 1-5"
pull_request:
push:
branches:
- main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saienduri is nodai-amdgpu-mi250-x86-64 available enough for the pull_request trigger here? Runners matching that label are currently busy running on-demand jobs in E2ESHARK Test Suite: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/11673948346/job/32505667904

This job itself takes ~5 minutes today, but we shouldn't be queueing for 1h+.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the job itself is a nightly, but maybe time to allocate a mi250 runner just for SHARK-Platform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good and makes sense as we scale these integration tests in the future. What needs to happen for us to enable that?

Add integration tests for llm server application on cpu
Moved `cpu_llm_server_test.py` to top level testing directory,
Added `iree-llvmcpu-target-cpu=host` in `cpu_settings`,
Skip if transformers is not installed,
Add `available_port` fixture to find an available port instead of hardcoding,
Use `/health` route to check for server availability
Add `ci-shark-and-shortfin.yml` workflow,
Remove `llama_export_compile_serve.sh` script,
Remove `test.yml` workflow file
@stbaione stbaione merged commit 3dcca1f into nod-ai:main Nov 5, 2024
12 checks passed
@ScottTodd
Copy link
Member

Something about this new test started failing overnight. The test might be flaky, or it might not be hermetic enough (e.g. by downloading dependencies that aren't pinned to specific versions).

https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-shark-platform.yml?query=branch%3Amain

The workflow initially succeeded at this run/commit, but then failed on a retry: https://github.com/nod-ai/SHARK-Platform/actions/runs/11704581609

@stbaione
Copy link
Contributor Author

stbaione commented Nov 6, 2024

Something about this new test started failing overnight. The test might be flaky, or it might not be hermetic enough (e.g. by downloading dependencies that aren't pinned to specific versions).

https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci-shark-platform.yml?query=branch%3Amain

The workflow initially succeeded at this run/commit, but then failed on a retry: https://github.com/nod-ai/SHARK-Platform/actions/runs/11704581609

@ScottTodd

Yeah, for some reason the LLM Server started returning faulty responses:

INFO     cpu_llm_server_test:cpu_llm_server_test.py:46 Prompt text:
INFO     cpu_llm_server_test:cpu_llm_server_test.py:47 1 2 3 4 5 
INFO     cpu_llm_server_test:cpu_llm_server_test.py:50 Generate endpoint status code: 200
INFO     cpu_llm_server_test:cpu_llm_server_test.py:52 Generated text:
INFO     cpu_llm_server_test:cpu_llm_server_test.py:84 6000 1 10000 10000 1000

Which caused an assertion failure. I'm seeing the exact same output for a failure here: https://github.com/nod-ai/SHARK-Platform/actions/runs/11706659706/job/32604263601?pr=435

The action is not pinned to a specific version for IREE, so my initial thought is that may be the culprit:

pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
            -e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"

Others specify an installation candidate:

- name: Checkout IREE repo
      uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
      with:
        repository: iree-org/iree
        path: ${{ env.IREE_REPO_DIR }}
        submodules: false
        ref: candidate-20241104.1068

In a way this does highlight that there may be a bug between shortfin and latest IREE, but it shouldn't block development. I used the installation instructions from the README for installing IREE as opposed to a specific candidate to catch stuff like this, but that's kind of annoying from a developer perspective.

I'm thinking we pin it to a specific version and we'll catch these kinds of issues when we bump IREE.

@ScottTodd ScottTodd mentioned this pull request Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants