Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: bump TF to 2.18, PT to 2.5 #4228

Draft
wants to merge 5 commits into
base: devel
Choose a base branch
from
Draft

Conversation

njzjz
Copy link
Member

@njzjz njzjz commented Oct 17, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced dependency management for CUDA and Python workflows.
    • Introduced new jobs for better organization of test duration handling.
  • Bug Fixes

    • Updated TensorFlow and Torch versions for improved compatibility and performance.
  • Documentation

    • Adjusted testing commands and linting configurations for clarity and compliance.
  • Chores

    • Streamlined caching mechanisms to optimize test duration tracking.

@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz changed the title ci(cuda): bump CUDA to 12.6, TF to 2.18, PT to 2.5 ci: bump CUDA to 12.6, TF to 2.18, PT to 2.5 Oct 17, 2024
Copy link
Contributor

coderabbitai bot commented Oct 17, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request updates the workflow configurations for testing CUDA and Python in the .github/workflows/test_cuda.yml and .github/workflows/test_python.yml files, respectively. The CUDA workflow changes the Docker image to nvidia/cuda:12.6-devel-ubuntu24.04, modifies TensorFlow and Torch versions, updates the libtorch download link, and skips a CUDA installation command. The Python workflow enhances dependency management, introduces new jobs, and simplifies installation commands. Additionally, the pyproject.toml file updates dependency versions, testing commands, and linting configurations.

Changes

File Change Summary
.github/workflows/test_cuda.yml Updated Docker image to nvidia/cuda:12.6-devel-ubuntu24.04, modified TensorFlow and Torch versions, updated libtorch download link, and skipped CUDA 12.3 installation command.
.github/workflows/test_python.yml Enhanced installation process, ignored specific branches, added concurrency settings, simplified torch installation, introduced new jobs for test duration management.
pyproject.toml Updated PYTORCH_VERSION to 2.5.0, added TENSORFLOW_VERSION, modified test commands for Linux and Windows, and adjusted linting configurations.

Possibly related PRs

  • ci: speed up Python test  #3776: The changes in the .github/workflows/test_python.yml file involve modifications to the installation of the torch package, which is also updated in the main PR to a new version.
  • ci: fix test-python test_durations and its caches #3820: This PR also modifies the .github/workflows/test_python.yml file, which is relevant as it deals with caching and test durations that may be impacted by the changes in package versions in the main PR.
  • ci: pin ubuntu to 22.04 #4213: Although primarily focused on the CI environment, this PR's changes to the Ubuntu version may indirectly relate to the CUDA updates in the main PR, as compatibility with CUDA versions can be affected by the operating system environment.

Suggested reviewers

  • wanghan-iapcm

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8b895ea and f116ebb.

📒 Files selected for processing (1)
  • .github/workflows/test_cuda.yml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/test_cuda.yml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
.github/workflows/test_cuda.yml (3)

50-50: LGTM: TensorFlow and PyTorch versions updated

The update to TensorFlow 2.18.0rc2 and PyTorch 2.5.0 aligns with the PR objectives. The use of ~= for version specification is a good practice.

Consider using more specific version ranges to avoid potential issues with future releases:

-    - run: source/install/uv_with_retry.sh pip install --system "tensorflow~=2.18.0rc2" "torch~=2.5.0"
+    - run: source/install/uv_with_retry.sh pip install --system "tensorflow~=2.18.0,<2.19.0" "torch~=2.5.0,<2.6.0"

This ensures compatibility with patch releases while preventing automatic updates to major versions.


Line range hint 41-47: Consider removing the commented-out CUDA installation step

The CUDA installation step has been correctly disabled as it's no longer needed with the NVIDIA Docker image. This is a good change.

For code cleanliness, consider removing this entire block instead of keeping it commented out. If you want to preserve this information for future reference, consider moving it to a separate documentation file or adding it as a comment in the workflow file's header.

You can remove these lines:

-    - run: |
-         wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \
-         && sudo dpkg -i cuda-keyring_1.0-1_all.deb \
-         && sudo apt-get update \
-         && sudo apt-get -y install cuda-12-3 libcudnn8=8.9.5.*-1+cuda12.3
-      if: false  # skip as we use nvidia image
🧰 Tools
🪛 actionlint

51-51: shellcheck reported issue in this script: SC2155:warning:1:8: Declare and assign separately to avoid masking return values

(shellcheck)


51-51: shellcheck reported issue in this script: SC2155:warning:2:8: Declare and assign separately to avoid masking return values

(shellcheck)


51-51: shellcheck reported issue in this script: SC2102:info:3:61: Ranges can only match single chars (mentioned due to duplicates)

(shellcheck)


Line range hint 1-105: Summary of CUDA workflow update

The changes in this PR successfully update the CUDA testing workflow to use CUDA 12.6, TensorFlow 2.18, and PyTorch 2.5, aligning with the PR objectives. Here's a summary of the main points:

  1. The Docker image has been updated to use CUDA 12.6 and Ubuntu 24.04.
  2. TensorFlow and PyTorch versions have been updated as intended.
  3. The libtorch download link has been updated, but there's a potential CUDA version mismatch to verify.
  4. The redundant CUDA installation step has been disabled.

Please address the following points:

  1. Verify compatibility with Ubuntu 24.04 for all dependencies and scripts.
  2. Consider using more specific version ranges for TensorFlow and PyTorch.
  3. Check and update the libtorch download link to ensure CUDA 12.6 compatibility.
  4. Remove the commented-out CUDA installation step for code cleanliness.

Once these points are addressed, the PR will be ready for merge.

🧰 Tools
🪛 actionlint

19-19: label "nvidia" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "macos-11", "macos-11.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cc4b23d and f1e974e.

📒 Files selected for processing (1)
  • .github/workflows/test_cuda.yml (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
.github/workflows/test_cuda.yml (2)

22-22: LGTM: Docker image updated to CUDA 12.6

The update to nvidia/cuda:12.6-devel-ubuntu24.04 aligns with the PR objective of bumping CUDA to 12.6. This change also includes an upgrade to Ubuntu 24.04.

Please ensure that all dependencies and scripts are compatible with Ubuntu 24.04. Run the following command to check the Ubuntu version in the workflow:


66-66: Verify CUDA compatibility for libtorch

The libtorch download link has been updated to version 2.5.0, which is consistent with the PyTorch version installed earlier. However, there's a potential version mismatch:

  • The Docker image uses CUDA 12.6
  • The libtorch download is for CUDA 12.4

Please verify if libtorch 2.5.0 with CUDA 12.4 is compatible with CUDA 12.6, or if there's a more appropriate version available. You can check the PyTorch website for the latest compatible versions:

If a CUDA 12.6 compatible version is available, consider updating the download link.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
.github/workflows/test_python.yml (1)

29-29: Simplified torch installation looks good, but consider version pinning.

The change to simplify the torch installation and ensure the CPU version is installed is a good improvement. It aligns with the PR objective of updating dependencies.

However, consider the following suggestions:

  1. To ensure reproducibility across different environments and times, it might be beneficial to pin the torch version. This prevents potential issues from unexpected changes in newer versions.
  2. Verify that this version of torch is compatible with the updated TensorFlow version (2.18 as per PR objectives) to avoid any integration issues.

Consider modifying the line to include a version pin:

-        source/install/uv_with_retry.sh pip install --system torch -i https://download.pytorch.org/whl/cpu
+        source/install/uv_with_retry.sh pip install --system torch==2.5.* -i https://download.pytorch.org/whl/cpu

This ensures you're using the latest patch version of PyTorch 2.5, which aligns with the PR objective of bumping to PT 2.5.

pyproject.toml (1)

279-280: Overall assessment of version updates

The changes to TENSORFLOW_VERSION and PYTORCH_VERSION align with the PR objectives to bump CUDA, TensorFlow, and PyTorch versions. These updates are significant and may have far-reaching effects on the project.

To ensure a smooth transition to these new versions:

  1. Conduct thorough testing across the entire codebase, paying special attention to TensorFlow and PyTorch specific functionalities.
  2. Update any documentation or README files to reflect the new version requirements.
  3. Consider creating a migration guide for users of the project, highlighting any breaking changes or new features that may affect their usage.
  4. Monitor the CI/CD pipeline closely after these changes to catch any unforeseen issues early.
  5. Plan for a potential rollback strategy in case critical issues are discovered post-merge.

These version bumps represent a significant change. While they bring new features and improvements, they also introduce the risk of compatibility issues. Careful testing and monitoring will be crucial for a successful integration.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f1e974e and 6aca89a.

📒 Files selected for processing (2)
  • .github/workflows/test_python.yml (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🔇 Additional comments (3)
.github/workflows/test_python.yml (1)

Line range hint 1-95: Overall workflow improvements look good.

The changes to this workflow file align well with the PR objectives and introduce several improvements:

  1. The torch installation has been simplified and ensures the CPU version is used.
  2. The workflow has been updated to ignore certain branches and include merge group events, which can help with CI efficiency.
  3. Concurrency settings have been refined to better manage parallel runs.
  4. New jobs for updating test durations have been added, which can help with better test distribution and performance tracking.

These changes should lead to a more efficient and maintainable CI process. Good job on the improvements!

🧰 Tools
🪛 actionlint

27-27: shellcheck reported issue in this script: SC2155:warning:3:8: Declare and assign separately to avoid masking return values

(shellcheck)


27-27: shellcheck reported issue in this script: SC2102:info:4:80: Ranges can only match single chars (mentioned due to duplicates)

(shellcheck)


27-27: shellcheck reported issue in this script: SC2102:info:4:102: Ranges can only match single chars (mentioned due to duplicates)

(shellcheck)

pyproject.toml (2)

279-279: Verify compatibility with TensorFlow 2.18.0rc2

The addition of TENSORFLOW_VERSION = "2.18.0rc2" aligns with the PR objective to bump TensorFlow to version 2.18. However, using a release candidate version in a CI/CD pipeline could potentially lead to instability or unexpected behavior.

Please ensure that:

  1. The codebase is compatible with TensorFlow 2.18.0rc2.
  2. Any breaking changes introduced in this version are addressed.
  3. The CI pipeline successfully runs with this version.

Consider adding a TODO comment to update to the stable 2.18.0 release when it becomes available.


280-280: Verify compatibility with PyTorch 2.5.0

The update of PYTORCH_VERSION from "2.3.1" to "2.5.0" is a significant version bump, which aligns with the PR objective. This update may introduce new features and potentially breaking changes.

Please ensure that:

  1. The codebase is compatible with PyTorch 2.5.0.
  2. Any breaking changes or deprecations between versions 2.3.1 and 2.5.0 are addressed.
  3. The CI pipeline successfully runs with this version.
  4. Performance improvements or new features in PyTorch 2.5.0 are leveraged where applicable.

Copy link

codecov bot commented Oct 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.53%. Comparing base (cc4b23d) to head (d3fe74a).

Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #4228      +/-   ##
==========================================
- Coverage   84.55%   84.53%   -0.03%     
==========================================
  Files         537      537              
  Lines       51237    51237              
  Branches     3047     3047              
==========================================
- Hits        43324    43312      -12     
  Misses       6965     6965              
- Partials      948      960      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Oct 17, 2024
Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Oct 18, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
pyproject.toml (1)

Line range hint 1-580: Summary of changes and potential impact

The main changes in this PR involve updating the versions of key dependencies:

  1. TensorFlow updated to 2.18.0rc2 (release candidate)
  2. PyTorch updated to 2.5.0
  3. CUDNN constraint removed, allowing for newer versions

These updates align with the PR objectives and may bring performance improvements and new features. However, they also introduce potential risks:

  1. Using a TensorFlow release candidate in production
  2. Possible breaking changes with the significant PyTorch version jump
  3. Potential compatibility issues with unconstrained CUDNN versions

To mitigate these risks and ensure a smooth transition:

  1. Implement comprehensive integration tests covering critical functionality with these new versions.
  2. Consider a phased rollout or maintaining a fallback option to quickly revert if issues arise.
  3. Monitor performance metrics closely after deployment to identify any regressions or improvements.
  4. Update documentation to reflect any changes in API or behavior resulting from these version updates.

These steps will help maintain the stability and reliability of the project while benefiting from the latest improvements in the underlying libraries.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d3fe74a and 8b895ea.

📒 Files selected for processing (1)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🔇 Additional comments (2)
pyproject.toml (2)

135-135: Consider the implications of removing CUDNN version constraint

The removal of the version constraint for nvidia-cudnn-cu12 (from <9 to no constraint) allows for the use of newer CUDNN versions. This change aligns with the CUDA update mentioned in the PR title. However, consider the following:

  1. Benefit: This change enables the use of the latest CUDNN optimizations and features.
  2. Risk: Removing the upper bound may lead to compatibility issues with future, untested CUDNN versions.

To ensure compatibility, run the following verification script:

#!/bin/bash
# Verify CUDNN version and compatibility
python3 << END
import torch
print(f"CUDNN version: {torch.backends.cudnn.version()}")
# Add any CUDNN-specific functionality tests here
END

This script will help confirm that the CUDNN version is correctly detected and basic functionality is maintained.


279-280: Verify compatibility with updated TensorFlow and PyTorch versions

The changes update the TensorFlow version to a release candidate (2.18.0rc2) and PyTorch to a newer stable version (2.5.0). While these updates align with the PR objectives, consider the following:

  1. Using a TensorFlow release candidate (2.18.0rc2) in a production environment may introduce instability. Ensure thorough testing is performed, especially for critical functionality.
  2. The PyTorch update to 2.5.0 is a significant version jump. While it likely brings performance improvements and new features, it may also introduce breaking changes.

To ensure compatibility, run the following verification script:

This script will help confirm that the new versions are correctly installed and basic functionality is maintained.

@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Oct 18, 2024
@njzjz njzjz changed the title ci: bump CUDA to 12.6, TF to 2.18, PT to 2.5 ci: bump TF to 2.18, PT to 2.5 Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant