Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add node READY state check for TAS #425

Merged
merged 3 commits into from
Dec 20, 2024

Conversation

chengcongdu
Copy link
Member

  • add node READY state check for TAS
  • remove check for compact placement tag in the original TAS under gpudirect-tcpxo/topology-scheduler

Sample Log output:

2024-12-17 18:44:09,978 - root - INFO - Attempting to schedule job: my-sample-job with 8 pods 
2024-12-17 18:44:09,978 - root - INFO - Found schedulable pod: default/my-sample-job-0-csdv6, CPU: 0, Memory: 0, GPU: 8 Index: 0
2024-12-17 18:44:09,978 - root - INFO - Found schedulable pod: default/my-sample-job-1-khhv9, CPU: 0, Memory: 0, GPU: 8 Index: 1
2024-12-17 18:44:09,978 - root - INFO - Found schedulable pod: default/my-sample-job-2-98hhb, CPU: 0, Memory: 0, GPU: 8 Index: 2
2024-12-17 18:44:09,979 - root - INFO - Found schedulable pod: default/my-sample-job-3-mhkf6, CPU: 0, Memory: 0, GPU: 8 Index: 3
2024-12-17 18:44:09,979 - root - INFO - Found schedulable pod: default/my-sample-job-4-5zwqn, CPU: 0, Memory: 0, GPU: 8 Index: 4
2024-12-17 18:44:09,979 - root - INFO - Found schedulable pod: default/my-sample-job-5-5hnpj, CPU: 0, Memory: 0, GPU: 8 Index: 5
2024-12-17 18:44:09,979 - root - INFO - Found schedulable pod: default/my-sample-job-6-9t2sj, CPU: 0, Memory: 0, GPU: 8 Index: 6
2024-12-17 18:44:09,979 - root - INFO - Found schedulable pod: default/my-sample-job-7-pxzzz, CPU: 0, Memory: 0, GPU: 8 Index: 7
2024-12-17 18:44:09,979 - root - INFO - Skipping node gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-1gpl because it is NotReady
2024-12-17 18:44:09,980 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-4rv9, CPU: 206.581, Memory: 1929444249344, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '41efc99a47c263f8261d7461cd83b658', 'd96d5606058e20f94c5925058e3efef2')
2024-12-17 18:44:09,981 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-4wl0, CPU: 206.581, Memory: 1929444245248, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '134e1eea7c4f0be251da22331a797ad9', '019dbdb5f06e8b23fd4b9038fcdb0ff4')
2024-12-17 18:44:09,981 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-c3b2, CPU: 206.581, Memory: 1929444245248, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', 'b3fee96ebe1261e9f5f998326efd4963', 'dd24710f564b53b53fceff51774a0c8c')
2024-12-17 18:44:09,982 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-gxjk, CPU: 206.581, Memory: 1929444249344, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '7be2d1e23bff97428301c2f2d4a24c43', '339145deef94c8781516b03b628b81a2')
2024-12-17 18:44:09,982 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-rsdt, CPU: 206.581, Memory: 1929444249344, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '52fa85f5ffd8ef1a76c303731e92ce6d', 'e2ea8d612f08a3bf15e32ca1b9fe1464')
2024-12-17 18:44:09,982 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-x79z, CPU: 206.581, Memory: 1929444241152, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '00e8fb6630bfcb9feff177c5643136b9', '456dc3a96a06ab5c5355ff3f26cdb909')
2024-12-17 18:44:09,983 - root - INFO - Node: gke-tii-on-gke-a3mega-a3mega-np-1-137c20ee-xp3p, CPU: 206.581, Memory: 1929444249344, GPU: 8, Topology: ('7d1ba2fc1fe47925a0ae8083c7e9f41b', '33025660a515f2790a3a883ba965e014', '76c0495936c89aea0f23976a88ab6897')
2024-12-17 18:44:09,984 - root - INFO - Node gke-tii-on-gke-a3mega-default-pool-02e2ac1c-2mrc does not have topology labels
2024-12-17 18:44:09,984 - root - INFO - Node: gke-tii-on-gke-a3mega-default-pool-02e2ac1c-2mrc, CPU: 5.677, Memory: 57298250496, GPU: 0, Topology: ()

@chengcongdu
Copy link
Member Author

@dburl PTAL

Copy link
Contributor

@thisSIDEofRANDOM thisSIDEofRANDOM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thisSIDEofRANDOM
Copy link
Contributor

Please sign and submit, thanks!

@dburl
Copy link
Contributor

dburl commented Dec 19, 2024

+1

@chengcongdu chengcongdu merged commit 04688ae into GoogleCloudPlatform:master Dec 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants