You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VFs appear as additional infiniband devices, but obviously don't report temperatures. The logs are then flooded with:
Dec 17 00:00:53 penny sh[1151952]: mopen: Operation not supported
Dec 17 00:00:53 penny sh[1151947]: mellanox_hca_temp: Failed to get temperature from InfiniBand HCA 'mlx5_2'!
Dec 17 00:00:53 penny sh[1151953]: mopen: Operation not supported
Dec 17 00:00:53 penny sh[1151947]: mellanox_hca_temp: Failed to get temperature from InfiniBand HCA 'mlx5_3'!
and so on.
The only clue I could find to recognise them as virtual is that node_guid is 0000:0000:0000:0000. I'm not sure if this is supposed to change when setting the mac address on the interfaces.
So far, with the virtual function interfaces unconfigured, the following patch suppresses the errors for me:
--- mellanox_hca_temp.orig 2021-06-27 08:55:33.406292246 +0200
+++ mellanox_hca_temp 2023-12-22 15:18:46.072149247 +0100
@@ -41,6 +41,10 @@
if test ! -d "$dev"; then
continue
fi
+ # node_guid is all zeros for Virtual Functions, which report no temp.
+ if [ "$(cat $dev/node_guid)" = "0000:0000:0000:0000" ]; then
+ continue
+ fi
device="${dev##*/}"
# get temperature
The text was updated successfully, but these errors were encountered:
I don't think that depending on an all-zeros node_guid is a reliable method, since the VF node_guid can be set by the user, e.g. echo 00:11:22:33:44:55:1:0 > /sys/class/infiniband/mlx5_0/device/sriov/0/node.
How about using the existence of the sriov directory as an indicator of whether the device is a VF or a PF?
I don't think that depending on an all-zeros node_guid is a reliable method, since the VF node_guid can be set by the user, e.g. echo 00:11:22:33:44:55:1:0 > /sys/class/infiniband/mlx5_0/device/sriov/0/node.
You are indeed right, this is not a reliable method. And it also seem to depend on which driver one is using.
How about using the existence of the sriov directory as an indicator of whether the device is a VF or a PF?
I'm with just mlx5 from within the kernel (didn't install and compile OFED), and there's no sriov directory. There are some
sriov_* files there, though:
The VF driver directory instead only has one file, writable:
root@penny:/sys/class/infiniband/mlx5_2/device # ls -als sriov_vf_msix_count
0 --w------- 1 root root 4096 Dec 24 11:30 sriov_vf_msix_count
So, I guess we could use driver/sriov_numvfs as a check, but I'm not sure this is valid across different driver combinations, and also I'm not sure what happens if one boots with sriov disabled in bios. I can't test this now.
VFs appear as additional infiniband devices, but obviously don't report temperatures. The logs are then flooded with:
and so on.
The only clue I could find to recognise them as virtual is that node_guid is 0000:0000:0000:0000. I'm not sure if this is supposed to change when setting the mac address on the interfaces.
So far, with the virtual function interfaces unconfigured, the following patch suppresses the errors for me:
The text was updated successfully, but these errors were encountered: