-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for Warewulf 4 on Rocky 9.4 with Slurm on x86_64 #2048
Conversation
073047f
to
e0ae216
Compare
Test Results18 files - 6 18 suites - 6 27s ⏱️ -18s Results for commit 076ad68. ± Comparison against base commit d938eda. This pull request removes 22 tests.
♻️ This comment has been updated with latest results. |
Thanks a lot. CI seems to be happy and looking at the GitHub Actions artefacts I can see an RPM with the documentation and recipe included. I will try it on of our test clusters tomorrow and let you know. |
Running this in our CI system. I see following errors:
and
and
Also, is this line still correct?
The recipe is also missing a line to wait for the compute nodes to get ready like with warewulf 3:
|
The dhcp setup looks wrong:
|
e0ae216
to
a714518
Compare
I think I fixed this (stupid typo that the fix somehow got reverted). One thing that needs to be defined (I assume it is) is |
There was some serious issues with generating |
This has been put in the code just after provisioning the nodes. |
Our test setup uses following input file: https://github.com/adrianreber/ohpc-infrastructure/blob/main/ansible/roles/test/templates/lenovo.mapping (It is converted, but the names are pretty close to the ones used in the recipe). I do not see |
If munge abd slurm is used I recommend strongly to use |
Thanks. I just had a look and |
I still see a lot of errors:
The path is wrong. I see that a file called The second path is also wrong. I see this file dhcpd.conf is still wrong:
The subnet definition is still wrong:
The variable The next error is:
and everything else seems to be triggered by that error. Your call to sleep to wait for provisioning is before the line to reboot the compute nodes. It should be after it. SSH keys do not seem to get imported into the compute node image. |
I was able to adapt the recipe locally to make it work a bit more: --- /opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh 2024-10-23 20:09:06.000000000 +0000
+++ /tmp/recipe.sh 2024-10-24 08:10:56.115799438 +0000
@@ -117,8 +117,8 @@
perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/etc/hosts.ww
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/general/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
systemctl enable --now warewulfd
wwctl overlay build
wwctl configure --all
@@ -126,7 +126,7 @@
# -------------------------------------------------
# Create compute image for Warewulf (Section 3.8.1)
# -------------------------------------------------
-wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
+wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 --syncuser
wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
dnf -y update
@@ -302,6 +302,7 @@
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
+cat /root/.ssh/*.pub | wwctl overlay import generic <(cat) /root/.ssh/authorized_keys
if [[ ${enable_ipoib} -eq 1 ]];then
wwctl overlay mkdir generic /etc/sysconfig/network-scripts/ The sleep is still at the wrong place. The SSH key import into the compute nodes was a part of warewulf 3. Not sure if this is supposed to happen automatically in warewulf 4. But the compute nodes to not come up. If I see it correctly they are making a lot of DHCP requests and the DHCP server is not happy about that. The range which is configured now is only two hosts. I see the following log from DHCP:
The successful DHCP transaction is from the BIOS, it then fails in iPXE to get the same IP address a second time. Not sure how this is supposed to work. I tried switching to a static dhcp setup in One more change that is necessary for the MAC address to be added to dhcpd.conf:
I am not sure about the SSH keys in the previous comments, because the nodes never actually booted. The node now gets an IP address in the BIOS and also during iPXE, but it fails to download some file via iPXE. The error log on the warewulf side is:
Ah, there is a typo in the container name. Let me try something and then write another, maybe less confusing, comment. |
Okay, so this is the latest change for the recipe I have running: --- /opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh 2024-10-23 20:09:06.000000000 +0000
+++ /tmp/recipe.sh 2024-10-24 09:08:35.270878390 +0000
@@ -117,8 +117,8 @@
perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/etc/hosts.ww
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/general/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
systemctl enable --now warewulfd
wwctl overlay build
wwctl configure --all
@@ -126,7 +126,7 @@
# -------------------------------------------------
# Create compute image for Warewulf (Section 3.8.1)
# -------------------------------------------------
-wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
+wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 --syncuser
wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
dnf -y update
@@ -302,6 +302,7 @@
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
+cat /root/.ssh/*.pub | wwctl overlay import generic <(cat) /root/.ssh/authorized_keys
if [[ ${enable_ipoib} -eq 1 ]];then
wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
@@ -315,20 +316,14 @@
wwctl overlay build
# Add hosts to cluster
for ((i=0; i<$num_computes; i++)) ; do
- wwctl node add --discoverable=yes --container=rocky9.4 \
- --ipaddr=${c_ip[$i]} --netmask=${internal_netmask} ${compute_prefix}$i
+ wwctl node add --discoverable=yes --container=rocky-9.4 \
+ --ipaddr=${c_ip[$i]} --hwaddr=${c_mac[i]} --netmask=${internal_netmask} ${c_name[i]}
done
wwctl overlay build
wwctl configure --all
# Enable and start munge and slurmctld (Cont.)
systemctl enable --now munge
systemctl enable --now slurmctld
-
-# -------------------------------------------------------------
-# Allow for optional sleep to wait for provisioning to complete
-# -------------------------------------------------------------
-sleep ${provision_wait}
-
# Optionally, add arguments to bootstrap kernel
if [[ ${enable_kargs} -eq 1 ]]; then
wwctl node set --yes --kernelargs="${kargs}" "${compute_regex}"
@@ -338,9 +333,16 @@
# Boot compute nodes (Section 3.10)
# ---------------------------------
for ((i=0; i<${num_computes}; i++)) ; do
- ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
+ # ignore this change ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
done
+
+# -------------------------------------------------------------
+# Allow for optional sleep to wait for provisioning to complete
+# -------------------------------------------------------------
+sleep ${provision_wait}
+
+
# ---------------------------------------
# Install Development Tools (Section 4.1)
# --------------------------------------- I also changed the warewulf configuration to static dhcp setup. I still cannot boot the compute node. It fails with: |
Now the compute nodes are booting, but I cannot login. Not sure why. How can I see the FS the compute nodes are seeing. ssh on the compute nodes doesn't accept any of the SSH keys. |
Okay, now I am able to ssh into the compute nodes. So, following changes are necessary to the recipe:
|
Also remove the
|
Is there a way to update a compute node content without rebooting. Like running |
I believe I have addressed all the errors noted above. I have changed my local workflow to use To update a compute node content without rebooting (like /warewulf/bin/wwgetfiles) run: wwctl overlay build this rebuilds the overlays and the changes will be pulled by the nodes within about a minute and applied to the node. Note: the nodes must be defined previously with I dropped the A total rebuild/configure is run by wwctl container build rocky-9.4
wwctl overlay build
wwctl configure --all Fixes in this commit:
|
I still see these errors:
Also for warewulf3 and confluent I can just |
Using this recipe with the mentioned changes I am able to run a complete test run successfully: Looks we are really close. One thing I am doing, which is not totally necessary for the recipe, but which we need for our tooling is following change:
Our tooling relies on the warewulf3 entries in |
Also, warewulf3 used to do the following: |
Yes, this is still part of warewulf and is done by the host template |
Ah, thanks. I will just source the file in our CI run. |
I found another fundamental issue with my dev workflow. I've been testing against the release version of OpenHPC, not the latest development branch and hence have been developing against Warewulf 4.4.x, not 4.5.5. I've updated my workflow and noted some of the same issues indicated above (I was not seeing them before). I'll work on updating the docs in the next day or two. |
Updated workflow to build against the latest build (OpenHPC 3.2) and hence Warewulf 4.5.
I added a fix for a a bug (I think) in Warewulf 4.5.5-320.ohpc.3.1 that does not have a Changelog: Target Warewulf 4.5
|
Here is how I setup the repo's. Please let me know if I did this wrong...
|
% begin_ohpc_run | ||
% ohpc_validation_newline | ||
% ohpc_validation_comment Update /etc/hosts template to have ${hostname}.localdomain as the first host entry | ||
% ohpc_command sed -e 's_\({{$node.Id.Get}}{{end}}\)_{{$node.Id.Get}}.localdomain \1_g' -i /srv/warewulf/overlays/host/rootfs/etc/hosts.ww |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run a second command here:
bash /etc/profile.d/ssh_setup.sh
The script only runs if a new login happens and we don't do that during the run of the recipe. So if SSH is setup early that would help CI and our users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding - will commit in a few min, just committed a documentation update and missed this comment, expect a new commit.
I'm currently using
but it extends outside the line on the page. Using a "" at the end does not work properly since it's a liner and the doc parcer expects space in front of the next line after "", which is bad python. Any suggestions? |
Just had a look and I see it. Not sure how to solve it. Maybe make the script a part of the warewulf4 RPM and then use the script. Also not the best solution. A different approach could be: $ dnf -y install netmask
$ netmask=$(netmask ${sms_ip}/${internal_network})
$ internal_network=${netmask%/*}
$ unset netmask |
I believe I addressed all the issues raised in the comments. Ready for another round of testing/review.
|
This is ready to be merged. Thanks a lot. I would like to see the commits squashed. We do not need your fixup commits in the repository and this is one logical unit of work which could be all part of a single commit. Do you want to squash the commits and add a descriptive commit message or should I do it via the GitHub interface. Both approaches work for me. You could also add a |
Recipe for using Warewulf 4 as a provisioner. Signed-off-by: Timothy Middelkoop <[email protected]> Co-authored-by: GodloveD <[email protected]>
78726c5
to
076ad68
Compare
Commits squashed and added Co-authored-by (I did not know that was a thing!). Thanks for all the guidance and help |
|
@MiddelkoopT The compute nodes do not have a DNS server configured. Any ideas what is missing here? |
The default gateway is also missing. |
I missed that one. It seems that we may need to set some more input variables (confluent does) as my scripting hacks may get unreliable given interface ordering, number of resolvers, etc. This mostly works on my test environment (not fully tested yet - have some network issues to resolve first):
I used the same names as confluent does for the last two. And now we can set the network information (
This is all done in the This is only hand tested right now for feedback. Questions:
|
In the file
Please do.
Ah, I was not aware of
I don't know the history of that line. It always fails in my CI runs, because we use the same interface for the internal and external network in our CI setup. Not sure what to do about that, but I would not change it for now. |
Created Pull Request #2053 |
Documentation for Warewulf 4. Commands tested locally on a VM. Only basic installation was tested and not the optional sections (including IB). Big thanks to David Godlove for the initial draft (https://github.com/GodloveD/ohpc/tree/warewulf4_doc_update).