Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for Warewulf 4 on Rocky 9.4 with Slurm on x86_64 #2048

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

MiddelkoopT
Copy link
Contributor

Documentation for Warewulf 4. Commands tested locally on a VM. Only basic installation was tested and not the optional sections (including IB). Big thanks to David Godlove for the initial draft (https://github.com/GodloveD/ohpc/tree/warewulf4_doc_update).

Copy link

github-actions bot commented Oct 21, 2024

Test Results

18 files   -  6  18 suites   - 6   27s ⏱️ -18s
53 tests  - 22  49 ✅  - 22  4 💤 ±0  0 ❌ ±0 
66 runs   - 22  62 ✅  - 22  4 💤 ±0  0 ❌ ±0 

Results for commit 076ad68. ± Comparison against base commit d938eda.

This pull request removes 22 tests.
rm_execution ‑ [libs/PHDF5] MPI C binary runs under resource manager (slurm/gnu14/mpich)
rm_execution ‑ [libs/PHDF5] MPI C binary runs under resource manager (slurm/gnu14/openmpi5)
rm_execution ‑ [libs/PHDF5] MPI C binary runs under resource manager (slurm/intel/mpich)
rm_execution ‑ [libs/PHDF5] MPI C binary runs under resource manager (slurm/intel/openmpi5)
rm_execution ‑ [libs/PHDF5] Parallel Fortran binary runs under resource manager (slurm/gnu14/mpich)
rm_execution ‑ [libs/PHDF5] Parallel Fortran binary runs under resource manager (slurm/gnu14/openmpi5)
rm_execution ‑ [libs/PHDF5] Parallel Fortran binary runs under resource manager (slurm/intel/mpich)
rm_execution ‑ [libs/PHDF5] Parallel Fortran binary runs under resource manager (slurm/intel/openmpi5)
test_module ‑ [HDF5] Verify HDF5 module is loaded and matches rpm version (gnu14)
test_module ‑ [HDF5] Verify HDF5 module is loaded and matches rpm version (intel)
…

♻️ This comment has been updated with latest results.

@adrianreber
Copy link
Member

Thanks a lot. CI seems to be happy and looking at the GitHub Actions artefacts I can see an RPM with the documentation and recipe included. I will try it on of our test clusters tomorrow and let you know.

@adrianreber
Copy link
Member

Running this in our CI system. I see following errors:

+ perl -pi -e 's/warewulf/${sms_name}/' /srv/warewulf/overlays/host/etc/hosts.ww
Can't open /srv/warewulf/overlays/host/etc/hosts.ww: No such file or directory.
+ perl -pi -e 's/warewulf/${sms_name}/' /srv/warewulf/overlays/general/etc/hosts.ww
Can't open /srv/warewulf/overlays/general/etc/hosts.ww: No such file or directory.
+ systemctl enable --now warewulfd
Created symlink /etc/systemd/system/multi-user.target.wants/warewulfd.service → /usr/lib/systemd/system/warewulfd.service.
+ wwctl overlay build
+ wwctl configure --all
Building overlay for openhpc-lenovo-jenkins-sms: host
Enabling and restarting the DHCP services
Job for dhcpd.service failed because the control process exited with error code.
See "systemctl status dhcpd.service" and "journalctl -xeu dhcpd.service" for details.
ERROR  : failed to start: failed to run start cmd: exit status 1
+ wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
ERROR  : could not import image: reading manifest BOSTAG in ghcr.io/warewulf/warewulf-rockylinux: manifest unknown
ERROR: could not import image: reading manifest BOSTAG in ghcr.io/warewulf/warewulf-rockylinux: manifest unknown
+ '*containerinstall*'
/opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh: line 130: *containerinstall*: command not found
+ export CHROOT=/srv/warewulf/chroots/rocky-9.4/rootfs
+ CHROOT=/srv/warewulf/chroots/rocky-9.4/rootfs
+ '*containerinstall*'
/opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh: line 136: *containerinstall*: command not found
+ '*containerinstall*'
/opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh: line 138: *containerinstall*: command not found
+ [[ 0 -eq 1 ]]
+ '[' 2 -gt 4 ']'
+ [[ 0 -eq 1 ]]
+ [[ 0 -eq 1 ]]
+ perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /etc/security/limits.conf
+ perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /etc/security/limits.conf
+ perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/security/limits.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/security/limits.conf: No such file or directory.
+ perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/security/limits.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/security/limits.conf: No such file or directory.
+ [[ '' -eq 1 ]]
+ [[ '' -eq 1 ]]
+ echo 'module(load="imudp")'
+ echo 'input(type="imudp" port="514")'
+ systemctl restart rsyslog
*.* action(type="omfwd" Target="10.241.58.134" Port="514" perl -pi -e s/^\*\.info/\#\*\.info/ /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
+ echo '*.* action(type="omfwd" Target="10.241.58.134" Port="514" perl' -pi -e 's/^\*\.info/\#\*\.info/' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
+ perl -pi -e 's/^authpriv/\#authpriv/' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf: No such file or directory.
+ perl -pi -e 's/^mail/\#mail/' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf: No such file or directory.
+ perl -pi -e 's/^cron/\#cron/' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf: No such file or directory.
+ perl -pi -e 's/^uucp/\#uucp/' /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf
Can't open /srv/warewulf/chroots/rocky-9.4/rootfs/etc/rsyslog.conf: No such file or directory.

and

+ dnf -y --installroot=/srv/warewulf/chroots/rocky-9.4/rootfs install nhc-ohpc
Unable to detect release version (use '--releasever' to specify release version)
AppStream                                       119 MB/s | 8.0 MB     00:00    
BaseOS                                           95 MB/s | 2.3 MB     00:00    
CRB                                              88 MB/s | 2.4 MB     00:00    
OpenHPC-3 - Base                                 44 kB/s | 3.6 MB     01:24    
OpenHPC-3 - Updates                              90 kB/s | 3.3 MB     00:37    
Rolling development build for 3.2               974 kB/s | 3.3 MB     00:03    
Extra Packages for Enterprise Linux $releasever  33 kB/s | 142 kB     00:04    
Errors during downloading metadata for repository 'epel':
  - Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
  - Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky&countme=1 (IP: 10.241.58.130)
Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
+ echo HealthCheckProgram=/usr/sbin/nhc
+ echo HealthCheckInterval=300
+ [[ 1 -eq 1 ]]
+ export 'kargs=acpi_pad.disable=1 intel_pstate=disable'
+ kargs='acpi_pad.disable=1 intel_pstate=disable'
+ [[ 1 -eq 1 ]]
+ dnf -y --installroot=/srv/warewulf/chroots/rocky-9.4/rootfs install kmod-msr-safe-ohpc
Unable to detect release version (use '--releasever' to specify release version)
Extra Packages for Enterprise Linux $releasever  33 kB/s | 142 kB     00:04    
Errors during downloading metadata for repository 'epel':
  - Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
+ dnf -y --installroot=/srv/warewulf/chroots/rocky-9.4/rootfs install msr-safe-ohpc
Unable to detect release version (use '--releasever' to specify release version)
Extra Packages for Enterprise Linux $releasever  33 kB/s | 142 kB     00:04    
Errors during downloading metadata for repository 'epel':
  - Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
+ dnf -y --installroot=/srv/warewulf/chroots/rocky-9.4/rootfs install msr-safe-slurm-ohpc
Unable to detect release version (use '--releasever' to specify release version)
Extra Packages for Enterprise Linux $releasever  28 kB/s | 142 kB     00:04    
Errors during downloading metadata for repository 'epel':
  - Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
Error: Failed to download metadata for repo 'epel': Cannot prepare internal mirrorlist: Status code: 404 for https://mirrors.fedoraproject.org/metalink?repo=epel-$releasever&arch=x86_64&infra=$infra&content=pub/rocky (IP: 10.241.58.130)
+ wwctl overlay import generic /etc/subuid
+ wwctl overlay import generic /etc/subgid

and

+ wwctl container build rocky-9.4
Created image for VNFS container rocky-9.4: /srv/warewulf/provision/container/rocky-9.4.img
Compressed image for VNFS container rocky-9.4: /srv/warewulf/provision/container/rocky-9.4.img.gz
+ wwctl overlay build
+ (( i=0 ))
+ (( i<2 ))
+ wwctl node add --discoverable=yes --container=BOSVER --ipaddr=10.241.58.132 --netmask=255.255.255.240 c0
Added node: c0
+ (( i++ ))
+ (( i<2 ))
+ wwctl node add --discoverable=yes --container=BOSVER --ipaddr=10.241.58.133 --netmask=255.255.255.240 c1
Added node: c1
+ (( i++ ))
+ (( i<2 ))
+ wwctl overlay build
Building system overlays for c0: [wwinit]
Created image for overlay c0/[wwinit]: /srv/warewulf/provision/overlays/c0/__SYSTEM__.img
Compressed image for overlay c0/[wwinit]: /srv/warewulf/provision/overlays/c0/__SYSTEM__.img.gz
Building runtime overlays for c0: [generic]
WARN   : Template requires file(s) from non-existant container: BOSVER:/etc/group
WARN   : Template requires file(s) from non-existant container: BOSVER:/etc/passwd
Created image for overlay c0/[generic]: /srv/warewulf/provision/overlays/c0/__RUNTIME__.img
Compressed image for overlay c0/[generic]: /srv/warewulf/provision/overlays/c0/__RUNTIME__.img.gz
Building system overlays for c1: [wwinit]
Created image for overlay c1/[wwinit]: /srv/warewulf/provision/overlays/c1/__SYSTEM__.img
Compressed image for overlay c1/[wwinit]: /srv/warewulf/provision/overlays/c1/__SYSTEM__.img.gz
Building runtime overlays for c1: [generic]
WARN   : Template requires file(s) from non-existant container: BOSVER:/etc/group
WARN   : Template requires file(s) from non-existant container: BOSVER:/etc/passwd
Created image for overlay c1/[generic]: /srv/warewulf/provision/overlays/c1/__RUNTIME__.img
Compressed image for overlay c1/[generic]: /srv/warewulf/provision/overlays/c1/__RUNTIME__.img.gz
+ wwctl configure --all
Building overlay for openhpc-lenovo-jenkins-sms: host
Enabling and restarting the DHCP services
Job for dhcpd.service failed because the control process exited with error code.
See "systemctl status dhcpd.service" and "journalctl -xeu dhcpd.service" for details.
ERROR  : failed to start: failed to run start cmd: exit status 1
+ systemctl enable --now munge
Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
+ systemctl enable --now slurmctld
Created symlink /etc/systemd/system/multi-user.target.wants/slurmctld.service → /usr/lib/systemd/system/slurmctld.service.
+ useradd -m test
+ wwsh file resync passwd shadow group
/opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh: line 400: wwsh: command not found

Also, is this line still correct?

pdsh -w ${compute_prefix}[1-${num_computes}] /warewulf/bin/wwgetfiles

The recipe is also missing a line to wait for the compute nodes to get ready like with warewulf 3:

# -------------------------------------------------------------
# Allow for optional sleep to wait for provisioning to complete
# -------------------------------------------------------------
sleep ${provision_wait}

@adrianreber
Copy link
Member

The dhcp setup looks wrong:

subnet 10.0.0.0 netmask 255.255.255.240 {
    max-lease-time 120;
    range 10.241.58.133 10.0.1.255;
    next-server 10.241.58.134;
}

@MiddelkoopT
Copy link
Contributor Author

subnet 10.0.0.0 netmask 255.255.255.240 {
max-lease-time 120;
range 10.241.58.133 10.0.1.255;
next-server 10.241.58.134;
}

I think I fixed this (stupid typo that the fix somehow got reverted).

One thing that needs to be defined (I assume it is) is internal_network, which is the subnet network address (aka 10.1.0.0 for a CIDR of 10.1.0.0/16) alongside the internal_netmask (in in the example, 10.1.255.255).

@MiddelkoopT
Copy link
Contributor Author

There was some serious issues with generating recipe.sh (I had realized I could easily generate this file and verify the output) and fixed a bug in parse_doc.pl for handling HEREDOC lines. A most of the errors reported would be a from parsing the document correctly early in the install process. I think I have fixed these issues and push them the branch.

@MiddelkoopT
Copy link
Contributor Author

MiddelkoopT commented Oct 23, 2024

# Allow for optional sleep to wait for provisioning to complete
# -------------------------------------------------------------
sleep ${provision_wait}

This has been put in the code just after provisioning the nodes.

@adrianreber
Copy link
Member

Our test setup uses following input file: https://github.com/adrianreber/ohpc-infrastructure/blob/main/ansible/roles/test/templates/lenovo.mapping (It is converted, but the names are pretty close to the ones used in the recipe).

I do not see internal_network anywhere so far.

@mslacken
Copy link
Contributor

If munge abd slurm is used I recommend strongly to use --syncuser at container import time. (I am typing this on my phone and the flag is there you can ignore this comment)

@adrianreber
Copy link
Member

If munge abd slurm is used I recommend strongly to use --syncuser at container import time. (I am typing this on my phone and the flag is there you can ignore this comment)

Thanks. I just had a look and --syncuser is part of the PR.

@adrianreber
Copy link
Member

I still see a lot of errors:

+ perl -pi -e 's/warewulf/${sms_name}/' /srv/warewulf/overlays/host/etc/hosts.ww
Can't open /srv/warewulf/overlays/host/etc/hosts.ww: No such file or directory.
+ perl -pi -e 's/warewulf/${sms_name}/' /srv/warewulf/overlays/general/etc/hosts.ww
Can't open /srv/warewulf/overlays/general/etc/hosts.ww: No such file or directory.

The path is wrong. I see that a file called /srv/warewulf/overlays/host/rootfs/etc/hosts.ww exists (it contains an extra rootfs).

The second path is also wrong. I see this file /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww.

dhcpd.conf is still wrong:

+ wwctl configure --all
Building overlay for openhpc-lenovo-jenkins-sms: host
Enabling and restarting the DHCP services
Job for dhcpd.service failed because the control process exited with error code.
See "systemctl status dhcpd.service" and "journalctl -xeu dhcpd.service" for details.
ERROR  : failed to start: failed to run start cmd: exit status 1

The subnet definition is still wrong:

subnet 10.0.0.0 netmask 255.255.255.240 {
    max-lease-time 120;
    range 10.241.58.132 10.241.58.133;
    next-server 10.241.58.134;
}

The variable internal_network does not exist anywhere in OpenHPC right now. If this is necessary you should add it to docs/recipes/install/rocky9/input.local.template. Although it feels like it, you should be able to calculate it from the SMS IP and the netmask.

The next error is:

+ wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
ERROR  : could not import image: reading manifest BOSTAG in ghcr.io/warewulf/warewulf-rockylinux: manifest unknown
ERROR: could not import image: reading manifest BOSTAG in ghcr.io/warewulf/warewulf-rockylinux: manifest unknown

and everything else seems to be triggered by that error.

Your call to sleep to wait for provisioning is before the line to reboot the compute nodes. It should be after it.

SSH keys do not seem to get imported into the compute node image.

@adrianreber
Copy link
Member

I was able to adapt the recipe locally to make it work a bit more:

--- /opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh   2024-10-23 20:09:06.000000000 +0000
+++ /tmp/recipe.sh      2024-10-24 08:10:56.115799438 +0000
@@ -117,8 +117,8 @@
 perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
 perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
 perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/etc/hosts.ww
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/general/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
 systemctl enable --now warewulfd
 wwctl overlay build
 wwctl configure --all
@@ -126,7 +126,7 @@
 # -------------------------------------------------
 # Create compute image for Warewulf (Section 3.8.1)
 # -------------------------------------------------
-wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
+wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 --syncuser
 wwctl container exec rocky-9.4 /bin/bash <<- EOF
 dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
 dnf -y update
@@ -302,6 +302,7 @@
 wwctl overlay import generic /etc/munge/munge.key
 wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
 wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
+cat /root/.ssh/*.pub | wwctl overlay import generic <(cat) /root/.ssh/authorized_keys

 if [[ ${enable_ipoib} -eq 1 ]];then
      wwctl overlay mkdir generic /etc/sysconfig/network-scripts/

The sleep is still at the wrong place. The SSH key import into the compute nodes was a part of warewulf 3. Not sure if this is supposed to happen automatically in warewulf 4.

But the compute nodes to not come up. If I see it correctly they are making a lot of DHCP requests and the DHCP server is not happy about that. The range which is configured now is only two hosts. I see the following log from DHCP:

dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: ICMP Echo reply while lease 10.241.58.133 valid.
dhcpd[71950]: Abandoning IP address 10.241.58.133: pinged before offer
dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: DHCPOFFER on 10.241.58.132 to f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: DHCPREQUEST for 10.241.58.132 (10.241.58.134) from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: DHCPACK on 10.241.58.132 to f4:c7:aa:44:41:4a via ens2f0
in.tftpd[72477]: Client ::ffff:10.241.58.132 finished /warewulf/ipxe-snponly-x86_64.efi
dhcpd[71950]: Reclaiming abandoned lease 10.241.58.133.
dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: ICMP Echo reply while lease 10.241.58.133 valid.
dhcpd[71950]: Abandoning IP address 10.241.58.133: pinged before offer
dhcpd[71950]: Reclaiming abandoned lease 10.241.58.133.
dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: ICMP Echo reply while lease 10.241.58.133 valid.
dhcpd[71950]: Abandoning IP address 10.241.58.133: pinged before offer
dhcpd[71950]: Reclaiming abandoned lease 10.241.58.133.
dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: ICMP Echo reply while lease 10.241.58.133 valid.
dhcpd[71950]: Abandoning IP address 10.241.58.133: pinged before offer
dhcpd[71950]: Reclaiming abandoned lease 10.241.58.133.
dhcpd[71950]: DHCPDISCOVER from f4:c7:aa:44:41:4a via ens2f0
dhcpd[71950]: ICMP Echo reply while lease 10.241.58.133 valid.
dhcpd[71950]: Abandoning IP address 10.241.58.133: pinged before offer

The successful DHCP transaction is from the BIOS, it then fails in iPXE to get the same IP address a second time. Not sure how this is supposed to work. I tried switching to a static dhcp setup in /etc/warewulf/warewulf.conf. That seems to work.

One more change that is necessary for the MAC address to be added to dhcpd.conf:

# Add hosts to cluster
for ((i=0; i<$num_computes; i++)) ; do
   wwctl node add --discoverable=yes --container=rocky9.4 \
   --ipaddr=${c_ip[$i]} --hwaddr=${c_mac[i]} --netmask=${internal_netmask} ${c_name[i]}
done

I am not sure about the SSH keys in the previous comments, because the nodes never actually booted.

The node now gets an IP address in the BIOS and also during iPXE, but it fails to download some file via iPXE.

The error log on the warewulf side is:

[Thu Oct 24 08:53:18 UTC 2024] RECV   : hwaddr: f4:c7:aa:44:41:4a, ipaddr: 10.241.58.132:54600, stage: ipxe
[Thu Oct 24 08:53:18 UTC 2024] SERV   : stage_file '/etc/warewulf/ipxe/default.ipxe'
[Thu Oct 24 08:53:18 UTC 2024] SEND   :              c1: /etc/warewulf/ipxe/default.ipxe
[Thu Oct 24 08:53:18 UTC 2024] RECV   : hwaddr: f4:c7:aa:44:41:4a, ipaddr: 10.241.58.132:54600, stage: kernel
[Thu Oct 24 08:53:18 UTC 2024] ERROR  : No kernel found for container rocky9.4: could not find kernel version
[Thu Oct 24 08:53:18 UTC 2024] SERV   : stage_file ''
[Thu Oct 24 08:53:18 UTC 2024] ERROR  : No resource selected

rpviewer(1)

Ah, there is a typo in the container name. Let me try something and then write another, maybe less confusing, comment.

@adrianreber
Copy link
Member

Okay, so this is the latest change for the recipe I have running:

--- /opt/ohpc/pub/doc/recipes/rocky9/x86_64/warewulf4/slurm/recipe.sh   2024-10-23 20:09:06.000000000 +0000
+++ /tmp/recipe.sh      2024-10-24 09:08:35.270878390 +0000
@@ -117,8 +117,8 @@
 perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
 perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
 perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/etc/hosts.ww
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/general/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
 systemctl enable --now warewulfd
 wwctl overlay build
 wwctl configure --all
@@ -126,7 +126,7 @@
 # -------------------------------------------------
 # Create compute image for Warewulf (Section 3.8.1)
 # -------------------------------------------------
-wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:BOSTAG rocky-9.4 --syncuser
+wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 --syncuser
 wwctl container exec rocky-9.4 /bin/bash <<- EOF
 dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
 dnf -y update
@@ -302,6 +302,7 @@
 wwctl overlay import generic /etc/munge/munge.key
 wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
 wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
+cat /root/.ssh/*.pub | wwctl overlay import generic <(cat) /root/.ssh/authorized_keys
 
 if [[ ${enable_ipoib} -eq 1 ]];then
      wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
@@ -315,20 +316,14 @@
 wwctl overlay build
 # Add hosts to cluster
 for ((i=0; i<$num_computes; i++)) ; do
-   wwctl node add --discoverable=yes --container=rocky9.4 \
-   --ipaddr=${c_ip[$i]} --netmask=${internal_netmask} ${compute_prefix}$i
+   wwctl node add --discoverable=yes --container=rocky-9.4 \
+   --ipaddr=${c_ip[$i]} --hwaddr=${c_mac[i]} --netmask=${internal_netmask} ${c_name[i]}
 done
 wwctl overlay build
 wwctl configure --all
 # Enable and start munge and slurmctld (Cont.)
 systemctl enable --now munge
 systemctl enable --now slurmctld
-
-# -------------------------------------------------------------
-# Allow for optional sleep to wait for provisioning to complete
-# -------------------------------------------------------------
-sleep ${provision_wait}
-
 # Optionally, add arguments to bootstrap kernel
 if [[ ${enable_kargs} -eq 1 ]]; then
 wwctl node set --yes --kernelargs="${kargs}" "${compute_regex}"
@@ -338,9 +333,16 @@
 # Boot compute nodes (Section 3.10)
 # ---------------------------------
 for ((i=0; i<${num_computes}; i++)) ; do
-   ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
+  # ignore this change ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
 done
 
+
+# -------------------------------------------------------------
+# Allow for optional sleep to wait for provisioning to complete
+# -------------------------------------------------------------
+sleep ${provision_wait}
+
+
 # ---------------------------------------
 # Install Development Tools (Section 4.1)
 # ---------------------------------------

I also changed the warewulf configuration to static dhcp setup.

I still cannot boot the compute node. It fails with:

rpviewer(2)

@adrianreber
Copy link
Member

Now the compute nodes are booting, but I cannot login. Not sure why. How can I see the FS the compute nodes are seeing. ssh on the compute nodes doesn't accept any of the SSH keys.

@adrianreber
Copy link
Member

Okay, now I am able to ssh into the compute nodes. So, following changes are necessary to the recipe:

-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/etc/hosts.ww
-perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/general/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
+perl -pi -e "s/warewulf/\${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww

BOSTAG handling needs to be fixed.

-   wwctl node add --discoverable=yes --container=rocky9.4 \
-   --ipaddr=${c_ip[$i]} --netmask=${internal_netmask} ${compute_prefix}$i
+   wwctl node add --discoverable=yes --container=rocky-9.4 \
+   --ipaddr=${c_ip[$i]} --hwaddr=${c_mac[i]} --netmask=${internal_netmask} ${c_name[i]}

sleep ${provision_wait} after rebooting the compute nodes

@adrianreber
Copy link
Member

adrianreber commented Oct 24, 2024

Also remove the \ from:

echo "server \${sms_ip} iburst" | wwctl overlay import generic <(cat) /etc/chrony.conf
wwctl overlay import generic <(echo SLURMD_OPTIONS="--conf-server \${sms_ip}") /etc/sysconfig/slurmd

@adrianreber
Copy link
Member

Is there a way to update a compute node content without rebooting. Like running /warewulf/bin/wwgetfiles in warewulf 3?

@MiddelkoopT
Copy link
Contributor Author

I believe I have addressed all the errors noted above. I have changed my local workflow to use recipe.sh. I believe I have addressed the rest of the comments/questions as follows:

To update a compute node content without rebooting (like /warewulf/bin/wwgetfiles) run:

wwctl overlay build

this rebuilds the overlays and the changes will be pulled by the nodes within about a minute and applied to the node. Note: the nodes must be defined previously with wwctl node add.

I dropped the slurm_pam as Warewulf 4 should manage the ssh keys. The failures before may have caused ssh not to be configured properly. If one of the commands in wwctl configure --all fails, the remainder will not be run.

A total rebuild/configure is run by

wwctl container build rocky-9.4
wwctl overlay build
wwctl configure --all

Fixes in this commit:

  • Fixed hosts.ww overlay paths
  • Compute the netmask directly with Python3
  • Fix BOSTAG and remove it's use from parse_doc.pl
  • wwctl configure --all should generate the authorized_keys, it will
    not if one the commands fails (see wwctl configure --help)
  • Enabling static dhcp and remove --discoverable
  • Added hwaddr for wwwctl node add and remove discoverable
  • Fixed extra \ escaping.
  • Moved sleep to provisioning to after resetting the nodes.
  • Fixed intall
  • Fixed node name off-by-one
  • Issues with multi-line commands and replacing BOSVER in scripts,
    hardcoded rocky-9.4

@adrianreber
Copy link
Member

I still see these errors:

+ perl -pi -e s/warewulf/openhpc-lenovo-jenkins-sms/ /srv/warewulf/overlays/host/etc/hosts.ww
Can't open /srv/warewulf/overlays/host/etc/hosts.ww: No such file or directory.
+ perl -pi -e s/warewulf/openhpc-lenovo-jenkins-sms/ /srv/warewulf/overlays/generic/etc/hosts.ww
Can't open /srv/warewulf/overlays/generic/etc/hosts.ww: No such file or directory.

Also for warewulf3 and confluent I can just ssh c1 and the host key is automatically handled. Is there an option to do this also with warewulf4?

@adrianreber
Copy link
Member

Using this recipe with the mentioned changes I am able to run a complete test run successfully:

https://repos.openhpc.community/results/3/3.2/2024-10-26-22-53-14-PASS-OHPC-3.2-rocky9.2-warewulf4-x86_64-slurm-23/

Looks we are really close. One thing I am doing, which is not totally necessary for the recipe, but which we need for our tooling is following change:

$ sed -e 's_\({{$node.Id.Get}}{{end}}\)_{{$node.Id.Get}}.localdomain \1_g' -i /srv/warewulf/overlays/host/rootfs/etc/hosts.ww

Our tooling relies on the warewulf3 entries in /etc/hosts which used to be: <IP> <node>.localdomain <node> .... I can change the recipe for our CI runs to have that format but if you would put it into the recipe we could remove one step from the CI specific changes.

@adrianreber
Copy link
Member

Also, warewulf3 used to do the following: echo -e "Host *\n StrictHostKeyChecking=no\n" > /root/.ssh/config. Is this also part of warewulf4 somehow? Somewhere?

@mslacken
Copy link
Contributor

Also, warewulf3 used to do the following: echo -e "Host *\n StrictHostKeyChecking=no\n" > /root/.ssh/config. Is this also part of warewulf4 somehow? Somewhere?

Yes, this is still part of warewulf and is done by the host template ssh_setup.sh.ww and will then reside under /etc/profile.d what means that this script is executed, after the login of root.
So you have to logoff and login that this take effect.

@adrianreber
Copy link
Member

Also, warewulf3 used to do the following: echo -e "Host *\n StrictHostKeyChecking=no\n" > /root/.ssh/config. Is this also part of warewulf4 somehow? Somewhere?

Yes, this is still part of warewulf and is done by the host template ssh_setup.sh.ww and will then reside under /etc/profile.d what means that this script is executed, after the login of root. So you have to logoff and login that this take effect.

Ah, thanks. I will just source the file in our CI run.

@MiddelkoopT
Copy link
Contributor Author

I found another fundamental issue with my dev workflow. I've been testing against the release version of OpenHPC, not the latest development branch and hence have been developing against Warewulf 4.4.x, not 4.5.5. I've updated my workflow and noted some of the same issues indicated above (I was not seeing them before). I'll work on updating the docs in the next day or two.

@MiddelkoopT
Copy link
Contributor Author

Updated workflow to build against the latest build (OpenHPC 3.2) and hence Warewulf 4.5.

  • ssh should just work as the authorized_keys and StrictHostKeyChecking=no is in the template (worked for me)
  • Added an ohpc_command to update the hosts.ww to add .localdomain as first hosts entry

I added a fix for a a bug (I think) in Warewulf 4.5.5-320.ohpc.3.1 that does not have a next-server in the dhcpd.conf.ww file for the configuration that we use.

Changelog:

Target Warewulf 4.5

 * Updated container name to use ${c_name[i]}
 * removed `/bin/false` in the container build to remove scary warning
 * Fixed duplicate `template: default`
 * Fixed bad paths for hosts overlay (target Warewulf 4.5, not 4.4)
 * Added servername.localdomain first for nodes in /etc/hosts for CI testing
 * Add workaround for dhcpd.conf.ww not including a `next-server` for tftp clients
 * ssh keys does work - no need for slurm.conf, no need for authorized keys

@MiddelkoopT
Copy link
Contributor Author

Here is how I setup the repo's. Please let me know if I did this wrong...

# Local: Enable dev (use for 3.x branch/Warewulf 4.5)
dnf config-manager --add-repo http://obs.openhpc.community:82/OpenHPC3:/3.2:/Factory/EL_9/

# 3.1 Enable OpenHPC repository
dnf install -y http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm

% begin_ohpc_run
% ohpc_validation_newline
% ohpc_validation_comment Update /etc/hosts template to have ${hostname}.localdomain as the first host entry
% ohpc_command sed -e 's_\({{$node.Id.Get}}{{end}}\)_{{$node.Id.Get}}.localdomain \1_g' -i /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run a second command here:

bash /etc/profile.d/ssh_setup.sh

The script only runs if a new login happens and we don't do that during the run of the recipe. So if SSH is setup early that would help CI and our users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding - will commit in a few min, just committed a documentation update and missed this comment, expect a new commit.

@MiddelkoopT
Copy link
Contributor Author

The variable internal_network does not exist anywhere in OpenHPC right now. If this is necessary you should add it to docs/recipes/install/rocky9/input.local.template. Although it feels like it, you should be able to calculate it from the SMS IP and the netmask.

I'm currently using

internal_network=$(python3 -c "import ipaddress; print(str(ipaddress.IPv4Interface('$sms_ip/$internal_netmask').network.network_address))")

but it extends outside the line on the page. Using a "" at the end does not work properly since it's a liner and the doc parcer expects space in front of the next line after "", which is bad python. Any suggestions?

@adrianreber
Copy link
Member

The variable internal_network does not exist anywhere in OpenHPC right now. If this is necessary you should add it to docs/recipes/install/rocky9/input.local.template. Although it feels like it, you should be able to calculate it from the SMS IP and the netmask.

I'm currently using

internal_network=$(python3 -c "import ipaddress; print(str(ipaddress.IPv4Interface('$sms_ip/$internal_netmask').network.network_address))")

but it extends outside the line on the page. Using a "" at the end does not work properly since it's a liner and the doc parcer expects space in front of the next line after "", which is bad python. Any suggestions?

Just had a look and I see it. Not sure how to solve it. Maybe make the script a part of the warewulf4 RPM and then use the script. Also not the best solution.

A different approach could be:

$ dnf -y install netmask
$ netmask=$(netmask ${sms_ip}/${internal_network})
$ internal_network=${netmask%/*}
$ unset netmask

@MiddelkoopT
Copy link
Contributor Author

I believe I addressed all the issues raised in the comments. Ready for another round of testing/review.

  • Add pam_slurm back into recipe
  • Revert provision_wait changes to slurm_startup.tex and reset_computes.tex
  • Revert some ordering changes
  • Add /etc/profile.d/ssh_setup.sh logic for CI for all.
  • Rewrite internal_network generation
  • Add 90 second sleep for CI on test user creation for login information to sync.

@adrianreber
Copy link
Member

This is ready to be merged. Thanks a lot.

I would like to see the commits squashed. We do not need your fixup commits in the repository and this is one logical unit of work which could be all part of a single commit. Do you want to squash the commits and add a descriptive commit message or should I do it via the GitHub interface. Both approaches work for me.

You could also add a Co-authored-by: GodloveD <[email protected]> after your Signed-off-by:.

Recipe for using Warewulf 4 as a provisioner.

Signed-off-by: Timothy Middelkoop <[email protected]>
Co-authored-by: GodloveD <[email protected]>
@MiddelkoopT
Copy link
Contributor Author

Commits squashed and added Co-authored-by (I did not know that was a thing!). Thanks for all the guidance and help

@adrianreber
Copy link
Member

Commits squashed and added Co-authored-by (I did not know that was a thing!). Thanks for all the guidance and help

Co-authored-by is a nice way to list co authors and it might give you a GitHub badge. I think. GitHub tracks it somehow.

@adrianreber adrianreber merged commit f31ad45 into openhpc:3.x Oct 31, 2024
20 checks passed
@adrianreber
Copy link
Member

@MiddelkoopT The compute nodes do not have a DNS server configured. Any ideas what is missing here?

@adrianreber
Copy link
Member

The default gateway is also missing.

@MiddelkoopT MiddelkoopT deleted the tm-warewulf4-doc branch October 31, 2024 21:50
@MiddelkoopT
Copy link
Contributor Author

I missed that one. It seems that we may need to set some more input variables (confluent does) as my scripting hacks may get unreliable given interface ordering, number of resolvers, etc. This mostly works on my test environment (not fully tested yet - have some network issues to resolve first):

internal_cidr=$(netmask ${sms_ip}/${internal_netmask})
internal_network=${internal_cidr%/*}
dns_servers=($(echo ($(nmcli -e=no -g IP4.DNS dev show))
ipv4_gateway=$(ip -json route show default | jq -r .[1].gateway)

echo $internal_cidr $internal_network $ipv4_gateway $dns_servers

I used the same names as confluent does for the last two.

And now we can set the network information (resolv.conf and default gateway), and also set the netmask here instead of using the node add (I think that was a bug at some point)

wwctl profile set -y default --netmask=${internal_netmask}
wwctl profile set -y default --gateway=${ipv4_gateway}
wwctl profile set -y default --netdev=default --nettagadd=DNS=${dns_servers}

wwctl profile list -a
wwctl overlay build

This is all done in the warewulf4_setup_centos.tex file.

This is only hand tested right now for feedback.

Questions:

  1. Should I create a new PR for these fixes?
  2. Should I create new input variables in input.tex and the template template for these values or is this a longer discussion?
  3. I'd like to get rid of the ip link set dev eth1 up as well. Not sure why it's there (copied over)

@adrianreber
Copy link
Member

In the file docs/recipes/install/rocky9/input.local.template we already define dns_servers and ipv4_gateway. No need to figure it out from the live system.

Should I create a new PR for these fixes?

Please do.

Should I create new input variables in input.tex and the template template for these values or is this a longer discussion?

Ah, I was not aware of inputs.tex but it would make sense to list dns_servers and ipv4_gateway in inputs.tex, right.

I'd like to get rid of the ip link set dev eth1 up as well. Not sure why it's there (copied over)

I don't know the history of that line. It always fails in my CI runs, because we use the same interface for the internal and external network in our CI setup. Not sure what to do about that, but I would not change it for now.

@MiddelkoopT
Copy link
Contributor Author

Created Pull Request #2053

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants