Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for RockyLinux9 #353

Merged
merged 38 commits into from
Mar 14, 2024
Merged

Add support for RockyLinux9 #353

merged 38 commits into from
Mar 14, 2024

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Jan 24, 2024

Make the appliance compatible with RockyLinux 9-based images.

Note that the CI and CaaS environments will continue to use RL8 at present. CI is only carried out using RL9 if a PR branch name starts with rl9 or RL9 is selected when running CI workflows manually.

Additional notes:

  • This also removes the Packer template for "environment-specific" compute image builds. No deployments are using this functionality and it has limitations which would be better addressed by making the fatimage usable on boot via cloud-init.
  • Podman commands in systemd units are changed to use --cgroup-manager=cgroupfs. This was demonstrated to be the default in RL8; for RL9 the default is systemd which leads to log warnings like:
    The cgroupv2 manager is set to systemd but there is no systemd user session available"
    For using systemd, you may need to login using an user session"
    Alternatively, you can enable lingering with: `loginctl enable-linger 1001` (possibly as root)"
    Falling back to --cgroup-manager=cgroupfs"
    "unlinkat /run/podman/libpod/tmp: permission denied"
    
    However enabling user-lingering leads to mysql failing to start at all, with the relevant error probably being:
    [email protected]: Got notification message from PID 5388, but reception only permitted for main PID 4270
    

Replaces #323

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 24, 2024

FAILED Fat image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/7643335283

image

Edit: had wrong image type in glance

@sjpb sjpb mentioned this pull request Jan 24, 2024
@sjpb
Copy link
Collaborator Author

sjpb commented Jan 24, 2024

Tests at 0982f41 on a "local" cluster:

  • deploy OK
  • reboot control: services OK
  • hpctests OK
  • OOD shell: OK
  • slurm jobs: OK
  • hpl timehistory: OK
  • slurm exporter: OK
  • ondemand exporter: some missing data, see below
  • OOD jupyter OK
  • OOD desktop: failed, no dbus-launch command

ondemand exporter:

[rocky@rl9-login-0 ~]$ systemctl status ondemand_exporter.service
Jan 24 16:26:24 rl9-login-0.rl9.invalid ondemand_exporter[36589]: ts=2024-01-24T16:26:24.868Z caller=collector.go:171 level=error msg="Error collecting apache information" err="Get \"http://localhost:81/server-status\":>

[rocky@rl9-login-0 ~]$ cat /usr/lib/systemd/system/ondemand_exporter.service
Environment="APACHE_STATUS_URL=http://localhost:81/server-status"

[rocky@rl9-login-0 ~]$ curl localhost:9301/metrics
# shows this is working at least

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 25, 2024

Fat image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/7653662893

Edit: currently failing due to CVMFS repo 503-ing
Edit: repo appears up, retrying

@sjpb
Copy link
Collaborator Author

sjpb commented Jan 25, 2024

Checked locally that e5608d9 works on both a) a cluster with existing non-system users b) fresh image

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 5, 2024

NB: Currently CI doens't get past the os-manila-mount install task b/c the rpm-reef URL at https://download.ceph.com/ has been broken/renamed.

Edit: see https://tracker.ceph.com/issues/64718

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 6, 2024

Repos fixed, lets try again

@sjpb sjpb marked this pull request as ready for review March 7, 2024 09:47
@sjpb sjpb requested a review from a team as a code owner March 7, 2024 09:47
@sjpb sjpb force-pushed the rl9_v2 branch 2 times, most recently from c7144a4 to 3b23cf9 Compare March 8, 2024 14:10
@sjpb sjpb removed the build Automatically build images label Mar 13, 2024
@sjpb
Copy link
Collaborator Author

sjpb commented Mar 13, 2024

Rebuilding fat image: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/8263123087

Built

  • openhpc-RL8-240313-1028-15f9ab38
  • openhpc-RL9-240313-1057-15f9ab38

@sjpb sjpb changed the title WIP: Add support for RockLinux9 Add support for RockLinux9 Mar 13, 2024
@sjpb sjpb marked this pull request as ready for review March 13, 2024 15:27
@sjpb
Copy link
Collaborator Author

sjpb commented Mar 13, 2024

Tests at 43d43f2 on "local" cluster:

  • deploy OK
  • hpctests OK
  • OOD shell: OK
  • slurm jobs: OK
  • hpl timehistory: OK although appear to be hitting ~100% mem usage
  • slurm exporter: OK
  • ondemand exporter: OK
  • OOD jupyter: OK
  • OOD desktop: failed, partition name was wrong, fixed, OK

m-bull
m-bull previously approved these changes Mar 14, 2024
Copy link
Collaborator

@m-bull m-bull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment but otherwise LGTM

@m-bull m-bull changed the title Add support for RockLinux9 Add support for RockyLinux9 Mar 14, 2024
@sjpb sjpb merged commit a415036 into main Mar 14, 2024
2 checks passed
@sjpb sjpb deleted the rl9_v2 branch March 14, 2024 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants