-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002
Conversation
Thanks, this is great. I will try it out on our CI systems. |
We have
That is also needed for all other recipes. So, no problem. |
The resulting RPMS can be found in the GitHub Actions for the next 24 hours. |
Test Results18 files - 6 18 suites - 6 27s ⏱️ -1s Results for commit 3abbea1. ± Comparison against base commit 611b01f. This pull request removes 10 tests.
♻️ This comment has been updated with latest results. |
Yes, /etc/hosts should be enough |
\input{common/install_ohpc_components_intro} | ||
|
||
\subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo} | ||
\input{common/enable_local_ohpc_repo_confluent} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of the history behind this line from the xcat recipe.In all other recipes we enable the OpenHPC repository by installing the OpenHPC release RPM which enabled a dnf repository from the OpenHPC repository server. Hardcoding the downloading of the repository tar files feels unnecessary especially as we do not do it at all in any of our current testing. Please try to work with the online repository if that would work for you.
If you need it for your testing we should put it behind some variable, so that it can be disabled.
Is this strictly necessary for you or can you work with the online repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure noted, l will look to work with the online repo.
\subsubsection{Build initial BOS image} \label{sec:assemble_bos} | ||
The following steps illustrate the process to build a minimal, default image for use with \Confluent{}. To begin, you will | ||
first need to have a local copy of the ISO image available for the underlying OS. In this recipe, the relevant ISO image | ||
is \texttt{Rocky-9.4-x86\_64-dvd1.iso} (available from the Rocky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image I downloaded does not have a "1" in the file name. The filename should be a variable so that it can be easily updated.
The main point which is currently not clear to me is if Confluent comes with a DHCP server? I was running the script a couple of times and the two compute nodes were always waiting for DHCP answers in the PXE boot step of the firmware. |
@tkucherera-lenovo Should I try again? Is there now a DHCP server configured, somehow? For the final merge you can squash the commits. For the main repository it makes no sense to keep your development history with fixups. If you want you do separate commits for the Please also add a |
Yes, you can try again. Confluent does have its own dhcp server and by default it will respond to DHCP requests. If an environment has its own DHCP server, it is possible to configure confluent to not respond to DHCP requests. In this case though l believe there was a bug where the setting for allowing deployment using pxe was not being set because the variable needed was missing from the input.local file l have added a fix for that now. going forward l will squash all commits and also add the |
@tkucherera-lenovo Is there an easy way to reset the host machine without reinstalling. Where does confluent store its state? Is there a directory I can delete to start from scratch? |
The state is stored |
Now I see that the compute nodes are trying to boot:
But after that nothing seems to happen. On the console I see: Any recommendations how to continue? Also there seems to be no point during the installation where the script waits for the compute nodes to be ready, so most commands are run when the compute nodes are not available. All the customization fails with: + nodeshell compute echo '"10.241.58.134:/home' /home nfs nfsvers=3,nodev,nosuid 0 '0"' '>>' /etc/fstab
c1: ssh: connect to host c1 port 22: No route to host
c2: ssh: connect to host c2 port 22: No route to host |
Now the installation is working but now it fails in post-installation scripts. I see on the server following error:
|
Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the Additionally just to be able to help me with debug. If you the command:
That output is sometimes helpful in debug. Thanks. |
[sms](*\#*) mkdir -p $epel_repo_dir_confluent | ||
[sms](*\#*) (*\install*) dnf-plugins-core createrepo | ||
# Download required EPEL packages | ||
[sms](*\#*) dnf download --destdir $epel_repo_dir_confluent fping libconfuse libunwind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strange, why don't we just enable EPEL on the compute nodes?
I just copied |
Now the compute nodes are provisioned, but I cannot login:
Using warewulf 3 provisioning the ssh keys from /root/.ssh are automatically part of the compute nodes and ssh works. Can confluent also use one of those existing keys and add it to the compute node? Also, the current recipe does not wait until the compute nodes are provisioned. It immediately continues and all commands like |
Ah, so the problem is, is that I have SSH keys in different formats and the last in the list is using an unsupported algorithm. In Following code change seems to work for me: --- /opt/confluent/lib/python/confluent/sshutil.py 2023-11-15 16:30:46.000000000 +0000
+++ /opt/confluent/lib/python/confluent/sshutil.py.new 2024-08-12 09:10:48.601474767 +0000
@@ -214,10 +214,14 @@
else:
suffix = 'rootpubkey'
for auth in authorized:
- shutil.copy(
- auth,
+ local_key = open(auth, 'r')
+ dest = open(
'/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
- myname, suffix))
+ myname, suffix), 'a')
+ dest.write(local_key.read())
+ if os.path.exists(
+ '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
+ myname, suffix)):
os.chmod('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
myname, suffix), 0o644)
os.chown('/var/lib/confluent/public/site/ssh/{0}.{1}'.format( Instead of copying all the files and overwriting everything with the last file, this appends all public keys. |
Now SSH works, but provisioning fails again:
It makes kind of sense, because Currently I am again stuck in provisioning: # nodedeploy compute
c1: pending: rocky-9.4-x86_64-default
c2: pending: rocky-9.4-x86_64-default
# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK |
Following code change seems to work for me: Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls |
on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option. |
How could this be best automated in a recipe like we are trying to build here? Any recommendations? |
I'd probably offer some example choices: I think we were imagining the first option, that sync targets aren't interested in the passwords. Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run). |
|
As this recipe is contributed by you (upstream confluent) I would let you decide how to design and implement it. Also with the proper warnings in the documentation. But whatever makes most sense for you. If the recipe results in a working cluster we are happy to include it. Maybe merge support makes sense as we do not use passwords anyway much (at all) or the blessed copy. I would defer this to you and your experience what makes most sense. |
With a Following things needs to be fixed at this point:
For warewulf we do:
As confluent first does the installation and then changes the running compute node, this approach will not work. # nodeshell compute dnf -y install epel-release
# nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm The following commands are unnecessary or do not work: # nodeshell compute dnf -y install ntp
# nodeshell compute dnf -y install --enablerepo=powertools lmod-ohpc #powertools does not exist, it is called crb and already enabled earlier
# nodeshell compute systemctl restart nfs
c1: Failed to restart nfs.service: Unit nfs.service not found.
c2: Failed to restart nfs.service: Unit nfs.service not found. This is needed: The existing Besides the items mentioned here we seem to be able to get a cluster with two compute nodes running. The nice thing for OpenHPC is that with this recipe we would finally have a stateful provisioned recipe again. When we used to have a XCAT stateful recipe, it was explicitly marked to be stateful, not sure how you want to do this. Do you want to have one recipe which can either do stateful or stateless? Or two recipes? |
So if I'm understanding
Changes to syncfiles to include: And in post.d, to install epel-release For nfs-utils, we could add it to the pkglist, or add a 'dnf -y install nfs-utils' as a 'post.d' script. For diskless, maybe a different recipe. It will be more 'warewulf' like, with 'imgutil build' and 'imgutil exec'. There's also been a suggestion to make the 'installimage' script work for those instead of just clones. |
Either install the repo file, but this requires to also copy the keys, or install the ohpc-release RPM via |
@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred. |
I already did at xcat2/confluent#159 |
Thanks, sorry for not noticing sooner. I accepted and amended it just a tad (to empty out the file before writing, and using 'with' to manage open/close of the files. |
@adrianreber FYI, confluent 3.11.0 has been released including your change for ssh pubkey handling. |
@adrianreber Since the compute nodes are provisioned without internet access running commands like
|
Hmm, I see. In our test setup all nodes have internet access that is why I didn't really think about it. I would say we mention that the nodes need internet access for all the steps somewhere in the documentation and leave it to the user to configure NAT or a proxy or whatever. That would be the easiest solution and would be acceptable for me. As we do not talk about network setup or network securing the nodes or the head node it sounds acceptable for me. What do you think? For our testing we actually set up a proxy server to reduce re-downloading of RPMs, so even with internet access we already change the network setup slightly. |
Having the nodes set up to access the internet also works for me. |
b48b5a8
to
10fe611
Compare
@adrianreber l have made some changes to include the mentioned discussions
Note: The error you were getting with nfs.service not found could be that NFS is not installed on the master node. According to section 1.2 of the ohpc install guide, NFS is hosted on the master node, but I do not see in the guides, warewolf or xcat, where it is installed. Is it assumed that it is already installed? Please advise. |
So with the latest changes I am to run a full test suite with no errors. I still have to do some minor changes. The following changes are currently still necessary:
|
Oh, and please squash your commits. For a new feature like this is would make sense to have it all in one commit without fixup commits. |
@@ -203,7 +203,7 @@ \subsubsection{Add \OHPC{} components} \label{sec:add_components} | |||
[sms](*\#*) (*\chrootinstall*) kernel | |||
|
|||
# Include modules user environment | |||
[sms](*\#*) (*\chrootinstall*) --enablerepo=powertools lmod-ohpc | |||
[sms](*\#*) (*\chrootinstall*) /usr/bin/crb enable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without trying it, this is now missing the installation of lmod-ohpc
.
@@ -156,7 +156,7 @@ \subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo} | |||
|
|||
% begin_ohpc_run | |||
\begin{lstlisting}[language=bash,keywords={},basicstyle=\fontencoding{T1}\fontsize{8.0}{10}\ttfamily,literate={ARCH}{\arch{}}1 {-}{-}1] | |||
[sms](*\#*) (*\install*) epel-release | |||
[sms](*\#*) (*\install*) epel-release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommendation when installing epel-release
it to do /usr/bin/crb enable
as a second command. My recommendation would be to install epel-release
on the SMS and on the compute nodes as well as run /usr/bin/crb enable
on the SMS and compute nodes.
89cf968
to
09c9f85
Compare
Looks like something with the Latex content is broken. You probably have to escape underscores like |
09c9f85
to
714fede
Compare
Can you do another squash and avoid the merge commit. Something like: $ git pull --rebase and then do the squashing? I will use this for one more test run, but it should be really close to be ready and smaller fixups can also be done later. |
@adrianreber you want me to squash those commits including the merge commit and have just one commit? |
Yes, just a single commit and no merge commits. |
714fede
to
9ac3f2b
Compare
@@ -31,12 +34,25 @@ bmc_password="${bmc_password:-unknown}" | |||
# Additional time to wait for compute nodes to provision (seconds) | |||
provision_wait="${provision_wait:-180}" | |||
|
|||
# Local domainname for cluster (xCAT recipe only) | |||
# DNS Local domainname for cluster (xCAT and Confluent recipe only) | |||
dns_servers="${dns_sersers:-172.30.0.254}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a typo. It says "sersers". Please fix.
@@ -21,6 +21,9 @@ sms_eth_internal="${sms_eth_internal:-eth1}" | |||
# Subnet netmask for internal cluster network | |||
internal_netmask="${internal_netmask:-255.255.0.0}" | |||
|
|||
# ipv4 gateway | |||
ipv4_gateway="${ipv4_gateway:-172.16.0.2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing "
missing.
Sorry for being pedantic, but could you also rework the commit message. Currently it is the result of the squash. Just make it from a single commit. The more information the better, but not what it is now. It has multiple "Signed-off-by" and some fixup information. |
So, another test shows that beside the mentioned typo, the missing |
Recipe to support using Confluent as a system manager and provisioner. When setting up an ohpc cluster. Signed-off-by: tkucherera <[email protected]>
9ac3f2b
to
3abbea1
Compare
@adrianreber made the change and added a much more descriptive commit message. thanks. |
Thank you so much for working with us. I will wait for CI to do a last check, but then I will merge it. |
This is a recipe that uses confluent for cluster provisioning.
Assumptions
Note
/var/lib/confluent/public
whereas on the compute nodes they can be reached via web rootconfluent-public