Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx_stage nginx_clean: Also clean up PUNs of non-existing users #3879

Open
CSC-swesters opened this issue Oct 17, 2024 · 7 comments · May be fixed by #3942
Open

nginx_stage nginx_clean: Also clean up PUNs of non-existing users #3879

CSC-swesters opened this issue Oct 17, 2024 · 7 comments · May be fixed by #3942
Assignees
Labels
bug Existing functionality not working as expected component/nginx_stage
Milestone

Comments

@CSC-swesters
Copy link
Contributor

Currently, the nginx_stage program depends on a user existing in order to function properly. Previously, there were issues with it crashing if a single user didn't exist, but thankfully that's no longer an issue.

For the purpose of cleaning up inactive PUNs regularly, we're running nginx_stage nginx_clean once every 5 minutes. The issue is that PUNs belonging to a UID which doesn't have a valid user attached to it any longer won't be cleaned up.
Users accounts might get removed from the system at any time due to all of their projects being closed, or due to misconduct or other issues, and in these cases, it would be nice if nginx_stage nginx_clean could kill their PUN regardless of activity or inactivity.

As far as I can see, this could be done by checking all of the PUN config files, and gathering the ones which have no valid owner, and finally calling the nginx -c ${config_file} -s stop command, like this function does. From these config file paths, the previous usernames of the UIDs would be available and other possible clean-up can be done, if necessary.

It would probably be possible to implement a separate tool for this as well, but I think OOD might want to integrate it into nginx_stage nginx_clean for completeness. Let me know what you think. Thanks in advance!

@osc-bot osc-bot added this to the Backlog milestone Oct 17, 2024
@johrstrom
Copy link
Contributor

It would probably be possible to implement a separate tool for this as well, but I think OOD might want to integrate it into nginx_stage nginx_clean for completeness.

I agree. I'm not sure off the top why it wouldn't clean a PUN for a user that doesn't exist, so I'll have to investigate why it's not cleaning those PUNs. But I agree that it should.

@CSC-swesters
Copy link
Contributor Author

I'm not sure off the top why it wouldn't clean a PUN for a user that doesn't exist

The issue is that the code tries to resolve the username (based on the /var/run/ondemand-nginx/${user}/passenger.pid files that exist on the system) into a struct passwd, to use C terminology. If nothing is found, weird things happen.

I tested this on OOD 3.1.9 using the following methodology:

  • Log into one of our testing instances, which creates a PUN for my user
  • For the record, I was the only user logged in on the instance.
  • Remove the user data by renaming the passwd.db and group.db databases so that they aren't recognized automatically. (See nss_db for more info.)
  • Run /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean

The outcome is quite interesting. I get an error message saying:

missing PID file: /var/run/ondemand-nginx/true/passenger.pid

and the nginx_clean exits with code 0. So the error is caught somewhere, but doesn't reflect the original error, being that my username didn't resolve into a valid user account.

I also strace:d the program while testing this and it didn't show any interesting errors on the syscall side. It basically checked what's inside the /var/run/ondemand-nginx/ directory, found my passenger.pid file, and tries to resolve the username into a user, which doesn't return anything. After this, there's an attempt to stat("/var/run/ondemand-nginx/true/passenger.pid"), which returns an ENOENT ("No such file or directory") and proceeds to print the warning message.

I suspect this is where the "all users do exist" assumption starts being a problem:

User.new v[/#{pun_pid_path(user: '(.+)')}/, 1]

@johrstrom johrstrom added component/nginx_stage bug Existing functionality not working as expected labels Oct 28, 2024
@johrstrom johrstrom modified the milestones: Backlog, 4.0 Oct 28, 2024
@johrstrom johrstrom self-assigned this Nov 5, 2024
@johrstrom
Copy link
Contributor

I'm trying to replicate the circumstances that would get us into this spot. Though I think I need ldap because I can't userdel in a container while the PUN is still active.

@johrstrom
Copy link
Contributor

johrstrom commented Nov 6, 2024

I got the hpc-toolset working to do this. Unfortunately, if it works correctly and gets sfoster (an account that doesn't exist anymore) nginx can't actually stop that process because it fails to getpwnam on that same user.

(I added one or two output statements here to see the nginx command)

[root@ondemand nginx_stage]# sbin/nginx_stage nginx_clean -f
looking at user sfoster
sfoster
/opt/ood/ondemand/root/usr/sbin/nginx ["-c", "/var/lib/ondemand-nginx/config/puns/sfoster.conf", "-s", "stop"]
nginx: [emerg] getpwnam("sfoster") failed in /var/lib/ondemand-nginx/config/puns/sfoster.conf:1
[root@ondemand nginx_stage]# vim lib/nginx_stage/generators/nginx_clean_generator.rb 

So I'm not really sure what to do here - save for searching for and kill -9ing the processes still running...

@johrstrom
Copy link
Contributor

The issue is that PUNs belonging to a UID which doesn't have a valid user attached to it any longer won't be cleaned up.

I'm now wondering why you delete the user instead of just disabling it? Seems like if you delete a user you'd have to do all sorts of cleanup tasks afterwards (like cleaning their HOME and so on). This may just be one more cleanup task you need to take care of.

@CSC-swesters
Copy link
Contributor Author

Thanks for looking into this!

nginx can't actually stop that process because it fails to getpwnam on that same user.

Okay, that's good to know.

I'm now wondering why you delete the user instead of just disabling it?

I won't go into details, but this is how our system is set up to work. Users who aren't authorized to compute won't exist either.

So I'm not really sure what to do here - save for searching for and kill -9ing the processes still running...

There's no need to go to SIGKILL immediately. In fact, when the nginx ... stop command successfully executes, it will simply look up the pid file of the given server configuration, and send a SIGTERM to that process. Below is an excerpt from a strace of nginx_stage nginx_clean -f, where the redacted_user still existed:

1900144 1730970218.810405 openat(AT_FDCWD, "/var/run/ondemand-nginx/redacted_user/passenger.pid", O_RDONLY) = 4
1900144 1730970218.810509 pread64(4, "127797\n", 22, 0) = 7
1900144 1730970218.810593 close(4)      = 0
1900144 1730970218.810692 kill(127797, SIGTERM) = 0
1900144 1730970218.811008 exit_group(0) = ?
1900144 1730970218.811979 +++ exited with 0 +++

I.e., the solution to this issue would in my opinion simply do something similar, without the /opt/ood/ondemand/root/usr/sbin/nginx program:

pid="$(cat /var/run/ondemand-nginx/${redacted_user:?}/passenger.pid)"
/usr/bin/kill -s TERM "$pid"
# Remove the empty PID file directory
rmdir /var/run/ondemand-nginx/${redacted_user:?}
# Remove the PUN config and secret_key_base (kind of hacky, please revisit)
find /var/lib/ondemand-nginx/config/puns/ -name "${redacted_user:?}.*" -delete

The PUN config and secret_key_base seems to be based on usernames only, so this should be easy to integrate at the same time:

self.pun_config_path = '/var/lib/ondemand-nginx/config/puns/%{user}.conf'
self.pun_secret_key_base_path = '/var/lib/ondemand-nginx/config/puns/%{user}.secret_key_base.txt'

@johrstrom johrstrom linked a pull request Nov 7, 2024 that will close this issue
@johrstrom
Copy link
Contributor

I have a PR in to also cleanup non existing PUNs/config files. Looking at the git blame, looks like we've been at it a while, you created #1833 😆.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Existing functionality not working as expected component/nginx_stage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants