Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dashboard hung on file io #3790

Closed
johrstrom opened this issue Sep 9, 2024 · 3 comments
Closed

dashboard hung on file io #3790

johrstrom opened this issue Sep 9, 2024 · 3 comments
Milestone

Comments

@johrstrom
Copy link
Contributor

We're having issues at OSC with PUNs getting into bad states. Looking at a PUN that's been running for some time (over a week at this point). Running kill -3 on a process gave this stack trace where the dashboard is waiting on File.lstat to return.

App 1759505 output: # Thread: #<Thread:0x00007f45f3229a08 /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/live.rb:300 sleep>(Worker 1), alive = true
App 1759505 output: ------------------------------------------------------------
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `initialize'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `new'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `block in ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `map'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:36:in `block (2 levels) in fs'
App 1759505 output:     /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/mime_responds.rb:214:in `respond_to'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:18:in `fs'
@osc-bot osc-bot added this to the Backlog milestone Sep 9, 2024
@johrstrom
Copy link
Contributor Author

Note that #3511 isn't causing this directly - but could have uncovered it. If the dashboard hangs, it's not likely to have anymore open files. At which point - the old implementation of lsof checking for apps would have indicated there are no running apps. And the PUN can be restarted.

But since we started to use ps, ps still sees this app as running and therefor won't stop the PUN.

@johrstrom
Copy link
Contributor Author

I'm able to replicate this in dev when the project NFS drives are behind a firewall (i.e., any attempt to access them hangs forever).

However, if I do this work in another thread - I cannot stop/kill that thread. (doing the work in the main thread a ctrl+c does not stop the main thread). So I'm trying to work out how I can in fact kill a thread when it's in this state.

@johrstrom
Copy link
Contributor Author

This is a duplicate of #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants