-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ansible goes into hung state on windows servers #564
Comments
I managed to get a server to do it and found the following in the PowerShell operational event log : 9:40:10pm Context: User Data: 9:40:11pm Context: User Data: look like it gets stuck in a loop of starting up powershell then erroring and staring again. |
I've transferred this to the What you can do is set the log_path option to a location on the remote host. This will be able to give you a lot more information on the progress of the task and to see whether it is stuck or not. Keep in mind the Windows Updates agent can take a really long time depending on the updates but if you can share the log/debug output from a task that seems stuck that will very much help try to narrow down what might be the problem. |
@jborean93 I'm not 100% convinced this is purely a win_update as I have seen the same thing happen gathering facts. |
I would certainly try and narrow it down if you can, knowing whether it's something with the complex |
I am getting this with 2 servers out of around 20. All I'm doing on the remote machine is below:
I'm running the below: Collection Version ansible.eda 1.4.2 |
Still getting this issue. Added in the log_path parameter, when a machine hangs it doesn't give out any logs, or create a log file. It seems like Ansible establishes a winrm connection, and then initiates it's task to run the update search. I haven't been able to see where it hangs, I'm guessing it's earlier in the process given it's not created a log file yet. I'm using this via gitlab, but have seen the same when running it directly on the command line. I will setup another job for win_ping and run it on a schedule to see if the issue comes back. Then will move on to getting the return of a command. For reference, I've built a new server and get the same results: Ubuntu 22.04 |
I can confirm that this issue still occurs when running win_ping, just happened to 3 servers across 20. |
On my test server I noticed the server was running with around 400-500MB free in RAM. I am running zabbix proxy on this and using it in a centralized monitoring system with Ansible/Gitlab being used for orchestration. Zabbix uses memory for caching, and multiple pollers which can utilize alot of memory. I checked on the server where the issue originated, and I can see it only has 160 MB of RAM free. I upped the test server to 12GB RAM and the win pings seem to run fine for 11 consecutive runs. I have done a few rounds of searching updates which have went successfully. I've created a schedule to run this every 30 minutes to see if the issue happens again. |
I have run the search updates 10 consecutive times without any issues and a noticeable performance increase. This issue seems to occur when ansible is running, and the local server it's running on has swapping enabled. Although memory free was low, memory available was sufficient, but it seemed that it was going to swap rather than taking back from the buffer/cache. I'm at a loss as to why this would cause a remote windows server to hang. @jborean93 - Sorry to ping directly, wonder if you could shed some light on this? I think it must be related to one of the ps scripts it passes in initially which I think acts as a pipe for ansible to send more scripts in? |
Issue came back this morning with no memory constraints. |
processes are still in place on the ansible box. I done an strace on these yesterday and it looked like they were just awaiting for input from the remote server. |
ansible-search-updates-2024-01-09-04-03-10.log I have managed to get the logs from one of the servers it failed on, which didn't completely hang but we can't RDP to it and some applications (like event viewer) crash when you try to run them. This happens after a specific period of time/attempts to the same selection of servers. Still trying to find out what the commonality is between them that's causing this. |
Unfortunately I can't really help too much as this is something specific to your environment. We are reliant on the WinRM service working properly with things like the operational timeout to avoid problems like this. If they aren't then there's not much we can do about it sorry. |
This sounds very similar to an issue I am dealing with. I am curious if you get the same event log entry I do when it occurs. For my scenario, we will have a paybook hang on a task indefinitely. When I check the "System" event log, I see "Application popup: powershell.exe - Application Error: The application was unable to start correctly (0x0000142). Click OK to close the application." |
SUMMARY
When running a Job against between 1 and 20 Windows servers will hang on one or two servers. This causes the job to get stuck on that task waiting for the hung server.
I have found that the process is not entirely hung on the Windows server. using TCP Dump I can see the communication between the hung process and the Ansible server, slowly going back an forth, but no real activity to say something is happening. Killing all the process running under the Ansible service account or rebooting the server fails the task for that windows server and the job continues as per normal.
I originally thought it was related to the recourses available on the Windows server at the time when Ansible connected but when I last investigated, and I logged onto the servers there was nothing I could see that would cause this hung state.
I'm struggling to figure out where this issue as I can't recreate it on demand.
There is nothing useful in the event log that I can see.
Another observation that may or may not be related is that there were Ansible processes running on a couple of servers many days after the pervious round of patching. I resume these where hung jobs that weren't resolved when the jobs were canceled by the engineer. At that time I had not fund the TCP Dump method of identifying which server was the hung server.
ISSUE TYPE
COMPONENT NAME
ansible.windows
ANSIBLE VERSION
COLLECTION VERSION
CONFIGURATION
OS / ENVIRONMENT
Windows Server 2016, 2019, 2022, 2012R2
STEPS TO REPRODUCE
The play book below is the one where we see this issue the most, but it's the one we use the most.
On the last round of patching job was stuck on the task "Ping host to wake up" and for another Job template it was also on the first. But I dont remember this always being the cases. In the past jobs have been found stuck further along.
EXPECTED RESULTS
The job doesn't get blocked and hung server is failed
ACTUAL RESULTS
No verbose logs at this time.
The text was updated successfully, but these errors were encountered: