You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE ] ReqChannel send clear load={'cmd': '_auth', 'id': 'minion_id', 'nonce': '0123456789abcdef0123456789abcdef', 'token': b'bytestring', 'pub': 'pubkey'}
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...
That goes on and on.
This issue was tested on the commit ID: 2b26693 (i think? lost track)
Setup
This configuration contains localhost, but master on localhost should not be configured.
The point of this configuration is to emulate dead master, which means it should eventually skip 127.0.0.1.
master:
- 127.0.0.1
- salt-master1.example.org
- salt-master2.example.org
master_type: str
random_master: False # Force to be in this order
master_alive_interval: 1
retry_dns: 0 # Failover should try masters on failure
transport: tcp
auth_timeout: 3
auth_tries: 2
acceptance_wait_time: 5
random_reauth_delay: 0
ping_interval: 5
on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install
Steps to Reproduce the behavior
Configure TCP transport on both master and minion. First IP on minion should point to dead/nonexistant master, and there should be no master randomization (for determinism);
Run minion salt-call -l debug state.test;
Minion will attempt to connect to dead master first and will continue to do so forever:
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...
Expected behavior
Minion should, instead, eventually give up and try next master in list:
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[TRACE ] Failed to send msg SaltReqTimeoutError('Message timed out')
[DEBUG ] Closing AsyncReqChannel instance
[INFO ] Master 127.0.0.1 could not be reached, trying next master (if any)
[DEBUG ] "salt-master1.example.org" Not an IP address? Assuming it is a hostname.
[WARNING ] Master ip address changed from 127.0.0.1 to 192.168.0.1
[DEBUG ] Master URI: tcp://192.168.0.1:4506
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
Screenshots
Most likely inapplicable (big wall of warnings).
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 3007.1Python Version:
Python: 3.12.3 (main, Jun 3 2024, 10:10:13) [Clang 16.0.6 ]Dependency Versions:
cffi: 1.16.0cherrypy: Not Installeddateutil: 2.9.0.post0docker-py: 7.1.0gitdb: Not Installedgitpython: Not InstalledJinja2: 3.1.4libgit2: Not Installedlooseversion: 1.3.0M2Crypto: 0.38.0Mako: 1.3.5msgpack: 1.0.8msgpack-pure: Not Installedmysql-python: 1.4.6packaging: 21.3pycparser: 2.22pycrypto: Not Installedpycryptodome: Not Installedpygit2: Not Installedpython-gnupg: 0.5.2PyYAML: 5.4.1PyZMQ: 25.1.2relenv: Not Installedsmmap: 5.0.1timelib: 0.3.0Tornado: 6.4.1ZMQ: 4.1.2Salt Package Information:
Package Type: Not InstalledSystem Versions:
dist: ubuntu 22.04 jammylocale: utf-8machine: x86_64release: 6.8.0-40-genericsystem: Linuxversion: Ubuntu 22.04 jammy
Additional context
This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout:
AppCrashExpress
changed the title
[BUG] [3007.1] TCP Transport stuck waiting for connection
[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master
Oct 14, 2024
Description
Hello!
We have updated Saltstack to 3007.1 (hooray!).
Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:
That goes on and on.
This issue was tested on the commit ID: 2b26693 (i think? lost track)
Setup
This configuration contains localhost, but master on localhost should not be configured.
The point of this configuration is to emulate dead master, which means it should eventually skip
127.0.0.1
.Steps to Reproduce the behavior
salt-call -l debug state.test
;Expected behavior
Minion should, instead, eventually give up and try next master in list:
Screenshots
Most likely inapplicable (big wall of warnings).
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout:
salt/salt/transport/tcp.py
Lines 1849 to 1857 in 2b26693
The solution seem to be as simple as removing them, since TCP transport tries connecting after setting up proper timeout handlers:
However, these lines are probably here for a reason? Or is it safe to remove them?
They weren't here in 3005.1 MessageClient, which is confusing :(
salt/salt/transport/tcp.py
Lines 753 to 757 in 6226b9c
The text was updated successfully, but these errors were encountered: