Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

Open
2 of 9 tasks
AppCrashExpress opened this issue Oct 14, 2024 · 1 comment
Open
2 of 9 tasks
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@AppCrashExpress
Copy link

AppCrashExpress commented Oct 14, 2024

Description

Hello!

We have updated Saltstack to 3007.1 (hooray!).

Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:

[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE   ] ReqChannel send clear load={'cmd': '_auth', 'id': 'minion_id', 'nonce': '0123456789abcdef0123456789abcdef', 'token': b'bytestring', 'pub': 'pubkey'}
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...

That goes on and on.

This issue was tested on the commit ID: 2b26693 (i think? lost track)

Setup

This configuration contains localhost, but master on localhost should not be configured.
The point of this configuration is to emulate dead master, which means it should eventually skip 127.0.0.1.

master:
  - 127.0.0.1
  - salt-master1.example.org
  - salt-master2.example.org
master_type: str
random_master: False # Force to be in this order
master_alive_interval: 1
retry_dns: 0 # Failover should try masters on failure

transport: tcp
auth_timeout: 3
auth_tries: 2
acceptance_wait_time: 5
random_reauth_delay: 0
ping_interval: 5
  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior

  1. Configure TCP transport on both master and minion. First IP on minion should point to dead/nonexistant master, and there should be no master randomization (for determinism);
  2. Run minion salt-call -l debug state.test;
  3. Minion will attempt to connect to dead master first and will continue to do so forever:
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    ...
    

Expected behavior
Minion should, instead, eventually give up and try next master in list:

[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out')
[DEBUG   ] Closing AsyncReqChannel instance
[INFO    ] Master 127.0.0.1 could not be reached, trying next master (if any)
[DEBUG   ] "salt-master1.example.org" Not an IP address? Assuming it is a hostname.
[WARNING ] Master ip address changed from 127.0.0.1 to 192.168.0.1
[DEBUG   ] Master URI: tcp://192.168.0.1:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

Screenshots
Most likely inapplicable (big wall of warnings).

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3007.1
 
Python Version:
        Python: 3.12.3 (main, Jun  3 2024, 10:10:13) [Clang 16.0.6 ]
 
Dependency Versions:
          cffi: 1.16.0
      cherrypy: Not Installed
      dateutil: 2.9.0.post0
     docker-py: 7.1.0
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: 0.38.0
          Mako: 1.3.5
       msgpack: 1.0.8
  msgpack-pure: Not Installed
  mysql-python: 1.4.6
     packaging: 21.3
     pycparser: 2.22
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 5.4.1
         PyZMQ: 25.1.2
        relenv: Not Installed
         smmap: 5.0.1
       timelib: 0.3.0
       Tornado: 6.4.1
           ZMQ: 4.1.2
 
Salt Package Information:
  Package Type: Not Installed
 
System Versions:
          dist: ubuntu 22.04 jammy
        locale: utf-8
       machine: x86_64
       release: 6.8.0-40-generic
        system: Linux
       version: Ubuntu 22.04 jammy

Additional context

This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout:

salt/salt/transport/tcp.py

Lines 1849 to 1857 in 2b26693

async def send(self, load, timeout=60):
await self.connect()
if self._closing:
raise ClosingError()
while not self._stream:
await asyncio.sleep(0.03)
message_id = self._message_id()
header = {"mid": message_id}
future = tornado.concurrent.Future()

The solution seem to be as simple as removing them, since TCP transport tries connecting after setting up proper timeout handlers:

     async def send(self, load, timeout=60):
-        await self.connect()
         if self._closing:
             raise ClosingError()
-        while not self._stream:
-            await asyncio.sleep(0.03)
         message_id = self._message_id()
         header = {"mid": message_id}
         future = tornado.concurrent.Future()

However, these lines are probably here for a reason? Or is it safe to remove them?
They weren't here in 3005.1 MessageClient, which is confusing :(

salt/salt/transport/tcp.py

Lines 753 to 757 in 6226b9c

def send(self, msg, timeout=None, callback=None, raw=False):
if self._closing:
raise ClosingError()
message_id = self._message_id()
header = {"mid": message_id}

@AppCrashExpress AppCrashExpress added Bug broken, incorrect, or confusing behavior needs-triage labels Oct 14, 2024
@AppCrashExpress AppCrashExpress changed the title [BUG] [3007.1] TCP Transport stuck waiting for connection [BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master Oct 14, 2024
@AppCrashExpress
Copy link
Author

I think I broke the triage workflow by renaming the issue before it was done running. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant