[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

AppCrashExpress · 2024-10-14T16:57:43Z

Description

Hello!

We have updated Saltstack to 3007.1 (hooray!).

Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:

[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE   ] ReqChannel send clear load={'cmd': '_auth', 'id': 'minion_id', 'nonce': '0123456789abcdef0123456789abcdef', 'token': b'bytestring', 'pub': 'pubkey'}
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...

That goes on and on.

This issue was tested on the commit ID: 2b26693 (i think? lost track)

Setup

This configuration contains localhost, but master on localhost should not be configured.
The point of this configuration is to emulate dead master, which means it should eventually skip 127.0.0.1.

master:
  - 127.0.0.1
  - salt-master1.example.org
  - salt-master2.example.org
master_type: str
random_master: False # Force to be in this order
master_alive_interval: 1
retry_dns: 0 # Failover should try masters on failure

transport: tcp
auth_timeout: 3
auth_tries: 2
acceptance_wait_time: 5
random_reauth_delay: 0
ping_interval: 5

on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install

Steps to Reproduce the behavior

Configure TCP transport on both master and minion. First IP on minion should point to dead/nonexistant master, and there should be no master randomization (for determinism);
Run minion salt-call -l debug state.test;

Minion will attempt to connect to dead master first and will continue to do so forever:

[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...

Expected behavior
Minion should, instead, eventually give up and try next master in list:

[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out')
[DEBUG   ] Closing AsyncReqChannel instance
[INFO    ] Master 127.0.0.1 could not be reached, trying next master (if any)
[DEBUG   ] "salt-master1.example.org" Not an IP address? Assuming it is a hostname.
[WARNING ] Master ip address changed from 127.0.0.1 to 192.168.0.1
[DEBUG   ] Master URI: tcp://192.168.0.1:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

Screenshots
Most likely inapplicable (big wall of warnings).

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt Version:
          Salt: 3007.1
 
Python Version:
        Python: 3.12.3 (main, Jun  3 2024, 10:10:13) [Clang 16.0.6 ]
 
Dependency Versions:
          cffi: 1.16.0
      cherrypy: Not Installed
      dateutil: 2.9.0.post0
     docker-py: 7.1.0
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: 0.38.0
          Mako: 1.3.5
       msgpack: 1.0.8
  msgpack-pure: Not Installed
  mysql-python: 1.4.6
     packaging: 21.3
     pycparser: 2.22
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 5.4.1
         PyZMQ: 25.1.2
        relenv: Not Installed
         smmap: 5.0.1
       timelib: 0.3.0
       Tornado: 6.4.1
           ZMQ: 4.1.2
 
Salt Package Information:
  Package Type: Not Installed
 
System Versions:
          dist: ubuntu 22.04 jammy
        locale: utf-8
       machine: x86_64
       release: 6.8.0-40-generic
        system: Linux
       version: Ubuntu 22.04 jammy

Additional context

This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout:

salt/salt/transport/tcp.py

Lines 1849 to 1857 in 2b26693

    
           async def send(self, load, timeout=60): 
        
               await self.connect() 
        
               if self._closing: 
        
                   raise ClosingError() 
        
               while not self._stream: 
        
                   await asyncio.sleep(0.03) 
        
               message_id = self._message_id() 
        
               header = {"mid": message_id} 
        
               future = tornado.concurrent.Future()

The solution seem to be as simple as removing them, since TCP transport tries connecting after setting up proper timeout handlers:

     async def send(self, load, timeout=60):
-        await self.connect()
         if self._closing:
             raise ClosingError()
-        while not self._stream:
-            await asyncio.sleep(0.03)
         message_id = self._message_id()
         header = {"mid": message_id}
         future = tornado.concurrent.Future()

However, these lines are probably here for a reason? Or is it safe to remove them?
They weren't here in 3005.1 MessageClient, which is confusing :(

salt/salt/transport/tcp.py

Lines 753 to 757 in 6226b9c

    
           def send(self, msg, timeout=None, callback=None, raw=False): 
        
               if self._closing: 
        
                   raise ClosingError() 
        
               message_id = self._message_id() 
        
               header = {"mid": message_id}

The text was updated successfully, but these errors were encountered:

AppCrashExpress · 2024-10-14T18:16:27Z

I think I broke the triage workflow by renaming the issue before it was done running. Sorry!

AppCrashExpress added Bug broken, incorrect, or confusing behavior needs-triage labels Oct 14, 2024

AppCrashExpress changed the title ~~[BUG] [3007.1] TCP Transport stuck waiting for connection~~ [BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

AppCrashExpress commented Oct 14, 2024 •

edited

Loading

AppCrashExpress commented Oct 14, 2024

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

Comments

AppCrashExpress commented Oct 14, 2024 • edited Loading

AppCrashExpress commented Oct 14, 2024

AppCrashExpress commented Oct 14, 2024 •

edited

Loading