title | expires_at | tags | ||
---|---|---|---|---|
(routing-release-0.277.0) TCP Router Port Conflict |
2028-08-17 |
|
- (routing-release-0.277.0) TCP Router Port Conflict
- 📑 Context
- 🔥 Affected Versions
- ✔️ Operator Checklist
- 🐛 Bug Variation 1 - TCP Router claims the port first
- 🐞 Bug Variation 2 - Internal component claims the port first
- 🧰 Fix
- 🗨️ FAQ
- 📝 Appendix A: Default System Component Ports
Each TCP route requires one port on the TCP Router VM. Ports for TCP routes are managed via router groups. Each router group has a list of reservable_ports
.
The Cloud Foundry documentation for "Enabling and Configuring TCP Routing" has the following warning and suggestions for valid port ranges:
Do not enter reservable_ports that conflict with other TCP router instances or ephemeral port ranges. Cloud Foundry recommends using port ranges within 1024-2047 and 18000-32767 on default installations.
These port suggestions do not overlap with any ports used by system components. However, there is nothing (until now) preventing users from expanding this range into ports that do overlap with ports used by system components.
This port conflict can result in two different buggy outcomes.
- All versions of routing-release before 0.277.0
- Read this doc.
- Compare the listening ports on your TCP Router VM to the list below. See how here.
- Update your manifest to make
routing_api.reserved_system_component_ports
match the ports you learned about from step 2. See bosh properties details here. - Upgrade to a version of routing-release with these fixes.
- Look at the TCP Router logs to see if any exisiting router groups are invalid. See logs to look for here.
- Fix invalid router groups. See routing-api documentation here.
- Re-run the check to make sure all router groups are valid. See how here.
- Some bosh job on the TCP Router VM fails to start. This will likely cause a deployment to fail.
- There are logs for the failing job that say it was unable to bind to its port.
2020/10/13 22:12:20 Metrics server closing: listen tcp :14726: bind: address already in use
2020/10/13 22:12:20 stopping metrics-agent
- Run
netstat -tlpn | grep PORT
and see that haproxy is running on the port that the bosh job tried to bind to.
If a TCP route gets the port before the bosh job, then the job will fail to bind to its port.
- You created a tcp route, but it doesnt work.
- Check the TCP Router logs and see that it failed to bind to the port for the tcp route.
{"timestamp":"2020-10-01T21:23:17.526206817Z","level":"info","source":"tcp-router","message":"tcp-router.writing-config","data":{"num-bytes":826}}
{"timestamp":"2020-10-01T21:23:17.526332658Z","level":"info","source":"tcp-router","message":"tcp-router.running-script","data":{}}
{"timestamp":"2020-10-01T21:23:19.581306843Z","level":"info","source":"tcp-router","message":"tcp-router.running-script","data":{"output":"[ALERT] 274/212317 (43) : Starting proxy listen_cfg_2822: cannot bind socket [0.0.0.0:2822]\n"}}
{"timestamp":"2020-10-01T21:23:19.581361142Z","level":"error","source":"tcp-router","message":"tcp-router.failed-to-run-script","data":{"error":"exit status 1"}}
- Run
netstat -tlpn | grep PORT
and see that some other process is running on the port that the TCP route is trying to use.
The TCP Router will fail to load the new config with the new TCP route, because something it bound to the conflicting port. This prevents ALL new TCP routes from working as long as the conflicting port is in the config. This will not cause the bosh job for TCP Router to fail. This bug is dangerous because it is easy to miss and can affect many users.
The fix for this issues focuses on preventing the creation of router groups that conflict with system component ports. We have done this via:
- a runtime check for creating and updating router groups
- a deploytime check for exising router groups
These fixes are available in routing release v0.277.0+. If you cannot update at this time, you can fix your routing groups manually. See here for instructions.
Bosh Property | Description | Default |
---|---|---|
routing_api.reserved_system_component_ports | Array of ports that are reserved for system components. Users will not be able to create router_groups with ports that overlap with this value. See Appendix A in this document to see what system components use these ports. If you run anything else on your TCP Router VM you must add its port to this list, or else you run the risk of still running into this bug. | See Appendix A |
tcp_router.fail_on_router_port_conflicts | Fail the TCP Router if routing_api.reserved_system_component_ports conflict with ports in existing router groups. We suggest giving your users a chance to update their router groups before turning it to true. | false |
routing_api.fail_on_router_port_conflicts | By default this is set to the same value as tcp_router.fail_on_router_port_conflicts . If true, then API calls to create or update router groups will fail if the reserved_ports conflict with the routing_api.reserved_system_component_ports . |
false |
If routing_api.fail_on_router_port_conflicts
is true, then when a user tries to create or update a router group to include a port in routing_api.reserved_system_component_ports
they will get a status code 400 and the following error:
{"name":"ProcessRequestError","message":"Cannot process request: Invalid ports. Reservable ports must not include the following reserved system component ports: [2822 2825 3458 3459 3460 3461 8853 9100 14726 14727 14821 14822 14823 14824 14829 15821 17002 35095 39873 40177 42393 46567 53035 53080]."}
When the TCP Router starts it will check all existing router groups against the routing_api.reserved_system_component_ports
property. To re-run this check you can monit restart the tcp router.
You will see the following in the TCP Router logs...
If there are invalid router groups and tcp_router.fail_on_router_port_conflicts
is false
- You will see
tcp-router.router-group-port-checker-error: WARNING! In the future this will cause a deploy failure.
- Plus you will see a list of which router groups contain the conflicting ports.
{
"timestamp": "2021-05-03T20:59:43.127270911Z",
"level": "error",
"source": "tcp-router",
"message": "tcp-router.router-group-port-checker-error: WARNING! In the future this will cause a deploy failure.",
"data": {
"error": "The reserved ports for router group 'group-1' contains the following reserved system component port(s): '14726, 14727, 14821, 14822, 14823, 14824, 14829, 15821, 17002'. Please update your router group accordingly.\nThe reserved ports for router group 'group-2' contains the following reserved system component port(s): '40177'. Please update your router group accordingly."
}
}
If there are invalid router groups and tcp_router.fail_on_router_port_conflicts
is true
- You will see
tcp-router.router-group-port-checker-error: Exiting now.
- Plus you will see a list of which router groups contain the conflicting ports.
- Then monit will report the tcp router as failing
{
"timestamp": "2021-05-03T21:04:02.507129979Z",
"level": "error",
"source": "tcp-router",
"message": "tcp-router.router-group-port-checker-error: Exiting now.",
"data": {
"error": "The reserved ports for router group 'group-1' contains the following reserved system component port(s): '14726, 14727, 14821, 14822, 14823, 14824, 14829, 15821, 17002'. Please update your router group accordingly.\nThe reserved ports for router group 'group-2' contains the following reserved system component port(s): '40177'. Please update your router group accordingly."
}
}
If the seeded router groups in routing_api.router_groups
are invalid and routing_api.fail_on_router_port_conflicts
is true
- The routing-api job will cause the deployment to fail.
- You will see the following log in
routing-api.stdout.log
{
"timestamp": "2021-05-03T21:04:02.507129979Z",
"source": "routing-api",
"message": "routing-api.failed-load-config",
"log_level": 2,
"data": {
"error": "Invalid ports. Reservable ports must not include the following reserved system component ports: [2822 2825 3457 3458 3459 3460 3461 8853 9100 14726 14727 14821 14822 14823 14824 14829 14830 14920 14922 15821 17002 53035 53080]."
}
}
If there are no invalid router groups
- You will see
tcp-router.router-group-port-checker-success: No conflicting router group ports.
{
"timestamp": "2021-05-03T21:08:32.733453194Z",
"level": "info",
"source": "tcp-router",
"message": "tcp-router.router-group-port-checker-success: No conflicting router group ports.",
"data": {}
}
❓ Do I really need to check the ports running on my TCP Router VM?
Yes. You might have custom jobs running on your deployment. If you don't include all in-use ports you risk running into this bug that will break TCP routes.
❓ How can I see what ports are in use on my TCP Router VM?
- Ssh onto your TCP Router VM and become root.
- Run
netstat -tlpn | grep -v haproxy
. Ignore haproxy since those are tcp routes and we are looking for system components. - To sort them all nicely try this:
netstat -tlpn | grep -v haproxy | cut -d" " -f16 | cut -d":" -f2 | grep -v For | sort -n
❓ I see something running on port 22! Why isn't that included in routing_api.reserved_system_component_ports
?
Router Groups have never been allowed to use ports 0 - 1023 so you don't need to specifically exclude them.
❓ Why aren't my ports for udp-forwarder and system-metrics-scraper included in routing_api.reserved_system_component_ports
?
Currently these jobs choose any open ephemeral port when they starts. This is problematic for this bug and will be fixed soon. You can track this issue for udp-forwarder here and system-metrics-scraper here.
❓ I fixed my router groups. How can I rerun the check?
You can rerun the check by monit restarting the TCP Router. Or you can wait for the next deploy that will restart the TCP Router.
❓ In the logs it says that there is a conflicting port, but everything is running just fine. What's up with that?
Either (1) you don't have a system component running on that port and everything is fine or (2) you having a ticking time bomb waiting to happen and you will likely run into this bug soon.
To see if there is a system component using that port run netstat -tlpn | grep PORT
on the TCP Router VM. If there is no system component running there, then you are fine and you can remove the port from routing_api.reserved_system_component_ports
. If there is a system component running there, then you should update your router group to not include that port ASAP.
❓ I can't upgrade yet. Is there another way I could check to see if there are invalid router groups?
Yes! You don't need our fancy automation, you can do it yourself. First grab all of the ports from the TCP Router VM (see instructions here). Then grab all of your router groups (see docs here). Then check all of the router groups to make sure they don't include any of the system component ports.
You will also need to check the router groups seeded in the routing_api.router_groups
property. Even though this property is only used to seed router groups on the very first deploy, it cannot contain invalid router groups. Either delete these seeded router groups from the manifest (this will have no affect on the current created router groups) or fix the router groups to contain valid ports only.
❓ Why can't you detect what is running on the VM and see what ports are used? Why is there a deploy time configured list?
We wanted a runtime and deploytime check for misconfigured router groups. This way we can check all existing router groups and router groups that will be updated and created in the future. It is hard to determine what will be running on a VM at deploytime. We determined that this was the easiest solution.
❓ Will I ever have to update this list?
Maybe, but not often. In release notes we will include instructions to update this list if a new system component starts running on the TCP Router VM. Of course if you have your own custom deployment setup then we can't warn you when this happens.
❓ I got a router-group-port-checker-error
in the TCP Router logs. What does that mean?
This error means that the port check was unsuccessful at checking to see if your router groups contain ports that overlap with routing_api.reserved_system_component_ports
. This can happen for a few reasons:
- The tcp_router client may not be authorized via UAA to view router groups. See this PR for an example of how to fix this.
- There could be a problem connecting to uaa. Debug your network connection and then rerun the check.
- There could be a problem connecting to the routing-api. Debug your network connection and then rerun the check.
This is a list of all of the system components for a default CF-deployment that might be running on the TCP Router VM and their ports. These are the default ports used for the routing_api.reserved_system_component_ports
property.
Some of these ports are configurable and may not match what is running on your deployment. You are responsible for checking this list against what is running on your deployment.
Note: Router Groups have never been allowed to use ports 0 - 1023, so you don't need to specifically exclude them.
Port | System Component or Job Name | Bosh Property Name | Bosh Link? | Note |
---|---|---|---|---|
2822 | monit | n/a | n/a | Not configurable. See code here. |
2825 | bosh agent | n/a | n/a | Not configurable. See code here. |
3457 | loggr-udp-forwarder-agent | listening_port | no | See bosh property here. |
3458 | loggr-forwarder-agent | grpc_port | no | See bosh property here. |
3459 | loggregator_agent | grpc_port | yes | See bosh property here. This is overwritten in the default CF-deployment here. |
3460 | loggr-syslog-agent | port | no | This is overwritten in the default CF-deployment here. |
3461 | metrics-agent | port | no | See bosh property here. |
8853 | bosh-dns-health | health.server.port | no | See bosh property here. |
9100 | otel-collector | ingress.grpc.port | no | Used by otel-collector as the main ingress port to receive OTLP over GRPC. This port was reclaimed from system-metrics-agent which had this as it used 53035 everywhere. See bosh property here. |
14726 | metrics-agent | metrics_exporter_port | no | Prometheus endpoint. See bosh property here. |
14727 | metrics-agent | metrics.port | no | Agent's own metrics and debug. See bosh property here. |
14821 | prom-scaper | metrics.port | no | See bosh property here. |
14822 | loggr-syslog-agent | metrics.port | no | See bosh property here. |
14823 | loggr-forwarder-agent | metrics.port | no | See bosh property here |
14824 | loggregator_agent | metrics.port | no | See bosh property here. |
14829 | loggr-udp-forwarder | metrics.port | no | See bosh property here. |
14830 | otel-collector | TBD | n/a | This port is used for the collector's metrics. This port was previously used by loggr-udp-forwarder, however it was disabled there. See this issue for more historical information. |
14920* | system-metrics-scraper | metrics_port | no | *This job does not run on TCP router or Gorouter! However you should not use it for an agent that will be deployed along side that job. See bosh property here. |
*This port was considered for a debug port, but it turns out it's in use by leadership-election which does not run on tcp-router. It is not reserved in TCP Router. See this issue for more information. | ||||
14922 | system-metrics-agent | debug_port | no | See bosh property here |
15821 | metrics-discovery-registrar | metrics.port | no | See bosh property here. |
17002 | cf-tcp-router | tcp_router.debug_address | yes | See bosh property here. |
53035 | system-metrics-agent | metrics_port | no | This is the new default. See the bosh property here. This used to be configured by an ops file in CF-deployment. |
53080 | bosh-dns | api.port | no | See bosh property here. |