Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEV-11497] Socket connection management #8

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open

[DEV-11497] Socket connection management #8

wants to merge 54 commits into from

Conversation

kpsroka
Copy link

@kpsroka kpsroka commented Aug 7, 2024

Rewrites the PhoenixSocket class to simplify connection management and message processing.

While this PR improves socket handling in the Phoenix library, it is not guaranteed to resolve any particular problem. Its intention is to primarily make the socket flow easier to understand and debug.

The major code change is that PhoenixSocket has been split into several classes, mainly SocketConnectionManager and its supporting classes.

The SocketConnectionManager is responsible for keeping a connection alive - it has facilities to keep track of current WebSocket connection state, reconnect connection on failures, and provides communication interface to PhoenixSocket. The upstream (to PhoenixSocket class) interface consists of three callbacks supplied to its constructor:

  • onMessage(String) for receiving messages from the WebSocket connection,
  • onError(Object, [StackTrace]) for communicating errors with either initialization or operation of the connection.
  • onStateChange(WebSocketConnectionState) which communicates current state of the connection (initialized, ready, closing, and closed).

The downstream interface is the addMessage(String) method.

In addition there are several smaller classes working with SocketConnectionManager:

  • _WebSocketConnection which is a thin wrapper around WebSocketChannel, managing its stream/sink interface.
  • _ConnectionCallbacks which encapsulates the upstream callbacks, and performs several checks to prevent obsolete or invalid calls from propagating.
  • SocketConnectionAttempt which identifies connection attempt, and allows a bit of control over reconnection delay.
  • ConnectionInitializationException.

The PhoenixSocket class is now mostly responsible for messaging and keeping track of the channels. What changes in some event handling is that this class no longer attempts to control the underlying WebSocket connection in response to the socket's events. Instead, the internals of SocketConnectionManager are responsible for controlling connections and acting on errors and closes to keep the connection running.

This PR also removes PhoenixRawSocket, which is unused, and incompatible with the changes.

Tests are pending, but I've been running it successfully as Mac and iOS app in the past day.

@roughike
Copy link

roughike commented Aug 9, 2024

I read through the changes. Before addressing the tiny details, I'd like to talk about the big picture.

I agree with you that the WTFs/minute when reading the library's source code is a bigger number that it should be. It's supposed to closely follow the reference implementation, which at times, it does not do a very good job at. However, many of us are familiar with the twisted way the library works, as we've debugged and tried to understand it for the past few years as we've fixed issues.

If we merge your PR, and replace a lot of old code with your new code, we'd be throwing a lot of that knowledge away. For the near future, this PR might simplify things on the surface level by splitting things up, but at a cost of replacing "the devil we know" with a new saint that we don't know that well yet.

@kpsroka: While this PR improves socket handling in the Phoenix library, it is not guaranteed to resolve any particular problem. Its intention is to primarily make the socket flow easier to understand and debug.

This being the case, can we make very minimal (or zero?) changes to the existing code while achieving the same goal?

For example:

  • Increase logging without changing existing code.
  • Add more documentation comments about certain flows. For example, about the heartbeats (they're sent every 30s, if we receive server response, we effectively reset the heartbeat timeout), connection state changes, etc. Might be a double-edged sword if the documentation is not kept up to date.
  • Add inline code comments at places where code is hard to read. I was going to say the other option is to make minimal changes to make the code more readable in separate PRs, but maybe it's better to stay as close to the reference implementation as possible so that it's easy to compare the two.
  • If there are clear problems we need to improve, make purpose-built PRs for those.
  • Something else?

As this PR currently stands, it feels a bit risky to merge it, especially when it does not fix anything concrete. We also don't have that many tests to begin with, so more minimal changes with a clear goal would be easier to manage.

@kpsroka
Copy link
Author

kpsroka commented Aug 12, 2024

I read through the changes. Before addressing the tiny details, I'd like to talk about the big picture.

I agree with you that the WTFs/minute when reading the library's source code is a bigger number that it should be. It's supposed to closely follow the reference implementation, which at times, it does not do a very good job at. However, many of us are familiar with the twisted way the library works, as we've debugged and tried to understand it for the past few years as we've fixed issues.

If we merge your PR, and replace a lot of old code with your new code, we'd be throwing a lot of that knowledge away. For the near future, this PR might simplify things on the surface level by splitting things up, but at a cost of replacing "the devil we know" with a new saint that we don't know that well yet.

I think that while we might be familiar with the current code, it being hard to understand also makes us avoid changing it even if we suspect something is wrong with it - it is more pleasant to fix our own code instead. The other benefit of having it easier is that more people will be able to learn it, and detect errors. I understand that this relies on relative simplicity of the new solution, so I'm open to updating this PR to simplify it even more, but I disagree with the general sentiment of valuing the current knowledge over future use.

@kpsroka: While this PR improves socket handling in the Phoenix library, it is not guaranteed to resolve any particular problem. Its intention is to primarily make the socket flow easier to understand and debug.

This being the case, can we make very minimal (or zero?) changes to the existing code while achieving the same goal?

For example:

* Increase logging without changing existing code.

* Add more documentation comments about certain flows. For example, about the heartbeats (they're sent every 30s, if we receive server response, we effectively reset the heartbeat timeout), connection state changes, etc. Might be a double-edged sword if the documentation is not kept up to date.

* Add inline code comments at places where code is hard to read. I was going to say the other option is to make minimal changes to make the code more readable in separate PRs, but maybe it's better to stay as close to the reference implementation as possible so that it's easy to compare the two.

* If there are clear problems we need to improve, make purpose-built PRs for those.

* Something else?

I don't think these can fully replace the code change themselves, and the result with applying these changes to existing code without any structural change can make the end result less readable.

As this PR currently stands, it feels a bit risky to merge it, especially when it does not fix anything concrete. We also don't have that many tests to begin with, so more minimal changes with a clear goal would be easier to manage.

I probably should add more context: the intention for this PR was to resolve the duplicate connection problem by changes to the socket layer. But as with other PRs that we merged to resolve this problem, I cannot guarantee that it's going to fix it. From tests on my devices I didn't notice this problem surfacing*, but I'll introduce more tests for this, including probable causes.

I'm putting this PR into draft mode for the time being, I'll focus on some tests and additional changes to simplify this even further.

* I saw an instance of multiple rapid connection/disconnection though. I'll investigate what's causing this [probably some calls to connect get queued too many times, when the only last one prevails]).

@kpsroka kpsroka marked this pull request as draft August 12, 2024 11:33
@brian-superlist
Copy link

brian-superlist commented Aug 13, 2024

Summary of chat:

  • Naming and semantic meaning of ready and connected. Important for all of us to be on same page, would be good to hear from Iiro and Miguel about their opinion.
  • Want to ensure this version does not preclude us from using binary protobufs in the future. Didn't sound like this would create any blockers for that.

Neelansh-ns and others added 5 commits August 13, 2024 11:40
…h#86)

* add timeout to websocket connection and socket heartbeat

* added timeout to WebSocket ready future

* closing sink before assigning null to ws

* cancel heartbeat when socket is disposed
@kpsroka
Copy link
Author

kpsroka commented Aug 13, 2024

Summary of chat:

  • Naming and semantic meaning of ready and connected. Important for all of us to be on same page, would be good to hear from Iiro and Miguel about their opinion.
  • Want to ensure this version does not preclude us from using binary protobufs in the future. Didn't sound like this would create any blockers for that.

Replaced the PhoenixSocket._isConnected field with "_isOpen", which also is in line with the Phoenix state event naming.

Added some tests, and more comments. Will work on more tests tomorrow, but I think it's ready to review in detail now @roughike @miguelcmedeiros

@kpsroka kpsroka marked this pull request as ready for review August 13, 2024 17:04
Copy link

@brian-superlist brian-superlist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline comments around naming and also implementation.

lib/src/socket_connection_attempt.dart Outdated Show resolved Hide resolved
lib/src/socket_connection.dart Outdated Show resolved Hide resolved
lib/src/socket_connection.dart Outdated Show resolved Hide resolved
lib/src/socket_connection.dart Outdated Show resolved Hide resolved
lib/src/socket_connection.dart Outdated Show resolved Hide resolved
lib/src/socket_connection.dart Outdated Show resolved Hide resolved
lib/src/socket.dart Show resolved Hide resolved
lib/src/socket_state.dart Outdated Show resolved Hide resolved
@kpsroka
Copy link
Author

kpsroka commented Aug 20, 2024

@brian-superlist @miguelcmedeiros @roughike

Updated structurally how the code is split in various classes, with main intention being simplicity, and preservation of current behavior of PhoenixScoket.connect().

What changed:

  • SocketConnectionAttempt got renamed to DelayedCallback, and added a callback to be executed when the delay completes. This is now a more generic class, effectively serving as Future.delayed with control over delay. I wanted to preserve it as a dependency-less utility, so it has no connection (nominal or logical) to "sockets".
  • SocketConnectionManager is now unstoppable - I removed the "stop" facility, so effectively once started it can only be disposed. This allowed for simpler life-cycle of it. However, now it needs to be reinstantiated whenever PhoenixSocket closes (disposing of the manager) and reopens.

@brian-superlist
Copy link

Hey all, I'm sorry to leave this comment right before I leave, but I'm not sure I fully support merging this PR. Right now, the network connection is pretty stable and recovers well. We have not had users writing in reports about seeing the offline indicator, nor are we sure we are still running into double connection issues. Finally, it looks like we are failing on a few different integration tests now.

Therefore, if this PR does not solve a specific issue, while possibly causing some regressions, I'm not sure if it's worth it. Readability is in the eye of the beholder, and I personally think this is harder to reason about than the previous code. Miguel mentioned that I might have Stockholm syndrome 😅, and I actually sympathize with that description. However, to make such a large change to a library that is now quite stable, we should have a strong case for it: What specifically does it improve?

I don't want to block the merging of this PR if the team feels it's the right direction, but I wanted to note that even after these latest changes I'm also hesitant about this changeset and the possibility that it might cause us new headaches with connections right after we solved so many of them.

If we want to continue with the plan we have right now: Fix up the integration tests, merge early next week, and test, I can disagree and commit. However, I wanted to try to clarify that I'm still not fully sold on this set of changes.

@roughike
Copy link

roughike commented Aug 23, 2024

Just for the record - I echo Brian's comment and what I said 2 weeks ago.

Feels like the risk-to-reward ratio is off with this one. I don't feel like it's the right direction, I don't think the changes simplify anything, and I think actually makes things more hard to follow for me.

I don't want to merge this one. If you and @miguelcmedeiros feel strongly about merging this one, I will not block you, but I'll assume that you know what you're doing and know the risks.

If you do decide to merge, at least the failing integration tests must be fixed or there has to be a very good reason why not to fix them.

@MarcelKaeding
Copy link

Let's merge not before the 1.14.0 build was sent to Testlio. We can then leave it in internal testing for 1 cycle before we release it to users.

@kpsroka
Copy link
Author

kpsroka commented Aug 27, 2024

Just for the record - I echo Brian's comment and what I said 2 weeks ago.

Feels like the risk-to-reward ratio is off with this one. I don't feel like it's the right direction, I don't think the changes simplify anything, and I think actually makes things more hard to follow for me.

I can add additional tests over the ones that are here, and that I've added myself, to even further minimize the risk.

I don't want to merge this one. If you and @miguelcmedeiros feel strongly about merging this one, I will not block you, but I'll assume that you know what you're doing and know the risks.

That's why I'd like to have it tested internally on daily before pushing it to production.

If you do decide to merge, at least the failing integration tests must be fixed or there has to be a very good reason why not to fix them.

After a bit of digging I realized that the backend at example/backend is a backend for integration testing (maybe the backend). I removed two tests which were not passing since (at least) 658661b, and adjusted some - mostly to call socket.dispose() either instead of socket.close() or actually running socket.dispose() at all.

…-state-cannot-reconnect-a-disposed-socket

[part of DEV-11757] Makes failed heartbeat not attempt to reconnect after ConnectionManager is closed or disposed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants