Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework ElectrumClient #512

Merged
merged 16 commits into from
Sep 11, 2023
Merged

Rework ElectrumClient #512

merged 16 commits into from
Sep 11, 2023

Conversation

t-bast
Copy link
Member

@t-bast t-bast commented Jul 12, 2023

Our ElectrumClient was poorly documented and had unexpected behavior in various non-trivial edge cases (e.g. disconnections while RPC calls are pending). Since we rely on untrusted electrum servers, we need clearer requirements on what to do in case of failures, and a mechanism to automatically disconnect from malicious-looking or buggy servers.

The architecture of the coroutines spawned by the ElectrumClient was also confusing, with for example the cancel() method called sometimes on child jobs, sometimes on the main scope, which was really hard to reason about since supervision wasn't explicit.

I highly recommend reviewing commits independently, as they are all narrowly scoped to fix a specific issue. We should also spend some time testing this in Phoenix to see how it behaves in practice.

I removed the CoroutineExceptionHandler installed on TLS sockets, but that is unfortunately premature. While this new version of the ElectrumClient doesn't crash when the socket is remotely closed in unit tests (while the previous version did crash), the behavior is different on Android and the application will still crash: we need ktorio/ktor#3690 to be integrated into ktor before we can safely get rid of it. But since it may also have unintended side-effects, I'm not sure yet whether we should keep it or not in this PR.

This is only used when connecting, it doesn't need to be an argument of
the `ElectrumClient`.
TCP sockets with electrum servers require some non-trivial code to
reconstruct individual messages from the bytes received on the socket.

We process bytes in chunks, then reconstruct utf8 strings, and finally
split those strings at newline characters into individual messages.

We document that code, rename fields to make it easier to understand,
and add unit tests. We remove it from the TCP socket abstraction, since
this is specific to electrum connections.

We also change the behavior in case the socket is closed while we have
buffered a partial message: it doesn't make sense to emit it, as listeners
won't be able to decode it.
This is a pure 1:1 mapping from `connectionStatus`, there is no reason for
that duplication. Clients can trivially migrate or `map` using the
`toConnectionState` helper function.
We clean up the coroutine hierarchy: the `ElectrumClient` has a single
internal coroutine that is launched once the connection has been
established with the electrum server. This coroutine launches three child
coroutines and uses supervision to gracefully stop if the connection is
closed or an error is received.

Connection establishment happens in the context of the job that calls
`connect()` and doesn't throw exceptions.

We revert the introduction of a `CoroutineExceptionHandler` inside the
JVM socket code, as it shouldn't be needed anymore.
Otherwise we may be stuck in the `Connecting` state.
We just keep `getHeader` / `getHeaders` which can be handy.
On most RPC calls, we can gracefully handle server errors. Since we cannot
trust the electrum server anyway, and they may lie to us by omission, this
doesn't downgrade the security model.

The previous behavior was to throw an exception, which was never properly
handled and would just result in a crash of the wallet application.
We add an explicit timeout to RPC calls and a retry. If that retry also
fails, two strategies are available:

- handle the error and gracefully degrade (non-critical RPC calls)
- disconnect and retry with new electrum server (critical RPCs such as
  subscriptions)

The timeout can be updated by the application, for example when a slow
network is detected or when Tor is activated.
@t-bast t-bast requested review from pm47, sstone and dpad85 July 12, 2023 10:52
Copy link
Member

@sstone sstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as possible there should be a more specific description of the problems we are trying to solve otherwise it's hard to evaluate how much better these changes are (which is not saying that the electrum client could not use some refactoring).
It may also be easier to decompose this into a small set of PR that build upon each other: for example ios tests fail and it's difficult to see why or when they got broken.

}

suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): T {
/** Send a request using timeouts, retrying once and giving up after that retry. */
private suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): Either<ServerError, T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a default timeout is a very good idea but I'm not that sure about automated retries.

Copy link
Member Author

@t-bast t-bast Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have a choice, I think we have to retry to mitigate the case where we failed simply because we get disconnected and reconnected (as described in the comment of the rpcCallWithTimeout function), otherwise we'll have a lot of easily avoidable failures (the behavior we have today is that the caller will hang forever waiting for a response).

@t-bast
Copy link
Member Author

t-bast commented Jul 25, 2023

As much as possible there should be a more specific description of the problems we are trying to solve otherwise it's hard to evaluate how much better these changes are

But that's exactly what I did in each commit message? I spent a lot of time creating a clean commit history, where each commit fixes specific issues. Can you tell me which commit is lacking details?

It may also be easier to decompose this into a small set of PR that build upon each other: for example ios tests fail and it's difficult to see why or when they got broken.

We could, but the split in many commits also achieves that. It's easy to locally test specific commits to see at what point tests started failing.

Note that the only tests that were failing are new tests that were added in this PR. I tried porting them to master, and they also fail there.

@t-bast t-bast force-pushed the electrum-client-rework branch 2 times, most recently from 4676981 to 885d82f Compare July 27, 2023 08:03
When connecting to an invalid address in tests on iOS, the error isn't
reported by the native socket which just hangs until the test times out.

We lower the timeout on the `connect` call, which previously had the same
value as the test timeout. We thus fail the connection attempt after
1 second and are able to reconnect to a different server.

The very low RPC timeout was flaky on iOS: because the clock used depends
on the actual dispatcher, it sometimes triggered only after receiving a
valid response from the server, which made the test fail. Setting it to
0ms ensures that the is immediately cancelled with a timeout exception
regardless of the dispatcher used.
We actually still need it until ktorio/ktor#3690
is integrated into a ktor release.
@t-bast t-bast marked this pull request as ready for review July 28, 2023 13:00
@t-bast
Copy link
Member Author

t-bast commented Jul 28, 2023

This is now ready for review and e2e testing!

This was forgotten after e2e tests.
Since `withTimeout` runs concurrently, it may throw the timeout exception
after we've established the TCP connection, so we need to release it
if we can (if we have a pointer to it).
sstone
sstone previously approved these changes Aug 10, 2023
Copy link
Member

@sstone sstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this on Android without problems.

sstone
sstone previously approved these changes Aug 14, 2023
Copy link
Member

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was high time for cleaning up the coroutine hierarchy, and relying on simple suspend functions at connection establishment is much simpler to reason about.

My main concern is the potential side effects of the new behavior, especially timeout errors when making calls to ElectrumClient. But the calls do not throw, they either:

  • return a nullable value: type checks should protect us here
  • for getScriptHashHistory/getScriptHashUnspents: return an empty list, which seems the appropriate thing to do.

So, I think we are fine.

There are a few things that have an impact on phoenix integration @dpad85:

  • move socketBuilder to connect()
  • remove old connectionState flow
  • introduction of timeouts (15s by default)
  • getConfirmations now takes currentMinDepth: that one seems the most annoying to me, as it makes the api (public and also internal) less friendly

We can actually get it from the connection status, which simplifies the
way callers use those functions.
@t-bast t-bast merged commit fdba64a into master Sep 11, 2023
2 checks passed
@t-bast t-bast deleted the electrum-client-rework branch September 11, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants