-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework ElectrumClient
#512
Conversation
This is only used when connecting, it doesn't need to be an argument of the `ElectrumClient`.
TCP sockets with electrum servers require some non-trivial code to reconstruct individual messages from the bytes received on the socket. We process bytes in chunks, then reconstruct utf8 strings, and finally split those strings at newline characters into individual messages. We document that code, rename fields to make it easier to understand, and add unit tests. We remove it from the TCP socket abstraction, since this is specific to electrum connections. We also change the behavior in case the socket is closed while we have buffered a partial message: it doesn't make sense to emit it, as listeners won't be able to decode it.
This is a pure 1:1 mapping from `connectionStatus`, there is no reason for that duplication. Clients can trivially migrate or `map` using the `toConnectionState` helper function.
We clean up the coroutine hierarchy: the `ElectrumClient` has a single internal coroutine that is launched once the connection has been established with the electrum server. This coroutine launches three child coroutines and uses supervision to gracefully stop if the connection is closed or an error is received. Connection establishment happens in the context of the job that calls `connect()` and doesn't throw exceptions. We revert the introduction of a `CoroutineExceptionHandler` inside the JVM socket code, as it shouldn't be needed anymore.
Otherwise we may be stuck in the `Connecting` state.
We just keep `getHeader` / `getHeaders` which can be handy.
On most RPC calls, we can gracefully handle server errors. Since we cannot trust the electrum server anyway, and they may lie to us by omission, this doesn't downgrade the security model. The previous behavior was to throw an exception, which was never properly handled and would just result in a crash of the wallet application.
We add an explicit timeout to RPC calls and a retry. If that retry also fails, two strategies are available: - handle the error and gracefully degrade (non-critical RPC calls) - disconnect and retry with new electrum server (critical RPCs such as subscriptions) The timeout can be updated by the application, for example when a slow network is detected or when Tor is activated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As much as possible there should be a more specific description of the problems we are trying to solve otherwise it's hard to evaluate how much better these changes are (which is not saying that the electrum client could not use some refactoring).
It may also be easier to decompose this into a small set of PR that build upon each other: for example ios tests fail and it's difficult to see why or when they got broken.
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Show resolved
Hide resolved
} | ||
|
||
suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): T { | ||
/** Send a request using timeouts, retrying once and giving up after that retry. */ | ||
private suspend inline fun <reified T : ElectrumResponse> rpcCall(request: ElectrumRequest): Either<ServerError, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a default timeout is a very good idea but I'm not that sure about automated retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really have a choice, I think we have to retry to mitigate the case where we failed simply because we get disconnected and reconnected (as described in the comment of the rpcCallWithTimeout
function), otherwise we'll have a lot of easily avoidable failures (the behavior we have today is that the caller will hang forever waiting for a response).
But that's exactly what I did in each commit message? I spent a lot of time creating a clean commit history, where each commit fixes specific issues. Can you tell me which commit is lacking details?
We could, but the split in many commits also achieves that. It's easy to locally test specific commits to see at what point tests started failing. Note that the only tests that were failing are new tests that were added in this PR. I tried porting them to |
4676981
to
885d82f
Compare
When connecting to an invalid address in tests on iOS, the error isn't reported by the native socket which just hangs until the test times out. We lower the timeout on the `connect` call, which previously had the same value as the test timeout. We thus fail the connection attempt after 1 second and are able to reconnect to a different server. The very low RPC timeout was flaky on iOS: because the clock used depends on the actual dispatcher, it sometimes triggered only after receiving a valid response from the server, which made the test fail. Setting it to 0ms ensures that the is immediately cancelled with a timeout exception regardless of the dispatcher used.
We actually still need it until ktorio/ktor#3690 is integrated into a ktor release.
99d635d
to
2b70f09
Compare
This is now ready for review and e2e testing! |
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Outdated
Show resolved
Hide resolved
This was forgotten after e2e tests.
Since `withTimeout` runs concurrently, it may throw the timeout exception after we've established the TCP connection, so we need to release it if we can (if we have a pointer to it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this on Android without problems.
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Outdated
Show resolved
Hide resolved
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was high time for cleaning up the coroutine hierarchy, and relying on simple suspend
functions at connection establishment is much simpler to reason about.
My main concern is the potential side effects of the new behavior, especially timeout errors when making calls to ElectrumClient
. But the calls do not throw, they either:
- return a
nullable
value: type checks should protect us here - for
getScriptHashHistory
/getScriptHashUnspents
: return an empty list, which seems the appropriate thing to do.
So, I think we are fine.
There are a few things that have an impact on phoenix integration @dpad85:
- move
socketBuilder
toconnect()
- remove old
connectionState
flow - introduction of timeouts (15s by default)
getConfirmations
now takescurrentMinDepth
: that one seems the most annoying to me, as it makes the api (public and also internal) less friendly
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClientExtensions.kt
Outdated
Show resolved
Hide resolved
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Outdated
Show resolved
Hide resolved
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Show resolved
Hide resolved
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Outdated
Show resolved
Hide resolved
src/commonMain/kotlin/fr/acinq/lightning/blockchain/electrum/ElectrumClient.kt
Show resolved
Hide resolved
We can actually get it from the connection status, which simplifies the way callers use those functions.
Our
ElectrumClient
was poorly documented and had unexpected behavior in various non-trivial edge cases (e.g. disconnections while RPC calls are pending). Since we rely on untrusted electrum servers, we need clearer requirements on what to do in case of failures, and a mechanism to automatically disconnect from malicious-looking or buggy servers.The architecture of the coroutines spawned by the
ElectrumClient
was also confusing, with for example thecancel()
method called sometimes on child jobs, sometimes on the mainscope
, which was really hard to reason about since supervision wasn't explicit.I highly recommend reviewing commits independently, as they are all narrowly scoped to fix a specific issue. We should also spend some time testing this in Phoenix to see how it behaves in practice.
I removed the
CoroutineExceptionHandler
installed on TLS sockets, but that is unfortunately premature. While this new version of theElectrumClient
doesn't crash when the socket is remotely closed in unit tests (while the previous version did crash), the behavior is different on Android and the application will still crash: we need ktorio/ktor#3690 to be integrated intoktor
before we can safely get rid of it. But since it may also have unintended side-effects, I'm not sure yet whether we should keep it or not in this PR.