Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing Tokio's TcpStream with async-std TcpStream #475

Merged
merged 13 commits into from
Jan 30, 2024

Conversation

b-yap
Copy link
Contributor

@b-yap b-yap commented Jan 18, 2024

Somehow in K8s, Tokio's TcpStream is stuck at "write ready" and is never "read ready".
Even after connecting using Rust's Tcpstream and then converting it to Tokio's, the stream never changes to "read ready".

That's when I decided to use Rust's Tcpstream completely. async-std library.

In terms of code changes, these are the difference you will see:

Tokio Async-Std Rust Std
Can split the stream to read/write halves, with the write half owned by a struct (Connector) provided an example but Connector now owns the TcpStream, so I found no need to split Can only try to clone
Everything is async Async Doesn't need async
Need to add timeout logic potentially will need it Can set it in the stream

This also implies removing some tokio features that were added primarily for TCP reasons: net and io-util.


How to begin the review:

  1. clients/stellar-relay-lib/src/connection/connector/connector.rs
    • As mentioned above, the whole TcpStream is owned by the Connector:
           pub struct Connector {
               ...
               - write_stream_overlay: OwnedWriteHalf,
               + tcp_stream: TcpStream,
           }
    • renamed fn new(...) to fn start(...) where it also starts connecting to Stellar Node
  2. clients/stellar-relay-lib/src/overlay.rs
    • StellarOverlayConnection's fn connect(...) will directly use Connector's fn start(...) . No more fn create_stream(...)` calls.
    • renamed fn disconnect() to fn stop() to make it similar to other structs (Agent, Connector).
  3. clients/stellar-relay-lib/src/connection/helper.rs
    • removal of fn create_stream(...)
  4. clients/stellar-relay-lib/src/connection/connector/message_reader.rs
    • all r_stream: &mut tcp::OwnedReadHalf is replaced with mut stream: TcpStream, removing async in these methods.
    • in fn read_message_from_stellar(...):
      • instead of "peeking" first, I changed it to actually "reading" the 1st 4 bytes, to avoid double operations. On the previous logic, it usually 1. peeks first, 2. and then read. Doing only 1 operation saves time.
      • to pass the stream around, it has to be cloned; hence the call totry_clone()
      • a reference of the Connector is being passed around. This eliminates cloning.
  5. clients/vault/src/oracle/testing_utils.rs
    • created a helper function: fn specific_stellar_relay_config(...) to specifically connect to certain config. And there are only 3 choices, based on the index of its list:
      • testnet have: sdftest1, sdftest2, sdftest3
      • mainnet have: Iowa, Frankfurt, Singapore

@b-yap b-yap requested a review from a team January 18, 2024 07:44
@TorstenStueber
Copy link
Member

@b-yap as this could be a bug in Tokios TCP implementation, would it make sense to report a bug ticket to Tokio?

Copy link
Contributor

@gianfra-t gianfra-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice you were able to solve this issue @b-yap !! Super strange why the stream gets stuck and is not able to read only in Kubernetes.

I just left some comments that in general are about using the non async version of the TcpStream and if they may block the thread in a meaningful way, when reading particularly.

Copy link
Contributor

@bogdanS98 bogdanS98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks fine except I'm not really sure if synchronous execution is going to achieve the same performance as async. Is there any particular reason why std::net::TcpStream was used instead of async_std::net::TcpStream?

@b-yap
Copy link
Contributor Author

b-yap commented Jan 22, 2024

@bogdanS98 @gianfra-t
After Gian's comment about using spawn_blocking(...), I am also finding std as NOT the best solution.

I just checked async_std's repo and the last update was about 9 months ago?
I'm running through their issues and found:

b-yap and others added 13 commits January 26, 2024 13:12
adding trace logs for the stream's readiness to read data
* async-std

* works, but still in progress

* working test withouth extra signals

* remove comments and re-add connector drop trait

* cleanup

* fix the failing test about current slot

* fix the failing test about current slot, by connecting to specifically different nodes

* update config files

* use a different account for testing

* fix rustfmt

---------

Co-authored-by: Gianfranco <[email protected]>
@b-yap b-yap force-pushed the connection-issues-investigation branch from f88ef24 to c52db4f Compare January 26, 2024 05:13
@b-yap b-yap changed the title Replacing Tokio's TcpStream with Rust's std lib TcpStream Replacing Tokio's TcpStream with ~~Rust's std lib TcpStream~~ async-std TcpStream Jan 26, 2024
@b-yap b-yap changed the title Replacing Tokio's TcpStream with ~~Rust's std lib TcpStream~~ async-std TcpStream Replacing Tokio's TcpStream with async-std TcpStream Jan 26, 2024
@b-yap b-yap requested review from a team, bogdanS98 and gianfra-t January 26, 2024 06:40
Copy link
Contributor

@gianfra-t gianfra-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me @b-yap! Amazing that you found out how to solve this weird problem.

let shutdown_sender = ShutdownSender::new();

// We use a random secret key to avoid conflicts with other tests.
let agent = start_oracle_agent(
get_test_stellar_relay_config(true),
specific_stellar_relay_config(true, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why do we want to get always the first node choice here? and then on the other test always the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sometimes an error already connected peer happens. I tried connecting to different nodes just to see if it works 🤞. The next best thing is to actually connect with different stellar accounts. Currently we only have 1 test account to play with.

Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too, great job 🎉 I think these changes even simplify the connectivity logic we had because we don't have to pass the two halfs of the stream around, which is even better.

Did you already test with Zoltan if this implementation works in Kubernetes @b-yap?

Copy link
Contributor

@bogdanS98 bogdanS98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looks good to me 👍🏼

@b-yap b-yap merged commit 56b9601 into main Jan 30, 2024
2 checks passed
@b-yap b-yap deleted the connection-issues-investigation branch January 30, 2024 05:00
@b-yap
Copy link
Contributor Author

b-yap commented Jan 30, 2024

Couldn't have done it without you guys, @bogdanS98 @gianfra-t .

@ebma yes I did

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants