Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

future: cass future wait timed #147

Merged
merged 7 commits into from
Oct 1, 2024

Conversation

muzarski
Copy link
Collaborator

ref: #42

This PR implements a cass_future_wait_timed API function which allows to wait for a future to resolve within a given period of time.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
  • I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
@muzarski
Copy link
Collaborator Author

muzarski commented Aug 1, 2024

Rebased on master.

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 1, 2024

The async test I enabled started to fail (even when I only rebased on master). I can't reproduce it locally:

  • for scylla tests, my local version of ccm for some reason can't create a 5.0.0 scylla cluster
  • for cassandra - the test passes

I'll update an integration test to make use of cass_future_wait instead of cass_future_wait_timed to see if the issue is with the timed version itself, or not.

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 1, 2024

I changed wait_timed to wait in integration test. There are multiple client timeout errors:

 /home/runner/work/cpp-rust-driver/cpp-rust-driver/tests/src/integration/objects/future.hpp:152: Failure
      Expected: CASS_OK
To be equal to: wait_code
      Which is: CASS_ERROR_LIB_REQUEST_TIMED_OUT [Request timed out]
Request timed out: Request timeout: Request took longer than 12000ms: deadline has elapsed

Link to the job: https://github.com/scylladb/cpp-rust-driver/actions/runs/10201719963/job/28224244570?pr=153#step:7:8477.

Recently, we merged a PR that sets a default client timeout to 12000ms (which is huge!). I'm not sure why the timeouts appear - these are simple insert queries. Link to the test case: https://github.com/scylladb/cpp-rust-driver/blob/master/tests/src/integration/tests/test_async.cpp#L99-L111.

I'll try to investigate further, but it seems that the test failure is not related to cass_future_wait_timed. The test fails due to client request timeouts. As I mentioned previously, the test passes for me locally. Maybe it's some GH actions issue - query timing out after 12s is really weird.

cc: @Lorak-mmk @wprzytula

@Lorak-mmk
Copy link
Collaborator

The async test I enabled started to fail (even when I only rebased on master). I can't reproduce it locally:

* for scylla tests, my local version of ccm for some reason can't create a 5.0.0 scylla cluster

What is the error? also we should update to newer version / test with more than 1 version.

* for cassandra - the test passes

I'll update an integration test to make use of cass_future_wait instead of cass_future_wait_timed to see if the issue is with the timed version itself, or not.

@Lorak-mmk
Copy link
Collaborator

I changed wait_timed to wait in integration test. There are multiple client timeout errors:

 /home/runner/work/cpp-rust-driver/cpp-rust-driver/tests/src/integration/objects/future.hpp:152: Failure
      Expected: CASS_OK
To be equal to: wait_code
      Which is: CASS_ERROR_LIB_REQUEST_TIMED_OUT [Request timed out]
Request timed out: Request timeout: Request took longer than 12000ms: deadline has elapsed

Link to the job: https://github.com/scylladb/cpp-rust-driver/actions/runs/10201719963/job/28224244570?pr=153#step:7:8477.

Recently, we merged a PR that sets a default client timeout to 12000ms (which is huge!). I'm not sure why the timeouts appear - these are simple insert queries. Link to the test case: https://github.com/scylladb/cpp-rust-driver/blob/master/tests/src/integration/tests/test_async.cpp#L99-L111.

I'll try to investigate further, but it seems that the test failure is not related to cass_future_wait_timed. The test fails due to client request timeouts. As I mentioned previously, the test passes for me locally. Maybe it's some GH actions issue - query timing out after 12s is really weird.

cc: @Lorak-mmk @wprzytula

So previously there was no client request timeout at all? And now that you introduced it, queries are timing out?
Tests are running in valgrind, so it's possible for queries to take a longer time, but it shouldn't be this long.
When running locally, do you also use valgrind?

scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
@muzarski
Copy link
Collaborator Author

muzarski commented Aug 4, 2024

So previously there was no client request timeout at all? And now that you introduced it, queries are timing out? Tests are running in valgrind, so it's possible for queries to take a longer time, but it shouldn't be this long. When running locally, do you also use valgrind?

Good thing that you noticed that. I missed it. Indeed, when I run the tests locally under valgrind, the timeouts appear.

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 4, 2024

The async test I enabled started to fail (even when I only rebased on master). I can't reproduce it locally:

* for scylla tests, my local version of ccm for some reason can't create a 5.0.0 scylla cluster

What is the error?

Nvm, it's not ccm issue. I was just missing openjdk-11 locally which is needed to setup 5.0.0 cluster apparently.

also we should update to newer version / test with more than 1 version.

What version would you suggest? 5.4.8?

@Lorak-mmk
Copy link
Collaborator

The async test I enabled started to fail (even when I only rebased on master). I can't reproduce it locally:

* for scylla tests, my local version of ccm for some reason can't create a 5.0.0 scylla cluster

What is the error?

Nvm, it's not ccm issue. I was just missing openjdk-11 locally which is needed to setup 5.0.0 cluster apparently.

I recommend using ccm by nix, it removed a lot of my frustrations with ccm.

also we should update to newer version / test with more than 1 version.

What version would you suggest? 5.4.8?

6.0

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 5, 2024

v2: addressed review comments

There are still timeouts appearing. The conclusion is that the timeouts appear due to tests being run under valgrind (reproducing locally as well). I see two solutions here:

  • Run the timeout-sensitive tests in a separate CI step without valgrind (right now, there is only one failing test due to timeouts).
  • Modify the failing test so it disables the global client timeout, and waits for the future indefinitely (change wait_timed to wait).

I think the former approach is better, as it allows us to test cass_future_wait_timed. WDYT? @Lorak-mmk @wprzytula

@Lorak-mmk
Copy link
Collaborator

Lorak-mmk commented Aug 5, 2024

v2: addressed review comments

There are still timeouts appearing. The conclusion is that the timeouts appear due to tests being run under valgrind (reproducing locally as well). I see two solutions here:

* Run the timeout-sensitive tests in a separate CI step without `valgrind` (right now, there is only one failing test due to timeouts).

* Modify the failing test so it disables the global client timeout, and waits for the future indefinitely (change `wait_timed` to `wait`).

I think the former approach is better, as it allows us to test cass_future_wait_timed. WDYT? @Lorak-mmk @wprzytula

In second approach, if you use wait instead of wait_timed, then how does it test wait_timed?

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 5, 2024

v2: addressed review comments
There are still timeouts appearing. The conclusion is that the timeouts appear due to tests being run under valgrind (reproducing locally as well). I see two solutions here:

* Run the timeout-sensitive tests in a separate CI step without `valgrind` (right now, there is only one failing test due to timeouts).

* Modify the failing test so it disables the global client timeout, and waits for the future indefinitely (change `wait_timed` to `wait`).

I think the former approach is better, as it allows us to test cass_future_wait_timed. WDYT? @Lorak-mmk @wprzytula

In second approach, if you use wait instead of wait_timed, then how does it test wait_timed?

That's the point - it does not. That's why I prefer to run this specific test without valgrind.

@Lorak-mmk
Copy link
Collaborator

Sorry, I misread your message.
Separate step complicates the CI, but in this case it seems reasonable to me.

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 5, 2024

v3: running the failing async test in a separate CI step without valgrind

I'll update Scylla version in CI in a separate PR.

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 5, 2024

v3.1: deduplicate error message variable in callback test

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 5, 2024

v3.2: fix CI step description for timeout-sensitive cassandra integration tests

@muzarski muzarski force-pushed the cass_future_wait_timed branch 4 times, most recently from 383d574 to 0202780 Compare August 7, 2024 06:48
@muzarski
Copy link
Collaborator Author

muzarski commented Aug 7, 2024

Rebased on master

@muzarski
Copy link
Collaborator Author

muzarski commented Aug 7, 2024

v4: added proper synchronization between cass_future_wait and cass_future_wait_timed callers.

@muzarski muzarski force-pushed the cass_future_wait_timed branch 3 times, most recently from 053c2b3 to c58ff91 Compare August 7, 2024 07:17
@muzarski muzarski requested a review from wprzytula August 7, 2024 07:37
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Outdated Show resolved Hide resolved
scylla-rust-wrapper/src/future.rs Show resolved Hide resolved
@muzarski
Copy link
Collaborator Author

v4.1:

  • rebased on master -> adjusted to new CI
  • addressed @wprzytula comments

README.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment regarding commit future: cass_future_wait_timed :
It says Next commit will fix that,, but the next commit is cargo: add futures 0.3 dependency. The commit after that is the one that fixes the problem.

muzarski and others added 7 commits October 1, 2024 13:46
pub(self) is equivalent to not providing a visibility specifier at all.
Note: this implementation does not work for `current-thread` tokio
runtime in some cases, causing a deadlock. Two following commits will
introduces changes which fix that.
We will introduce a synchronization between `cass_future_wait` and
`cass_future_wait_timed` callers.
Previously, there was a scenario when using `current-thread` tokio runtime,
that caused the deadlock.

The scenario described:
```
- we make use of a `current-thread` tokio runtime
  (`Builder::new_current_thread().enable_all().build()`)
- we create a future which sleeps, let's say, 2s and then sets the value
- we call `cass_future_wait_timed` on this future, with a timeout 1s.
  This consumes the handle, starts the future, times out after 1s (there is still a task to be polled until completion - the future from above which sets the value)
- now we call `cass_future_wait` on the same future.
  This waits on cond variable (blocking operation).
  Now current thread is blocked, it is waiting (on a cond variable) for a value to be set.
  But there is noone to poll the future that sets the value,
  since we are using a `current-thread` runtime - deadlock...
```

This commit fixes that, preventing such scenario from happening.

Co-authored-by: Karol Baryła <[email protected]>
This test needs to be run without valgrind, as it causes the client
request timeouts to appear.
@muzarski
Copy link
Collaborator Author

muzarski commented Oct 1, 2024

v4.2:

  • rebased on master
  • adjusted commit message that @Lorak-mmk mentioned

@muzarski muzarski merged commit 9ac038f into scylladb:master Oct 1, 2024
7 checks passed
@muzarski muzarski mentioned this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants