Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Implement Graceful Shutdown in Native worker #23517

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anandamideShakyan
Copy link

Description

Added PUT method in /v1/info/state for transitioning the server to the SHUTTING_DOWN state for graceful shutdown

Motivation and Context

The Prestissimo project currently lacks support for transitioning the server to the SHUTTING_DOWN state via the PUT method in the /v1/info/state endpoint. Implementing this feature allows for a graceful shutdown of the server, ensuring that ongoing processes are completed without abrupt termination.

Impact

This implementation introduces a new feature that enables the server to handle a graceful shutdown when the SHUTTING_DOWN state is triggered via a PUT request. By supporting this functionality, the server can now properly manage ongoing requests before shutting down, reducing the risk of data loss or inconsistency.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... :pr:`12345`
* ... :pr:`12345`

Hive Connector Changes
* ... :pr:`12345`
* ... :pr:`12345`

If release note is NOT required, use:

== NO RELEASE NOTE ==

@anandamideShakyan anandamideShakyan requested a review from a team as a code owner August 25, 2024 12:19
Copy link

linux-foundation-easycla bot commented Aug 25, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: anandamideShakyan / name: Shakyan Kushwaha (2879ca5)

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this code. @anandamideShakyan : Please can you add a unit test (maybe in ServerOperationsTest or a new unit test file) for this functionality.

@anandamideShakyan anandamideShakyan changed the title [native] Added PUT method in /v1/info/state for Graceful Shutdown [native] Implement Graceful Shutdown in Native worker Aug 26, 2024
@anandamideShakyan
Copy link
Author

@aditi-pandit Sure, I will add the unit test.

Copy link
Contributor

@amitkdutta amitkdutta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandamideShakyan Prestoserver handles graceful shutdown mechanism with SIGTERM, which is a standard way to shut down any process. Here is how the signal handler is registered (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L152)

Also in the stop method], server goes in graceful shotdown mode

As a result, no explicit endpoint is required to ask the server to shut down. Just sending SIGTERM like any other process works as expected. In Java, perhaps these signal handling is not possible, hence this end point concept came into development. Native worker even have additional signal handling mechnism to catch debugging information in case of SEGFault (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoMain.cpp#L25C27-L25C47)

CC: @spershin

Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
@anandamideShakyan
Copy link
Author

@anandamideShakyan Prestoserver handles graceful shutdown mechanism with SIGTERM, which is a standard way to shut down any process. Here is how the signal handler is registered (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L152)

Also in the stop method], server goes in graceful shotdown mode

As a result, no explicit endpoint is required to ask the server to shut down. Just sending SIGTERM like any other process works as expected. In Java, perhaps these signal handling is not possible, hence this end point concept came into development. Native worker even have additional signal handling mechnism to catch debugging information in case of SEGFault (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoMain.cpp#L25C27-L25C47)

CC: @spershin

@amitkdutta

Please refer to this issue : #23299

@amitkdutta
Copy link
Contributor

@anandamideShakyan Prestoserver handles graceful shutdown mechanism with SIGTERM, which is a standard way to shut down any process. Here is how the signal handler is registered (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L152)
Also in the stop method], server goes in graceful shotdown mode
As a result, no explicit endpoint is required to ask the server to shut down. Just sending SIGTERM like any other process works as expected. In Java, perhaps these signal handling is not possible, hence this end point concept came into development. Native worker even have additional signal handling mechnism to catch debugging information in case of SEGFault (https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoMain.cpp#L25C27-L25C47)
CC: @spershin

@amitkdutta

Please refer to this issue : #23299

@anandamideShakyan Commented in the issue as well

#23299 (comment)

@@ -1404,6 +1413,28 @@ void PrestoServer::reportNodeStatus(proxygen::ResponseHandler* downstream) {
http::sendOkResponse(downstream, json(fetchNodeStatus()));
}

void PrestoServer::handleGracefulShutdown(const std::vector<std::unique_ptr<folly::IOBuf>>& body, proxygen::ResponseHandler* downstream){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format check is failing. Please fix this (see the Makefile having a format-fix target).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1404,6 +1413,28 @@ void PrestoServer::reportNodeStatus(proxygen::ResponseHandler* downstream) {
http::sendOkResponse(downstream, json(fetchNodeStatus()));
}

void PrestoServer::handleGracefulShutdown(const std::vector<std::unique_ptr<folly::IOBuf>>& body, proxygen::ResponseHandler* downstream){
if (body.size()==1 && body[0]->moveToFbString() == "\"SHUTTING_DOWN\"") {
// Print message and initiate shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need all these comments as they explains what happens which is apparent from the code. We use the Velox comment guidelines: https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#coding-best-practices (see "Code Comments" section).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@majetideepak
Copy link
Collaborator

Please add a unit test as Aditi requested above.

@anandamideShakyan
Copy link
Author

@majetideepak I have added the unit test GracefulShutdownTest.cpp.

@prestodb-ci prestodb-ci linked an issue Oct 18, 2024 that may be closed by this pull request
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandamideShakyan can you check if can use PrestoServerOperations.h?
See the test presto_cpp/main/tests/ServerOperationTest.cpp.
I see that Aditi referenced this as well.

@anandamideShakyan
Copy link
Author

@majetideepak I had to add GracefulShutdownTest.h as a separate file because if we are running the presto server after some test cases we get error The memory manager has already been set, the PrestoServer initializes global memory manager and there are checks in the initializer to see if an instance is already created.
Also the reason for adding is as a separate executable is that when the presto server exits, the whole process would exit and I used to get SIGABRT.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some test cases we get error The memory manager has already been set

This should not happen. Can you share a reproducer?
Let's extend class PrestoServerOperations if you need to launch presto server in a separate thread.

// Give coordinator some time to receive our new node state and stop sending
// any tasks.
std::this_thread::sleep_for(std::chrono::seconds(shutdownOnsetSec));
std::thread shutdownThread([this, shutdownOnsetSec]() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for a separate shutdown thread?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were not able to read the server status if we didn't schedule the shutdown in a separate thread because before we could read the status, server was shutting down.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to read this status?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it to compare it against the status codes in the test cases.

Copy link
Collaborator

@majetideepak majetideepak Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again not ideal to make changes only for testing.

@@ -157,7 +158,9 @@ PrestoServer::PrestoServer(const std::string& configDirectoryPath)

PrestoServer::~PrestoServer() {}

void PrestoServer::run() {
void PrestoServer::run(
std::function<void(proxygen::HTTPServer*)> onSuccess,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these arguments used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it inside PrestoServerWrapper.cpp to get the socket address of the server that we need while running the test suite(see SetUpTestSuite()).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not ideal to make API changes just for testing. Let's figure out another way to test.

@anandamideShakyan
Copy link
Author

anandamideShakyan commented Oct 27, 2024

This should not happen. Can you share a reproducer? Let's extend class PrestoServerOperations if you need to launch presto server in a separate thread.

@majetideepak
I extended the ServerOperationTest.cpp to include the graceful shutdown test suite,I get this :
[----------] 7 tests from ServerOperationTest I20241027 17:00:11.820835 20054 Configs.cpp:120] [PRESTO_STARTUP] Registered properties from '../../../etc/config.properties': runtime-metrics-collection-enabled=true register-test-functions=true http-server.http.port=7777 shutdown-onset-sec=1 presto.version=testversion discovery.uri=http://127.0.0.1:49595 I20241027 17:00:11.821100 20054 Configs.cpp:120] [PRESTO_STARTUP] Registered properties from '../../../etc/node.properties': node.internal-address=127.0.0.1 node.location=testing-location node.environment=testing I20241027 17:00:11.821332 20054 Configs.cpp:120] [PRESTO_STARTUP] Registered properties from '../../../etc/velox.properties': mutable-config=true I20241027 17:00:11.822000 20054 Configs.cpp:900] [PRESTO_STARTUP] Updated in '../../../etc/velox.properties' from SystemProperties: max_partial_aggregation_memory=16777216 max_page_partitioning_buffer_size=33554432 max_output_buffer_size=33554432 presto.array_agg.ignore_nulls=false I20241027 17:00:11.822346 20054 PrestoServer.cpp:799] [PRESTO_STARTUP] Starting with node memory 40GB E20241027 17:00:11.822404 20054 Exceptions.h:66] Line: /Users/shakyan/opensource/presto/presto-native-execution/velox/velox/common/memory/Memory.cpp:177, Function:initialize, Expression: instance == nullptr The memory manager has already been set: Memory Manager[capacity UNLIMITED alignment 64B usedBytes 0B number of pools 2 List of root pools: __sys_root__ Memory Allocator[MALLOC capacity UNLIMITED allocated bytes 0 allocated pages 0 mapped pages 0] ARBIRTATOR[NOOP CAPACITY[UNLIMITED]]], Source: RUNTIME, ErrorCode: INVALID_STATE libc++abi: terminating due to uncaught exception of type facebook::velox::VeloxRuntimeError: Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: The memory manager has already been set: Memory Manager[capacity UNLIMITED alignment 64B usedBytes 0B number of pools 2 List of root pools: __sys_root__ Memory Allocator[MALLOC capacity UNLIMITED allocated bytes 0 allocated pages 0 mapped pages 0] ARBIRTATOR[NOOP CAPACITY[UNLIMITED]]] Retriable: False Expression: instance == nullptr Function: initialize File: /Users/shakyan/opensource/presto/presto-native-execution/velox/velox/common/memory/Memory.cpp Line: 177 Process finished with exit code 134 (interrupted by signal 6:SIGABRT)
Steps to reproduce:
Add the following to ServerOperationTest.cpp

#include "presto_cpp/main/http/tests/HttpTestBase.h"
#include "presto_cpp/main/tests/PrestoServerWrapper.h"
.......
namespace facebook::presto {
static std::unique_ptr<facebook::presto::PrestoServer> getPrestoServer() {
    auto server = std::make_unique<PrestoServer>("../../../etc");
    return server;
}
........
class ServerOperationTest : public testing::Test {
protected:
    static facebook::presto::test::PrestoServerWrapper* wrapper;
    static folly::SocketAddress* socketAddress;
    static void SetUpTestSuite() {
        folly::SingletonVault::singleton()->registrationComplete();
#ifndef PRESTO_STATS_REPORTER_TYPE
        // Initialize singleton for the reporter.
        folly::Singleton<facebook::velox::BaseStatsReporter> reporter(
            []() { return new facebook::velox::DummyStatsReporter(); });
#endif
        static auto prestoServer = getPrestoServer();
        wrapper = new facebook::presto::test::PrestoServerWrapper(
            std::move(prestoServer));
        socketAddress = new folly::SocketAddress(wrapper->start().get());
    }

    static void TearDownTestSuite() {
        wrapper->stop();
    }
.........
};
facebook::presto::test::PrestoServerWrapper* ServerOperationTest::wrapper =
    nullptr;
folly::SocketAddress* ServerOperationTest::socketAddress = nullptr;
..........
TEST_F(ServerOperationTest, TestGetState) {
    auto memoryPool = memory::MemoryManager::getInstance()->addLeafPool("");
    HttpClientFactory clientFactory;
    auto client = clientFactory.newClient(
        *socketAddress,
        std::chrono::milliseconds(1'000),
        std::chrono::milliseconds(0),
        false,
        memoryPool);
    {
        auto response = sendGet(client.get(), "/v1/info/state").get();
        ASSERT_EQ(response->headers()->getStatusCode(), http::kHttpOk);
    }
}
TEST_F(ServerOperationTest, TestSendPutShuttingDown) {
    auto memoryPool = memory::MemoryManager::getInstance()->addLeafPool("");
    HttpClientFactory clientFactory;
    auto client = clientFactory.newClient(
        *socketAddress,
        std::chrono::milliseconds(1'000),
        std::chrono::milliseconds(0),
        false,
        memoryPool);
    {
        std::string emptyBody = "";
        auto response = sendPut(client.get(), "/v1/info/state", 0, emptyBody).get();
        ASSERT_EQ(
            response->headers()->getStatusCode(),
            http::kHttpBadRequest); // Assuming empty request is bad
        ASSERT_EQ(bodyAsString(*response, memoryPool.get()), "Bad Request");
    }
    {
        std::string invalidBody = "\"SHUTTING_DWN\"";
        auto response =
            sendPut(client.get(), "/v1/info/state", 0, invalidBody).get();
        ASSERT_EQ(response->headers()->getStatusCode(), http::kHttpBadRequest);
        ASSERT_EQ(bodyAsString(*response, memoryPool.get()), "Bad Request");
    }
    {
        std::string body = "\"SHUTTING_DOWN\"";
        auto response = sendPut(client.get(), "/v1/info/state", 0, body).get();
        ASSERT_EQ(response->headers()->getStatusCode(), http::kHttpOk);
    }
}
} // namespace facebook::presto

While working on the issue I discussed the errors that I was getting with @nmahadevuni, he suggested me to make gracefulshutdown as a separate file and a separate executable. This solved the issue. Aditi also said that I can use a new file.

@anandamideShakyan
Copy link
Author

@majetideepak , I can add the test suite as a mock test. But it still won't test the whole client->RESTEndpoint->response.
I can shift testcase to ServerOperationsTest.cpp and use serverOperationSetState with actual server created. But we would still have that testcase as a separate executable. Also it doesn't test the REST endpoint, it will call the shutdown method directly. Do you want me to go ahead with this approach?

@majetideepak
Copy link
Collaborator

test the whole client->RESTEndpoint->response

@anandamideShakyan We already have the setup for such e2e in the Java tests. Can we add a test there?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandamideShakyan This looks great! I have 2 minor comments.

folly::trimWhitespace(body[0]->moveToFbString()) == "\"SHUTTING_DOWN\"") {
LOG(INFO) << "Shutdown requested";
if (nodeState() == NodeState::kActive) {
std::thread([this]() { this->stop(); }).detach();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we invoke stop in a separate thread? Please add a comment.

return PrestoNativeQueryRunnerUtils.createNativeQueryRunner(true);
}

private String getServerUri(QueryRunner queryRunner)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need a function if it is used only once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Native] Support Graceful Shutdown in Prestissimo
5 participants