Add RFC for supporting distributed procedure #12

hantangwangd · 2024-06-11T03:44:25Z

Propose design to expand the current procedure architecture in presto, support defining, registering and calling procedures which need to be executed in a distributed way.

Besides, in order to demonstrate as an example and to figure out the functional boundaries among the different architecture levels, also describe the design for Iceberg to support distributed procedure as well as the design of a specific distributed procedure rewrite_data_files.

rschlussel

This is an interesting feature. do you have thoughts on integrating this for cpp workers as well?

hantangwangd · 2024-06-21T17:57:30Z

@rschlussel Thanks for the comment. Yes, of course we should integrate this for cpp workers if it's proved to be useful and needed.

I think we can take a two-step approach for this. Firstly, we support it on java workers, confirm the feasible of the entire architecture and figure out the best functional boundary division. Then, after that, we can support it on cpp workers following the design and implementation path on java workers. What's your opinion?

aditi-pandit

Thanks @hantangwangd for this RFC. Had couple of comments mainly related to Native execution related impact.

aditi-pandit · 2024-06-25T22:13:53Z

RFC-0004-support-distributed-procedure.md

+5. Similar to statements such as `create table as`/`insert`/`delete`/`refresh material view` that involve distributed processing, two related SPI methods are defined for `call distributed procedure` statement in the metadata and connector metadata interfaces. `beginCallDistributedProcedure` is used for preparation work before the start of distributed scheduling. And `finishCallDistributedProcedure` is used for transaction commitment after the completion of distributed writing.
+
+
+6. As for a specific connector (such as Iceberg), in the implementation logic of method `beginCallDistributedProcedure` and `finishCallDistributedProcedure` in `ConnectorMetadata`, in addition to accomplish the common logic for this connector (such as starting a transaction, building a transaction context, committing a transaction, etc.), it should also resolve the specified distributed procedure and call its relevant method to execute the procedure-customized logic.


Native vs Java runtimes begin to diverge significantly at this point. For native runtimes its best if the distributed procedure call method is in C++. But that would mean older java procedures would need a rewrite. Do you have any particular ideas on how you will handle this for Native engine ?

Thanks for the comment. Actually, beginCallDistributedProcedure and finishCallDistributedProcedure in ConnectorMetadata are all invoked in coordinator, so native worker do not need to handle them, that is, there is no need in native worker end to resolve the distributed procedures and invoke them.

The non-coordinator worker's responsibility in call distributed procedure is similar to that of insert into or create table as. That is, local planning CallDistributedProcedureNode to a TableWriterOperator with ExecutionWriterTarget, getting corresponding ConnectorPageSink based on this writer target, executing the data writing and finally returning the fragments page to table finish stage.

So I think there is no need to consider this issue, what do you think? Any misunderstand please let me know.

aditi-pandit · 2024-06-25T22:16:30Z

RFC-0004-support-distributed-procedure.md

+            extends WriterTarget
+{
+    private final QualifiedObjectName procedureName;
+    private final Object[] procedureArguments;


Presume the procedureArguments have types. There could be a difference in the types supported by native vs java execution. Would be great to clarify about that.

As I understand, CallDistributedProcedureTarget is a subclass of WriterTarget which is used only in coordinator. That is, it's always used in java execution environment.

ExecuteProcedureHandle which is a subclass of ExecutionWriterTarget will be sent to worker. The handling of it is similar to InsertHandle or CreateHandle.

aditi-pandit · 2024-06-25T22:18:28Z

RFC-0004-support-distributed-procedure.md

+
+Among them, `TableScanNode -> FilterNode` defines the data to be processed. It's based on the target table determined by `schema` and `table_name` in the parameters, as well as the possibly existing filter conditions.
+
+The `CallDistributedProcedureNode -> TableFinishNode` structure is similar to the `TableWriterNode -> TableFinishNode` structure, used to perform distributed data manipulation and final unified submission behavior.


Would be great to give more details on what row/control messages are exchanged between the CallDistricutedProcedureNode and TableFinishNode. e.g. TableWriteNode provides protocol of PartitionUpdates and CommitContext that is shared between the two.

Sure, it's a good suggestion. I will supplement this part of the content.

aditi-pandit · 2024-06-25T22:22:35Z

RFC-0004-support-distributed-procedure.md

+
+#### 6. Iceberg connector's support for distributed procedure
+
+In Iceberg, we often need to record the original data files that have been scanned and rewritten during the execution of table data operations (including deleted files that have been fully applied), and in the final submission, combine the newly generated data files due to rewriting to make some changes and transaction submissions at the metadata level.


Iceberg in Native engine uses HiveConnector itself. The HiveConnector in Native engine handles both Hive and Iceberg splits. Does HiveConnector support distributed procedure ? How will it be reused or enhanced for Iceberg ?

As beginCallDistributedProcedure and finishCallDistributedProcedure in ConnectorMetadata are all invoked in coordinator, I think the main change for hive to support distributed procedures is in java implementation.

Hive should implement beginCallDistributedProcedure and finishCallDistributedProcedure in ConnectorMetadata, customize it's own preparation and submission logic, and then implement and register its own distributed procedures. All these jobs could be done in java only.

So I think the main change in worker end logic which need to support native c++ is that it should generate and provide ConnectorPageSink based on the newly added ConnectorDistributedProcedureHandle. Please let me know if there are any omissions.

jaystarshot · 2024-06-26T07:31:47Z

RFC-0004-support-distributed-procedure.md

+3. Add a new plan node type: `CallDistributedProcedureNode`. During the analyzing and logical planning phase, construct a logical plan with the following shape for `call distributed procedure` statement:
+
+    ```text
+    TableScanNode -> FilterNode -> CallDistributedProcedureNode -> TableFinishNode -> OutputNode


Not sure I understand correctly, but if we are going to support a general distributed procedure then why does it have to be just above TableScanNode or table finish?
If its not for a general case but specific to table layouts, then why not just a custom operator implementation?

Thanks for the comment. As described in this RFC, compared to existing coordinator-only procedures, these newly expanded kind of procedures which need to be executed distributively involve operations on table's data other than metadata. For example, these procedures can rewrite table data, merge small data files, sort table data, repartition table data etc. So the common action process of them includes reading data from the target table, doing some filtering and translation, writing the translated data to files and submitting the changes in metadata level. So I think the general distributed procedure could be planned to a plan tree like this. (Or adding some other plan nodes like JoinNode/AggregationNode etc. for functional expansion in future)

why not just a custom operator implementation?

After optimization and plan fragmentation, we should be able to shuffle data between stages to utilize the capabilities of the entire cluster.

jaystarshot · 2024-06-26T07:42:37Z

RFC-0004-support-distributed-procedure.md

+
+## Summary
+
+Propose design to expand the current procedure architecture in presto, support defining, registering and calling procedures which need to be executed in a distributed way.


Can you also define what you mean by procedure?

Thanks for the suggestion. Strictly speaking, maybe they should be called connector specific system procedures. I'll add some explanations here.

hantangwangd · 2024-07-31T19:37:58Z

This RFC is ready for review, the comments are all handled, and the relevant implementation can be viewed here: prestodb/presto#22659. Please take a look when convenient, thanks! cc: @rschlussel @aditi-pandit @jaystarshot @tdcmeehan @ZacBlanco @yingsu00

aditi-pandit · 2024-08-07T10:21:42Z