Poet 2.0 architecture and design #367

pigmej · 2022-11-28T09:44:31Z

To improve Poet resiliency even further and be able to scale better we need to do arch changes in Poet.

Currently proposed idea by @noamnelke looks as follows:

PoET Service Architecture.excalidraw.txt

noamnelke · 2022-11-28T14:18:37Z

This proposal also includes some architectural changes to the node. I'll explain it all in more detail when I have more time (since I don't think this is feasible before genesis), but I wanted to have this placeholder out.

poszu · 2023-09-01T11:15:55Z

@pigmej @noamnelke
An updated proposal of the new Poet architecture (the image has Excalidraw scene embedded).

Short summary of Registration service API as of know + plans:

source: https://github.com/spacemeshos/poet/blob/develop/rpc/api/v1/api.proto

/Info: serves poet runtime information:
- open round ID (❗not used by the node),
- executing round ID, if any (❗not used by the node),
- poet PubKey (⚠️ not really useful, a node should probably get the pubkey along with poet URL in the config),
- round information (phase shift and cycle gap) - not used currently
/PowParams - returns the proof of work parameters (challenge and difficulty) - needed for submitting
/Submit - submit a challenge/token, returns:
- round ID it submitted to
- round end time
- 🚧 planned: signed registration receipt (Registration receipts #339)
/Proofs/<round> - serves proofs for requested round
🚧 planned /Members/<round> - serves members of the requested round (Add RPC API for querying the list of members for a given round. #200)

Open questions

Should we stick with `levelDB` in the registration service?

Is it required to be able to horizontally scale instances of the Registration service?

It would require:

an externally served DB,
a centralized agent to schedule rounds (to decide when a round closes) and to feed the workers

I think it's not worth the effort, at least not at this stage. A single registration service will be faster and should be enough to serve hundreds thousands /Submit requests within the cycle gap window easily.

dshulyak · 2023-09-02T09:57:29Z

i think using kafka or anything is a gross over-engineering. publishing single message every two weeks can be done using grpc or native http/2 easily. if you want redundancy, e.g get membership from morer than one sevice, it can be achieved by connecting to more than one registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

poszu · 2023-09-04T12:00:52Z

@dshulyak

i think using kafka or anything is a gross over-engineering.

What is gross about it? Please explain why if you think it's not the right tool for the job.

publishing single message every two weeks can be done using grpc or native http/2 easily

We don't want to expose worker servers to the Internet at all. The workers should instead pull data from the Registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

Yes, support for a standalone mode is planned: #365:

💡 The option to run them together in a single process (aka a standalone mode) could probably stay for cloud deployments, go-sm unit- and system-tests etc.

dshulyak · 2023-09-04T12:48:53Z

What is gross about it? Please explain why if you think it's not the right tool for the job.

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

The workers should instead pull data from the Registration service.

i don't understand the difference. instead of "kafka" it can be pulled from this registration service. if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

poszu · 2023-09-05T07:44:35Z

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad? Their deployment is usually straightforward. I proposed Kafka because this is an MQ I worked with before, but perhaps there are better/simpler solutions.

if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

Isn't it re-inventing an MQ?

Assuming we shot down the idea for an MQ, what other good options do we have?

A GRPC API, the registration service working as a server providing a way to:
- pull membership root for the next round to execute,
- post a new proof when a round is finished.
Do you have any other ideas?

noamnelke · 2023-09-05T12:41:26Z

I'm working on my own proposal based on that by @poszu (mostly similar).

In the meantime, I'll say that I generally agree with @dshulyak that there must be an "easy mode" to run a PoET server, for tests but also for anyone who wants to easily run a private server for themselves.

I think we can satisfy everyone if we design PoET as a few building blocks that can be used in different ways:

PoSW module
Membership tree module
Scheduling module
Others? (I'll know more when I finish speccing my idea, still WIP)

Then we can use those in "standalone mode" by building some scaffolding around them that implements the API by calling into these modules from a single go executable.

For "infra mode" we'll build different scaffolding that can use a MQ to communicate between several different executables and data stores.

Both of these "modes" will be used internally - standalone in tests and infra mode for the actual PoET service that Spacemesh operates. There's some risk that bugs will exist in one version and not the other, but we can try to minimize this by keeping the scope of the separate scaffolding minimal.

Does this make sense to you guys, or sound over engineered and overly complex?

dshulyak · 2023-09-06T10:14:15Z

i don't think that supporting 3 modes of operation is a good idea. if there could be one that is reliable and efficient we should be using it everywhere.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad?

so it means that kafka-based mode will be reliable, and everything else we will say that may not be that reliable. so everyone who needs to help maintaining poet will have to learn how to work and debug kafka. and whoever deploys it will also need to have basic understanding of what can go wrong.

A GRPC API, the registration service working as a server providing a way to:

this is what i have in mind. i don't think that downloading data periodically on notification implies reinventing mq.

Does this make sense to you guys, or sound over engineered and overly complex?

it does seem to me unnecessarily complex

pigmej · 2023-09-06T10:27:47Z

Here are a few requirements from my side that definitely should be considered:

There should be no direct dependency between the worker and the registration service. They should be separate entities. With a separate life cycle. (obviously not all versions will work with all versions but standard software development)
IMHO additionally, worker services should take union of all accessible members lists and make the biggest possible tree. That would lower the pressure of keeping the registration services HA etc. Lower the region failures impact etc.
And VERY importantly please keep it simple, no MQ, simple long pooling or even periodic requests are much better than any MQ for that. There is literally one (maybe few messages per two weeks). We really don't want overnight debugging of MQ issue or kafka problem or hotfixing that parts.
we shouldn't then need more "workarounds" to do if 110 then 111 also.

poszu added this to the Poet v2 milestone Sep 1, 2023

poszu assigned poszu and noamnelke Sep 1, 2023

poszu mentioned this issue Sep 1, 2023

Poet V2 spacemeshos/pm#257

Open

4 tasks

poszu transferred this issue from spacemeshos/pm Sep 1, 2023

dshulyak added the area/poet label Sep 25, 2023

poszu moved this from 🏗 Doing to 📋 Backlog in Dev team kanban Sep 27, 2023

pigmej unassigned poszu Aug 19, 2024

pigmej moved this from 📋 Backlog to On Hold in Dev team kanban Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poet 2.0 architecture and design #367

Poet 2.0 architecture and design #367

pigmej commented Nov 28, 2022 •

edited

Loading

noamnelke commented Nov 28, 2022

poszu commented Sep 1, 2023 •

edited

Loading

dshulyak commented Sep 2, 2023

poszu commented Sep 4, 2023

dshulyak commented Sep 4, 2023 •

edited

Loading

poszu commented Sep 5, 2023

noamnelke commented Sep 5, 2023

dshulyak commented Sep 6, 2023

pigmej commented Sep 6, 2023 •

edited

Loading

Poet 2.0 architecture and design #367

Poet 2.0 architecture and design #367

Comments

pigmej commented Nov 28, 2022 • edited Loading

noamnelke commented Nov 28, 2022

poszu commented Sep 1, 2023 • edited Loading

Short summary of Registration service API as of know + plans:

Open questions

Should we stick with levelDB in the registration service?

Is it required to be able to horizontally scale instances of the Registration service?

dshulyak commented Sep 2, 2023

poszu commented Sep 4, 2023

dshulyak commented Sep 4, 2023 • edited Loading

poszu commented Sep 5, 2023

noamnelke commented Sep 5, 2023

dshulyak commented Sep 6, 2023

pigmej commented Sep 6, 2023 • edited Loading

pigmej commented Nov 28, 2022 •

edited

Loading

poszu commented Sep 1, 2023 •

edited

Loading

Should we stick with `levelDB` in the registration service?

dshulyak commented Sep 4, 2023 •

edited

Loading

pigmej commented Sep 6, 2023 •

edited

Loading