Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poet 2.0 architecture and design #367

Open
Tracked by #257
pigmej opened this issue Nov 28, 2022 — with Huly for GitHub · 9 comments
Open
Tracked by #257

Poet 2.0 architecture and design #367

pigmej opened this issue Nov 28, 2022 — with Huly for GitHub · 9 comments
Assignees
Milestone

Comments

Copy link
Member

pigmej commented Nov 28, 2022

To improve Poet resiliency even further and be able to scale better we need to do arch changes in Poet.

Currently proposed idea by @noamnelke looks as follows:
PoET Architecture

PoET Service Architecture.excalidraw.txt

@noamnelke
Copy link
Member

This proposal also includes some architectural changes to the node. I'll explain it all in more detail when I have more time (since I don't think this is feasible before genesis), but I wanted to have this placeholder out.

@poszu poszu added this to the Poet v2 milestone Sep 1, 2023
@poszu poszu mentioned this issue Sep 1, 2023
4 tasks
@poszu poszu transferred this issue from spacemeshos/pm Sep 1, 2023
@poszu
Copy link
Contributor

poszu commented Sep 1, 2023

@pigmej @noamnelke
An updated proposal of the new Poet architecture (the image has Excalidraw scene embedded).
poet-v2-arch

Short summary of Registration service API as of know + plans:

source: https://github.com/spacemeshos/poet/blob/develop/rpc/api/v1/api.proto

  • /Info: serves poet runtime information:
    • open round ID (❗not used by the node),
    • executing round ID, if any (❗not used by the node),
    • poet PubKey (⚠️ not really useful, a node should probably get the pubkey along with poet URL in the config),
    • round information (phase shift and cycle gap) - not used currently
  • /PowParams - returns the proof of work parameters (challenge and difficulty) - needed for submitting
  • /Submit - submit a challenge/token, returns:
  • /Proofs/<round> - serves proofs for requested round
  • 🚧 planned /Members/<round> - serves members of the requested round (Add RPC API for querying the list of members for a given round. #200)

Open questions

Should we stick with levelDB in the registration service?

Is it required to be able to horizontally scale instances of the Registration service?

It would require:

  • an externally served DB,
  • a centralized agent to schedule rounds (to decide when a round closes) and to feed the workers

I think it's not worth the effort, at least not at this stage. A single registration service will be faster and should be enough to serve hundreds thousands /Submit requests within the cycle gap window easily.

@dshulyak
Copy link
Contributor

dshulyak commented Sep 2, 2023

i think using kafka or anything is a gross over-engineering. publishing single message every two weeks can be done using grpc or native http/2 easily. if you want redundancy, e.g get membership from morer than one sevice, it can be achieved by connecting to more than one registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

@poszu
Copy link
Contributor

poszu commented Sep 4, 2023

@dshulyak

i think using kafka or anything is a gross over-engineering.

What is gross about it? Please explain why if you think it's not the right tool for the job.

publishing single message every two weeks can be done using grpc or native http/2 easily

We don't want to expose worker servers to the Internet at all. The workers should instead pull data from the Registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

Yes, support for a standalone mode is planned: #365:

💡 The option to run them together in a single process (aka a standalone mode) could probably stay for cloud deployments, go-sm unit- and system-tests etc.

@dshulyak
Copy link
Contributor

dshulyak commented Sep 4, 2023

What is gross about it? Please explain why if you think it's not the right tool for the job.

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

The workers should instead pull data from the Registration service.

i don't understand the difference. instead of "kafka" it can be pulled from this registration service. if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

@poszu
Copy link
Contributor

poszu commented Sep 5, 2023

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad? Their deployment is usually straightforward. I proposed Kafka because this is an MQ I worked with before, but perhaps there are better/simpler solutions.

if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

Isn't it re-inventing an MQ?

Assuming we shot down the idea for an MQ, what other good options do we have?

  1. A GRPC API, the registration service working as a server providing a way to:
    • pull membership root for the next round to execute,
    • post a new proof when a round is finished.
  2. Do you have any other ideas?

@noamnelke
Copy link
Member

I'm working on my own proposal based on that by @poszu (mostly similar).

In the meantime, I'll say that I generally agree with @dshulyak that there must be an "easy mode" to run a PoET server, for tests but also for anyone who wants to easily run a private server for themselves.

I think we can satisfy everyone if we design PoET as a few building blocks that can be used in different ways:

  • PoSW module
  • Membership tree module
  • Scheduling module
  • Others? (I'll know more when I finish speccing my idea, still WIP)

Then we can use those in "standalone mode" by building some scaffolding around them that implements the API by calling into these modules from a single go executable.

For "infra mode" we'll build different scaffolding that can use a MQ to communicate between several different executables and data stores.

Both of these "modes" will be used internally - standalone in tests and infra mode for the actual PoET service that Spacemesh operates. There's some risk that bugs will exist in one version and not the other, but we can try to minimize this by keeping the scope of the separate scaffolding minimal.

Does this make sense to you guys, or sound over engineered and overly complex?

@dshulyak
Copy link
Contributor

dshulyak commented Sep 6, 2023

i don't think that supporting 3 modes of operation is a good idea. if there could be one that is reliable and efficient we should be using it everywhere.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad?

so it means that kafka-based mode will be reliable, and everything else we will say that may not be that reliable. so everyone who needs to help maintaining poet will have to learn how to work and debug kafka. and whoever deploys it will also need to have basic understanding of what can go wrong.

A GRPC API, the registration service working as a server providing a way to:

this is what i have in mind. i don't think that downloading data periodically on notification implies reinventing mq.

Does this make sense to you guys, or sound over engineered and overly complex?

it does seem to me unnecessarily complex

@pigmej
Copy link
Member Author

pigmej commented Sep 6, 2023

Here are a few requirements from my side that definitely should be considered:

  • There should be no direct dependency between the worker and the registration service. They should be separate entities. With a separate life cycle. (obviously not all versions will work with all versions but standard software development)
  • IMHO additionally, worker services should take union of all accessible members lists and make the biggest possible tree. That would lower the pressure of keeping the registration services HA etc. Lower the region failures impact etc.
  • And VERY importantly please keep it simple, no MQ, simple long pooling or even periodic requests are much better than any MQ for that. There is literally one (maybe few messages per two weeks). We really don't want overnight debugging of MQ issue or kafka problem or hotfixing that parts.
  • we shouldn't then need more "workarounds" to do if 110 then 111 also.

@poszu poszu moved this from 🏗 Doing to 📋 Backlog in Dev team kanban Sep 27, 2023
@pigmej pigmej moved this from 📋 Backlog to On Hold in Dev team kanban Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: On Hold
Development

No branches or pull requests

4 participants