Check block cache across multiple rainbow instances #109

lidel · 2024-03-21T00:43:52Z

Problem

At inbrowser.dev (backed by rainbow from ipfs.io gateway, so a general problem in our infra), we see inconsistent page load times across regions, and sometimes across requests within the same region.

User can get instant response from one instance, and then on subsequent page load, or request, I get stalled page load and timeout, even tho the data exist in cache of one of the other rainbows in the global cluster. We also see inconsistency across subresources on a single page.

Scope

Rainbow users running multiple instances should have means of "logically merging their block caches"
This should be opt-in feature, that requires manual configuration of rainbow operator
(Open question) Do we want to run bitswap server in rainbow, or HTTP client to avoid "the unsustainable manual peering trap"?
We don't want to invent any new protocols. Use HTTP stack if possible.

Solutions

A: Add HTTP Retrieval Client to Rainbow, leverage `Cache-Control: only-if-cached`

We know we need HTTP retrieval client for Kubo to enable HTTP Gateway over Libp2p by default, and to make direct HTTP retrieval from service providers more feasible. We can't do that without a client and end-to-end tests. Prototyping one in Rainbow sounds like a good plan, improving multiple work streams at the same time.

The idea here is to introduce HTTP client which runs in addition, or in parallel to bitswap retrieval.
Keep it simple, don't mix abstractions, do opportunistic block retrieval like bitswap, but over HTTP.

Using application/vnd.ipld.raw and trustless gateway protocol is a good match here: allows us to benefit from HTTP caching and middleware, making it more flexible than bitswap.

Rainbow could:

Have a list of other rainbow instances in form of URLs with trustless gateway endpoints
- In case of ipfs.io gateway, we could produce a list with shuffled same-region instances first, and the rest of instances after them.
Make inexpensive block requests with Cache-Control: only-if-cached going over list in sequence.
- This does not cost any expensive IO, if rainbow does not have the block locally, it will instantly respond with HTTP 412.

This way, once a block lands in any of our rainbow caches, we will discover it, and requests won't timeout after 1m on unlucky scenarios.

Open questions:

Is sequential, inexpensive HTTP check enough to avoid amplification attacks?
Ok to start at the same time as bitswap, or do we want to delay, and act as a fallback when we are unable to find block by regular means for (>10-30s)?

B: Set up reverse proxy (nginx, lb) to try rainbows with `Cache-Control: only-if-cached` first

Writing this down just to have something other than (A), I don't personally believe (B) is feasible.

The idea here is to update the way our infrastructure proxies gateway requests to rainbow instances, and first ask all upstream instances within the region for resource with Cache-Control: only-if-cached, and if none of them has the thing, retry with a normal request that will trigger p2p retrieval.

The downside here is that this feels like antipattern:

Overrides any user-provided Cache-Control
Creates cache hot spots: popular data is not distributed across rainbow instances, but always served by a specific instance which fetched it first.

C: Reuse Bitswap client and server we already have

Right now, Rainbow runs Bitswap in read-only mode. It always says it does not have data when asked over bitswap.

What we could do is to a permissioned version of peering:

libp2p preconnect to safelisted set of peers and protect these peering connections from being closed
- If Rainbow does not announce peer records to DHT, we should require full /ip|dns*/.../p2p/peerid, otherwise we
(for now) allow serving data over bitswap to safe-listed set of /p2p/ multiaddrs (quick and easy), leverage existing peering config / libraries where possible (Add peering support #35)
(allows us to do more in the future) switch to HTTP retrieval (over libp2p or /http)

D: ?

Ideas welcome.

The text was updated successfully, but these errors were encountered:

lidel · 2024-03-21T00:49:27Z

cc @aschmahmann: the (A) is a brain dump of the idea how we could logically share the block caches I've mentioned earlier this week. sanity check would be appreciated.

lidel · 2024-03-28T15:15:15Z

@hacdias fysa after discussing with @aschmahmann it seems that option (C) is easiest to wire up and get to work today (we already have bitswap), but allows us to leverage HTTP in future, once a client exists.

I imagine the end user would only need to set up single list:
RAINBOW_PEERING_ADDRS=/dns4/peer1.example.com/tcp/4001/p2p/{peerid1},/dns4/peer2.example.com/tcp/4001/p2p/{peerid2}

This will both:

ensure connection with passed peers is maintained
allow serving blocks over bitswap if from any of the peerids listed

hsanjuan · 2024-03-29T22:22:56Z

So one premise of rainbow vs. the older gateways was to avoid hosting: if data is not retrievable from somewhere in ipfs network, then that is not rainbow's problem. This moves into the direction of actually using rainbow to host things, by relying on other rainbow peer's caches, which in turns assumes that rainbow peers do cache things for non-negligible times. Big baggage.

Reg RAINBOW_PEERING_ADDRS, the idea of --seed and --seed-index was that each peer can autogenerate other rainbow peers addresses, look them up in the DHT and auto-protect connections to them without need for any ad-hoc configuration like needing to provide a list of peers, which is always a pain when rolling out and scaling up-down things. Maybe it's useful now.

lidel · 2024-04-03T13:51:50Z

The idea is to limit the baggage by enabling "hosting of cached things" only for safelisted peerids. This "cache-sharing" requires mutual agreement and is opportunistic, has no SLA for how long things are cached, and the default bitswap behavior for non-safelisted peers remains to always respond "i dont have it".

Perhaps we should rename this feature and move away from "peering" to "cache sharing" to set expectations closer to reality and avoid feature creep?

In case of ipfs.io "cache sharing" will be with other rainbow instances, but we have use cases where people self-host their own datasets and want to use rainbow as a dedicated gateway in front of kubo or ipfs-cluster, hoped to create config option which works for them too, that is why explicit RAINBOW_PEERING_ADDRS was proposed, but it might be too flexible.

Reusing --seed and --seed-index + having peer routing announcements would allow us to do peering and cache sharing without having to configure peerids/multiaddrs. I agree, it feels "safer" for ecosystem, and easier to maintaine. By limiting cache sharing only to sibling rainbow instances, we dont bring baggage or allow for anti-patterns: it is oly for "rainbow cache sharing" and is still forcing everyone to use regular / delegated routing for discovering "real" providers.

If we go with --seed, we could enable cache sharing via opt-in configuration. I guess we need to limit the number of peerids we generate for safelisting, so perhaps RAINBOW_CACHE_SHARE_ALLOW_INDEXES=a,b-c where a is seed-index of specific rainbow instance we allow cache sharing with and b-c is range? ipfs.io infra would have simple RAINBOW_CACHE_SHARE_ALLOW_INDEXES=0-n

hsanjuan · 2024-04-03T15:43:45Z

so perhaps RAINBOW_CACHE_SHARE_ALLOW_INDEXES=a,b-c where a is seed-index of specific rainbow instance we allow cache sharing with and b-c is range? ipfs.io infra would have simple RAINBOW_CACHE_SHARE_ALLOW_INDEXES=0-n

I would separate the feature where rainbow peers autodiscover and connect to each others as something of its own... perhaps call it RAINBOW_SEEDS_PEERING, or RAINBOW_SEEDS_SWARM, or RAINBOW_SEEDS_NETWORK. It's not only useful for caches. My original thought was to mount diverse functionality on top, in particular distributed rate-limiting/quota system across a swarm of rainbows.

So after implementing that, on the side you could have RAINBOW_SEEDS_BITSWAP, or RAINBOW_SHARED_CACHE or whatever, which relies on the rainbow swarm.

hsanjuan · 2024-04-03T15:47:08Z

Regarding indexes: I would index 0-100 by default, then maybe have an MAX_INDEX option to go higher/lower. I fear having ranges may just one more way to configure things wrong for users.

lidel · 2024-04-04T16:03:36Z

Sgtm. We can add opt-in config:

RAINBOW_SEEDS_PEERING=true|false
- false by default
- when set to true
  - requires SEED and SEED_INDEX to be set, and errors if not
  - reads RAINBOW_SEEDS_PEERING_MAX_INDEX, if not set, uses 100 as implicit default
  - sets up peering with specific via instances boxo/peering
RAINBOW_PEERING_SHARED_CACHE=true|false
- false by default
- enabling it will enable sharing of cache with peerids from peering.ListPeers()
  - initial implementation will be (C): modify bitswap behavior, and return locally cached data when asked by peered nodes

Then the infra at ipfs.io would run with same SEED, RAINBOW_SEEDS_PEERING=true and RAINBOW_PEERING_SHARED_CACHE=true

hsanjuan · 2024-04-04T20:26:01Z

We can add opt-in config:

Sounds good. Only nitpick is that RAINBOW_PEERING_SHARED_CACHE should automatically enable RAINBOW_SEEDS_PEERING and require its preconditions right? or should it fail loudly if "peering = false"? I lean towards the former.

Also I understand these become flags too as everything else.

lidel added this to IPFS Shipyard Team Apr 4, 2024

lidel moved this to 🥞 Todo in IPFS Shipyard Team Apr 4, 2024

hacdias self-assigned this Apr 5, 2024

This was referenced Apr 5, 2024

feat: seed-based automatic peering #111

Merged

feat: shared cache for manual peering #114

Merged

hacdias closed this as completed in #111 Apr 24, 2024

github-project-automation bot moved this from 🥞 Todo to 🎉 Done in IPFS Shipyard Team Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check block cache across multiple rainbow instances #109

Check block cache across multiple rainbow instances #109

lidel commented Mar 21, 2024 •

edited

Loading

lidel commented Mar 21, 2024

lidel commented Mar 28, 2024 •

edited

Loading

hsanjuan commented Mar 29, 2024

lidel commented Apr 3, 2024 •

edited

Loading

hsanjuan commented Apr 3, 2024

hsanjuan commented Apr 3, 2024

lidel commented Apr 4, 2024 •

edited

Loading

hsanjuan commented Apr 4, 2024

Check block cache across multiple rainbow instances #109

Check block cache across multiple rainbow instances #109

Comments

lidel commented Mar 21, 2024 • edited Loading

Problem

Scope

Solutions

A: Add HTTP Retrieval Client to Rainbow, leverage Cache-Control: only-if-cached

B: Set up reverse proxy (nginx, lb) to try rainbows with Cache-Control: only-if-cached first

C: Reuse Bitswap client and server we already have

D: ?

lidel commented Mar 21, 2024

lidel commented Mar 28, 2024 • edited Loading

hsanjuan commented Mar 29, 2024

lidel commented Apr 3, 2024 • edited Loading

hsanjuan commented Apr 3, 2024

hsanjuan commented Apr 3, 2024

lidel commented Apr 4, 2024 • edited Loading

hsanjuan commented Apr 4, 2024

lidel commented Mar 21, 2024 •

edited

Loading

A: Add HTTP Retrieval Client to Rainbow, leverage `Cache-Control: only-if-cached`

B: Set up reverse proxy (nginx, lb) to try rainbows with `Cache-Control: only-if-cached` first

lidel commented Mar 28, 2024 •

edited

Loading

lidel commented Apr 3, 2024 •

edited

Loading

lidel commented Apr 4, 2024 •

edited

Loading