Crawler based DHT client #709

aschmahmann · 2021-03-12T21:50:23Z

closes #619

Still very WIP. Should merge after the crawler gets merged.

TODO:

Extract crawler fixes into crawler branch and merge
~~Expose two implementations of the MessageSender interface that we already have~~ Will have to wait for another PR
- ~~One for sending one off messages to peers~~
- ~~One that tries to reuse streams and keep them alive~~
Add support for options in constructor
Add docs + comments - We have some, but we're going to have to come back for more here when we are able to do more of the cleanup work

whyrusleeping · 2021-04-08T18:39:27Z

What does the general extra load on the network look like for this? Crawling 'every peer' every hour seems... potentially expensive, especially as the network grows.

Additionally, theres an added resource consumption to consider here, we should quantify that.

And finally, what specific metrics should we be gathering on the network to help us better understand these implications?

aschmahmann · 2021-04-08T20:22:53Z

All excellent questions.

What does the general extra load on the network look like for this?

I'm not sure in that I haven't done large scale testground tests to simulate behavior here, or thought of a good way to test this on a smaller scale.

However, at a high level resource consumption should probably look like this:

Client resource usage
- I mostly don't care about this at the moment, local testing shows it uses more RAM and not a ton of CPU but it's an experimental alternative client if they don't like it they can turn it off
Server resource usage
- Record storage (HDD, RAM, CPU)
  - I expect this number across the public IPFS network to increase pretty significantly since
    - I'm pretty sure that, due to issues with the standard DHT client and how go-ipfs uses it, only a fraction of peoples records are actually making it out to the network
    - This may encourage more use of IPNS as well as IPFS in general
  - If this number really increases drastically that's both problematic, but also really cool. A user with 1M CIDs should be storing about 1M * 20 (the bucket size) * 100 bytes (a CID is 30 bytes but let's just assume we have a bunch of overhead) = 2GB of storage across the network
    - If we assume something like 20k DHT server nodes that's 100kb per node
    - So if we assume DHT server nodes would care about a 100MB increase in data storage that would mean an extra Billion CIDs getting on the network which would be pretty cool 😎
      - If this started becoming a problem we'd have to see how many CIDs came from a few major sources and then work to prevent abuse (e.g. a max number of records per peerID/IP address)
      - IPNS records and especially public keys are bigger than CIDs
- Connections (CPU, RAM, Sockets)
  - The standard client
    - periodically queries subsets of the network to update its routing
    - queries the network and forms new connections every time the user/application performs a query (e.g. peer address discovery, provider records/CIDs, IPNS records) and then throws the data away once we use it (e.g. querying the same key twice in a row gives you very little savings)
  - The alternative client
    - periodically queries the entire network to update its routing (it currently does this once an hour in bulk, but could probably be spread out and a number that wasn't picked out of a hat could be chosen 😅)

what specific metrics should we be gathering on the network to help us better understand these implications

The winning question! Here are some that are on my mind (but more welcome, cc @petar @barath)

Churn rate on the network -> it can inform how frequently we need to crawl the network
Measure the number (and type) of queries a server node (or the group of hydra nodes) receive per hour as well as how many peerIDs we learn about during that time -> we should be able to compare with the number of queries we'd have in the new system (if you exclude the final Puts)
Measure resource usage on an individual node as it stores more CIDs -> so we can make some projections around stress per node as the network usage increases

robertschris · 2021-04-20T16:43:25Z

This is an interesting DHT model I haven't thought of yet.

While I don't expect this to replace the default DHT mode, it does offer a great potential for bulk providers for one reason, it gets more efficient the more throughput of query that get fed through it.

Over time as short lived and unreliable nodes get culled from the peer table only long running high quality peers should remain and failure rates should drop quite quickly (if the query rate is high enough).

Some thoughts on this...

You should add a Bloom Filter(or similar) filled with all failed/unreliable peers.
Adding a BF containing all the multiadresses that you have failed to connect to in the past should offer a nice performance boost as over time you will stop trying to use peers we already know are unreliable but keep getting told about.
Adding the ability for trusted nodes(part of a cluster?) to delegate their put requests to these bulk provider nodes should help as this DHT mode is both expensive to upkeep and gets better the more you use it, so using it like a supernode for put requests should be quite effective.
The warmup time is problematic and will increase with network size. Some shortcuts I can think of though...
a. Save your peer table on shutdown and reload it on startup, unless it is too old and stale then discard.
b. If you have trusted peers that are already running and stable you can grab their peer table.
The default table refresh strategy should be "trickle" based to spread out the upkeep load more evenly, also full crawls are expensive to the network as a whole and should be used sparingly. My first thought on this would be to periodically run a set of "Star" positioned requests and add(perhaps test first) all the peers we hear about while traversing these findprovs queries.

This should net you a few hundred peers throughout the DHT to add to your table each time you run it and then run this either on a fixed interval(every 30 seconds?) or dynamically whenever the node thinks it needs to refresh the table.

Additionally it may be required to once in a while(once a week/month?) to clear the unreliable peer BF and do a full crawl to both keep the false positive error rate on the BF down and to give peers another shot at proving to be reliable.

aschmahmann · 2021-04-20T18:49:35Z

@robertschris 👍 to basically all your suggestions.

Some thoughts on your suggestions:

Keeping some peer history would probably be good. e.g. peers might only be online for 16hrs a day and then off for the other 8, could be nice to track. Tracking the peers here doesn't seem the highest priority IMO, but rather it'd be great to update the table more gradually then only once per crawl
This is definitely a good thing to do and will make running these nodes as infrastructure really nice. However, it is either already doable or requires a protocol update depending on the circumstance.
- Doable: PutValue can be done in a delegated way already
- Protocol Upgrade required: Provide cannot be done in a delegated way since we don't announce "X has data with multihash Y" instead we announce "I have multihash Y". This requires Signed Provider records for the DHT #559, or more generally Remove special cases for Peer and Provider records #584.
I agree these seem like good ideas that should be available

In the shorter term what I'll likely do here is bundle the existing client in with this one and use the standard client until this one is fully operational.

Related to 3 and 1 the key with the network crawling is not so much that the whole network is crawled but that the routing table is a complete trie instead of segmented into buckets of 20. Properly implementing accelerated lookups from the Kademlia paper XOR-trie based routing table #572 and filling the trie gradually instead of crawling the whole network at once seems reasonable. If it doesn't happen in this PR I would certainly like to see it happen in a follow up one.

I'd like to clarify that there are 3 separate optimizations here that work independently from each other (although they're complementary):

XOR Trie based routing table (and crawling)
Not waiting on any particular peer to respond to us (e.g. waiting for some fraction of the target peers to respond instead of the closest X peers)
Allowing for bulk (provide) operations

Some of these could likely be added into the standard DHT client, and some things from the standard client (like the routing logic) could potentially be reused here in a more limited way to help with gradually filling the trie. Additionally, when this new client is deemed sufficiently stable I'd like to see it be useful as a DHT server as well since these nodes could accelerate lookups by less powerful peers that are using the standard client lookup logic.

I'm hoping to do a writeup this month about the work so far, time permitting 🤞. Generally I am hoping that this shows how client code can really give us a lot of progress here and to then be able to push on where the protocol needs changes and on the impacts of some of the tunable client + protocol parameters (e.g. number of peers in a response, routing table refresh periods, how many peers to wait for responses from to consider an operation successful, periodicity of repeating provides/puts, etc.)

aschmahmann · 2021-04-27T20:04:14Z

internal/config/config.go

+// QueryFilterFunc is a filter applied when considering peers to dial when querying
+type QueryFilterFunc func(dht interface{}, ai peer.AddrInfo) bool
+
+// RouteTableFilterFunc is a filter applied when considering connections to keep in
+// the local route table.
+type RouteTableFilterFunc func(dht interface{}, conns []network.Conn) bool


Note: this is a breaking change as these used to take a *IpfsDHT. We could have them take a thing that returns a host since that's all that was being used but this might be workable as well while we explore this space.

MichaelMure · 2021-04-30T16:41:52Z

I just wanted to say that this is a pretty cool work, and will likely be useful for Infura. 👍 👍

…ning the component it is testing

…to the filter interfaces to support more DHT implementations

… we've already finished the query to help us deal with backoffs + invalid addresses

…g GetClosestPeers calls. Only keep the backup addresses for peers found durin a crawl that we actually connected with. Properly clear out peermap between crawls

…hannel

…any operations

fullrt/dht.go

internal/net/message_manager.go

internal/net/message_manager_test.go

dht_filters.go

subscriber_notifee.go

fullrt/dht.go

…sses extended

This was referenced Apr 5, 2021

Massively parallel provider slows down over time #619

Closed

Proposal: IPFS Content Providing protocol/web3-dev-team#31

Merged

Release v0.9.0 ipfs/kubo#8058

Closed

Experimental dht client ipfs/kubo#8061

Merged

aschmahmann force-pushed the feat/crawler-client branch 3 times, most recently from 4c5ccc6 to 9fa7066 Compare April 13, 2021 19:50

aschmahmann force-pushed the feat/crawler-client branch 3 times, most recently from ff71d0a to f2ee9d0 Compare April 22, 2021 21:17

Base automatically changed from feat/crawler to master April 22, 2021 21:58

aschmahmann force-pushed the feat/crawler-client branch 2 times, most recently from a063d5f to 3080640 Compare April 22, 2021 22:50

aschmahmann commented Apr 27, 2021

View reviewed changes

aschmahmann added 11 commits April 30, 2021 13:50

crawler client v1

48c1814

bug fixes and add some options

365c18c

go fmt + prov fix

753a1ea

optimize provide many to only calculate peer info once

9bc8a2f

a bit of cleanup and added a Ready function

41d6f44

moved TestInvalidMessageSenderTracking to the internal package contai…

e7a2e57

…ning the component it is testing

move IpfsDHT options to the internal package. Make a breaking change …

7702392

…to the filter interfaces to support more DHT implementations

switch experimental client to share options with the standard client

b26c889

add support for bulk puts

c8f5857

cleanup some linting errors and make findpeer only dial the peer once…

4880f62

… we've already finished the query to help us deal with backoffs + invalid addresses

cleanup + add more logging to bulk sending

b62d4af

aschmahmann added 4 commits April 30, 2021 13:50

more findpeer changes

4532fb0

changed signature of DHT filter options to be more general

7e79983

experimental dht: add stored addresses to peerstore temporarily durin…

d543baa

…g GetClosestPeers calls. Only keep the backup addresses for peers found durin a crawl that we actually connected with. Properly clear out peermap between crawls

dht: switch GetClosestPeers to return a slice of peers instead of a c…

ee4a44e

…hannel

aschmahmann force-pushed the feat/crawler-client branch from 02efa46 to ee4a44e Compare April 30, 2021 17:54

aschmahmann added 4 commits April 30, 2021 14:00

linter cleanup

43f5e30

fullrtdht: better progress logging on bulk operations

de6e380

fullrtdht: do not error on bulk operations if we are not asked to do …

359d3de

…any operations

fix logging output

aefc008

aschmahmann marked this pull request as ready for review May 11, 2021 20:01

aschmahmann commented May 11, 2021

View reviewed changes

fullrt/dht.go Outdated Show resolved Hide resolved

aschmahmann commented May 11, 2021

View reviewed changes

internal/net/message_manager.go Outdated Show resolved Hide resolved

aschmahmann commented May 11, 2021

View reviewed changes

internal/net/message_manager_test.go Outdated Show resolved Hide resolved

Stebalien approved these changes May 11, 2021

View reviewed changes

dht_filters.go Show resolved Hide resolved

dht_filters.go Show resolved Hide resolved

dht_filters.go Show resolved Hide resolved

aschmahmann commented May 11, 2021

View reviewed changes

subscriber_notifee.go Outdated Show resolved Hide resolved

aschmahmann added 5 commits May 12, 2021 13:00

fullrt: have support for both standard and fullrt options

77e7408

oops

d0236f0

reorder imports

1b0922d

renamed StreamDisconnect to OnDisconnect

c22b4bf

skip disconnect events if the message sender does not need them

57eeffe

aschmahmann force-pushed the feat/crawler-client branch from add1f46 to 57eeffe Compare May 12, 2021 19:00

fullrtdht: add Close function and remove passed in context

ff66a31

aschmahmann commented May 13, 2021

View reviewed changes

fullrt/dht.go Show resolved Hide resolved

fix percentage logging during bulk sends

03d275f

aschmahmann force-pushed the feat/crawler-client branch from 4fb082c to 03d275f Compare May 13, 2021 07:31

aschmahmann added 3 commits May 14, 2021 01:29

crawler: starting peers and peers found during crawl have their addre…

8b9a22e

…sses extended

import ordering

74d4e7e

import ordering

6e5e5d6

aschmahmann merged commit eac1b5e into master May 14, 2021

Wondertan mentioned this pull request May 31, 2021

Investigate initial timeouts on DHT / DAS celestiaorg/celestia-core#377

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler based DHT client #709

Crawler based DHT client #709

aschmahmann commented Mar 12, 2021 •

edited

Loading

whyrusleeping commented Apr 8, 2021

aschmahmann commented Apr 8, 2021

robertschris commented Apr 20, 2021

aschmahmann commented Apr 20, 2021

aschmahmann Apr 27, 2021

MichaelMure commented Apr 30, 2021

Crawler based DHT client #709

Crawler based DHT client #709

Conversation

aschmahmann commented Mar 12, 2021 • edited Loading

whyrusleeping commented Apr 8, 2021

aschmahmann commented Apr 8, 2021

robertschris commented Apr 20, 2021

aschmahmann commented Apr 20, 2021

aschmahmann Apr 27, 2021

Choose a reason for hiding this comment

MichaelMure commented Apr 30, 2021

aschmahmann commented Mar 12, 2021 •

edited

Loading