-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle blob objects larger than MessageSizeMax #18
Comments
Unfortunately, the issue is security. I wrote up the issue and some potential solutions here: https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/4 |
This is something the ipns helper from git-remote-ipld will attempt to solve by mapping large git objects to ipfs files. (so a form of 4). This has 3 disadvantages:
It would be great to have some standard way to feed ipfs a map 'CID <-> CID' saying 'I trust this mapping for fetching large objects' |
I know, security, but what's the threat model? Let's unwrap this a little more before giving it a wontfix stamp. There's two things I can think of:
Now what we want to do here is a pretty bitswap-specific thing, we want to send/receive blocks that are larger than MessageSizeMax. Everything else in IPFS works fine with blocks larger than it. So ideally a solution would not spill over into other parts of the stack. Having slept over it, I think Bitswap could also transparently do a mix of option 3 and 4. It could fragment large blocks into smaller children and a parent on the fly, and reassemble the original block once parent and children have been received and validated. This has the slight disadvantage of ~double disk space usage, unless we teach the blockstore how to concatenate multiple blocks into one. (Which is feels okay to me.) This fragmentation is basically posthoc chunking, and given that the fragments are valid blocks, we can fetch them from multiple nodes in parallel. It's block fragmentation using IPFS and IPLD as building blocks. The only thing we'd add to the data model is a new IPLD format which has different rules for hashing. A thing that feels weird but is probably okay is that these children blocks can still be bogus data. This is true, but we still make sure they are valid blocks. Even with non-fragmented blocks, you can always receive bogus data, and bitswap rightfully doesn't care as long as it's a valid block. A valid block means we can store and garbage-collect it.
Doing some kind of web-of-trust there is a huge can of worms to open. This is git, it should be as simple as possible to build on top for others. Just data structures. The UX with this solution will suffer badly, since solving this on the application layer also means it doesn't get solved in other applications, let alone other layers. I want to fetch large git blobs ("large" is an exaggeration really) through the gateway/CLI/API, link them into other data structures, transfer them, pin them, etc., all while retaining the original hash. I think this problem is worthy of a generalized solution - the exact same problem exists for ipld-torrent, and I'm sure we'll also see blockchain transaction bigger than 2 MiB. And we haven't even integrated with that many content-addressed systems yet. |
Libp2p and The issue here is that I could ask some peer for a file that matches hash X and that peer could keep on sending me data claiming that the file will match the hash eventually. That's where all the discussion on the other thread concerning "trusted peers" (or even snarks) comes in. Even if I write the data directly to disk (chunked or not), I still need a way to verify it before I fill up my disk. Please read at least some of the discussion on the thread I linked. |
Can't this issue be solved by including the size of the hashed thing in the cid as an additional parameter? This would be used as an upper bound on the amount of data that needs to be downloaded to verify that it's the correct data. A user can then configure the maximum block size or with the cli a warning can be emitted if the size is over a certain threshold. |
It could be. I'm hesitant to do that as we'd need to bump the CID version, again, but that is a decent solution. |
Git objects don't store object sizes with references, so for internal nodes there wouldn't be any good source of that information |
How about keeping the default upper bound and allow passing a different upper bound as an option in the api. Then That would be option 2 of the initial suggestions. |
I am interested in this too. I like taking advantage of that streaming nature of sha-1. Say we prefix the chunk with the hash of it's predecessor? (i.e. the hash state prior to hashing the current block.) This is like saying a large chunk with a streaming hash function can be viewed as a Merkle tree with the shape of a singly linked list. What's the security problem? Yes, the previous hash state provided with the block could be bogus, but that issue already exists with regular Merkle intermediate node, right? One is always free to create bogus dangling references to get the hash they want, and then solve for that bogus reference, right? edit I guess there is a qualitative difference in solving with a fixed vs free initial state, even if the attacker has complete freedom over the message. On the other hand, it is quite useful that sha-1 appends the length at the end of the message stream. That means any attacker needs to commit to a length up front, and cannot waste the target's resources indefinitely. edit 2 I elaborated more on this in https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/24. Despite the fact that spoofing data and a previous hash state could be much easier, I think it is worth it just for git+sha-1 given this format's ubiquity----we can still ban the general construction for other hash algorithms that are actually secure. |
See protocol/beyond-bitswap#29 and protocol/beyond-bitswap#30 for some progress designing fixes to this. |
go-libp2p-net.MessageSizeMax
puts an upper limit of ~4 MiB on the size of messages on a libp2p protocol stream: https://github.com/libp2p/go-libp2p-net/blob/70a8d93f2d8c33b5c1a5f6cc4d2aea21663a264c/interface.go#L20That means Bitswap will refuse to transfer blocks that are bigger than
(1 << 22) - bitswapMsgHeaderLength
, while locally these blocks are usable just fine. In unixfs that limit is fine, because we apply chunking. In ipld-git however, we can't apply chunking because we must retain the object's original hash. It's quite common to have files larger than 4 MiB in a Git repository, so we should come with a way forward pretty soon.Here's three options:
Related issues: ipfs/kubo#4473 ipfs/kubo#3155 (and slightly less related ipfs/kubo#4280 ipfs/kubo#4378)
The text was updated successfully, but these errors were encountered: