Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC|BB|L2-09/10 Handling arbitrarily large blocks and Treating files as large blocks #29

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions RFC/rfcBBL209/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# RFC|BB|L2-08: Handle Arbitrary Block Sizes

* Status: `Brainstorm`

## Abstract

This RFC proposes adding a new type of data exchange to Bitswap for handling blocks of data arbitrarily larger than the 1MiB limit by using the features of common hash functions that allow for pausing and then resuming the hashes of large objects.

## Shortcomings

Bitswap has a maximum block size of 1MiB which means that it cannot transfer all forms of content addressed data. A prominent example of this is Git repos which even though they can be represented as a content addressed IPLD graph cannot necessarily be transferred over Bitswap if any of the objects in the repo exceed 1MiB.

## Description

The major hash functions work by taking some data `D` chunking it up into `n` pieces `P_0...P_n-1` then they modify an internal state `S` by loading pieces into the hash function in some way. This means that there are points in the hash function where we can pause processing and get the state of the hash function so far. Bitswap can utilize this state to effectively break up large blocks into smaller ones.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that breaking blocks on hash function chunk boundaries means breaking rabin (or any other content based) chunking, which is essential for efficient content deduplication. For this to support arbitrary content boundaries for IPFS blocks, the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.

Can you elaborate more on what you were thinking here? What do you mean by accumulated hash value (my guess is the internal State) and the trailing partial hash (not sure what this means).

I'm thinking there are three different chunkers involved here:

  1. The SHA-256 hash chunker which uses 64 byte boundaries
  2. The exchange layer (i.e. Bitswap) chunker which emits the intermediate state S along the way, which has the restriction of every state being on a 64 byte boundary (e.g. we could use a fixed 256KiB, or a more complex function as long as it always ended on a 64B boundary)
  3. The data storage layer where the data could be stored in any chunking fashion we want although we should store the exchange layer mappings for reuse when people download from us in the future

IIUC the thing that you're getting at here is that it's unfortunate that if Rabin would've saved us bandwidth here that it's only saving us disk storage because of the restrictions at the exchange layer.


I think there's potentially a way around this (which has tradeoffs) by allowing more data into the manifest. For example, when B responds to A giving them a manifest of blocks to download in addition to giving them a list of the intermediate States B could also send more information about the data corresponding to those states.

For example, B could send (State, large block start index, large block end index, []ConsecutiveSubBlock) where ConsecutiveSubBlock = (SubBlockMultiHash, subblock start index, subblock end index) and in this way A could decide whether to get the blocks corresponding to the State transition by either asking about bytes in the full block or by asking for the ConsecutiveSubBlocks and then using some of the bytes. This would allow B to tell A about a deduplicating chunking scheme they could use, but A 1) wouldn't be obligated to use it when downloading data 2) wouldn't be obligated to use it in their datastore.

WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.

Can you elaborate more on what you were thinking here? What do you mean by accumulated hash value (my guess is the internal State) and the trailing partial hash (not sure what this means).

If you chunk on arbitrary byte boundaries, then the full "state" needed to resume hashing needs to include the tail data past the last 64B boundary that has not yet been included into the hash-chunker. It means the full state needed is a little larger.

I'm thinking there are three different chunkers involved here:

  1. The SHA-256 hash chunker which uses 64 byte boundaries

Note this can vary depending on the hash function used. I think some have >64 byte hash-chunkers.

  1. The exchange layer (i.e. Bitswap) chunker which emits the intermediate state S along the way, which has the restriction of every state being on a 64 byte boundary (e.g. we could use a fixed 256KiB, or a more complex function as long as it always ended on a 64B boundary)

This is the blocker; rabin chunkers must chunk on arbitrary byte boundaries and have variable sized chunks to work properly. The classic example is the large file modified by adding a single byte at the front; despite everything except the first byte being duplicated, chunkers that can't re-align on the 1-byte offset will fail to de-duplicate anything. Any chunker constrained to 64 byte boundaries will fail to find duplicates after any insert/delete that is not a nice multiple of 64 bytes.

  1. The data storage layer where the data could be stored in any chunking fashion we want although we should store the exchange layer mappings for reuse when people download from us in the future

This bit is internal client level and doesn't really matter from an API point of view. The important bit is 2.

IIUC the thing that you're getting at here is that it's unfortunate that if Rabin would've saved us bandwidth here that it's only saving us disk storage because of the restrictions at the exchange layer

Rabin can't be used at all if the chunks are constrained to 64 Byte boundaries. Without using some kind of content-based chunker like rabin, you get no de-duplication at all between slightly mutated data within files unless the data mutations are constrained to multiples of your block size; so no shifting of data except by multiples of the block size.

I think there's potentially a way around this (which has tradeoffs) by allowing more data into the manifest. For example, when B responds to A giving them a manifest of blocks to download in addition to giving them a list of the intermediate States B could also send more information about the data corresponding to those states.

For example, B could send (State, large block start index, large block end index, []ConsecutiveSubBlock) where ConsecutiveSubBlock = (SubBlockMultiHash, subblock start index, subblock end index) and in this way A could decide whether to get the blocks corresponding to the State transition by either asking about bytes in the full block or by asking for the ConsecutiveSubBlocks and then using some of the bytes. This would allow B to tell A about a deduplicating chunking scheme they could use, but A 1) wouldn't be obligated to use it when downloading data 2) wouldn't be obligated to use it in their datastore.

I don't think I fully understand this. Are you effectively creating virtual-large-blocks for each state transition out of a list of ranged sub-blocks in your manifest? And this is so that the sub-blocks can be created using a de-duplicating chunker, while the virtual-large-blocks can be 64byte aligned?

If yes, this sounds very similar to the "virtual DAG" idea I was proposing you could do using CID-aliases in the DHT. Note that you would not need to have a list of ConsecutiveSubBlocks, you could just have a CID+offset+length where the CID could be a reference to the merkle-dag node that contains the whole range. You can walk the DAG to find the leaf nodes needed from that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want a overview of how chunking/encoding/etc affect data deduplication, I think I've summarized everything about it here;

https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is a little long, especially since I think we're basically on the same page but figured it'd help to be more explicit.

If you chunk on arbitrary byte boundaries, then the full "state" needed to resume hashing needs to include the tail data past the last 64B boundary that has not yet been included into the hash-chunker. It means the full state needed is a little larger.
...
I don't think I fully understand this.

So I think what we're describing is just two ways of getting at the same idea. We're in agreement that the exchange layer chunker, which is verifiable, cannot just send over data-layer chunker information (e.g. rabin) since they won't line up on the same boundaries.

Overall we need something that a client who has State_i can receive (State_i-1, bytes) and can verifiably check that the given state transition is valid.

IIUC the difference is:

Your approach

I was having some difficulty describing what you were getting at here, but I think you were leaning towards:

Define transitions from (State_i-1, bytes) -> State_i as (SHA2_State_with_extra_data_i-1, CIDs for data layer chunking blocks (e.g. rabin)) to SHA2_State_with_extra_data_i where the extra data might include some bytes left over in between rounds of SHA2 chunking.

My approach

Define transitions from (State_i-1, bytes) -> State_i as (SHA2_State_i-1, []WayOfDescribingBytes) -> SHA2_State_i-1) where I've listed two implementations of WayOfDescribingBytes.

Below we have three peers: Alice, the client. Bob the server that sends the manifest and/or data. Charlie another server that can send a manifest and/or data.

  1. Byte offsets within the large block itself (works, but not great for deduplication)
  2. Multihashes (since the IPLD codecs aren't required) of the blocks Bob has used locally to chunk the data (e.g. rabin) along with the start offset of the first block and end offset of the last block. For example, [(MH1, start at byte 55), MH2, (MH3, end at byte 12)]. Note: In one of your other comments I think you allude to allowing for graphs here instead of just a list of blocks, IMO that seems like an excessive level of indirection since we already know what we want.

Both of these approaches describe a set of bytes and have their own advantages/disadvantages:

  1. Advantage: Works even if I ask Charlie for parts of the large block and Charlie has used a different chunker than Bob (e.g. buzzhash or fixed size). Disadvantage: Wastes bandwidth if Charlie and Bob used the same chunker, or if Alice had previously downloaded a large block (e.g. a file) that utilizes the data chunks in from how Bob has chunked up the data.
    2.Advantage: If people are using the same chunker I get deduplication benefits. Disadvantage: Fails completely if someone has the same large block, but chunked up differently

Are you effectively creating virtual-large-blocks for each state transition out of a list of ranged sub-blocks in your manifest?

Yes, although the "large" in this case isn't very big since the sum of the sizes should be less than the recommended Bitswap block size (i.e. 1MiB). We're just accounting for if Rabin chunking gives us 10 chunks that contain that 1MiB of data.

This isn't perfect, for example if the data is a 1TiB file that's expressible using a rabin chunker in 1MiB we'll send metadata corresponding to a 1TiB file since we want to account for someone using a different chunker. However, as the manifests should be very small compared to the data and most real data likely won't have such extreme differences in chunking vs non-chunking size I suspect this is ok.

And this is so that the sub-blocks can be created using a de-duplicating chunker, while the virtual-large-blocks can be 64byte aligned?

Yes, and also handle the case where people have used different chunkers for the same data.

If you want a overview of how chunking/encoding/etc affect data deduplication

That's a great post 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long comment thread, but I just want to add that we get common-prefix deduplication of stuff with the linear hash for free.


### Example: Merkle–Damgård constructions like SHA-1 or SHA-2

MD pseudo-code looks roughly like:

```golang
func Hash(D []byte) []byte {
pieces = getChunks(D)

var S state
for i, p := range pieces {
S = process(S, p) // Call this result S_i
}

return finalize(S) // Call this H, the final hash
}
```

From the above we can see that:

1. At any point in the process of hashing D we could stop, say after piece `j`, save the state `S_j` and then resume later
2. We can always calculate the final hash `H` given only `S_j` and all the pieces `P_j+1..P_n-1`

The implication for Bitswap is that if each piece size is not more than 1MiB then we can send the file **backwards** in 1MiB increments. In particular a server can send `(S_n-2, P_n-1)` and the client can use that to compute that `P_n-1` is in fact the last part of the data associated with the final hash `H`. The server can then send `(S_n-3, P_n-2)` and the client can calculate that `P_n-2` is the last block of `S_n-2` and therefore also the second to last block of `H`, and so on.

#### Extension

This scheme requires linearly downloading a file which is quite slow with even modest latencies. However, utilizing a scheme like [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) (i.e. downloading metadata manifests up front) we can make this fast/parallelizable

#### Security

In order for this scheme to be secure it must be true that there is only a single pair `(S_i-1, P_i)` that can be produced to match with `S_i`. If the pair must be of the form `(S_i-1, P_malicious)` then this is certainly true since otherwise one could create a collision on the overall hash function. However, given that there are two parameters to vary it seems possible this could be computationally easier than finding a collision on the overall hash function.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like some 👀 on this ideally from people who are more practiced with this type of cryptanalysis than I am.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that finding any kind of (S_i-1, P_i) pair that matches another (S_i-1, P_other) is at least as hard to crack as finding any arbitrary matching (S_foo, P_foo) (S_bar, P_bar) pair, which is the same as finding a hash collision between any two blocks. It is a birthday-attack, but secure hash functions have enough bits to be safe from birthday attacks.

Copy link
Collaborator Author

@aschmahmann aschmahmann Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some poking around (thanks @Stebalien for pointers) it seems that as long as the underlying compression function is not subject to freestart collisions then we should be fine and if not then things become trickier.

My understanding from this paper and its corresponding thesis is that SHA-256 is not yet subject to freestart collisions.


Even if we were subject to freestart collisions things may not necessarily be so bad since the attacker would also need to be the creator of the file and would be able to selectively give some people the data and other people would not get different data, but instead just waste some bandwidth and other resources which on its face doesn't seem like a super worthwhile attack.

If so then what we're really trying to avoid here is approximately a pseudo-second-preimage attack on the compressor function (close to the Definition 7 here). My understanding is that this would be even harder for an attacker to pull off and might even be reasonably safe for functions like SHA-1 which are no longer collision resistant (although pseudo-preimage attacks are of course may be easier to pull off than full preimage attacks).


@dbaarda thanks for the feedback, it does seem like this is probably ok. However, I do think it's a little more subtle than there are no collisions on SHA-2 implying there are no issues in this scheme.

(S_foo, P_foo) (S_bar, P_bar) pair, which is the same as finding a hash collision between any two blocks

My understanding is that this indicates a collision on the compressor function, but not on the overall hash function since a hash collision is that given some starting state IV H(IV, P_good) = H(IV, P_bad) (Definition 1 in that paper linked above) which mean that unless you can chain back S_foo and S_bar to some common state IV that there isn't a full hash collision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not know "freestart collisions" had a name; I'm glad it does!


#### SHA-3

While SHA-3 is not a Merkle–Damgård construction it follows the same psuedocode structure above

### Example: Tree constructions like Blake3, Kangaroo-Twelve, or ParallelHash

In tree constructions we are not restricted to downloading the file backwards and can instead download the parts of the file the we are looking for, which includes downloading the file forwards for sequential streaming.

There is detail about how to do this for Blake3 in the [Blake3 paper](https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf) section 6.4, Verified Streaming

### Implementation Plan

#### Bitswap changes

* When a server responds to a request for a block if the block is too large then instead send a traversal order list of the block as defined by the particular hash function used (e.g. linear and backwards for SHA-1,2,3)
* Large Manifests
* If the list is more than 1MiB long then only send the first 1MiB along with an indicator that the manifest is not complete
* When the client is ready to process more of the manifest then it can send a request WANT_LARGE_BLOCK_MANIFEST containing the multihash of the entire large block and the last hash in the manifest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: Instead of special-casing the manifest file (and having to deal with large vs small manifests), recursively treat the manifest as a downloadable artifact:

If the manifest is small (1<MB), send the whole manifest in the response, otherwise send the manifest of the manifest.

* When requesting subblocks send requests as `(full block multihash, start index, end index)`
* process subblock responses separately from full block responses verifying the results as they come in
* As in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) specify how much trust goes into a given manifest, examples include
* download at most 20 unverified blocks at a time from a given manifest
* grow trust geometrically (e.g. 10 blocks, then if those are good 20, 40, ...)

#### Datastore

* Servers should cache/store a particular chunking for the traversal that is defined by the implementation for the particular hash function (e.g. 256 KiB segments for SHA-2)
* Once clients receive the full block they should process it and store the chunking, reusing the work from validating the block
* Clients and servers should have a way of aliasing large blocks as a concatenated set of smaller blocks
* Need to quarantine subblocks until the full block is verified as in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25)

#### Hash function support

* Add support for SHA-1/2 (should be very close to the same)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sha 1 is deprecated / not-recommended at this point. It seems unclear it's valuable or safe to support it. why do we want to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"git"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for Git support, however with Git eventually moving to SHA-2 if it turned out SHA-1 was unworkable we could probably deal with it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer if support for this for sha-1 is opt-in, if not on by default. See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html which I propose adding in multiformats/multicodec#203 for why.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time this is usable, SHA-256 support in Git will likely be stabilized anyway, given it's already implemented AFAICT, so I don't see the point in making it opt-out.

* Make it possible for people to register new hash functions locally, but some should be built into the protocol

## Evaluation Plan

* IPFS file transfer benchmarks as in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25)

## Prior Work

* This proposal is almost identical to the one @Stebalien proposed [here](https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/6)
* Utilizes overlapping principles with [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25)

### Alternatives

An alternative way to deal with this problem would be if there was a succinct and efficient cryptographic proof that could be submitted that showed the equivalence of two different DAG structures under some constraints. For example, showing that a single large block with a SHA-2 hash is the equivalent to a tree where the concatenated leaf nodes give the single large block.

### References

This was largely taken from [this draft](https://hackmd.io/@adin/sha256-dag)

## Results

## Future Work
48 changes: 48 additions & 0 deletions RFC/rfcBBL210/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# RFC|BB|L2-10: UnixFS files identified using hash of the full content
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ribasushi I feel like you may have some thoughts on this 😄


* Status: `Brainstorm`

## Abstract

This RFC proposes that for UnixFS files we allow for downloading data using a CID corresponding to the hash of the entire file instead of just the CID of the particular UnixFS DAG (tree width, chunking, internal node hash function, etc.).

Note: This is really more about IPFS than Bitswap, but it's close by and dependent on another RFC.

## Shortcomings

There exists a large quantity of content on the internet that is already content addressable and yet not downloadable via IPFS and Bitswap. For example, many binaries, videos, archives, etc. that are distributed today have their SHA-256 listed along side them so that users can run `sha2sum file` and compare the output with what they were expecting. When these files are added to IPFS they can be added as: a) An application-specific DAG format for files (such as UnixFSv1) which are identified by a DAG root CID which is different from a CID of the multihash of the file data itself b) a single large raw block which cannot be processed by Bitswap.

Additionally, for users using application specific DAGs with some degree of flexibility to them (e.g. UnixFS where there are multiple chunking strategies) two users who import the same data could end up with different CIDs for that data.

## Description

Utilizing the results of [RFCBBL209](../rfcBBL209/README.md) we can download arbitrarily sized raw blocks. We allow UnixFS files that have raw leaves to be stored internally as they are now but also aliased as a single virtual block.

## Implementation plan

* Implement [RFCBBL209](../rfcBBL209/README.md)
* Add an option when doing `ipfs add` that creates a second aliased block in a segregated blockstore
* Add the second blockstore to the provider queue

## Impact

This scheme allows a given version of IPFS to have a canonical hash for files (e.g. SHA256 of the file data itself), which allows for independent chunking schemes, and by supporting the advertising/referencing of one or more common file hash schemes allow people to find some hash on a random website and check to see if it's discoverable in IPFS.

There are also some larger ecosystem wide impacts to consider here, including:

1. There's a lot of confusion around UnixFS CIDs not being derivable from SHA256 of a file, this approach may either tremendously help or cause even more confusion (especially as we move people from UnixFS to IPLD). An example [thread](https://discuss.ipfs.io/t/cid-concept-is-broken/9733) about this
2. Storage overhead for multiple "views" on the same data and extra checking + advertising of the data
3. Are there any deduplication use case issues we could run into here based on users not downloading data that was chunked as the data creator did it, but instead based on how they want to chunk it (or likely the default chunker)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decent deduplication requires using a consistent hash and chunking algorithm, and to deduplicate arbitrarily aligned data, the chunking algorithm must be content based with variably sized blocks of any byte-length between a min and max value. Every different way of chunking/hashing the data (and creating the merkle-tree) results in another copy of the data that cannot be deduplicated.

To do deduplication efficiently you want IPFS to have a fixed chunking/hashing algorithm and merkle-tree under the hood, and then support alternative "views" on top of this, that can present the underlying data as if it was chunked/hashed differently. I don't know how much demand there is for alternative tree views, but certainly a common use-case is the "one big file with this hash" view. This could be implemented as just an alternative DHT entry similar to an IPNS entry that is keyed by the whole-file hash and points to a list of CID's (each different hash/chunker/tree options result in a different CID. ideally there is only one) for that file. These could be signed by the node that did the upload for verification purposes, but you would still need to download the whole file to verify the whole-file hash.

I don't know how much demand there is for alternative tree views of the data, but this could be implemented using an alternative merkle tree containing the desired hash-type for each node, and where the "raw" leaves are actually ranged references into the underlying native merkle tree nodes. I'm not sure exactly how validation of these alerternative-view merkle nodes would work, but you would probably have to download the data-segment (by downloading the underlying merkle-tree-fragment) for each -node to validate the hash. There might be ways to include signatures by the uploading peer-node, but you probably want to do this in a way that the same alternative-view-uploaded by different peers can share the same data. Perhaps an IPNS entry pointing at the alternative-view-root-merkle node is the best way that peers can sign that the've uploaded/verified that alternative-view.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think @dbaarda has a great idea, that mapping a full-file hash (SHA-256 for ex) to a CID right at the DHT layer seems like a clean way to add this functionality with no redesign of anything existing (other than a new DHT type), and just a small amount of new code.

Also it means any existing data already stored on IPFS doesn't need to be re-stored (re-added) but anyone at any time could put it's canonical hash (SHA-256) in the DHT and immediately it would be findable by everyone else.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you extend this idea a tiny bit and make the alias value In the DHT a CID+range (offset + length) then you can add aliases to any piece of arbitrary data regardless of the underlying chunking/merkle tree structure. This would allow you to eg. Add a sha256 alias to a file that had been added inside an uncompressed tar file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tricks that @aschmahmann is trying to balance in this proposal, as I understand it, is being able to take a legacy hash, like the sha256 of a large file, and have some confidence that you're getting 'correct' data while you're downloading it.

If the DHT just holds a mapping to a CID, you don't know until you fully retrieve the file that it will hash to the original sha256 value you were interested in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least you only have to trust the peer that gave you the CID and then you can still pull data from all the other untrusted peers in the normal trustless way, where they can only fool you "1MB at a time (bitswap)" so to speak. Also if you get a DHT answer from at least 3 peers, you can go with whatever the consensus is about what the correct CID is, before you try it, but I'm not sure if DHT is designed to try to get 3 or more answers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. IMO verifiable data is really important here, DHT's (and open p2p networks in general) are Sybil attackable/spammable so working with unverified data is tough and needs to be really worth the associated problems

Agree, but maybe that's the price you pay for "larger blocks"; you can't validate them until you've downloaded them, just like small blocks.

Right now people can't have content-addressable "blocks" of data larger than 2M (with 1M recommended) at all. You can't have content-addressable "blocks" larger than that, unless you're OK with the key/hash not being of the raw data, but of a merkle-tree node where the hash depends on the chunking algorithm chosen. People might want to build applications on top of IPFS with larger blocks, and this would facilitate that.

Adding a simple CID Alias to the DHT suddenly means you can have blocks of any size keyed by the block's content hash. Under the hood IPFS is chunking things into pieces optimized for it's deduping/network/storage/etc requirements, but you now optionally can have an abstract "block" on top of that with less restrictions.

  1. Almost anything you can do with a custom DHT record type you can do with provider records + a custom protocol. The advantage of using the DHT is generally that someone can publish and then go offline and the record is still there (for a while, e.g. a day), however, by going the custom protocol route you can have things work even if a client doesn't have a DHT implementation (or it's been turned off)

I would have said that the big advantage of the DHT is you can find things with it. Any solution that doesn't put the hash/key/cid that you want to find the data by in a DHT is not going to be findable, at least not by that key. You need some kind of mapping from the key/hash you have to the CID the data is actually published under.

At least you only have to trust the peer that gave you the CID and then you can still pull data from all the other untrusted peers in the normal trustless way

Yes, that's true but anyone can just put mappings in which means you could easily be given a bogus CID (may make the "best out of 3" approach not doable). To make things worse this can be used by a malicious actor to do a sort of attack on a third party by getting you to try and download a large file from them that wastes both of your bandwidths.

Note this is true with current non-IPFS publishing of ISO images and their hash; you need to download the whole thing before you can verify it against the published hash.

I agree it would be good to have some way to validate that the CID alias for a large block actually does hash to the key it's indexed by, but I haven't really got a good solution. Signing by the publishing peer might help, but I haven't thought it through. Perhaps CID aliases should be published via IPNS to be trusted? Note you don't have to prove that each individual raw block (or maybe block fragment) is a valid part of the whole large block, just that whole data referred to by the CID reference has that hash, since the IPFS fetching of that CID will validate the individual raw blocks are part of that CID as they are fetched.

Note the DHT entry containing a CID alias reference can only be validated by downloading all the data pointed at by that reference, but this doesn't have to be the whole file

I think we're on the same page here, but a CID can be validated using the data corresponding to that block. In normal/current usage blocks are relatively small (<1MiB) and large data collections are established by using IPLD to create content-address linked DAGs, this proposal is about the fact that it happens to be that there is a pre-existing "format" for files where the data is just a single large block and it'd be nice to be compatible with that.

What sorts of use cases are you envisioning where I can lookup the SHA-256 of a large section of a single large block? How is anyone finding the reference to that subsection of a large block and why wouldn't they just break the data into digestible pieces and make an IPLD DAG?

It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;

  1. Create a whole-file sha-256 CID-alias that points at a CID containing a single uploaded file. This means you can fetch the file using it's whole-file sha-256 hash, instead of a hash that varies depending on the chunking algorithm chosen.

  2. Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.

  3. Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.

  4. Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.

I think 1. would be the main use-case, but once you have that capability people would figure out creative ways to use it. Note 1. and 3. only require CID-aliases that point at an existing CID, but 2. and 4. require CID-aliases that include ranges (offset+length).

These DHT CID alias entries can be used to build an alternative merkle tree "view" structure that has its own blocks/nodes/sums while referring to the same underlying data in the IPFS preferred format

Related to the above which question/use cases are you hoping to answer with the DHT CID aliases? I think having some concrete examples will make it easier to talk about this.

DHT ... so it definitely supports multiple answers per key

A bit, but not really there is support for multiple provider records (i.e. a map of key to a list of peers), but not for anything else. There could be, but it's not a trivial thing (more info here).

Ah... that's interesting... maybe the mechanisms for adding provider records could be used for adding CIDs to a CID-alias entry?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I was making the assumption that any DHT entry that points a SHA-256 to a wrong CID (for whatever reason) would have a digital signature so the peer responsible for the "mistake" could be identified and blamed and then somehow 'down-ranked' (as a potential hostile actor) by the peer that discovered the mistake. Like you guys, I can't see a way to avoid having to trust that a CID given in the DHT is correct, until after the full download is complete.

Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain. It may be the case that true content-addressing (based on a file hash) is simply impractical in a trustless-system, once you consider that it's synonymous with a big attack vector. If this is true (and I can't prove it isn't) then my apologies for even being one of proponents of the idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much longer than I wanted (sorry about that).

Separating this into two sections, one about how the DHT works and the other about whether I think this approach of storing CID mappings in the DHT is something we should do.

DHT Background Clarification

Almost anything you can do with a custom DHT record type you can do with provider records + a custom protocol. The advantage of using the DHT is generally that someone can publish and then go offline and the record is still there (for a while, e.g. a day), however, by going the custom protocol route you can have things work even if a client doesn't have a DHT implementation (or it's been turned off)

I would have said that the big advantage of the DHT is you can find things with it.

You're right I was being sloppy with my wording/explanation here. The point I was trying to make is that if you wanted to put some custom key-value pair (k,v) in the DHT generally you can get around the DHT not supporting your custom pair by doing this:

  1. Make a key k' = Multihash("MyKeyType:" + k) using some common hash function (e.g. SHA2)
  2. Put provider records in the DHT saying that you are a provider of k'
  3. Register a protocol handler that does whatever you want (e.g. returns v given k)
  4. Users who want to find the value for k calculate k', find the providers of k', connect to them using the custom protocol, and ask them for v

This strategy is how IPNS over PubSub works and manages to be faster than IPNS over the DHT even for first time lookups (relevant logic largely lives in go-libp2p-pubsub-router and go-libp2p-discovery).

The things you lose over having native DHT record support are the ability to publish and then go offline for a few hours and then come back online later. This isn't to say we shouldn't figure out a way to make it easier to upgrade the DHT and support new record types, just that this other path works.

Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain.

Generally open p2p systems are subject to Sybil attacks, this includes IPFS. There are mitigations available (and we even do some of them), but overall the concept persists. The question to ask is what bad thing occurs/can occur if someone performs a Sybil attack. Thus far the only thing that happens to IPFS under a Sybil attack is basically a denial of service attack on some the resource being attacked.

Thoughts on unverified relationships between two CIDs

Continuing from above this proposal allows for different types of attacks than just a DoS on the Sybil attacked resource, and without some reputation backing isn't too difficult to pull off.

  • Attack: The adversary can say "Popular file with SHA2 X corresponds to DAG Y" where DAG Y is the wrong data
    • Now the user is forced to not just get an "error not found" but actually download a potentially large amount of data before realizing they've been duped
    • The adversary doesn't even need to waste resources since DAG Y can be hosted by someone else
    • The adversary can cause a DoS on the "someone else" hosting the data by overwhelming them with innocent peers thinking Y is really some popular content

It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;

I think proposal 09 covers most of these use cases at the network layer if clients are willing to make custom block stores locally. Zooming out if we have a primitive that lets us download arbitrarily sized blocks and we want to be able to download these blocks in parts from multiple peers who are transforming the data in various ways (chunking, compression, etc.) that's ok as long as peers present a virtual blockstore that presents the data in its canonical form. This might end up requiring computation/storage tradeoffs (e.g. for decompression), but it gives us verifiability which IMO is key.

  1. Create a whole-file sha-256 CID-alias that points at a CID containing a single uploaded file. This means you can fetch the file using it's whole-file sha-256 hash, instead of a hash that varies depending on the chunking algorithm chosen.

This proposal does exactly that, you just advertise a provider record in the DHT for the full-file SHA-2 and can then download without worrying about chunking schemes, etc. and it does it verifiably.

  1. Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.

This proposal doesn't cover this use case, but the idea of working with compression representations as if they're the data itself seems like a bit of a mine field with people wanting various different things out of compression.

One ask is "given that Bob has a compressed version of the file F how can Alice download it knowing only SHA2(F)?" and if you want to be able to download from multiple people who could have compressed the data differently then either way you'll need to be downloading based on the raw bytes in the file. If so, then Bob can have a virtual datastore where if someone asks him for F he'll decompress it before sending.

Another ask is to try and minimize download bandwidth, perhaps by using compression. That compression might come at the data layer, or could come at the transport layer (there's an RFC in this repo about compression in the Bitswap transport). If the question is "I want to download data from peers who have gzip(F) and all I know is SHA2(F)" then we run into verifiability problems. There might be value here, especially if I'm building a friend-to-friend network on IPFS and I know who I trust, but given the verifiability issues IMO this doesn't seem great. If downloading gzip(F) is really so important to the user then they could always pass around references to it as SHA2(gzip(F)).

  1. Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.

Even assuming that by merkle-tree DAG you're referring only to UnixFS files I'm still confused. I'm not sure I see the immediate value here, but couldn't this proposal mostly let you do this? If an application wanted to treat a subset of a file as a new file then they could just do another add and this reduces to case 1.

  1. Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.

This seems like it's really about clients having multiple virtual blocks backed by the same underlying bytes. As long as there is a canonical representation that people can advertise/search under and there is a canonical download representation (e.g. the raw bytes of the large block/file) then how you internally represent that single block is up to you.

Copy link

@Clay-Ferguson Clay-Ferguson Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann I agree with everything you said, and you outlined the attack vector very well too. I also agree there's not really any value to identifying individual merkle nodes by their SHA2 just because the entire file is identified by SHA2, with all due respect to the person bringing that up. He is thinking "the right way" however, because normally in a tree-architecture when you have every node being "handled" identically to every other node (including the root itself) that's usually indicative of a good design, so it made sense to bring that up.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late and waaay too long post. I started writing this some time ago and RL kept interfering before I could post it, so each time I came back to it I added more, and now I'm struggling to edit it down. I guess the TLDR is;

I'm not against this proposal. The reverse-download-block-incremental hash idea is sound, and a good way to validate the download incrementally so it can be rejected early if it's bad. It's probably the only way for non-homomorphic hash functions, which are the ones people currently use and care about. I just wanted to point out;

  1. You can store hash-state at arbitrary byte boundaries by also including the hash-chunker-tail-fragment in the hash state. This means you don't need to re-chunk the data into hash-chunker-aligned chunks, but can calculate the incremental hash for the existing merkle-dag data blocks with arbitrary alignment. Decent de-duplicating chunkers WILL chunk on arbitrary boundaries.

  2. If you do want to keep the hash-state a little smaller and/or optimize the chunk size for the incremental hash, a CID alias in the DHT could be used to efficiently build a "virtual merkle-tree" with it's own chunking that refers/reuses the data in an underlying existing "as-added" merkle-tree with different chunking. This feature would also be a useful-building-block for other uses.

A lot of the rest of my words are me thinking aloud and getting a bit off-topic... I'm sorry about that.

I would have said that the big advantage of the DHT is you can find things with it.

You're right I was being sloppy with my wording/explanation here. The point I was trying to make is that if you wanted to put some custom key-value pair (k,v) in the DHT generally you can get around the DHT not supporting your custom pair by doing this:

  1. Make a key k' = Multihash("MyKeyType:" + k) using some common hash function (e.g. SHA2)
  2. Put provider records in the DHT saying that you are a provider of k'
  3. Register a protocol handler that does whatever you want (e.g. returns v given k)
  4. Users who want to find the value for k calculate k', find the providers of k', connect to them using the custom protocol, and ask them for v

You are right that you can do this, but it's very inefficient and more unreliable when 'v' is smaller than a list of the peers hosting a bitswapped block containing 'v' would be. Putting 'v' directly in the DHT instead of in an IPFS block means you avoid more network transfers and round-trips (bitswap is avoided entirely), avoid storing as much data in the DHT (v is smaller than the peer records would be) and storage in peer's blockstores (you don't need peers to store 'v' in a block at all), and avoid relying on available peers providing the block.

I was proposing a 'v' record that contains only a CID, offset, length, and maybe a publishing-peer-signature, which would be smaller than a list of peers. This would be like a sym-link to any arbitrary piece of data in IPFS, referenced by a hash of that data. This would be a low-level efficient piece that other things could be built on top of. How you safely use this feature is probably something for upper layers using it to figure out.

But you are right; this feature doesn't need to be added to the DHT to build proof-of-concept implementations using this idea. Adding it to the DHT would be an optimization that could be done later. I guess my main point is that it's worth designing things with this optimization in the back of your mind, so that you don't end up with something that's hard to migrate to use it later.

This strategy is how IPNS over PubSub works and manages to be faster than IPNS over the DHT even for first time lookups (relevant logic largely lives in go-libp2p-pubsub-router and go-libp2p-discovery).

I don't know much about pubsub (yet) but I'm betting pubsub has it's own push notification/lookup system that bypasses the DHT and propogates changes faster than DHT writes propogate. I suspect it's this publishing path that makes it faster, not that pubsub puts it's 'v' values in bitswapped blocks.

I'm willing to bet that a DHT lookup of a tiny 'v' record directly in the DHT will nearly always be faster than looking up a 'v' record in a block. The only exception might be when the 'v' record is already cached in the local blockstore, but even then caching in the DHT client should be just as fast. If it's not, then there is something wrong.

The things you lose over having native DHT record support are the ability to publish and then go offline for a few hours and then come back online later. This isn't to say we shouldn't figure out a way to make it easier to upgrade the DHT and support new record types, just that this other path works.

That's the obvious functional difference, but there's a big performance difference too.

Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain.

Note bad clients can "add" CID entries that don't match the blocks they serve for them too. The only protection is these "bad blocks" are small and thus identified as bad before much is downloaded, and peers providing these bad blocks (presumably) get ignored eventually.

There are several different levels to what a CID alias could do, with increasing features and risks;

  1. Just provide an alternative CID using a different hash to an existing block. This could eg. be used to provide a sha1 alias, as needed by git, to a block added using a sha256 hash. This only requires a CID, no offset+length range, and no peer-signature (it would not add anything). This has no greater risk and exactly the same protections as the existing CIDs, with the exception that obviously sha1 is a weaker hash. Note that this could be done without CID aliases by just re-adding the block using a different hash, but CID aliases mean the blocks and lists of providers are shared/de-duplicated, at the cost an extra DHT lookup to de-reference the CID alias.

  2. Provide an alternative alias CID using a hash of all the "raw data" under that CID. This could be used to provide a sha256 alias to a whole file, or any node in a merkle-tree-DAG, by a hash of its content. This doesn't require an offset+length range, but it probably does require a peer-signature. It can only be fully validated after downloading all the "raw data", but note that each block under the CID is validated normally as part of that CID, and the final validation of the whole data is verifying that the CID alias points at a CID that has the correct overall hash. Before starting the download, the CID alias peer-signature can be use to check that it has been published by a trusted peer, and peers found to be publishing bad CID aliases can be blacklisted.

  3. Provide an alias to an arbitrarily offset/length "large block" of data under a CID, keyed using the hash of that data. This is the same as 2. except it also requires an offset+length range. It's risks and mitigations are the same as 2. with the extra risk that degenerate CID aliases could point a high-level CID alias with a range on either side of a block-boundary, requiring downloading all the merkle-DAG nodes from the root down to and including the two raw nodes on either side of the boundary just to get a tiny piece of data. This is likely to be a minor inefficiency, but if it does look like a deliberate DOS attempt the signing peer can be blacklisted.

I agree that CID aliases are more vulnerable to attacks than CID's and IPNS records, but that's largely because IPFS delegates and denies responsibility for the big risk part; that the CID or IPNS record points at the data that the person publishing says it does. A person can "add" a compromised binary and publicize the CID or even an IPNS record as the place to get it, and IPFS will happily verify that yep, you've downloaded the compromised binary exactly as it was "added", and will not tell you it's been compromised. Verifying the downloaded binary against an officially published hash is something that (currently) has to be done outside IPFS.

It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;

I think proposal 09 covers most of these use cases at the network layer if clients are willing to make custom block stores locally. Zooming out if we have a primitive that lets us download arbitrarily sized blocks and we want to be able to download these blocks in parts from multiple peers who are transforming the data in various ways (chunking, compression, etc.) that's ok as long as peers present a virtual blockstore that presents the data in its canonical form. This might end up requiring computation/storage tradeoffs (e.g. for decompression), but it gives us verifiability which IMO is key.

The bit about "if clients are willing to make custom block stores locally" worries me. I don't think this is necessary, and implies that a whole bunch of de-duplication at the blockstore and (more importantly) network layer will be broken.

I was thinking the blockstore and network layer would always use the underlying IPFS "as-added" native merkle-dag using the existing fetching/storing/caching mechanisms, and "large blocks" would be a higher-level abstraction for "reading" arbitrarily offset/sized "virtual blocks" from a CID. Under the hood it would just fetch/walk the original merkle tree and download the relevant "leaf" raw data blocks that encapsulate that larger virtual block. The addition of the DHT CID alias idea would give you a way to reference/search for these "virtual blocks". This would mean the local data store and network layers would be unchanged, and all the raw data for any large blocks would deduplicate/reuse the native merkle-tree data.

  1. Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.

This proposal doesn't cover this use case, but the idea of working with compression representations as if they're the data itself seems like a bit of a mine field with people wanting various different things out of compression.

One ask is "given that Bob has a compressed version of the file F how can Alice download it knowing only SHA2(F)?" and if you want to be able to download from multiple people who could have compressed the data differently then either way you'll need to be downloading based on the raw bytes in the file. If so, then Bob can have a virtual datastore where if someone asks him for F he'll decompress it before sending.

Note this would not work with compression, only concatenation, as is done by eg tar (NOT tar.gz). You could write a custom tar file uploader that not only gave you a CID for the whole tar file, but a CID alias for every file inside the tar file. This would be more efficient than doing "add" of the tar and each file individually UNLESS you had/used a tar-file aware chunker/dag-builder that could achieve the same thing by building the merkle-dag to reflect the tar-file contents.

  1. Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.

Even assuming that by merkle-tree DAG you're referring only to UnixFS files I'm still confused. I'm not sure I see the immediate value here, but couldn't this proposal mostly let you do this? If an application wanted to treat a subset of a file as a new file then they could just do another add and this reduces to case 1.

This idea was just presented as an example that someone might see a use for.

  1. Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.

This seems like it's really about clients having multiple virtual blocks backed by the same underlying bytes. As long as there is a canonical representation that people can advertise/search under and there is a canonical download representation (e.g. the raw bytes of the large block/file) then how you internally represent that single block is up to you.

This is actually about creating multiple virtual blocks you can advertise/search for, that clients can then fetch/store using the "as-added" merkle-tree native representation. It's not purely internal, because the virtual blocks are advertised/searched for by other clients. That the network/storage layers use the native merkle-tree representation means all the data is transmitted/deduplicated at that layer, and clients assemble them into the "large blocks" themselves as needed.

4. File identified using hash of the full content enables [validation of HTTP gateway responses](https://github.com/ipfs/in-web-browsers/issues/128) without running full IPFS stack, which allows for:
- user agents such as web browsers: display integrity indicator when HTTP response matched the CID from the request
- IoT devices: downloading firmware updates over HTTPS without the need for trusting a gateway or a CA

Copy link
Member

@lidel lidel Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would make public gateways way more useful by removing MITM risk (cc @Gozala @autonome)

Suggested change
4. File identified using hash of the full content enables [validation of HTTP gateway responses](https://github.com/ipfs/in-web-browsers/issues/128) without running full IPFS stack, which allows for:
- user agents such as web browsers: display integrity indicator when HTTP response matched the CID from the request
- IoT devices: downloading firmware updates over HTTPS without the need for trusting a gateway or a CA

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still issues with directories, but yes for files this can help, good idea 💡!

## Evaluation Plan

TBD

## Prior Work

## Results

## Future Work