Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log and log trie multicodec types #223

Merged
merged 2 commits into from
Aug 19, 2021
Merged

Conversation

i-norden
Copy link
Contributor

@i-norden i-norden commented Aug 5, 2021

In our IPLD representation of the Ethereum Merkle data structure, we would like to compose the logs contained in a receipt into a "log trie" that is linked to from the receipt. We want to Merkleize them in a manner analogous to how transaction and receipt lists canonically are: composing them into a Modified Merkle Patricia Trie (MMPT), whereby the RLP encoded log is the value stored in a leaf node and the path to that leaf node is the RLP encoding of the log's index in the receipt.

We are not changing the binary encoding of a receipt IPLD- that remains the consensus encoding of a receipt- we only want to change the in-memory materialization of that binary data so that a receipt IPLD node links to the root node of a log trie.

Implementations of the new log and log trie codecs and the updated receipt codec can be found here.

The motivation for this is so that

  1. We can describe IPLD selectors that extend all the way down to a log node.
  2. We can generate (semi) conventional MMPT proofs for individual logs.

The primary issue here, I think, is that it does add additional overhead when unmarshalling receipt binary into an ipld.Node as we need to materialize the trie to calculate its root hash and generate the link. If need be, we can create a custom codec that does this and leave the normal receipt codec as is.

Some unrelated tweaks I threw in, that might be superfluous:

  1. Changed eth-block to eth-header, as it is only the header that is referenced by "block"hash and stored as the IPLD block. The header contains all the links to the uncles, tx, rct, state (and thereby storage) tries but the block body is not a part of the "eth-block" IPLD block.
  2. Changed "RLP" to "MarshalBinary" in the descriptions for transaction and receipt, because after EIP-2718 tx envelops were introduced the consensus encoding of a tx or rct is only the pure RLP encoding for the legacy types.

Also changing eth-block to eth-header, as it is actually only the header that is hash referenced by "block"hash and the header in turn contains all the roots to the tx, rct, state (and thereby storage) tries.
Also changed RLP to MarshalBinary in the description for tx and rct, because after EIP-2718 tx envelopers were introduced the consensus encoding of a tx or rct is only the pure RLP encoding for the legacy types.
@rvagg
Copy link
Member

rvagg commented Aug 10, 2021

The rename of eth-block to eth-header might be a problem, we have the same issue with the other chains that are listed here where the *-block is really just the header. But it's also not strictly wrong because in blockchain terminology the hash of that thing is commonly referred to as the block identifier. I'm not sure about a wholesale rename, as much as I'd prefer correctness. @Stebalien what's your take on this?

WRT to the rest, I'm not clear on the relationship between the receipt and the log entries in ETH—by placing a link to the log entries in the receipt you're suggesting that this can be derived from some binary that I give you that represents the "receipt", but I'm not sure that's the case, is it? The log entries are separate to the receipt, so you don't have a two-way relationship such that you could round-trip a "receipt" through IPLD and back again, either way, and this log trie root CID is intact? In unpackLogRootCID in your code you seem to need rct.Logs in order to derive the CID.

Maybe this is another one of those places where a separate utility type that exists outside of the chain would be a more appropriate choice rather than reading something into the chain that's not there?

@Stebalien
Copy link
Member

The rename of eth-block to eth-header might be a problem, we have the same issue with the other chains that are listed here where the *-block is really just the header. But it's also not strictly wrong because in blockchain terminology the hash of that thing is commonly referred to as the block identifier. I'm not sure about a wholesale rename, as much as I'd prefer correctness. @Stebalien what's your take on this?

Depends on who's using this table? This won't cause a problem in go, but yeah, I'd prefer to leave it as-is unless there's a good reason to change it.

@rvagg
Copy link
Member

rvagg commented Aug 10, 2021

@i-norden best leave the renaming out, but changing the note at the end to refer to the "header" is a good idea to be clear.

@i-norden
Copy link
Contributor Author

i-norden commented Aug 16, 2021

@i-norden best leave the renaming out, but changing the note at the end to refer to the "header" is a good idea to be clear.

Will do!

WRT to the rest, I'm not clear on the relationship between the receipt and the log entries in ETH—by placing a link to the log entries in the receipt you're suggesting that this can be derived from some binary that I give you that represents the "receipt", but I'm not sure that's the case, is it? The log entries are separate to the receipt, so you don't have a two-way relationship such that you could round-trip a "receipt" through IPLD and back again, either way, and this log trie root CID is intact? In unpackLogRootCID in your code you seem to need rct.Logs in order to derive the CID.

That's correct, our code needs the logs in order to derive the CID (we create a the log trie out of them, and derive the CID from the root hash of the trie). We are only placing the log trie CID link in the IPLD node representation of the receipt, we are not changing the binary encoding of a receipt IPLD- that remains the Etheruem consensus encoding of a Receipt.

So when we decode from binary we use the logs to generate the CID and when we encode back to binary we don't encode the CID, directly, but we encode those logs which can be used to generate it again. The CID itself is not persistent but the data used to generate it in a deterministic way is. This probably doesn't qualify as a proper round trip, but with the consensus encoding of a receipt we have everything we need to go both ways.

Considering how etheruem CID links are normally "encoded" it doesn't seem like too much of a stretch to me. We never actually persist an entire CID into the IPLD block binary- we instead have encoded a keccak256 hash that we use to derive, per the codec, a CID link when we materialize the IPLD node from its binary. Our use of map representation for these IPLD nodes when the binary is actually a struct/tuple encoding is another sort of materialization transformation like this. Although, this transformation is more complex and entirely specific to this type.

Maybe this is another one of those places where a separate utility type that exists outside of the chain would be a more appropriate choice rather than reading something into the chain that's not there?

Yeah I think that makes sense, although we would really like to be able to use IPLD selectors on individual logs. And, as I see it, we aren't so much reading something into the chain that's not there but rather applying a transformation on what already exists.

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think the use of this is pushing some boundaries of IPLD's data model <> encoding concepts to the limits, or maybe beyond limits, although it could very well be that this simply highlights limitations of what we're trying to circumscribe IPLD and maybe we need to be more flexible when it comes to existing formats that play fast and lose with content addressing for "linking"—or maybe more accurately, don't use hashing for "linking" in the sense that we want to be linking, since these blockchain formats really just want to use hashing for merkle inclusion but we want to use it for linking and navigating.

Either way though, I don't think these details preclude inclusion in the multicodec table of something that is a thing in the wild and can be represented in IPLD at some level.

@rvagg rvagg merged commit 22da894 into multiformats:master Aug 19, 2021
@i-norden
Copy link
Contributor Author

Thanks @rvagg! That makes sense, and I feel a bit apprehensive about making this change to the Receipt IPLD codec implementation myself because generating the LogTrie and calculating its root every time we materialize a Receipt ipld.Node from the Receipt binary will have some significant overhead if there are a lot of logs in the receipt. I can do some benchmarking to examine this further. Definitely agree it makes sense to merge these Log and LogTrie multicodecs, we can represent these objects as IPLD whether or not (or how) we decide to link to them from a Receipt.

Perhaps we could introduce a second Receipt multicodec type and corresponding codec implementation where we perform this LogTrie materialization and linking and leave the original Receipt codec as is? Is there any precedent for that, having two different multicodec types for the same binary data?

@rvagg
Copy link
Member

rvagg commented Aug 25, 2021

The main precedent is raw combined with any other codec, but that's a special case.

#207 has ongoing discussion about doubling-up but there's a lot of resistance to using the codec code in a CID as anything more than a "this is how to decode the binary data" signal. It continues to be tempting to want to use it to signal more than that! I think it argues for an extension of the CID format to include additional signalling options, codec should probably be as limited as possible.

There is another almost-precedent which perhaps has some relationship here—bitcoin-witness-commitment is used for hash tuple blocks, like bitcoin-tx when it's working through the merkle tree (bitcoin-tx does double duty for the transactions at the base of the merkle tree). So when a bitcoin-tx block is 64-bytes long, bot it and bitcoin-witness-commitment decode using the same initial algorithm, but because of the loss of information as we navigate through the tree if we treat the witness commitment the same then our DAG doesn't hold otgether. What saves us from this being a strict double-up is that bitcoin-witness-commitment in the data model is [CID, Bytes] (second element is a nonce) whereas bitcoin-tx in the merkle tree is [CID, CID] so we can justify it being distinct.

I don't know if that helps at all, but the general rule is something like: codec code in a CID should only be used to signal the decode (and encode) algorithm used to turn the block's bytes into data model form. But there's some flexibility around the edges to deal with the messy real world.

@rvagg
Copy link
Member

rvagg commented Aug 25, 2021

Sorry, #203 was the other relevant discussion, #207 is just related to it. In the case of #203 we have a block format that is strictly git-raw but there's additional information about the origin that they don't want to lose along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants