-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC(tap): Snapshot merkle tree #1113
Conversation
@mnm678 please could you mark this PR as a draft until it's ready for review? |
97f0fdb
to
561428c
Compare
@joshuagl I just fixed a typo and rebased, so this should be ready for review now. |
The linked document appears to be an old version, I found this working copy much more informative. Particularly this section:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool work here @mnm678! Apologies for the delayed review, it took me quite a bit of time to understand the design (and I'm still not 100% clear that I do). IIRC the purpose of this PR was to PoC the concept in before working on a TAP? A TAP would be really helpful for fully reviewing this submission.
On the above understanding, I've avoided reviewing the code in too much detail. Though I have made various comments and recommendations when reading through. The major code structure issue I'd like to see addressed if this were intended for a merge is to remove the amount of branching/special-casing for handling the Merkle tree related logic. The code will be much easier to follow, and therefore maintain, with the Merkle logic existing mostly in single-purpose methods (i.e. a generate_snapshot_merkle_metadata
and generate_snapshot_metadata
vs a combined generate_snapshot_metadata
which takes a boolean indicating whether to generate Merkle metadata).
From a PoC/TAP authoring/review perspective I'd like to see the following questions addressed:
- Do we have a feeling for how the proposed mechanisms perform? It would be good to create something like @lukpueh's scripts that test the performance of delegating to hashed bins to get a feel for the performance of the Merkle tree builder on large repositories like PyPI or Dockerhub
- Do you expect the tree building algorithm to be part of the specification? Can clients/other implementations be compatible without that knowledge?
- For large repositories this will generate a lot of files, what's the garbage collection story for Merkle tree files? Should we consider formalising guidelines around storing them to make cleanup easier? For example, it might make sense to use a directory per tree with the directory named for the root? We could look at other systems with similar tree structures for inspiration (i.e. Git Objects)
- I'd appreciate some time in a TAP being explaining why the Merkle files are different to all of the other metadata files we create (they don't use the signable format and are not signed), though perhaps that's obvious to most of the TAP editors?
metadata_role: | ||
The name of the metadata role. This should not include a file extension. | ||
<Exceptions> | ||
tuf.exceptions.RepositoryError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I'm not sure this is the right exception, but which exceptions to use where is somewhat of an open question...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have one you think would fit better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Sorry, I wasn't very clear. I meant this more as an FYI for later and had intended to link to #1131
tuf/client/updater.py
Outdated
merkle_root = self.metadata['current']['timestamp']['merkle_root'] | ||
|
||
# Download Merkle path | ||
self._update_metadata(metadata_role + '-snapshot', 1000, snapshot_merkle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the length be a constant? How did we come up with 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this constant to tuf.settings to avoid the magic number here, but the choice of 1000 could probably use more discussion.
tuf/repository_lib.py
Outdated
# this path to the client for verification | ||
return root, leaves | ||
|
||
def write_merkle_paths(root, leaves, storage_backend, merkle_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def write_merkle_paths(root, leaves, storage_backend, merkle_directory): | |
def _write_merkle_paths(root, leaves, storage_backend, merkle_directory): |
Internal methods should add an underscore prefix to the method name.
tuf/client/updater.py
Outdated
# If merkle root is set, do not update snapshot metadata. Instead, | ||
# download the relevant merkle path when downloading a target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If merkle root is set, do not update snapshot metadata. Instead, | |
# download the relevant merkle path when downloading a target. | |
# If merkle root is set, do not update snapshot metadata. Instead, | |
# we will download the relevant merkle path later when downloading | |
# a target. |
tuf/client/updater.py
Outdated
updated_metadata_object = metadata_signable['signed'] | ||
if snapshot_merkle: | ||
# Snaphot merkle files are not signed | ||
updated_metadata_object=metadata_signable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated_metadata_object=metadata_signable | |
updated_metadata_object = metadata_signable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated_metadata
might be a signable, or not. Having the same variable be different types is often not a good sign. See my comment above about making a separate method for updating merkle files.
tuf/repository_lib.py
Outdated
def _print_merkle_tree(node, level): | ||
""" | ||
Recursive function used by print_merkle_tree | ||
""" | ||
print('--'* level + node.hash()) | ||
if not node.isLeaf(): | ||
_print_merkle_tree(node.left(), level + 1) | ||
_print_merkle_tree(node.right(), level + 1) | ||
else: | ||
print('--' * (level+1) + node.name()) | ||
|
||
|
||
|
||
def print_merkle_tree(root): | ||
""" | ||
Helper function to print merkle tree contents for demos and verification | ||
of the Merkle tree contents | ||
""" | ||
print('') | ||
_print_merkle_tree(root, 0) | ||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want/need to merge these demo/debug helpers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might be useful for saving the full tree for auditing purposes. I need to think more about the exact process for merkle tree auditing, but I'll include something in the TAP and we can decide then if these are needed.
leaf_contents = SCHEMA.OneOf([VERSIONINFO_SCHEMA, | ||
METADATA_FILEINFO_SCHEMA]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why the leaf_contents
can be one of two types? Aren't VERSIONINFO_SCHEMA
objects already conformant to METADATA_FILEINFO_SCHEMA
, because in the latter the hashes
and length
fields are optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same definition used in the FILEINFODICT_SCHEMA
used by snapshot metadata. I can remove the VERSIONINFO_SCHEMA
from both places if it is redundant.
Maybe @lukpueh knows why they are both listed for snapshot?
tuf/client/updater.py
Outdated
merkle_root = self.metadata['current']['timestamp']['merkle_root'] | ||
|
||
# Download Merkle path | ||
self._update_metadata(metadata_role + '-snapshot', 1000, snapshot_merkle=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, rather than overload _update_metadata
and special-case that function for snapshot Merkle files, it might be a bit cleaner to have a separate method for updating snapshot Merkle files?
tuf/repository_lib.py
Outdated
def right(self): | ||
return self._right | ||
|
||
def isLeaf(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a very pythonic function name. is_leaf
or just a boolean variable leaf
. This should probably be a property of Node
?
Thanks @joshuagl!
Yes, the goal is to turn this feature into a TAP before finalizing anything. I made some initial design decisions in this pr, but we can revisit them as needed.
I did some experiments locally and the performance seemed pretty good, but I agree that formalizing these in test cases is a good next step. This feature is only really needed for large repositories, so we have to make sure it scales well.
We can discuss this in the TAP, but IMO we should include the tree building algorithm in the POUF. Implementations will need to know the algorithm in order to ensure that the hashes match, but it's not fundamental for achieving the security properties.
The merkle tree files do not need to be accessed once new files are generated, so it makes sense to store/delete these together. I don't think garbage collection is currently addressed in the specification, but we can include a description of this either there or in the TAP.
Based on my initial understanding, they do not have to be signed because all the data in the Merkle files is validated using the Merkle root (which is signed by timestamp). I will certainly include a more detailed explanation of this in the TAP, and we can discuss whether a signature is actually needed. |
That sounds reasonable to me.
Yeah. It's not addressed in the specification at present, and unlikely to be added there as we move away from encoding notions of files in the specification (theupdateframework/specification#103). It would be good to address this in the TAP for the purposes of the reference implementation and adding something to the proposed secondary literature (theupdateframework/specification#91).
Of course, timestamp is signed and the nodes in the Merkle tree are protected by cryptographic hashes that are verified (contents match hash) as the files are downloaded. I'd certainly appreciate more explanation of this in the TAP, thank you. |
I added an initial TAP for this feature at theupdateframework/taps#125. |
I've converted this PoC to a draft PR, hope you don't mind |
Generate a snapshot merkle tree when writing snapshot metadata and use the root hash in timestamp. This is a work in progress commit. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
This commit adds some clarity to the merkle tree generation by adding additional commments, breaking operations into different functions, and adding a print_merkle_tree function for easier validation of the Merkle tree. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
… merkle paths. snapshot merkle filenames should be of the form role-snapshot.json. This commit ensures that they will not have an additional .json in the middle (role.json-snapshot.json). In addition, this commit updates the snapshot merkle files to use a dictionary for the merkle path to preserve the order of elements. Signed-off-by: marinamoore <[email protected]>
To do so, it adds a verify_merkle_path function that verifies a merkle tree using a snapshot merkle file and the root hash from timestamp. In addition, this commit ensures that, when present, the snapshot merkle paths will be used in place of snapshot.json Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
…unctionality. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
…presentation when hashing snapshot information. By using the canonical json representation, this commit ensures that the hash will be consistent across different python versions. this commit additionally updates the test data to use the correct hashes. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
This commit ensures that when snapshot merkle trees are used, the client is not looking for version information in the snapshot file, but instead waiting to download version information from the snapshot merkle tree. Another approach could be to download the snapshot merkle file early to ensure that it exists before continuing the verification. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
The verification_fn moves the verification for signable metadata into a separate function. Files that are not signable, such as snapshot merkle files, do not use the verification for signable metadata. Signed-off-by: marinamoore <[email protected]>
This creates a separate function for updating merkle metadata to avoid all of the merkle tree special cases in _update_metadata. In addition, this cleans up the calling of _update_merkle_metadata by creating a MERKLE_FILELENGTH in tuf.settings Signed-off-by: marinamoore <[email protected]>
Remove the getters and setters in Node, InternalNode, and Leaf to simplify the code. This also improves consistency by always using 'digest' instead of 'hash' for merkle tree variable names Signed-off-by: marinamoore <[email protected]>
To ensure that generate_snapshot_metadata always returns the same number of variables, move the snaphot merkle tree generation to _generate_and_write_metadata. Alternatively, the snapshot merkle tree generation could be included in a generate_snapshot_merkle_metadata function. Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
Signed-off-by: marinamoore <[email protected]>
In order to support third party or client auditing of Merkle trees, auditors need to be able to access previous versions of the snapshot merkle metadata (at least since the last timestamp key replacement). This commit allows this by writing snapshot Merkle files with consistent snapshots so that previous versions can be accessed. Signed-off-by: marinamoore <[email protected]>
342b58f
to
588f417
Compare
The auditor verifies all nodes in the snapshot merkle tree to check for rollback attacks. This PoC implementation re-uses a lot of functionality from the updater. It may be better to move some of the re-used functionality to a separate place (ie separating file downloading from update logic). The auditor implementation currently requires there to be snapshot metadata on the repository in order to iterate over all of the nodes. In the future, if the verification succeeds, the auditor should add a signature to timestamp metadata. Signed-off-by: Marina Moore <[email protected]>
588f417
to
ec82eb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the full context of TUF in my comments here, so don't place too much value on them. My comments were instead more based on how this could be applied to storing metadata in container registries where we want to avoid too many round-trips and can store a long lists of manifests (top and intermediate level nodes) in an index, and a long list of blobs (leaf nodes) in an artifact, and each of those list entries can include an annotation to note a min or max key value of the referenced child node(s).
self.left = left | ||
self.right = right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we flatten the tree by allowing more than two children per internal node? And would it make sense to have a min or max key field included with each child to know how to search the tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a key field, I'm thinking something like "the timestamp of the target + content hash" which is unique while sorting nicely, older more static data to the earlier leaves and newer frequently changing data to the later leaves. That would hopefully have clients able to cache more of the data from previous queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we flatten the tree by allowing more than two children per internal node? And would it make sense to have a min or max key field included with each child to know how to search the tree?
I think we need not have an RTT per level of the tree. This may be an implementation detail in the prototype we could clean up The repository could bundle the full path to the leaf node and send it all in a single reply.
Using a structure like you mention (which is like a B-Tree instead of a binary tree), will use more bandwidth, but require fewer signatures (on tree creation) and fewer blocks read from disk. Our sense is that the bandwidth was the more important concern, but feel free to correct us!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the context of container registries, the round trips are an issue, and the non-leaves (index/manifests) tend to be small relative to the leaf nodes (layers/blobs). My thoughts are very much along the lines of a Merkle + B-tree since that maps so closely with container registries, but I'm missing the TUF context to know how that would result in significantly larger download sizes.
# Write the path to the merkle_directory | ||
file_contents = tuf.formats.build_dict_conforming_to_schema( | ||
tuf.formats.SNAPSHOT_MERKLE_SCHEMA, | ||
leaf_contents=l.contents, | ||
merkle_path=merkle_path, | ||
path_directions=path_directions) | ||
file_content = _get_written_metadata(file_contents) | ||
file_object = tempfile.TemporaryFile() | ||
file_object.write(file_content) | ||
filename = os.path.join(merkle_directory, l.name + '-snapshot.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this directory scale relative to the sum of the leaf nodes? Would it still be required if we included a key field with every child pointer so a search through the tree would be used rather than a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory is for storing all of the nodes on disc (which might look different for a non-filesystem storage mechanism). The key field might help with discovery of the nodes, but they'd still all need to be stored somewhere.
With the legacy code gone this PR will need quite a re-write. Although the PR has been pending for quite a while (it actually predates the TAP it implements, which was added later) there is some recent technical discussion, not directly related to the legacy code. @mnm678, how would you like to proceed? Should we keep the PR for updates or close and submit a fresh one? |
Yeah, we'll want to re-write these PoC prs using the new code. I think the existing code will have limited direct applicability, so we should close this. The discussion will remain visible after we close it, and is linked in the TAP issue. |
This pr adds the ability to use a snapshot merkle tree instead of the snapshot.json metadata. More details about the snapshot merkle design are available in the Notary v2 design proposal: https://docs.google.com/document/d/1w8PFELVxt4p1aMk5oJv0RbDyd5J4OyvwguNSNZ1sJNw/edit# or the TAP.
This implementation includes creation and verification of snapshot merkle metadata, as well as an example auditor implementation.