Instead of hashing trees by mtime and size, hash by file content. #25

timmfin · 2014-10-02T15:00:49Z

This allows for the cache to be hit far more often, especially when you are modifying the content of a file back and forth (e.g. frequent undo/redo).

Also, added an optional digestCache option to hashTree(...) which is used to prevent unnecessary file reads (if the mtime and size of the file haven’t changed, assume that we don’t need to re-read & hash its contents).

Also, makes it so you can optionally pass in a `digestCache` which is used to prevent unnecessary file reads (if the mtime and size haven’t changed, assume that we don’t need to re-read & hash the file’s contents).

…tchen-sink-helpers change to hash trees based on file content instead of just size and mtime: broccolijs/broccoli-kitchen-sink-helpers#25 Also, simples the logic by always writing to the cache dir and reusing `symlinkOrCopyFromCache` for both the non-cached and cached code paths (like broccoli-cached-writer does). FYI, this has the side-effect of making files in the cache have full filenames/paths (instead of just being named 0, 1, 2 …). Plus, added tests.

rwjblue · 2014-10-02T15:08:30Z

index.js

+  if (digest === undefined) {
+    digest = crypto
+      .createHash('sha1')
+      .update(fs.readFileSync(fullPath))


Reading the files here, is likely to cause relatively significant increase in memory and total I/O.

Yup, I understand that. My assumption is that this hit is more than made up by having to do far fewer rebuilds (and/or significantly mitigated by solid-state disks). Plus, although anecdotal, I do know that there are other build systems—like the Sprockets/Rails Asset Pipeline—that do things this way... so it isn't completely crazy (but I do understand your hesitance).

I guess what is not clear to me, is why what we are doing now is unacceptable. Are there actual scenarios where the stat.mode, stat.mtime, stat.size, and relativePath does not already provide sufficient caching? Especially when we have symlinks (and we do not need to us fs.utimes to ensure copied files have the same mtime).

Also, in general, I would prefer to be better than the asset pipeline if at all possible (not shadow it).

Got it. Just to make sure that we're on the same page, this PR is definitely experimental and the conversation is just (if not more) important than the code. And as I've said before, I'm new to broccoli so I apologize if I'm rehashing old thoughts, missing the point, etc.

So why do I think content based caching is better...

With mtime-based caching you can only really have one cache, whatever the last state of a file/directory/tree was. It is a binary thing, either you use the last file/tree or your generate a new file/tree (and blow away the previous cache).

However, with content-based caching, you can have several previous states saved to re-use if needed. That is particularly helpful if you undo/revert one or more changes (pretty common in the front-end refresh the browser workflow) or when you do things like change to a new branch, make a fix, and then swap back to your previous branch.

In both of those cases (undoing a few changes and changing back to a previous branch), content-based caching enables those to be pulled directly from the cache and not trigger a rebuild. And to me that is huge, because our environment deals with pretty large and complicated projects. For example, 500-1000+ files, different projects/git repositories that depend on each other (and those dependencies still get modified pretty frequently, so you can't/don't want to copy them, make then an explicit version, etc). Right now, I'm looking at a 20-30 second fresh cache build time, and while I think I can improve that quite a bit, I don't think I'll be able to reduce it by an order of magnitude.

That is why preventing rebuilds in common cases is huge to me. Especially with how you can nests trees inside one another in Broccoli and a rebuild in one tree could force a later tree to rebuild, which could trigger another rebuild, etc.

That brings up another useful bit, what if what a filter/plugin outputs is identical to a previous rebuild even though the input was different? Content-based caching deals very well with this, but mtime caching less so (unless you can ensure that every plugin is written very well and is good about ensuring it uses its cache and preserves mtimes in all possible scenarios).

Whew, so those are my thoughts. Thanks for listening. And I'm glad to talk more about our environment/team/project sizes if that is helpful. I realize that our slightly more "enterprise-y" (shudder) use-cases can be pretty different than some are used to.

ps: Maybe there is a way to try and save multiple "cache states" with a file's mtime and size stats, but can't think of a way do it well (or without lots of complicated hoops to ensure mtimes get reset to exact previous values).

pps: I do realize that content-based caching is more complicated, requires you to has sane heuristics to not grow the cache forever, etc. But IMHO, I this is a case were that complexity is very worth it, at least at the scale and complexity I'm (and other larger projects/engineering teams are dealing with).

Plus clean it up by passing an options object around instead of a direct reference to the digest cache. This also made it easier for me to attach other things to the options object, like cumulative file IO duration, for basic profiling.

g13013 · 2014-10-08T12:11:38Z

@timmfin am not sure this is an issue with broccoli it self but a plugin issue, relativePath, mtime and size guarantee a fast check, I think that it's the plugin responsibility to implement a file cache.

timmfin · 2014-10-08T13:29:07Z

@g13013 yes, I agree about this being a plugin concern. I didn't even realize that the Broccoli watcher uses it internally.

However, I wanted to at least try and be ambitious enough to make this be the default an/or recommended hashing mechanism for plugins. Sure, we could implement this as another method or in a different module, but then all existing plugins would need to learn about the change and implement it for their plugin to take advantage.

That being said, I do realize that having separate hashing functions (and/or have additional forks of broccoli-filter and broccoli-caching-writer that use a content-based hash) may very well be the most sensible path forward. I can take that route if preferred.

ps: Though I'd still think it would be a good idea to have a conversation about what the default expected plugin behavior should be (mtime vs content hashing, etc).

g13013 · 2014-10-08T15:17:31Z

@timmfin I agree with you in the fact that at least broccoli-writer should export some api to deal with files.

@rwjblue an example would be to export a file walk,stat,copy,move api, which implement best practices that broccoli recommend.

…content * upstream/master: Release version 0.2.6 Add node-glob lockdown to CHANGELOG.md. Lock glob to specific version.

timmfin · 2015-04-22T17:24:37Z

Btw updated this to make the content-based cache optional and off by default. To use it you can call hashTreeWithContents or pass the { hashContent: true } option to hashTree (done in this commit).

(Though you should probably be passing in the digestCache cache option as well. It's an optimization so that the file's content digest only generated when the mtime is changed, preventing unnecessary file reads).

…-kitchen-sink-helpers: broccolijs/broccoli-kitchen-sink-helpers#25)

…ter-emit-symlinks-and-content-cache * commit '6f864b965d86f2532cb9b684e5703c2b793af386': Update to use optional content-based caching (via my fork of broccoli-kitchen-sink-helpers: broccolijs/broccoli-kitchen-sink-helpers#25) Tests (orinally from d075ee0) Fix issue where symlinks pointing to the outputDir were broken (in 0.13.x, because the outputDir would be deleted after every build). Merge remote-tracking branch 'upstream/master' into make-filter-emit-symlinks Conflicts: index.js package.json tests/index.js

…vel mtime cache

Needed to be able to correctly set initialRelativePath when hashing individual files (and not starting from a directory). Otherwise, the relative path will not be correct by the time it gets to digestOfFileContents

… `hashFile()` was added in 803cbc0)

Instead of hashing trees by mtime and size, hash by file content.

855a0a7

Also, makes it so you can optionally pass in a `digestCache` which is used to prevent unnecessary file reads (if the mtime and size haven’t changed, assume that we don’t need to re-read & hash the file’s contents).

timmfin mentioned this pull request Oct 2, 2014

Modify broccoli-filter to work better with the (proposed) broccoli-kitchen-sink-helpers change to hash trees based on file content instead of just size and mtime broccolijs/broccoli-filter#15

Closed

rwjblue reviewed Oct 2, 2014
View reviewed changes

timmfin mentioned this pull request Dec 23, 2014

Proposal: Centralize Temp Directory Assignment broccolijs/broccoli#202

Closed

timmfin added 2 commits March 13, 2015 13:25

Merge remote-tracking branch 'upstream/master' into hash-key-by-file-…

fbf3880

…content * upstream/master: Release version 0.2.6 Add node-glob lockdown to CHANGELOG.md. Lock glob to specific version.

Make the content-based cache optional (and off by default)

f289de7

timmfin added a commit to timmfin/broccoli-filter that referenced this pull request Apr 22, 2015

Update to use optional content-based caching (via my fork of broccoli…

6f864b9

…-kitchen-sink-helpers: broccolijs/broccoli-kitchen-sink-helpers#25)

timmfin mentioned this pull request Apr 24, 2015

Attempt to cache a tree by the content of the files and not only by their size and mtime. ember-cli/broccoli-caching-writer#13

Closed

timmfin added 4 commits April 29, 2015 17:54

Little tweak for easier key debugging

bc61b28

Use the absolute path, not the relativePath for the optional first-le…

8ced112

…vel mtime cache

Add hashFile function.

803cbc0

Needed to be able to correctly set initialRelativePath when hashing individual files (and not starting from a directory). Otherwise, the relative path will not be correct by the time it gets to digestOfFileContents

Error to check that correct relativePath is passed in (the reason why…

1a45ddd

… `hashFile()` was added in 803cbc0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instead of hashing trees by mtime and size, hash by file content. #25

Instead of hashing trees by mtime and size, hash by file content. #25

timmfin commented Oct 2, 2014

rwjblue Oct 2, 2014

timmfin Oct 2, 2014

rwjblue Oct 2, 2014

timmfin Oct 3, 2014

g13013 commented Oct 8, 2014

timmfin commented Oct 8, 2014

g13013 commented Oct 8, 2014

timmfin commented Apr 22, 2015

Instead of hashing trees by mtime and size, hash by file content. #25

Are you sure you want to change the base?

Instead of hashing trees by mtime and size, hash by file content. #25

Conversation

timmfin commented Oct 2, 2014

rwjblue Oct 2, 2014

Choose a reason for hiding this comment

timmfin Oct 2, 2014

Choose a reason for hiding this comment

rwjblue Oct 2, 2014

Choose a reason for hiding this comment

timmfin Oct 3, 2014

Choose a reason for hiding this comment

g13013 commented Oct 8, 2014

timmfin commented Oct 8, 2014

g13013 commented Oct 8, 2014

timmfin commented Apr 22, 2015