Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instead of hashing trees by mtime and size, hash by file content. #25

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

timmfin
Copy link

@timmfin timmfin commented Oct 2, 2014

This allows for the cache to be hit far more often, especially when you are modifying the content of a file back and forth (e.g. frequent undo/redo).

Also, added an optional digestCache option to hashTree(...) which is used to prevent unnecessary file reads (if the mtime and size of the file haven’t changed, assume that we don’t need to re-read & hash its contents).

Also, makes it so you can optionally pass in a `digestCache` which is used to prevent unnecessary file reads (if the mtime and size haven’t changed, assume that we don’t need to re-read & hash the file’s contents).
timmfin added a commit to timmfin/broccoli-filter that referenced this pull request Oct 2, 2014
…tchen-sink-helpers change to hash trees based on file content instead of just size and mtime: broccolijs/broccoli-kitchen-sink-helpers#25

Also, simples the logic by always writing to the cache dir and reusing `symlinkOrCopyFromCache` for both the non-cached and cached code paths (like broccoli-cached-writer does). FYI, this has the side-effect of making files in the cache have full filenames/paths (instead of just being named 0, 1, 2 …). 

Plus, added tests.
if (digest === undefined) {
digest = crypto
.createHash('sha1')
.update(fs.readFileSync(fullPath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the files here, is likely to cause relatively significant increase in memory and total I/O.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I understand that. My assumption is that this hit is more than made up by having to do far fewer rebuilds (and/or significantly mitigated by solid-state disks). Plus, although anecdotal, I do know that there are other build systems—like the Sprockets/Rails Asset Pipeline—that do things this way... so it isn't completely crazy (but I do understand your hesitance).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what is not clear to me, is why what we are doing now is unacceptable. Are there actual scenarios where the stat.mode, stat.mtime, stat.size, and relativePath does not already provide sufficient caching? Especially when we have symlinks (and we do not need to us fs.utimes to ensure copied files have the same mtime).

Also, in general, I would prefer to be better than the asset pipeline if at all possible (not shadow it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Just to make sure that we're on the same page, this PR is definitely experimental and the conversation is just (if not more) important than the code. And as I've said before, I'm new to broccoli so I apologize if I'm rehashing old thoughts, missing the point, etc.

So why do I think content based caching is better...

With mtime-based caching you can only really have one cache, whatever the last state of a file/directory/tree was. It is a binary thing, either you use the last file/tree or your generate a new file/tree (and blow away the previous cache).

However, with content-based caching, you can have several previous states saved to re-use if needed. That is particularly helpful if you undo/revert one or more changes (pretty common in the front-end refresh the browser workflow) or when you do things like change to a new branch, make a fix, and then swap back to your previous branch.

In both of those cases (undoing a few changes and changing back to a previous branch), content-based caching enables those to be pulled directly from the cache and not trigger a rebuild. And to me that is huge, because our environment deals with pretty large and complicated projects. For example, 500-1000+ files, different projects/git repositories that depend on each other (and those dependencies still get modified pretty frequently, so you can't/don't want to copy them, make then an explicit version, etc). Right now, I'm looking at a 20-30 second fresh cache build time, and while I think I can improve that quite a bit, I don't think I'll be able to reduce it by an order of magnitude.

That is why preventing rebuilds in common cases is huge to me. Especially with how you can nests trees inside one another in Broccoli and a rebuild in one tree could force a later tree to rebuild, which could trigger another rebuild, etc.

That brings up another useful bit, what if what a filter/plugin outputs is identical to a previous rebuild even though the input was different? Content-based caching deals very well with this, but mtime caching less so (unless you can ensure that every plugin is written very well and is good about ensuring it uses its cache and preserves mtimes in all possible scenarios).

Whew, so those are my thoughts. Thanks for listening. And I'm glad to talk more about our environment/team/project sizes if that is helpful. I realize that our slightly more "enterprise-y" (shudder) use-cases can be pretty different than some are used to.


ps: Maybe there is a way to try and save multiple "cache states" with a file's mtime and size stats, but can't think of a way do it well (or without lots of complicated hoops to ensure mtimes get reset to exact previous values).

pps: I do realize that content-based caching is more complicated, requires you to has sane heuristics to not grow the cache forever, etc. But IMHO, I this is a case were that complexity is very worth it, at least at the scale and complexity I'm (and other larger projects/engineering teams are dealing with).

Plus clean it up by passing an options object around instead of a direct reference to the digest cache. This also made it easier for me to attach other things to the options object, like cumulative file IO duration, for basic profiling.
@g13013
Copy link

g13013 commented Oct 8, 2014

@timmfin am not sure this is an issue with broccoli it self but a plugin issue, relativePath, mtime and size guarantee a fast check, I think that it's the plugin responsibility to implement a file cache.

@timmfin
Copy link
Author

timmfin commented Oct 8, 2014

@g13013 yes, I agree about this being a plugin concern. I didn't even realize that the Broccoli watcher uses it internally.

However, I wanted to at least try and be ambitious enough to make this be the default an/or recommended hashing mechanism for plugins. Sure, we could implement this as another method or in a different module, but then all existing plugins would need to learn about the change and implement it for their plugin to take advantage.

That being said, I do realize that having separate hashing functions (and/or have additional forks of broccoli-filter and broccoli-caching-writer that use a content-based hash) may very well be the most sensible path forward. I can take that route if preferred.

ps: Though I'd still think it would be a good idea to have a conversation about what the default expected plugin behavior should be (mtime vs content hashing, etc).

@g13013
Copy link

g13013 commented Oct 8, 2014

@timmfin I agree with you in the fact that at least broccoli-writer should export some api to deal with files.

@rwjblue an example would be to export a file walk,stat,copy,move api, which implement best practices that broccoli recommend.

…content

* upstream/master:
  Release version 0.2.6
  Add node-glob lockdown to CHANGELOG.md.
  Lock glob to specific version.
@timmfin
Copy link
Author

timmfin commented Apr 22, 2015

Btw updated this to make the content-based cache optional and off by default. To use it you can call hashTreeWithContents or pass the { hashContent: true } option to hashTree (done in this commit).

(Though you should probably be passing in the digestCache cache option as well. It's an optimization so that the file's content digest only generated when the mtime is changed, preventing unnecessary file reads).

timmfin added a commit to timmfin/broccoli-filter that referenced this pull request Apr 22, 2015
timmfin added a commit to timmfin/broccoli-filter that referenced this pull request Apr 22, 2015
…ter-emit-symlinks-and-content-cache

* commit '6f864b965d86f2532cb9b684e5703c2b793af386':
  Update to use optional content-based caching (via my fork of broccoli-kitchen-sink-helpers: broccolijs/broccoli-kitchen-sink-helpers#25)
  Tests (orinally from d075ee0)
  Fix issue where symlinks pointing to the outputDir were broken (in 0.13.x, because the outputDir would be deleted after every build).
  Merge remote-tracking branch 'upstream/master' into make-filter-emit-symlinks

Conflicts:
	index.js
	package.json
	tests/index.js
Needed to be able to correctly set initialRelativePath when hashing individual files (and not starting from a directory). Otherwise, the relative path will not be correct by the time it gets to digestOfFileContents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants