-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AC hit rate metrics with prometheus labels #350
base: master
Are you sure you want to change the base?
Conversation
This is a draft, not ready to be merged. Summary of what is added by this commit: - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. Cache hit ratio for CAS entries is easily misinterpreted. Example: A typical action cache hit often involves 3 or more HTTP requests: GET AC 200 GET CAS 200 (.o file) GET CAS 200 (.d file) ... But a cache miss for the same action is typically a single HTTP request: GET AC 404 The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent the cache hit ratio experienced by the user for actions. The ratio of only AC requests is easier to reason about, especially when AC requests checks existence of CAS dependencies. The number of AC hits and misses can be directly compared against numbers printed in the end of each build by bazel client. And against other prometheus counters produced by remote execution systems for executed actions. An understanding about the reason for cache misses is necessary to improve the cache hit ratio. It could be that the system has been configured in a way that prevent artifacts from being reused between different OS. Or that the cache is only populated by CI jobs on master, potentially resulting in cache misses for other users, etc. It becomes easier to notice such patterns, if cache hit ratio could be calculated for different categories of builds. Such categories can be set as custom headers via bazel flags --remote_header=branch=master and applied as prometheus labels. Mapping of headers to prometheus labels are controlled in bazel-remote's config file. The ratio between cache uploads and cache misses is also relevant, as an view about which categories are not uploading their result. The ratio of cache uploads can also indicate if much is uploaded but seldom requested. E.g. does it make sense to populate central caches from interactive builds or only from CI? Categories and custom headers, could also be set for an overview about: - Bazel versions using a cache instance? - How much separate organizations are using a cache instance? - From which network traffic originates? - Which products are built using the cache? - If the traffic comes via proxy adding its own headers? - Distinguish dummy requests for monitoring the cache, from real requests? - ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need to set aside some time to look through this properly (perhaps tomorrow, but more likely monday).
I agree that AC hit rates are more useful than CAS or combined AC+CAS hit rates- I wonder if it would be sufficient to modify the existing metrics in the disk cache layer?
@@ -21,6 +21,8 @@ const ( | |||
// used for HTTP when running with the --disable_http_ac_validation | |||
// commandline flag. | |||
RAW | |||
|
|||
UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we need this- if we can't figure out the blob type that's a probably a basic error that needs fixing on the client side, not something the server needs metrics for IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is not relevant for server side metrics and it does not end up in the metrics since only AC is handled. The reason for adding it is only to have something to pass in here:
Lines 222 to 225 in 25e244e
kind, hash, instance, err := parseRequestURL(r.URL.Path, h.validateAC) | |
if err != nil { | |
http.Error(w, err.Error(), http.StatusBadRequest) | |
h.logResponse(http.StatusBadRequest, r, cache.UNKNOWN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we handle metrics separately from logging, then we should be able to drop this.
# Allows mapping HTTP and gRPC headers to prometheus | ||
# labels. Headers can be set by bazel client as: | ||
# --remote_header=os=ubuntu18-04. Not all counters are | ||
# affected. | ||
#metrics: | ||
# categories: | ||
# os: | ||
# - rhel7 | ||
# - rhel8 | ||
# - ubuntu16-04 | ||
# - ubuntu18-04 | ||
# branch: | ||
# - master | ||
# user: | ||
# - ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes more sense to track this on the client side. eg the client makes a build, extracts build statistics, then uploads the data somewhere for tracking/analysis. Fixes generally need to happen on the client side anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is that:
- It is convenient to have this information in same prometheus databas as all other cache and remote execution related metrics, to correlate them and include in same dashboard. And since promethues actively fetches, instead of passively receive pushes, a more complex solution with proxy in between, would be needed, if client would push.
- Adding --remote_header flags is much easier than somehow extracting build statistics and send separately to a server. I’m not sure if BES could be used, but my company don’t have such infrastructure in place currently.
- The client can’t provide statistics per cache instance, when there is an hierarcy of caches. E.g. bazel-remote instances on localhosts, as proxy for central bazel-remote instance. (However that use case might also require a proxy, a load balancer, or whatever is between, to add own headers or forward headers from client)
There is a limit on what information could be stored as prometheus labels, without causing too many time series. For some use cases it makes more sense to aggregate information on client side until after build finish, and then push to Elasticsearch, or similar. There are pros and cons with both ways, and I think they can complement each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)
Thanks Mostyn! No problem waiting to monday.
If the existing counters are modified (also or instead), then consider:
|
# Allows mapping HTTP and gRPC headers to prometheus | ||
# labels. Headers can be set by bazel client as: | ||
# --remote_header=os=ubuntu18-04. Not all counters are | ||
# affected. | ||
#metrics: | ||
# categories: | ||
# os: | ||
# - rhel7 | ||
# - rhel8 | ||
# - ubuntu16-04 | ||
# - ubuntu18-04 | ||
# branch: | ||
# - master | ||
# user: | ||
# - ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)
// It would have been nice if bazel-remote could reload the set | ||
// of white listed values from updated configuration file, by | ||
// SIGHUP signal instead of having to restart bazel-remote. | ||
return "other" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be a word that gives more of a hint that it's masked a real user defined header value. I'm not sure of a good one, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tell if you find a good word!
I agree that ”other” can be confusing for someone trying to set a header.
But I hope the word ”other” would make sense for most people looking at a dashboard, without knowing that there is a concept of allowed values involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function return an error if the label isn't allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean reject the http request with an 4xx error? I think that would be too strict. See my long comment from today where I’m trying to elaborate on use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two high-level comments that don't have an obvious line to be added:
-
I agree that the AC hit rate is the most interesting metric, especially if we're doing validation/completeness checking. If we're not doing those checks then maybe the AC+CAS hit rate is more interesting? Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.
-
I think we should break this down into at least a few steps (no guarantees that we'll decide to go forward with each stage):
a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.
b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).
c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.
categoryValues := make(map[string]map[string]struct{}) | ||
|
||
if config != nil && config.Categories != nil { | ||
for categoryName, whiteListedValues := range config.Categories { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to use the term "allowedValues", it's more inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good point!
} | ||
|
||
type metrics struct { | ||
categoryValues map[string]map[string]struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should consider simplifying this, and just use a list of allowed labels? eg ubuntu1804_ci_trunk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn’t there a risk that would complicate it rather than simplifying?
I.e. introducing special cases like: ”ubuntu18” when only ”ubuntu1804” is allowed, handling underscore in values, etc?
However, I’m certainly not a fan of the golang syntax map[string]struct{}
for a set of strings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this map, how would it help if a client says ”ubuntu18” instead of ”ubuntu1804” ?
I figure that a list is simpler to implement, and probably easier to understand, at the cost of needing to enumerate all possible labels. But since we don't want too many labels, that number shouldn't be large anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missinterpreted your example of "ubuntu1804_ci_trunk" as a string with underscore as separator. That's why I mentioned handling of underscore etc.
categoryValues is a map, with label names/header names (e.g. "os" and "user") as keys. The keys are mapped to values by the map, and each such value is a set of strings, containing allowed values for that label/header (e.g. {"ubuntu18-04", "ubuntu16-04"})
Perhaps categoryValues isn't a good name for such a map?
See also my long comment from today where I’m trying to elaborate on use cases.
Does it make sense?
Personally I would be most interested in only AC, regardless of validation/completeness checking, since I think it is too complex to draw conclusions from AC+CAS or CAS hit rate. But others might see use cases that I don't realize.
I don't have anything against that. But I suggest not adding client specific labels when counting CAS requests, since I'm afraid of potential performance overhead of that. E.g. there is a huge amount of CAS HEAD requests for remote execution scenarios.
In disk.go or http.go/grpc*.go?
In disk.go or http.go/grpc*.go?
I agree that it could get complicated and we should wait with c. But I think it should be considered also for some more parmeters, see #352. Should all parameters allowed to be reloaded, be placed in a separate file? |
I’m elaborating about use cases with examples: Bazel clients configured to always add flags like:
And bazel clients invoked via CI can in addition add headers like:
The value set for user and branch is not bounded and could therefore not be stored as is in Prometheus. But in many cases it is good enough to distinguish between if it was ci user or non-ci user, or if it was master or non-master branch. That could be achieved by configuring bazel-remote with:
and any non allowed branches or users would get the Prometheus label value “other”. The number of alternative operating systems (“os” header) is typically bounded, and allowed values could be listed in a bazel-remote configuration file like:
From time to time operating system comes and goes, and then the list in the configuration could be updated. It does not happen very frequently, so it might be acceptable having to restart bazel-remote, but it would be preferable if possible without restart. And if there are a few rare ones, it could be OK to have them reported by bazel-remote as “other”. The set of labels and header names (branch, user, os, …) would typically be more static, than the set of values each of them might have (rhel7, rhel8, …) . Therefore I think it is OK having to restart bazel-remote if the set of header names changes, but preferably not when their allowed set of values changes. Example of Prometheus queries on the metrics, using the labels.
|
Let's just report hit rate metrics for AC lookups to begin with.
If we're not differentiating between http and grpc cache lookups, I think it probably makes sense to do this in disk.go - unless there's something I'm missing?
See above. It might be worth merging this step with (a), depending on the size of the changes.
Good question. If possible I would prefer an upgrade path that does not require manual updating of config files (unless you want to make use of new features). Which other parameters would we want to be reloadable? One argument in favour of putting reloadable settings in a separate file is that we don't need to figure out what to do if non-reloadable settings change. I suppose we could just log warnings that those settings have not been reloaded. |
I become inspired by #355 and will play around with what happens if disk.go would implement an interface like:
Then http.go and grpc*.go could use that interface instead of accessing disk.go directly. And it would be possible to have additional implementations of CasAcCache interface, that can wrap disk.go and handle metrics. That should avoid the problem about if counters should be placed in disk.go or in http.go/grpc*.go. And it could allow alternative implementations for non-prometheus metrics also. RequestContext would provide some way for accessing http/grpc headers. |
This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality. Same functionality as in buchgr#350 - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. but implemented in an alternative way: - disk.io implements two interfaces: cache.CasAcCache, cache.Stats - cache.metricsdecorator is decorator for cache.CasAcCache and provides prometheus metrics. Pros with this alternative implementation: - Should allow non-prometheus metrics as requested in buchgr#355 - Avoid the question about if the counters should be placed in disk.go or http.go/grpc*.go. If placing in disk.go, there are issues about how to avoid incrementing counter twice for the same request (both in Get and in GetValidatedActionResult) and at the same time count found AC but missing CAS, as cache miss. - Makes headers available also for logic in disk.go, which could be useful for other functionality in the future. - Metrics can be separated from logging, and still not require injecting counter increment invocations in tons of places. Incrementing from a few well defined places minimize the risk for bugs in the metrics.
Issues if placing in disk.go:
I tried an alternative implementation in #358, with metrics in separate decorator instead.
Use cases listed in #352. I'm mainly interested in reloadable settings for:
|
This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality.
Summary of what is added by this commit:
Cache hit ratio for CAS entries is easily misinterpreted. Example:
A typical action cache hit often involves 3 or more HTTP requests:
But a cache miss for the same action is typically a single HTTP request:
The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.
The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.
An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.
The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?
Categories and custom headers, could also be set for an overview about: