Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AC hit rate metrics with prometheus labels #350

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ulrfa
Copy link
Contributor

@ulrfa ulrfa commented Sep 16, 2020

This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality.

Summary of what is added by this commit:

  • Prometheus counter for cache hit ratio of only AC requests.
  • Support for prometheus labels based on custom HTTP and gRPC headers.

Cache hit ratio for CAS entries is easily misinterpreted. Example:

A typical action cache hit often involves 3 or more HTTP requests:

GET AC 200
GET CAS 200 (.o file)
GET CAS 200 (.d file)
...

But a cache miss for the same action is typically a single HTTP request:

GET AC 404

The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.

The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.

An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.

The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?

Categories and custom headers, could also be set for an overview about:

  • Bazel versions using a cache instance?
  • How much separate organizations are using a cache instance?
  • From which network traffic originates?
  • Which products are built using the cache?
  • If the traffic comes via proxy adding its own headers?
  • Distinguish dummy requests for monitoring the cache, from real requests?
  • ...

This is a draft, not ready to be merged.

Summary of what is added by this commit:

 - Prometheus counter for cache hit ratio of only AC requests.
 - Support for prometheus labels based on custom HTTP and gRPC headers.

Cache hit ratio for CAS entries is easily misinterpreted. Example:

  A typical action cache hit often involves 3 or more HTTP requests:

    GET AC 200
    GET CAS 200 (.o file)
    GET CAS 200 (.d file)
    ...

  But a cache miss for the same action is typically a single HTTP request:

    GET AC 404

The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.

The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.

An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.

The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?

Categories and custom headers, could also be set for an overview about:
 - Bazel versions using a cache instance?
 - How much separate organizations are using a cache instance?
 - From which network traffic originates?
 - Which products are built using the cache?
 - If the traffic comes via proxy adding its own headers?
 - Distinguish dummy requests for monitoring the cache, from real requests?
 - ...
Copy link
Collaborator

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to set aside some time to look through this properly (perhaps tomorrow, but more likely monday).

I agree that AC hit rates are more useful than CAS or combined AC+CAS hit rates- I wonder if it would be sufficient to modify the existing metrics in the disk cache layer?

@@ -21,6 +21,8 @@ const (
// used for HTTP when running with the --disable_http_ac_validation
// commandline flag.
RAW

UNKNOWN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need this- if we can't figure out the blob type that's a probably a basic error that needs fixing on the client side, not something the server needs metrics for IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not relevant for server side metrics and it does not end up in the metrics since only AC is handled. The reason for adding it is only to have something to pass in here:

bazel-remote/server/http.go

Lines 222 to 225 in 25e244e

kind, hash, instance, err := parseRequestURL(r.URL.Path, h.validateAC)
if err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
h.logResponse(http.StatusBadRequest, r, cache.UNKNOWN)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we handle metrics separately from logging, then we should be able to drop this.

Comment on lines +227 to +241
# Allows mapping HTTP and gRPC headers to prometheus
# labels. Headers can be set by bazel client as:
# --remote_header=os=ubuntu18-04. Not all counters are
# affected.
#metrics:
# categories:
# os:
# - rhel7
# - rhel8
# - ubuntu16-04
# - ubuntu18-04
# branch:
# - master
# user:
# - ci
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes more sense to track this on the client side. eg the client makes a build, extracts build statistics, then uploads the data somewhere for tracking/analysis. Fixes generally need to happen on the client side anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that:

  1. It is convenient to have this information in same prometheus databas as all other cache and remote execution related metrics, to correlate them and include in same dashboard. And since promethues actively fetches, instead of passively receive pushes, a more complex solution with proxy in between, would be needed, if client would push.
  2. Adding --remote_header flags is much easier than somehow extracting build statistics and send separately to a server. I’m not sure if BES could be used, but my company don’t have such infrastructure in place currently.
  3. The client can’t provide statistics per cache instance, when there is an hierarcy of caches. E.g. bazel-remote instances on localhosts, as proxy for central bazel-remote instance. (However that use case might also require a proxy, a load balancer, or whatever is between, to add own headers or forward headers from client)

There is a limit on what information could be stored as prometheus labels, without causing too many time series. For some use cases it makes more sense to aggregate information on client side until after build finish, and then push to Elasticsearch, or similar. There are pros and cons with both ways, and I think they can complement each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 17, 2020

I will need to set aside some time to look through this properly (perhaps tomorrow, but more likely monday).

Thanks Mostyn! No problem waiting to monday.

I agree that AC hit rates are more useful than CAS or combined AC+CAS hit rates- I wonder if it would be sufficient to modify the existing metrics in the disk cache layer?

If the existing counters are modified (also or instead), then consider:

  • If validating is enabled, then report AC hit only if all referenced CAS entries are found.
  • If counter is incremented in disk.GetValidatedActionResult, then disk.Get should not also increment it for the same request. (One of the reasons to why I choosed to increment from http.go/grpc_ac.go instead of disk.go).
  • Also count uploads.
  • CAS or CAS+AC statistics might still also be interesting in some rare scenarios? (but I don't look at them myself)

Comment on lines +227 to +241
# Allows mapping HTTP and gRPC headers to prometheus
# labels. Headers can be set by bazel client as:
# --remote_header=os=ubuntu18-04. Not all counters are
# affected.
#metrics:
# categories:
# os:
# - rhel7
# - rhel8
# - ubuntu16-04
# - ubuntu18-04
# branch:
# - master
# user:
# - ci
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)

// It would have been nice if bazel-remote could reload the set
// of white listed values from updated configuration file, by
// SIGHUP signal instead of having to restart bazel-remote.
return "other"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a word that gives more of a hint that it's masked a real user defined header value. I'm not sure of a good one, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tell if you find a good word!

I agree that ”other” can be confusing for someone trying to set a header.

But I hope the word ”other” would make sense for most people looking at a dashboard, without knowing that there is a concept of allowed values involved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function return an error if the label isn't allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean reject the http request with an 4xx error? I think that would be too strict. See my long comment from today where I’m trying to elaborate on use cases.

Copy link
Collaborator

@mostynb mostynb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two high-level comments that don't have an obvious line to be added:

  1. I agree that the AC hit rate is the most interesting metric, especially if we're doing validation/completeness checking. If we're not doing those checks then maybe the AC+CAS hit rate is more interesting? Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.

  2. I think we should break this down into at least a few steps (no guarantees that we'll decide to go forward with each stage):

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.
b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).
c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

categoryValues := make(map[string]map[string]struct{})

if config != nil && config.Categories != nil {
for categoryName, whiteListedValues := range config.Categories {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to use the term "allowedValues", it's more inclusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point!

utils/metrics/metrics.go Show resolved Hide resolved
utils/metrics/metrics.go Show resolved Hide resolved
}

type metrics struct {
categoryValues map[string]map[string]struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should consider simplifying this, and just use a list of allowed labels? eg ubuntu1804_ci_trunk ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t there a risk that would complicate it rather than simplifying?
I.e. introducing special cases like: ”ubuntu18” when only ”ubuntu1804” is allowed, handling underscore in values, etc?

However, I’m certainly not a fan of the golang syntax map[string]struct{} for a set of strings...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this map, how would it help if a client says ”ubuntu18” instead of ”ubuntu1804” ?

I figure that a list is simpler to implement, and probably easier to understand, at the cost of needing to enumerate all possible labels. But since we don't want too many labels, that number shouldn't be large anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missinterpreted your example of "ubuntu1804_ci_trunk" as a string with underscore as separator. That's why I mentioned handling of underscore etc.

categoryValues is a map, with label names/header names (e.g. "os" and "user") as keys. The keys are mapped to values by the map, and each such value is a set of strings, containing allowed values for that label/header (e.g. {"ubuntu18-04", "ubuntu16-04"})

Perhaps categoryValues isn't a good name for such a map?

See also my long comment from today where I’m trying to elaborate on use cases.

Does it make sense?

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 22, 2020

If we're not doing those checks then maybe the AC+CAS hit rate is more interesting?

Personally I would be most interested in only AC, regardless of validation/completeness checking, since I think it is too complex to draw conclusions from AC+CAS or CAS hit rate. But others might see use cases that I don't realize.

Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.

I don't have anything against that. But I suggest not adding client specific labels when counting CAS requests, since I'm afraid of potential performance overhead of that. E.g. there is a huge amount of CAS HEAD requests for remote execution scenarios.

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.

In disk.go or http.go/grpc*.go?

b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).

In disk.go or http.go/grpc*.go?

c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

I agree that it could get complicated and we should wait with c. But I think it should be considered also for some more parmeters, see #352. Should all parameters allowed to be reloaded, be placed in a separate file?

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 24, 2020

I’m elaborating about use cases with examples:

Bazel clients configured to always add flags like:

--remote_header=user=$USER
--remote_header=host=$HOST
--remote_header=os=\`get_os_name.sh\`

And bazel clients invoked via CI can in addition add headers like:

--remote_header=branch=$BRANCH_NAME
--remote_header=product=$PRODUCT_NAME
--remote_header=pipeline=$CI_PIPELINE_NAME

The value set for user and branch is not bounded and could therefore not be stored as is in Prometheus. But in many cases it is good enough to distinguish between if it was ci user or non-ci user, or if it was master or non-master branch. That could be achieved by configuring bazel-remote with:

branch:
   - master
user:
   - ci

and any non allowed branches or users would get the Prometheus label value “other”.

The number of alternative operating systems (“os” header) is typically bounded, and allowed values could be listed in a bazel-remote configuration file like:

os:
   - rhel7
   - rhel8
   - ubuntu16-04
   - ubuntu18-04

From time to time operating system comes and goes, and then the list in the configuration could be updated. It does not happen very frequently, so it might be acceptable having to restart bazel-remote, but it would be preferable if possible without restart. And if there are a few rare ones, it could be OK to have them reported by bazel-remote as “other”.

The set of labels and header names (branch, user, os, …) would typically be more static, than the set of values each of them might have (rhel7, rhel8, …) . Therefore I think it is OK having to restart bazel-remote if the set of header names changes, but preferably not when their allowed set of values changes.

Example of Prometheus queries on the metrics, using the labels.

# Total number of AC requests from user ci on rhel8
sum (bazel_remote_incoming_requests_completed_total{kind="AC",user="ci",os="rhel8"})

# Total number of AC requests grouped by user "ci" and "other"
sum by (user) (bazel_remote_incoming_requests_completed_total{kind="AC"})

# Total number of AC requests grouped by: "rhel7", "rhel8", "ubuntu16-04", "ubuntu18-04" and "other"
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC"})

# Total number of AC requests grouped operating systems, but only including those performed by user ci.
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC",user="ci"})

# Total number of AC cache misses
sum (bazel_remote_incoming_requests_completed_total{kind="AC",method=~"GET|HEAD",status="NotFound"})

# Total number of AC cache misses, grouped per OS
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC",method=~"GET|HEAD",status="NotFound"})

@mostynb
Copy link
Collaborator

mostynb commented Sep 25, 2020

If we're not doing those checks then maybe the AC+CAS hit rate is more interesting?

Personally I would be most interested in only AC, regardless of validation/completeness checking, since I think it is too complex to draw conclusions from AC+CAS or CAS hit rate. But others might see use cases that I don't realize.

Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.

I don't have anything against that. But I suggest not adding client specific labels when counting CAS requests, since I'm afraid of potential performance overhead of that. E.g. there is a huge amount of CAS HEAD requests for remote execution scenarios.

Let's just report hit rate metrics for AC lookups to begin with.

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.

In disk.go or http.go/grpc*.go?

If we're not differentiating between http and grpc cache lookups, I think it probably makes sense to do this in disk.go - unless there's something I'm missing?

b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).

See above. It might be worth merging this step with (a), depending on the size of the changes.

c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

I agree that it could get complicated and we should wait with c. But I think it should be considered also for some more parmeters, see #352. Should all parameters allowed to be reloaded, be placed in a separate file?

Good question. If possible I would prefer an upgrade path that does not require manual updating of config files (unless you want to make use of new features).

Which other parameters would we want to be reloadable? One argument in favour of putting reloadable settings in a separate file is that we don't need to figure out what to do if non-reloadable settings change. I suppose we could just log warnings that those settings have not been reloaded.

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 25, 2020

I become inspired by #355 and will play around with what happens if disk.go would implement an interface like:

type CasAcCache interface {
    Put(kind EntryKind, hash string, size int64, rdr io.Reader, context RequestContext) (error)  // TODO or io.ReadCloser ?
    Get(kind EntryKind, hash string, size int64, context RequestContext) (io.ReadCloser, int64, error)
    Contains(kind EntryKind, hash string, size int64, context RequestContext) (bool, int64)
    GetValidatedActionResult(hash string, context RequestContext) (*pb.ActionResult, []byte, error)
}

Then http.go and grpc*.go could use that interface instead of accessing disk.go directly.

And it would be possible to have additional implementations of CasAcCache interface, that can wrap disk.go and handle metrics.

That should avoid the problem about if counters should be placed in disk.go or in http.go/grpc*.go.

And it could allow alternative implementations for non-prometheus metrics also.

RequestContext would provide some way for accessing http/grpc headers.

ulrfa added a commit to ulrfa/bazel-remote that referenced this pull request Sep 26, 2020
This is a draft, not ready to be merged. I wait with test cases until
getting feedback about the main functionality.

Same functionality as in buchgr#350

 - Prometheus counter for cache hit ratio of only AC requests.

 - Support for prometheus labels based on custom HTTP and gRPC headers.

but implemented in an alternative way:

 - disk.io implements two interfaces: cache.CasAcCache, cache.Stats

 - cache.metricsdecorator is decorator for cache.CasAcCache and
   provides prometheus metrics.

Pros with this alternative implementation:

 - Should allow non-prometheus metrics as requested in
   buchgr#355

 - Avoid the question about if the counters should be placed in
   disk.go or http.go/grpc*.go. If placing in disk.go, there are
   issues about how to avoid incrementing counter twice for the
   same request (both in Get and in GetValidatedActionResult)
   and at the same time count found AC but missing CAS, as cache miss.

 - Makes headers available also for logic in disk.go, which could be
   useful for other functionality in the future.

 - Metrics can be separated from logging, and still not require
   injecting counter increment invocations in tons of places.
   Incrementing from a few well defined places minimize the risk
   for bugs in the metrics.
@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 27, 2020

In disk.go or http.go/grpc*.go?

If we're not differentiating between http and grpc cache lookups, I think it probably makes sense to do this in disk.go - unless there's something I'm missing?

Issues if placing in disk.go:

  • How to avoid incrementing a counter twice for the same request? (incrementing both in disk.Get and disk.GetValidatedActionResult for the same AC request)

  • Needs access to gRPC/HTTP headers in disk.go

I tried an alternative implementation in #358, with metrics in separate decorator instead.

Which other parameters would we want to be reloadable? One argument in favour of putting reloadable settings in a separate file is that we don't need to figure out what to do if non-reloadable settings change. I suppose we could just log warnings that those settings have not been reloaded.

Use cases listed in #352. I'm mainly interested in reloadable settings for:

  • metrics configuration (add additional allowed values)
  • http proxy settings (url and perhaps others)
  • enable/disable access log (no such configuration parameter currently)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants