Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AC hit rate metrics with prometheus labels #350

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ go_library(
"//config:go_default_library",
"//server:go_default_library",
"//utils/idle:go_default_library",
"//utils/metrics:go_default_library",
"//utils/rlimit:go_default_library",
"@com_github_abbot_go_http_auth//:go_default_library",
"@com_github_grpc_ecosystem_go_grpc_prometheus//:go_default_library",
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,23 @@ host: localhost

# If true, enable experimental remote asset API support:
#experimental_remote_asset_api: true

# Allows mapping HTTP and gRPC headers to prometheus
# labels. Headers can be set by bazel client as:
# --remote_header=os=ubuntu18-04. Not all counters are
# affected.
#metrics:
# categories:
# os:
# - rhel7
# - rhel8
# - ubuntu16-04
# - ubuntu18-04
# branch:
# - master
# user:
# - ci
Comment on lines +227 to +241
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes more sense to track this on the client side. eg the client makes a build, extracts build statistics, then uploads the data somewhere for tracking/analysis. Fixes generally need to happen on the client side anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that:

  1. It is convenient to have this information in same prometheus databas as all other cache and remote execution related metrics, to correlate them and include in same dashboard. And since promethues actively fetches, instead of passively receive pushes, a more complex solution with proxy in between, would be needed, if client would push.
  2. Adding --remote_header flags is much easier than somehow extracting build statistics and send separately to a server. I’m not sure if BES could be used, but my company don’t have such infrastructure in place currently.
  3. The client can’t provide statistics per cache instance, when there is an hierarcy of caches. E.g. bazel-remote instances on localhosts, as proxy for central bazel-remote instance. (However that use case might also require a proxy, a load balancer, or whatever is between, to add own headers or forward headers from client)

There is a limit on what information could be stored as prometheus labels, without causing too many time series. For some use cases it makes more sense to aggregate information on client side until after build finish, and then push to Elasticsearch, or similar. There are pros and cons with both ways, and I think they can complement each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)


```

## Docker
Expand Down
7 changes: 6 additions & 1 deletion cache/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ const (
// used for HTTP when running with the --disable_http_ac_validation
// commandline flag.
RAW

UNKNOWN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we need this- if we can't figure out the blob type that's a probably a basic error that needs fixing on the client side, not something the server needs metrics for IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not relevant for server side metrics and it does not end up in the metrics since only AC is handled. The reason for adding it is only to have something to pass in here:

bazel-remote/server/http.go

Lines 222 to 225 in 25e244e

kind, hash, instance, err := parseRequestURL(r.URL.Path, h.validateAC)
if err != nil {
http.Error(w, err.Error(), http.StatusBadRequest)
h.logResponse(http.StatusBadRequest, r, cache.UNKNOWN)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we handle metrics separately from logging, then we should be able to drop this.

)

func (e EntryKind) String() string {
Expand All @@ -30,7 +32,10 @@ func (e EntryKind) String() string {
if e == CAS {
return "cas"
}
return "raw"
if e == RAW {
return "raw"
}
return "unknown"
}

// Logger is designed to be satisfied by log.Logger.
Expand Down
8 changes: 8 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ type HTTPBackendConfig struct {
BaseURL string `yaml:"url"`
}

// Metrics stores configuration for prometheus metrics.
type Metrics struct {
Categories map[string][]string `yaml:"categories"`
}

// Config holds the top-level configuration for bazel-remote.
type Config struct {
Host string `yaml:"host"`
Expand All @@ -55,6 +60,7 @@ type Config struct {
DisableGRPCACDepsCheck bool `yaml:"disable_grpc_ac_deps_check"`
EnableACKeyInstanceMangling bool `yaml:"enable_ac_key_instance_mangling"`
EnableEndpointMetrics bool `yaml:"enable_endpoint_metrics"`
Metrics *Metrics `yaml:"metrics"`
ExperimentalRemoteAssetAPI bool `yaml:"experimental_remote_asset_api"`
HTTPReadTimeout time.Duration `yaml:"http_read_timeout"`
HTTPWriteTimeout time.Duration `yaml:"http_write_timeout"`
Expand All @@ -73,6 +79,7 @@ func New(dir string, maxSize int, host string, port int, grpcPort int,
disableGRPCACDepsCheck bool,
enableACKeyInstanceMangling bool,
enableEndpointMetrics bool,
metrics *Metrics,
experimentalRemoteAssetAPI bool,
httpReadTimeout time.Duration,
httpWriteTimeout time.Duration) (*Config, error) {
Expand All @@ -95,6 +102,7 @@ func New(dir string, maxSize int, host string, port int, grpcPort int,
DisableGRPCACDepsCheck: disableGRPCACDepsCheck,
EnableACKeyInstanceMangling: enableACKeyInstanceMangling,
EnableEndpointMetrics: enableEndpointMetrics,
Metrics: metrics,
ExperimentalRemoteAssetAPI: experimentalRemoteAssetAPI,
HTTPReadTimeout: httpReadTimeout,
HTTPWriteTimeout: httpWriteTimeout,
Expand Down
8 changes: 5 additions & 3 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"github.com/buchgr/bazel-remote/config"
"github.com/buchgr/bazel-remote/server"
"github.com/buchgr/bazel-remote/utils/idle"
"github.com/buchgr/bazel-remote/utils/metrics"
"github.com/buchgr/bazel-remote/utils/rlimit"

grpc_prometheus "github.com/grpc-ecosystem/go-grpc-prometheus"
Expand Down Expand Up @@ -283,6 +284,7 @@ func main() {
ctx.Bool("disable_grpc_ac_deps_check"),
ctx.Bool("enable_ac_key_instance_mangling"),
ctx.Bool("enable_endpoint_metrics"),
nil,
ctx.Bool("experimental_remote_asset_api"),
ctx.Duration("http_read_timeout"),
ctx.Duration("http_write_timeout"),
Expand Down Expand Up @@ -311,6 +313,7 @@ func main() {

accessLogger := log.New(os.Stdout, "", logFlags)
errorLogger := log.New(os.Stderr, "", logFlags)
metrics := metrics.NewMetrics(c.Metrics)

var proxyCache cache.Proxy
if c.GoogleCloudStorage != nil {
Expand Down Expand Up @@ -344,8 +347,7 @@ func main() {
}

validateAC := !c.DisableHTTPACValidation
h := server.NewHTTPCache(diskCache, accessLogger, errorLogger, validateAC, c.EnableACKeyInstanceMangling, gitCommit)

h := server.NewHTTPCache(diskCache, accessLogger, errorLogger, metrics, validateAC, c.EnableACKeyInstanceMangling, gitCommit)
var htpasswdSecrets auth.SecretProvider
cacheHandler := h.CacheHandler
if c.HtpasswdFile != "" {
Expand Down Expand Up @@ -444,7 +446,7 @@ func main() {
validateAC,
c.EnableACKeyInstanceMangling,
enableRemoteAssetAPI,
diskCache, accessLogger, errorLogger)
diskCache, accessLogger, errorLogger, metrics)
if err3 != nil {
log.Fatal(err3)
}
Expand Down
1 change: 1 addition & 0 deletions server/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ go_library(
"//cache:go_default_library",
"//cache/disk:go_default_library",
"//utils/idle:go_default_library",
"//utils/metrics:go_default_library",
"@com_github_abbot_go_http_auth//:go_default_library",
"@com_github_bazelbuild_remote_apis//build/bazel/remote/asset/v1:go_default_library",
"@com_github_bazelbuild_remote_apis//build/bazel/remote/execution/v2:go_default_library",
Expand Down
10 changes: 6 additions & 4 deletions server/grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ import (

"github.com/buchgr/bazel-remote/cache"
"github.com/buchgr/bazel-remote/cache/disk"

"github.com/buchgr/bazel-remote/utils/metrics"
_ "github.com/mostynb/go-grpc-compression/snappy" // Register snappy
_ "github.com/mostynb/go-grpc-compression/zstd" // and zstd support.
)
Expand All @@ -39,6 +39,7 @@ type grpcServer struct {
errorLogger cache.Logger
depsCheck bool
mangleACKeys bool
metrics metrics.Metrics
}

// ListenAndServeGRPC creates a new gRPC server and listens on the given
Expand All @@ -48,27 +49,28 @@ func ListenAndServeGRPC(addr string, opts []grpc.ServerOption,
validateACDeps bool,
mangleACKeys bool,
enableRemoteAssetAPI bool,
c *disk.Cache, a cache.Logger, e cache.Logger) error {
c *disk.Cache, a cache.Logger, e cache.Logger, m metrics.Metrics) error {

listener, err := net.Listen("tcp", addr)
if err != nil {
return err
}

return serveGRPC(listener, opts, validateACDeps, mangleACKeys, enableRemoteAssetAPI, c, a, e)
return serveGRPC(listener, opts, validateACDeps, mangleACKeys, enableRemoteAssetAPI, c, a, e, m)
}

func serveGRPC(l net.Listener, opts []grpc.ServerOption,
validateACDepsCheck bool,
mangleACKeys bool,
enableRemoteAssetAPI bool,
c *disk.Cache, a cache.Logger, e cache.Logger) error {
c *disk.Cache, a cache.Logger, e cache.Logger, m metrics.Metrics) error {

srv := grpc.NewServer(opts...)
s := &grpcServer{
cache: c, accessLogger: a, errorLogger: e,
depsCheck: validateACDepsCheck,
mangleACKeys: mangleACKeys,
metrics: m,
}
pb.RegisterActionCacheServer(srv, s)
pb.RegisterCapabilitiesServer(srv, s)
Expand Down
12 changes: 12 additions & 0 deletions server/grpc_ac.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ import (
pb "github.com/bazelbuild/remote-apis/build/bazel/remote/execution/v2"
"github.com/golang/protobuf/proto"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/metadata"
"google.golang.org/grpc/peer"
"google.golang.org/grpc/status"

"github.com/buchgr/bazel-remote/cache"
"github.com/buchgr/bazel-remote/utils/metrics"
)

var (
Expand Down Expand Up @@ -63,6 +65,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}
if rdr == nil || sizeBytes <= 0 {
s.accessLogger.Printf("%s %s %s", logPrefix, req.ActionDigest.Hash, "NOT FOUND")
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.NOT_FOUND, ctx)
return nil, status.Error(codes.NotFound,
fmt.Sprintf("%s not found in AC", req.ActionDigest.Hash))
}
Expand All @@ -82,6 +85,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}

s.accessLogger.Printf("%s %s OK", logPrefix, req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.OK, ctx)
return result, nil
}

Expand All @@ -93,6 +97,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,

if result == nil {
s.accessLogger.Printf("%s %s NOT FOUND", logPrefix, req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.NOT_FOUND, ctx)
return nil, status.Error(codes.NotFound,
fmt.Sprintf("%s not found in AC", req.ActionDigest.Hash))
}
Expand Down Expand Up @@ -129,6 +134,7 @@ func (s *grpcServer) GetActionResult(ctx context.Context,
}

s.accessLogger.Printf("GRPC AC GET %s OK", req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_GET, metrics.OK, ctx)

return result, nil
}
Expand Down Expand Up @@ -290,6 +296,7 @@ func (s *grpcServer) UpdateActionResult(ctx context.Context,
}

s.accessLogger.Printf("GRPC AC PUT %s OK", req.ActionDigest.Hash)
s.incAcRequestMetrics(metrics.METHOD_PUT, metrics.OK, ctx)

// Trivia: the RE API wants us to return the ActionResult from the
// request, in order to follow this standard method style guide:
Expand Down Expand Up @@ -331,3 +338,8 @@ func addWorkerMetadataGRPC(ctx context.Context, ar *pb.ActionResult) {

ar.ExecutionMetadata.Worker = worker
}

func (s *grpcServer) incAcRequestMetrics(method metrics.Method, status metrics.Status, ctx context.Context) {
headers, _ := metadata.FromIncomingContext(ctx)
s.metrics.IncomingRequestCompleted(metrics.AC, method, status, headers, metrics.GRPC)
}
3 changes: 2 additions & 1 deletion server/grpc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ func TestMain(m *testing.M) {

accessLogger := testutils.NewSilentLogger()
errorLogger := testutils.NewSilentLogger()
metrics := testutils.NewMetricsStub()

listener = bufconn.Listen(bufSize)

Expand All @@ -87,7 +88,7 @@ func TestMain(m *testing.M) {
validateAC,
mangleACKeys,
enableRemoteAssetAPI,
diskCache, accessLogger, errorLogger)
diskCache, accessLogger, errorLogger, metrics)
if err2 != nil {
fmt.Println(err2)
os.Exit(1)
Expand Down
Loading