Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adds Prometheus metrics to ODIS combiner daemon (Odis Combiner to k8s) #49

Merged
merged 42 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
d79b6ae
WIP
alvarof2 Sep 26, 2023
0e4de57
endpoint
alvarof2 Sep 26, 2023
9ae7ff6
cloudbuild.yaml
alvarof2 Sep 26, 2023
1335a90
Add some counters and latency histogram
alvarof2 Sep 27, 2023
4b5ac14
Less buckets for Histograms
alvarof2 Sep 27, 2023
e2bb448
status code label in responses metric
alvarof2 Sep 27, 2023
256df86
More metrics
alvarof2 Sep 28, 2023
d65cff7
Latency Histograms names
alvarof2 Sep 28, 2023
8571d8a
errorsCaughtInEndpointHandler name
alvarof2 Sep 28, 2023
9501f50
Add increments to combinerErrors
alvarof2 Sep 28, 2023
6c1e09b
Add fullNodeLatency Histogram
alvarof2 Sep 29, 2023
78f0f19
Remove metric for all error types logged
alvarof2 Sep 29, 2023
63e6bc4
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
aaronmgdr Sep 29, 2023
547b564
Remove metric for all error types logged in handlers.ts
alvarof2 Sep 29, 2023
90199ca
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 3, 2023
4973730
Dockerfile
alvarof2 Oct 3, 2023
84c24c8
Merge branch 'main' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 3, 2023
ec7e561
Dockerfiles and sync to main
alvarof2 Oct 3, 2023
e351617
workflows
alvarof2 Oct 5, 2023
553954d
Merge branch 'main' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 6, 2023
2170002
GH Action combiner
alvarof2 Oct 6, 2023
0302719
ODIS signer GH Action
alvarof2 Oct 6, 2023
1faa676
Lint
alvarof2 Oct 6, 2023
dd083eb
More metrics
alvarof2 Oct 6, 2023
59b271c
sigInconsistenciesErrors name
alvarof2 Oct 6, 2023
3f17970
Merge branch 'main' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 6, 2023
9a436fb
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 6, 2023
c698d6f
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 6, 2023
4c00727
odis-signer-container
alvarof2 Oct 6, 2023
30a2c6a
K8S test URL
alvarof2 Oct 9, 2023
2153393
K8S contextName for tests
alvarof2 Oct 9, 2023
1be5b0d
Cloud Function proxy
alvarof2 Oct 9, 2023
f89d30f
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 10, 2023
b2249b6
Branch name
alvarof2 Oct 10, 2023
3d806f0
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 10, 2023
85baf6c
Total errors metric
alvarof2 Oct 10, 2023
e434409
Remove alfajoresstaging-k8s loadtest case
alvarof2 Oct 11, 2023
0085c0b
Merge branch 'alecps/odisCombinerk8s' into alvarof/prom-odisCombinerk8s
alvarof2 Oct 11, 2023
b51040a
Remove sha from GH Action name
alvarof2 Oct 11, 2023
169a8e4
alfajoresstaging = K8S testing
alvarof2 Oct 11, 2023
b123201
Alfajores K8S URL
alvarof2 Oct 12, 2023
d04bc9a
Mainnet K8S URL
alvarof2 Oct 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/odis-combiner-container.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
name: Build ODIS combiner image

on:
push:
paths:
- 'dockerfiles/phone-number-privacy/Dockerfile-combiner'
- 'packages/phone-number-privacy/combiner/**'
branches:
- main
pull_request:
paths:
- 'dockerfiles/phone-number-privacy/Dockerfile-combiner'
- 'packages/phone-number-privacy/combiner/**'
workflow_dispatch:

jobs:
odis-combiner-build-dev:
uses: celo-org/reusable-workflows/.github/workflows/[email protected]
name: Build us-west1-docker.pkg.dev/devopsre/dev-images/odis-combiner
if: |
github.ref != 'refs/heads/main'
with:
workload-id-provider: projects/1094498259535/locations/global/workloadIdentityPools/gh-social-connect/providers/github-by-repos
service-account: '[email protected]'
artifact-registry: us-west1-docker.pkg.dev/devopsre/dev-images/odis-combiner
tag: ${{ github.sha }}
context: .
file: dockerfiles/phone-number-privacy/Dockerfile-combiner
trivy: true

odis-combiner-build:
uses: celo-org/reusable-workflows/.github/workflows/[email protected]
name: Build us-west1-docker.pkg.dev/devopsre/social-connect/odis-combiner
if: |
github.ref == 'refs/heads/main'
with:
workload-id-provider: projects/1094498259535/locations/global/workloadIdentityPools/gh-social-connect-main/providers/github-by-repos
service-account: '[email protected]'
artifact-registry: us-west1-docker.pkg.dev/devopsre/social-connect/odis-combiner
tag: ${{ github.sha }}
context: .
file: dockerfiles/phone-number-privacy/Dockerfile-combiner
trivy: true
42 changes: 42 additions & 0 deletions dockerfiles/phone-number-privacy/Dockerfile-combiner
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
##### Gathering dependencies
FROM scratch AS packages

# Copy phone-number-privacy package and its dependency closure.
# Assemble all dependencies into the packages folder so the second stage can select whether to
# include all packages, or just the phone-number-privacy packages.
WORKDIR /celo-phone-number-privacy/
COPY packages/phone-number-privacy/combiner packages/phone-number-privacy/combiner
COPY packages/phone-number-privacy/common packages/phone-number-privacy/common
COPY packages/sdk/encrypted-backup packages/sdk/encrypted-backup
COPY packages/sdk/identity packages/sdk/identity

##### Main stage
FROM node:18
LABEL org.opencontainers.image.authors="[email protected]"

WORKDIR /celo-phone-number-privacy/

# Copy monorepo settings
COPY lerna.json package.json yarn.lock ./

# Makes build fail if it doesn't copy git, will be removed after build
COPY .git .git

# Setting ONLY_PUBLISHED_DEPENDENCIES to true or any non-empty string results in only the
# phone-number-privacy package being copied into the image, and therefore it will only build using
# published dependencies. Setting ONLY_PUBLISHED_DEPENDENCIES to "" will copy in all dependecies.
ARG ONLY_PUBLISHED_DEPENDENCIES=""
ARG PACKAGE_SELECTOR=${ONLY_PUBLISHED_DEPENDENCIES:+phone-number-privacy/combiner}
COPY --from=packages celo-phone-number-privacy/packages/${PACKAGE_SELECTOR} packages/${PACKAGE_SELECTOR}

# Install dependencies and build.
RUN yarn install --network-timeout 100000 --frozen-lockfile && yarn cache clean
RUN yarn build

RUN rm -r .git

# Setup and run the combiner application.
ENV NODE_ENV production
WORKDIR /celo-phone-number-privacy/packages/phone-number-privacy/combiner
EXPOSE 8080
ENTRYPOINT ["yarn", "start:docker"]
1 change: 1 addition & 0 deletions packages/phone-number-privacy/combiner/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
"lru-cache": "^10.0.1",
"node-fetch": "^2.6.9",
"pg": "^8.2.1",
"prom-client": "12.0.0",
"uuid": "^7.0.3"
},
"devDependencies": {
Expand Down
35 changes: 27 additions & 8 deletions packages/phone-number-privacy/combiner/src/common/combine.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import { Request } from 'express'
import * as t from 'io-ts'
import { PerformanceObserver } from 'perf_hooks'
import { fetchSignerResponseWithFallback, SignerResponse } from './io'
import { Counters, Histograms, newMeter } from './metrics'

export interface Signer {
url: string
Expand Down Expand Up @@ -85,17 +86,23 @@ export async function thresholdCallToSigners<R extends OdisRequest>(
await Promise.all(
signers.map(async (signer) => {
try {
const signerFetchResult = await fetchSignerResponseWithFallback(
signer,
endpoint,
keyVersionInfo.keyVersion,
request,
logger,
// @ts-ignore
abortSignal
const _meter = newMeter(Histograms.signerLatency, signer.url, request.url)
const signerFetchResult = await _meter(() =>
fetchSignerResponseWithFallback(
signer,
endpoint,
keyVersionInfo.keyVersion,
request,
logger,
// @ts-ignore
abortSignal
)
)

// used for log based metrics
Counters.sigResponses
.labels(signerFetchResult.status.toString(), signer.url, request.url)
.inc()
logger.info({
message: 'Received signerFetchResult',
signer: signer.url,
Expand All @@ -105,6 +112,9 @@ export async function thresholdCallToSigners<R extends OdisRequest>(
if (!signerFetchResult.ok) {
const responseData = await signerFetchResult.json()
// used for log based metrics
Counters.sigResponsesErrors
.labels(signerFetchResult.status.toString(), signer.url, request.url)
.inc()
logger.info({
message: 'Received signerFetchResult on unsuccessful signer response',
res: responseData,
Expand Down Expand Up @@ -156,11 +166,18 @@ export async function thresholdCallToSigners<R extends OdisRequest>(
}
} catch (err) {
if (isTimeoutError(err)) {
Counters.sigRequestErrors
.labels(signer.url, request.url, ErrorMessage.TIMEOUT_FROM_SIGNER)
.inc()
logger.error({ signer }, ErrorMessage.TIMEOUT_FROM_SIGNER)
} else if (isAbortError(err)) {
Counters.warnings.labels(request.url, WarningMessage.CANCELLED_REQUEST_TO_SIGNER).inc()
logger.info({ signer }, WarningMessage.CANCELLED_REQUEST_TO_SIGNER)
} else {
// Logging the err & message simultaneously fails to log the message in some cases
Counters.sigRequestErrors
.labels(signer.url, request.url, ErrorMessage.SIGNER_REQUEST_ERROR)
.inc()
logger.error({ signer }, ErrorMessage.SIGNER_REQUEST_ERROR)
logger.error({ signer, err })

Expand All @@ -180,6 +197,8 @@ export async function thresholdCallToSigners<R extends OdisRequest>(

if (errorCodes.size > 0) {
if (errorCodes.size > 1) {
Counters.errors.labels(request.url).inc()
Counters.sigInconsistenciesErrors.labels(request.url).inc()
logger.error(
{ errorCodes: JSON.stringify([...errorCodes]) },
ErrorMessage.INCONSISTENT_SIGNER_RESPONSES
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { ErrorMessage } from '@celo/phone-number-privacy-common'
import threshold_bls from 'blind-threshold-bls'
import Logger from 'bunyan'
import { Counters } from '../../common/metrics'
import { CryptoClient, ServicePartialSignature } from './crypto-client'

function flattenSigsArray(sigs: Uint8Array[]) {
Expand Down Expand Up @@ -75,6 +76,8 @@ export class BLSCryptographyClient extends CryptoClient {
// We move it to the verified set so that we don't need to re-verify in the future
this.verifiedSignatures.push(unverifiedSignature)
} else {
Counters.errors.labels(unverifiedSignature.url).inc()
Counters.blsComputeErrors.labels(unverifiedSignature.url).inc()
logger.error({ url: unverifiedSignature.url }, ErrorMessage.VERIFY_PARITAL_SIGNATURE_ERROR)
}
}
Expand Down
92 changes: 53 additions & 39 deletions packages/phone-number-privacy/combiner/src/common/handlers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ import { SemanticAttributes } from '@opentelemetry/semantic-conventions'
import Logger from 'bunyan'
import { Request, Response } from 'express'
import { performance, PerformanceObserver } from 'perf_hooks'
import * as client from 'prom-client'
import { getCombinerVersion } from '../config'
import { OdisError } from './error'
import { Counters, newMeter } from './metrics'

const tracer = opentelemetry.trace.getTracer('combiner-tracer')

Expand All @@ -37,10 +39,14 @@ export function catchErrorHandler<R extends OdisRequest>(
const logger: Logger = res.locals.logger
logger.error(ErrorMessage.CAUGHT_ERROR_IN_ENDPOINT_HANDLER)
logger.error(err)
Counters.errors.labels(req.url).inc()
Counters.errorsCaughtInEndpointHandler.labels(req.url).inc()
if (!res.headersSent) {
if (err instanceof OdisError) {
sendFailure(err.code, err.status, res, req.url)
} else {
Counters.errors.labels(req.url).inc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is double counting I believe

Counters.unknownErrors.labels(req.url).inc()
sendFailure(ErrorMessage.UNKNOWN_ERROR, 500, res, req.url)
}
} else {
Expand Down Expand Up @@ -84,50 +90,57 @@ export function tracingHandler<R extends OdisRequest>(
}

export function meteringHandler<R extends OdisRequest>(
histogram: client.Histogram<string>,
handler: PromiseHandler<R>
): PromiseHandler<R> {
return async (req, res) => {
const logger: Logger = res.locals.logger

// used for log based metrics
logger.info({ req: req.body }, 'Request received')
return async (req, res) =>
newMeter(
histogram,
req.url
)(async () => {
const logger: Logger = res.locals.logger

const eventLoopLagMeasurementStart = Date.now()
setTimeout(() => {
const eventLoopLag = Date.now() - eventLoopLagMeasurementStart
logger.info({ eventLoopLag }, 'Measure event loop lag')
})
// TODO:(soloseng): session ID may not always exist
const startMark = `Begin ${req.url}/${req.body.sessionID}`
const endMark = `End ${req.url}/${req.body.sessionID}`
const entryName = `${req.url}/${req.body.sessionID} latency`

const obs = new PerformanceObserver((list) => {
const entry = list.getEntriesByName(entryName)[0]
if (entry) {
logger.info({ latency: entry }, 'e2e response latency measured')
// used for log based metrics
logger.info({ req: req.body }, 'Request received')

const eventLoopLagMeasurementStart = Date.now()
setTimeout(() => {
const eventLoopLag = Date.now() - eventLoopLagMeasurementStart
logger.info({ eventLoopLag }, 'Measure event loop lag')
})
// TODO:(soloseng): session ID may not always exist
const startMark = `Begin ${req.url}/${req.body.sessionID}`
const endMark = `End ${req.url}/${req.body.sessionID}`
const entryName = `${req.url}/${req.body.sessionID} latency`

const obs = new PerformanceObserver((list) => {
const entry = list.getEntriesByName(entryName)[0]
if (entry) {
logger.info({ latency: entry }, 'e2e response latency measured')
}
})
obs.observe({ entryTypes: ['measure'], buffered: false })

performance.mark(startMark)

try {
Counters.requests.labels(req.url).inc()
await handler(req, res)
if (res.headersSent) {
// used for log based metrics
logger.info({ res }, 'Response sent')
Counters.responses.labels(req.url, res.statusCode.toString()).inc()
}
} finally {
performance.mark(endMark)
performance.measure(entryName, startMark, endMark)

performance.clearMeasures(entryName)
performance.clearMarks(startMark)
performance.clearMarks(endMark)
obs.disconnect()
}
})
obs.observe({ entryTypes: ['measure'], buffered: false })

performance.mark(startMark)

try {
await handler(req, res)
if (res.headersSent) {
// used for log based metrics
logger.info({ res }, 'Response sent')
}
} finally {
performance.mark(endMark)
performance.measure(entryName, startMark, endMark)

performance.clearMeasures(entryName)
performance.clearMarks(startMark)
performance.clearMarks(endMark)
obs.disconnect()
}
}
}

export function timeoutHandler<R extends OdisRequest>(
Expand All @@ -154,6 +167,7 @@ export async function disabledHandler<R extends OdisRequest>(
req: Request<{}, {}, R>,
response: Response<OdisResponse<R>, Locals>
): Promise<void> {
Counters.warnings.labels(req.url, WarningMessage.API_UNAVAILABLE).inc()
sendFailure(WarningMessage.API_UNAVAILABLE, 503, response, req.url)
}

Expand Down
Loading