Skip to content
This repository has been archived by the owner on Mar 27, 2021. It is now read-only.

Commit

Permalink
Implement new Bigtable timeout & retry settings (#733)
Browse files Browse the repository at this point in the history
* heroic-730 Implement new Bigtable timeout & retry settings

Signed-off-by: Peter Kingswell <[email protected]>

* heroic-733 lowered timeouts, added more block comments.

- passing all tests
- next is to refactor the 5 parameters into a class

Signed-off-by: Peter Kingswell <[email protected]>

* heroic-733 updated failing unit test

Signed-off-by: Peter Kingswell <[email protected]>

* heroic-733 attempting refactor connection settings into one class

so that they're not repeated several times over

Signed-off-by: Peter Kingswell <[email protected]>

* updated timeout settings as agreed with @SergeyR on Slack

* heroic-733 small settings & comment corrections

Signed-off-by: Peter Kingswell <[email protected]>

* heroic-733 small settings & comment corrections

Signed-off-by: Peter Kingswell <[email protected]>

* fixed runtime IT error in console & syntax warnings

Signed-off-by: Peter Kingswell <[email protected]>

* added tagValuesTruncatedSuggestMany to exercise difference between CI

and localhost IT test runs where it works locally but not in CI.

Signed-off-by: Peter Kingswell <[email protected]>

* fixed checkstyle error.

Signed-off-by: Peter Kingswell <[email protected]>

* fixed checkstyle error.

Signed-off-by: Peter Kingswell <[email protected]>

* fixes logging in unit & integration tests.

Signed-off-by: Peter Kingswell <[email protected]>

* fixes inconsistent tagValuesTruncatedSuggest behaviour

Signed-off-by: Peter Kingswell <[email protected]>

* made the tests in *SuggestBackend*IT.java deterministic

Signed-off-by: Peter Kingswell <[email protected]>

* increased time available for deleteSeries() to complete

down the road all tests of this kind (timer-based) will
need redoing as they're lame and brittle.

Signed-off-by: Peter Kingswell <[email protected]>

* refactored 6 retry & timeout settings into new POD class

which is then added as a field to BigtableBackend, Module etc.

Signed-off-by: Peter Kingswell <[email protected]>

* reduced cohesion between modules by switching implementations

from MetricsConnectionSettingsModule to MetricsConnectionSettings

Signed-off-by: Peter Kingswell <[email protected]>

* moved MetricsConnectionSettingsModule to Bigtable module since...

that's where it belongs, basically.

Signed-off-by: Peter Kingswell <[email protected]>

* *actually* moved MetricsConnectionSettingsModule to Bigtable module

since...that's where it belongs, basically.

Signed-off-by: Peter Kingswell <[email protected]>

* added {@link ...} javadoc

Signed-off-by: Peter Kingswell <[email protected]>

* fixed comment

Signed-off-by: Peter Kingswell <[email protected]>

* implemented review feedback

- removed useless/confusing @JSON annotations
- changed fields from Integer to int
- also fixed code analysis warnings

* Rename .java to .kt

* PR feedback - refactored POD BT Java class to Kotlin

* implemented a workaround for @Inject

not working as documented (seemingly)

* improved comments, removed unnecessary subclass

* io.grpc -> 1.35.0 & bigtable-client-core -> 1.19.0

* increasing heroic system test startup time by 30s

to prevent the intermittent timeouts that are being observed.

* fixed Could not find policy 'pick_first' exception.

- occurred when a timeout happened
- full message: java.lang.IllegalStateException: Could not find policy
'pick_first'. Make sure its implementation is either registered to
LoadBalancerRegistry or included in
META-INF/services/io.grpc.LoadBalancerProvider from your jar files.

* adds integration test for Bigtable timeouts

* report the failing test's actual content

* fix testBackendTimesOutCorrectly assertion

* unit test fix attempt #2

* set Google-default timeout and retry

settings.
tidied up docs too
  • Loading branch information
sming authored Mar 4, 2021
1 parent 2e63167 commit c6b5a6f
Show file tree
Hide file tree
Showing 37 changed files with 1,291 additions and 577 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,4 @@ gradle-app.setting

# direnv config file - https://direnv.net/
/.envrc
logs
5 changes: 3 additions & 2 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ allprojects {

dependency 'io.netty:netty-tcnative-boringssl-static:2.0.28.Final'

dependencySet(group: 'io.grpc', version: '1.16.1') {
dependencySet(group: 'io.grpc', version: '1.35.0') {
entry 'grpc-auth'
entry 'grpc-core'
entry 'grpc-netty'
Expand Down Expand Up @@ -200,7 +200,7 @@ allprojects {
entry 'google-cloud-core'
entry 'google-cloud-core-grpc'
}
dependency 'com.google.cloud.bigtable:bigtable-client-core:1.12.1'
dependency 'com.google.cloud.bigtable:bigtable-client-core:1.19.0'

dependency 'com.addthis:stream-lib:3.0.0'
dependency 'org.xerial.snappy:snappy-java:1.1.7.2'
Expand Down Expand Up @@ -313,6 +313,7 @@ subprojects {
}

test {
testLogging.showStandardStreams = true
testLogging {
events "passed", "skipped", "failed", "standardOut", "standardError"
outputs.upToDateWhen { false }
Expand Down
49 changes: 43 additions & 6 deletions docs/content/_docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,41 +132,47 @@ Precedence for each flag is defined as the following:
The following features are available:

#### com.spotify.heroic.deterministic_aggregations

{:.no_toc}

Enable feature to only perform aggregations that can be performed with limited resources. Disabled by default.

Aggregations are commonly performed per-shard, and the result concatenated. This enabled experimental support for distributed aggregations which behave transparently across shards.

#### com.spotify.heroic.distributed_aggregations

{:.no_toc}

Enable feature to perform distributed aggregations. Disabled by default.

Aggregations are commonly performed per-shard, and the result concatenated. This enables experimental support for distributed aggregations which behave transparently across shards. Typically this will cause more data to be transported across shards for each request.

#### com.spotify.heroic.shift_range

{:.no_toc}

Enable feature to cause range to be rounded on the current cadence. Enabled by default.

This will assert that there are data outside of the range queried for and that the range is aligned to the queried cadence. Which is a useful feature when using a dashboarding system.

#### com.spotify.heroic.sliced_data_fetch

{:.no_toc}

Enable feature to cause data to be fetched in slices. Enabled by default.

This will cause data to be fetched and consumed by the aggregation framework in pieces avoiding having to load all data into memory before starting to consume it.

#### com.spotify.heroic.end_bucket_stategy

{:.no_toc}

Enabled by default.

Use the legacy bucket strategy by default where the resulting value is at the end of the timestamp of the bucket.

#### com.spotify.heroic.cache_query

{:.no_toc}

Disabled by default.
Expand Down Expand Up @@ -492,6 +498,39 @@ batchSize: <int>
# If set, the Bigtable client will be configured to use this address as a Bigtable emulator.
# Default CBT emulator runs at: "localhost:8086"
emulatorEndpoint: <string>

# Reference: https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable/config/CallOptionsConfig.Builder
# The amount of milliseconds to wait before issuing a client side timeout for mutation remote
# procedure calls.
# In other words, If timeouts are set, how many milliseconds should pass before a
# DEADLINE_EXCEEDED is thrown. The Google default is 600_000 ms (10 minutes).
# Currently, this feature is experimental.
mutateRpcTimeoutMs: int

# ReadRowsRpcTimeoutMs
# The amount of milliseconds to wait before issuing a client side timeout for readRows streaming remote procedure calls.
# In other words, from https://github.com/hegemonic/cloud-bigtable-client/blob/master/bigtable-client-core-parent/bigtable-client-core/src/main/java/com/google/cloud/bigtable/config/CallOptionsConfig.java :
# The default duration to wait before timing out read stream RPC (default value: 12 hours).

readRowsRpcTimeoutMs: int

# ShortRpcTimeoutMs - The amount of milliseconds to wait before issuing a client side timeout for short remote procedure calls.
# In other words, the default duration to wait before timing out RPCs (default
# value: 60 seconds)
# from https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable /config/CallOptionsConfig#SHORT_TIMEOUT_MS_DEFAULT
shortRpcTimeoutMs: int

# MaxScanTimeoutRetries
# The maximum number of times to retry after a scan timeout.
# https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable/config/RetryOptions.html#getmaxscantimeoutretries
# Default is 3.
maxScanTimeoutRetries: int

# maxElapsedBackoffMs
# Maximum amount of time we will retry an operation that is failing.
# So if this is 5,000ms and we retry every 2,000ms, we would do 2 retries.
# Default is 60 seconds
maxElapsedBackoffMs: int
```
##### `<bigtable_credentials>`
Expand Down Expand Up @@ -542,18 +581,16 @@ Maximum number of distinct groups a single result group may contain.

##### seriesLimit

Maximum amount of time series a single request is allowed to fetch, per cluster (if federated).
Maximum amount of time series a single request is allowed to fetch, per cluster (if federated).

A note: when using resource identifiers this limit only applies to the number of series found in the metadata backend, *not* the total series returned.

It is therefore possible to have a low limit *not* be exceeded with the number of series found in metadata, however, return far more series from the metrics backend when resource identifiers are taken into account (which may trigger additional limits).

##### failOnLimits
##### failOnLimits

When true, any limits applied will be reported as a failure.



### [`<metadata_backend>`](#metadata_backend)

Metadata acts as the index to time series data, it is the driving force behind our [Query Language](docs/query_language).
Expand Down Expand Up @@ -717,7 +754,6 @@ sniff: <bool> default = false
nodeSamplerInterval: <duration> default = 30s
```
#### [Memory](#memory)
An in-memory datastore. This is intended only for testing and is definitely not something you should run in production.
Expand Down Expand Up @@ -970,6 +1006,7 @@ level: <string> default = TRACE
```

#### Query log output

{:.no_toc}

Each successful query will result in several output entries in the query log. Entries from different stages of the query. Example output:
Expand All @@ -996,6 +1033,7 @@ Each successful query will result in several output entries in the query log. En
| `data` | Contains data relevant to this query stage. This might for example be the original query, a partial response or the final response.

#### Contextual information

{:.no_toc}

It's possible to supply contextual information in the query. This information will then be included in the query log, to ease mapping of performed query to the query log output.
Expand Down Expand Up @@ -1033,7 +1071,6 @@ Enable distributed tracing output of Heroic's operations. Tracing is instrumente

A few tags are added to incoming requests such as the java version. If running on GCP, zone and region tags are added as well.


```yaml
# Probability, between 0.0 and 1.0, of sampling each trace.
probability: <float> default = 0.01
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
/*
* Copyright (c) 2015 Spotify AB.
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package com.spotify.heroic.metric

import com.spotify.heroic.metric.consts.ApiQueryConsts
import org.apache.commons.lang3.builder.ToStringBuilder
import org.apache.commons.lang3.builder.ToStringStyle
import java.util.*

open class MetricsConnectionSettings(
maxWriteBatchSize: Optional<Int>,
mutateRpcTimeoutMs: Optional<Int>,
readRowsRpcTimeoutMs: Optional<Int>,
shortRpcTimeoutMs: Optional<Int>,
maxScanTimeoutRetries: Optional<Int>,
maxElapsedBackoffMs: Optional<Int>
) {
/**
* See [ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS]
*/
@JvmField
var mutateRpcTimeoutMs: Int

/**
* See [ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS]
*/
@JvmField
var readRowsRpcTimeoutMs: Int

/**
* See [ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS]
*/
@JvmField
var shortRpcTimeoutMs: Int

/**
* See [ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES]
*/
@JvmField
var maxScanTimeoutRetries: Int

/**
* See [ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS]
*/
@JvmField
var maxElapsedBackoffMs: Int

/**
* See [MetricsConnectionSettings.DEFAULT_MUTATION_BATCH_SIZE]
*/
@JvmField
var maxWriteBatchSize: Int

protected constructor() : this(
Optional.of<Int>(MAX_MUTATION_BATCH_SIZE),
Optional.of<Int>(ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS),
Optional.of<Int>(ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS),
Optional.of<Int>(ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS),
Optional.of<Int>(ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES),
Optional.of<Int>(ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS)
) {
}

override fun toString(): String {
return ToStringBuilder(this, ToStringStyle.MULTI_LINE_STYLE)
.append("maxWriteBatchSize", maxWriteBatchSize)
.append("mutateRpcTimeoutMs", mutateRpcTimeoutMs)
.append("readRowsRpcTimeoutMs", readRowsRpcTimeoutMs)
.append("shortRpcTimeoutMs", shortRpcTimeoutMs)
.append("maxScanTimeoutRetries", maxScanTimeoutRetries)
.append("maxElapsedBackoffMs", maxElapsedBackoffMs)
.toString()
}

companion object {
/**
* default number of Cells for each batch mutation
*/
const val DEFAULT_MUTATION_BATCH_SIZE = 1000

/**
* maximum possible number of Cells for each batch mutation
*/
const val MAX_MUTATION_BATCH_SIZE = 100000

/**
* minimum possible number of Cells supported for each batch mutation
*/
const val MIN_MUTATION_BATCH_SIZE = 10
@JvmStatic
fun createDefault(): MetricsConnectionSettings {
return MetricsConnectionSettings()
}
}

init {
// Basically make sure that maxWriteBatchSize, if set, is sane
var maxWriteBatch = maxWriteBatchSize.orElse(DEFAULT_MUTATION_BATCH_SIZE)
maxWriteBatch = maxWriteBatch.coerceAtLeast(MIN_MUTATION_BATCH_SIZE)
maxWriteBatch = maxWriteBatch.coerceAtMost(MAX_MUTATION_BATCH_SIZE)
this.maxWriteBatchSize = maxWriteBatch

this.mutateRpcTimeoutMs =
mutateRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS)
this.readRowsRpcTimeoutMs =
readRowsRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS)
this.shortRpcTimeoutMs =
shortRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS)
this.maxScanTimeoutRetries =
maxScanTimeoutRetries.orElse(ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES)
this.maxElapsedBackoffMs =
maxElapsedBackoffMs.orElse(ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS)
}
}
Loading

0 comments on commit c6b5a6f

Please sign in to comment.