Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a node id to all spans #896

Conversation

gregschohn
Copy link
Collaborator

@gregschohn gregschohn commented Aug 13, 2024

The value is either an ECS taskId, a Kubernetes hostname, or random UUID (as per env variables). Adding something more dynamic like shardId isn't possible to do as a resource as those need to be defined at the time that the SDK is initialized.

Put that worker/node id into all of the spans as a resource and continue to use it to thread the MDC environment for the work coordinator.

Description

  • Category Enhancement
  • Why these changes are required? Better traceability
  • What is the old behavior before changes and new behavior after changes? Spans now say what machine they were running on. Those values correlate to actual hardware.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1940

Testing

gradle tests

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…r random UUID (as per env variables).

Put that worker/node id into all of the spans as a resource and continue to use it to thread the MDC environment for the work coordinator.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the NodeIdAsAnInstrumentationResource branch from 6a2d28e to be81e7d Compare August 13, 2024 23:04
Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 32.72727% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.58%. Comparing base (82d5748) to head (8476607).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...opensearch/migrations/tracing/RootOtelContext.java 15.38% 16 Missing and 6 partials ⚠️
...ion/src/main/java/com/rfs/RfsMigrateDocuments.java 0.00% 5 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java 0.00% 5 Missing ⚠️
...Snapshot/src/main/java/com/rfs/CreateSnapshot.java 0.00% 2 Missing ⚠️
...a/org/opensearch/migrations/MetadataMigration.java 0.00% 2 Missing ⚠️
...rg/opensearch/migrations/utils/ProcessHelpers.java 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #896      +/-   ##
============================================
- Coverage     78.66%   78.58%   -0.09%     
- Complexity     2515     2517       +2     
============================================
  Files           386      387       +1     
  Lines         14981    15012      +31     
  Branches        920      923       +3     
============================================
+ Hits          11785    11797      +12     
- Misses         2636     2651      +15     
- Partials        560      564       +4     
Flag Coverage Δ
gradle-test 75.09% <32.72%> (-0.11%) ⬇️
python-test 88.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@peternied
Copy link
Member

@gregschohn Thanks for doing this - just in time for me to use. I'm adding this to my review queue.

var workCoordinator = new OpenSearchWorkCoordinator(
new CoordinateWorkHttpClient(connectionContext),
TOLERABLE_CLIENT_SERVER_CLOCK_DIFFERENCE_SECONDS,
workerId
);
MDC.put(LOGGING_MDC_WORKER_ID, workerId); // I don't see a need to clean this up since we're in main
TryHandlePhaseFailure.executeWithTryCatch(() -> {
log.info("Running RfsMigrateDocuments with workerId = " + workerId);
log.info("Running RfsMigrateDocuments");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't still log the workerId, even if it's the nodeInstanceName?

Comment on lines 227 to 235
// @Test
// public void testDocumentMigrationForBigMonolithicShardWorks() throws Exception {
// testDocumentMigration(1,
// SearchClusterContainer.OS_V2_14_0.getImageName(),
// SearchClusterContainer.OS_V2_14_0,
// GENERATOR_BASE_IMAGE,
// new String[]{"tail", "-f", "/dev/null"});
// }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here/commented out?

Comment on lines 229 to 230
final var workerId = UUID.randomUUID().toString();
System.err.println("Starting Traffic Replayer with id=" + workerId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing UUID workerIds here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should use the ProcessHelpers.getNodeInstanceName() to get the id

@@ -18,7 +20,15 @@ public class RootWorkCoordinationContext extends RootOtelContext {
public final WorkCoordinationContexts.AcquireNextWorkItemContext.MetricInstruments acquireNextWorkMetrics;

public RootWorkCoordinationContext(OpenTelemetry sdk, IContextTracker contextTracker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this overload to prevent null from leaking out

Comment on lines 229 to 230
final var workerId = UUID.randomUUID().toString();
System.err.println("Starting Traffic Replayer with id=" + workerId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should use the ProcessHelpers.getNodeInstanceName() to get the id

)
.setMeterProvider(
SdkMeterProvider.builder().setResource(serviceResource).registerMetricReader(metricReader).build()
SdkMeterProvider.builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 The new format is much easier to read.

Give me newlines + nesting that make it easy to see what an object is composed of.

@@ -35,13 +35,9 @@ public class RootOtelContext implements IRootOtelContext {

public static OpenTelemetry initializeOpenTelemetryForCollector(
@NonNull String collectorEndpoint,
@NonNull String serviceName
@NonNull String serviceName,
String nodeName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why isn't this nonnull too?

Comment on lines 11 to 31
public static String getNodeInstanceName() {
String id = null;
try {
id = System.getenv("ECS_TASK_ID"); // for ECS deployments
if (id != null) {
return id;
}
id = System.getenv("HOSTNAME"); // for any kubernetes deployed pod
if (id != null) {
return id;
}
// add additional fallbacks here
id = DEFAULT_NODE_ID;
return id;
} finally {
if (id != null) {
String finalId = id;
log.atInfo().setMessage(() -> "getNodeInstanceName()=" + finalId).log();
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
public static String getNodeInstanceName() {
String id = null;
try {
id = System.getenv("ECS_TASK_ID"); // for ECS deployments
if (id != null) {
return id;
}
id = System.getenv("HOSTNAME"); // for any kubernetes deployed pod
if (id != null) {
return id;
}
// add additional fallbacks here
id = DEFAULT_NODE_ID;
return id;
} finally {
if (id != null) {
String finalId = id;
log.atInfo().setMessage(() -> "getNodeInstanceName()=" + finalId).log();
}
}
}
public static String getNodeInstanceName() {
var nodeId = Optional.of("ECS_TASK_ID").map(System::getenv)
.or(() -> Optional.of("HOSTNAME").map(System::getenv))
.orElseGet(DEFAULT_NODE_ID);
log.atInfo().setMessage(() -> "getNodeInstanceName()=" + nodeId).log();
return nodeId;
}

@peternied
Copy link
Member

Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	DocumentsFromSnapshotMigration/src/main/java/com/rfs/RfsMigrateDocuments.java
#	DocumentsFromSnapshotMigration/src/test/java/com/rfs/ParallelDocumentMigrationsTest.java
#	MetadataMigration/src/main/java/com/rfs/MetadataMigration.java
@gregschohn gregschohn force-pushed the NodeIdAsAnInstrumentationResource branch from 9cca6d1 to 8847c03 Compare September 10, 2024 13:01
Also removed the empty MetadataMigration file that was a result of the last merge (the file was moved)

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the NodeIdAsAnInstrumentationResource branch from 3655888 to fd9b306 Compare September 10, 2024 17:34
@gregschohn gregschohn merged commit 97c173f into opensearch-project:main Sep 10, 2024
14 of 15 checks passed
@gregschohn gregschohn deleted the NodeIdAsAnInstrumentationResource branch September 10, 2024 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants