-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFS refactoring into separate component programs + new work coordination mechanisms #739
RFS refactoring into separate component programs + new work coordination mechanisms #739
Conversation
Signed-off-by: Greg Schohn <[email protected]>
…o contention that we need to mitigate. If there is contention, there will be multiple snapshot requests and multiple processes will poll and proceed once the snapshot has been taken. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # RFS/src/main/java/com/rfs/RunRfsWorker.java # RFS/src/main/java/com/rfs/worker/Runner.java # RFS/src/test/java/com/rfs/worker/SnapshotRunnerTest.java
Signed-off-by: Greg Schohn <[email protected]>
… quality of the code. Signed-off-by: Greg Schohn <[email protected]>
…uction ready interface that can be used to coordinate work. The code doesn't compile, the requests aren't fully featured, and the test is in the process of being shelled and rewritten (let alone clocked up to stress the interfaces). Signed-off-by: Greg Schohn <[email protected]>
…se works Signed-off-by: Greg Schohn <[email protected]>
Amp the unit test for WorkCoordinator up to stress this more. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…e leases expire or exceptions happen. The code compiles, but I haven't written tests for a lot of those pieces yet. Signed-off-by: Greg Schohn <[email protected]>
…ut any Coordinated Lease Locking. Leases aren't required for the first phases of snapshot migration since they're run for a short period of time and are expected to be orchestrated externally. If that decision were to reverse, it should be straightforward to wrap the methods in a ScopedWorkCoordinatorHelper. Tests that mocked internal details were deleted because those details have changed. A future commit will replace those with tests further up the stack. Signed-off-by: Greg Schohn <[email protected]>
…kCoordinator class. Signed-off-by: Greg Schohn <[email protected]>
Some files came into being or went out of being across the merge, so some of the changes may be greater than I would have liked for a merge. Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # RFS/src/main/java/com/rfs/RunRfsWorker.java # RFS/src/main/java/com/rfs/worker/DocumentsRunner.java # RFS/src/main/java/com/rfs/worker/DocumentsStep.java # RFS/src/main/java/com/rfs/worker/IndexStep.java # RFS/src/test/java/com/rfs/worker/DocumentsRunnerTest.java # RFS/src/test/java/com/rfs/worker/DocumentsStepTest.java
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java # DocumentsFromSnapshotMigration/src/main/java/com/rfs/RfsMigrateDocuments.java # MetadataMigration/src/main/java/com/rfs/MetadataMigration.java
…RFS JUnit test. Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>
…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>
dbb93a4
to
cede889
Compare
…ove it to DocumentFromSnapshotMigration Signed-off-by: Greg Schohn <[email protected]>
cede889
to
a9b6709
Compare
…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew! Done with my initial pass, need to sleep on this change and revisit tomorrow, lots of changes here - great work
@@ -9,6 +9,12 @@ | |||
<Root level="debug"> | |||
<AppenderRef ref="Console"/> | |||
</Root> | |||
<Logger name="org.apache.hc.client5.http.wire" level="debug" additivity="false"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this level of logging? Could we have it only enabled on specific tests where this would be useful?
"}\n"; | ||
|
||
var response = httpClient.makeJsonRequest(PUT_METHOD, INDEX_NAME, null, body); | ||
if ((response.getStatusCode() / 100) != 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cute - but hard to understand on first glance. If we are going through the trouble of making our own http client abstraction it could at the very least have a helper method for functionality like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. response.wasNotSuccessful()
if (resultFromUpdate == DocumentModificationResult.CREATED) { | ||
return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration)); | ||
} else { | ||
final var httpResponse = httpClient.makeJsonRequest(GET_METHOD, INDEX_NAME + "/_doc/" + workItemId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a second request? I feel like I'm missing something
|
||
if (resultFromUpdate == DocumentModificationResult.CREATED) { | ||
return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are return in the if statement it would be easier to read if you got rid of the else block so all DocumentModificationResult checks were on the same vertical indent
log.error("Next work item picked=" + nextWorkItem); | ||
Assertions.assertNotNull(nextWorkItem); | ||
Assertions.assertNotNull(nextWorkItem.workItemId); | ||
Assertions.assertTrue(nextWorkItem.leaseExpirationTime.isAfter(Instant.now())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to make sense to compare 'local' time against the clusters timestamp, that seems like a source of random failure in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the clients (Document migration workers) handle leases by the 'honor system'. If the clock on the client is way off, we'll have multiple parties stepping on each others toes. This is a simple sanity check to build some protection against things from completely falling apart.
I'd like to maintain the client time for the kill-switch based off of the cluster time, as it's returned in the http headers of each target request. The responses may be for different nodes and there could be skew there, so that could be problematic too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the general problem of a single source of truth for time I think we don't need to verify that scenario in this case of these tests. What would we do about a failure at this line - it seems like unactionable failure to me?
I'd rather remove the validation of this concept from these tests and bake them into separate more focused test cases that assure the time management/coordination is working as expected - this seems like a good candidate for a follow up (and I don't think is a code requirement for the moment). What do you think?
DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java
Outdated
Show resolved
Hide resolved
|
||
@BeforeAll | ||
static void setupSourceContainer() { | ||
esSourceContainer.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you need to put some things in the source container to expect to be migrated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source container is the elasticsearch_rfs_source container, which is being built automatically by gradle (every time still :( ). For now, that includes our 4 OpenSearch Benchmark test runs of 1000 docs / index. I'd like to improve our cacheable image generation and to include different variants (7.17, 6.8; many more shards; fewer shards), etc.
That will probably be in future PRs though.
Added methods to the work coordinator to show if there are work items remaining and another for the count. FullTest is beginning to make some use of those to confirm that scaling down measures can work. That test still doesn't work though as more features are being added to it. Signed-off-by: Greg Schohn <[email protected]>
… all of our acquisition calls, etc. Signed-off-by: Greg Schohn <[email protected]>
Make sure that the indices match between target and source. Swallow exceptions that might have come out of the migration run and let the run start over again, just as orchestrated containers would do. Signed-off-by: Greg Schohn <[email protected]>
…CoordinationWork Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # DocumentsFromSnapshotMigration/build.gradle # DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java # RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java
Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java # RFS/build.gradle
Signed-off-by: Greg Schohn <[email protected]>
…pull the right image for the slowTest target. buildDockerImage_elasticsearchRFSSource is run completely every time. The DockerBuildImage stuff for dockerSolution makes good use of inputs/outputs to do more minimal, dependency-driven builds. This pattern uses multiple containers, so inputs and outputs will need to be managed externally. Signed-off-by: Greg Schohn <[email protected]>
…ion" for CDK & README Signed-off-by: Greg Schohn <[email protected]>
|
||
task slowTest(type: Test) { | ||
useJUnitPlatform() | ||
dependsOn buildDockerImage_elasticsearchRFSSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slows down the slow tests even more. Why can't we use test containers and populate the documents during the test - this decoupling of test setup and test validation will slow us down over time sustainably IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, I think that the time is roughly the same. It takes about 2 minutes to load the test data that I'm using. The hope with the prebuilt image is that if I'm running repeatedly, I don't need to keep regenerating my source data. I'll work on making the prebuilt stuff cacheable (it isn't now, which admittedly sucks), but the additional tax is just a couple extra docker operations (and an extra startup cycle of ES).
log.info("Snapshot not finished yet; sleeping for 1 seconds..."); | ||
Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This or something like it to prevent the log from going out of sync
log.info("Snapshot not finished yet; sleeping for 1 seconds..."); | |
Thread.sleep(1000); | |
var waitPeriod = Duration.ofSeconds(1); | |
log.info("Snapshot not finished yet; sleeping for " + waitPeriod.toMillis() + "ms..."); | |
Thread.sleep(waitPeriod.toMillis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops - I had copied that code in. Values are fixed now
seenWorkerItems.put(nextWorkItem.workItemId, nextWorkItem.workItemId); | ||
return null; | ||
return workCoordinator.acquireNextWorkItem(expirationWindow).visit( | ||
new IWorkCoordinator.WorkAcquisitionOutcomeVisitor<>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this would be a good use case for a mock so you only need to handle the onAcquireWork method
class NoAvailableWorkToBeDone implements WorkAcquisitionOutcome { | ||
@Override | ||
public <T> T visit(WorkAcquisitionOutcomeVisitor<T> v) throws IOException { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug, shouldn't this be v.onNoAvailableWorkToBeDone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
import java.time.Duration; | ||
import java.time.Instant; | ||
|
||
public interface IWorkCoordinator extends AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is the problem of allocating work, then there is the problem of defining work to the allocated, having a single interface in one place makes this harder to understand, seems like this would be better suited to be in two interfaces - in the case where there is a dual consumer and producer, it could just implement both interfaces. What do you think?
} | ||
|
||
public interface WorkItemGetter { | ||
@NonNull IWorkCoordinator.WorkAcquisitionOutcome apply(IWorkCoordinator wc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe tryAcquire(...)
is a better fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we are pushing towards a deadline - and I think we've made huge strides with this PR. What I would advise to keep moving fast would be to merge as is and then rather than create follow up items - just keep working on the outstanding comments on this PR until they are all resolved. What do you think?
I've just done a pass of them and there are far fewer that we need to put this set of changes to 'rest'. Making github issues or jira tasks seems like low-value overhead when we've got the context in this PR.
Signed-off-by: Greg Schohn <[email protected]>
9aead15
to
6f50219
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #739 +/- ##
=============================================
+ Coverage 69.82% 89.37% +19.55%
=============================================
Files 269 46 -223
Lines 11815 2702 -9113
Branches 772 0 -772
=============================================
- Hits 8250 2415 -5835
+ Misses 3158 287 -2871
+ Partials 407 0 -407
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Greg Schohn <[email protected]>
Description
[Describe what this change achieves]
Issues Resolved
[List any issues this PR will resolve]
Is this a backport? If so, please add backport PR # and/or commits #
Testing
[Please provide details of testing done: unit testing, integration testing and manual testing]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.