[RFS] Collection of Minor Changes #717

chelma · 2024-06-11T21:46:03Z

Description

Updated the RFS Docker Compose setup to run Elasticsearch 7.17 by default as a source
Updated RFS to accept a maximum shard size it will attempt to migrate. The default chosen is 50 GB. If it encounters a shard larger than the max, it will increment the attempt count without downloading/migrating, resulting in the shard being eventually marked as FAILED.
Updated RFS to delete the Snapshot Blobfiles from S3 after unpacking them into their Lucene index. We do not delete the metadata files as they are not likely to cause problems with disk space (~5 kB/shard), we can add that behavior later with the abstractions I've introduced, and I'm currently on a tight schedule at the moment.
Updated RFS to delete the shards' unpacked Lucene Indices from disk after we're done with it (whether we successfully migrated it or not). Does not delete the directory that the Lucene Indices were housed in.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1749

Testing

Added/updated unit tests
Ran a migration using the Docker Compose setup locally. Confirmed that shards larger than a specified max were rejected without downloading any files, confirmed that the s3 downloaded blobfiles were cleaned up, and confirmed that the Lucene Index directories were cleaned out.

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chris Helma <[email protected]>

codecov · 2024-06-11T22:12:04Z

Codecov Report

Attention: Patch coverage is 32.52033% with 83 lines in your changes missing coverage. Please review.

Project coverage is 64.00%. Comparing base (669f0f9) to head (30826c7).
Report is 3 commits behind head on main.

Files	Patch %	Lines
RFS/src/main/java/com/rfs/RunRfsWorker.java	0.00%	36 Missing ⚠️
...ain/java/com/rfs/common/SnapshotShardUnpacker.java	35.71%	18 Missing ⚠️
...va/com/rfs/common/EphemeralSourceRepoAccessor.java	0.00%	14 Missing ⚠️
...c/main/java/com/rfs/common/SourceRepoAccessor.java	40.00%	9 Missing ⚠️
RFS/src/main/java/com/rfs/ReindexFromSnapshot.java	0.00%	4 Missing ⚠️
...java/com/rfs/common/DefaultSourceRepoAccessor.java	60.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #717      +/-   ##
============================================
+ Coverage     63.24%   64.00%   +0.76%     
- Complexity     1578     1585       +7     
============================================
  Files           220      222       +2     
  Lines          9156     9083      -73     
  Branches        793      771      -22     
============================================
+ Hits           5791     5814      +23     
+ Misses         2956     2858      -98     
- Partials        409      411       +2

Flag	Coverage Δ
unittests	`64.00% <32.52%> (+0.76%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RFS/src/main/java/com/rfs/common/DeletingSourceRepoAccessor.java

peternied · 2024-06-11T22:00:44Z

RFS/src/main/java/com/rfs/common/DeletingSourceRepoAccessor.java

+        }
+    }
+
+    public static class CouldNotLoadRepoFile extends RuntimeException {


Move this out into an Exceptions package, we we don't need class specific versions of the same named exception between this class an DefaultSourceRepoAccessor

Good catch, done.

RFS/src/main/java/com/rfs/common/SourceRepoAccessor.java

RFS/src/main/java/com/rfs/common/SnapshotShardUnpacker.java

peternied · 2024-06-11T22:15:20Z

RFS/src/main/java/com/rfs/worker/DocumentsRunner.java

-                ShardMetadata.Factory shardMetadataFactory, SnapshotShardUnpacker unpacker, LuceneDocumentsReader reader,
-                DocumentReindexer reindexer) {
-        this.members = new DocumentsStep.SharedMembers(globalState, cmsClient, snapshotName, metadataFactory, shardMetadataFactory, unpacker, reader, reindexer);
+    public DocumentsRunner(GlobalState globalState, CmsClient cmsClient, String snapshotName, long maxShardSizeBytes,


This is nigh impossible to read - lets move to using @requireargsconstructor as a quick stop gap, but I think we need to have better data modeling so we don't have to pass every configuration option around, clustering by functionality or having a settings object that can be peaked into would handle this

I don't find it hard to read, but feel free to update it how you'd like.

In general, I go back and forth on this one in other areas of Java code. Doing the 'right' thing with builders is cumbersome and it only amounts to a runtime check.
Please try to put these on separate lines and order them to tell the best story possible. This is basically dependency injection and everything is required, right?

@gregschohn Yeah, this is just dependency injection. I can put them on new lines if that makes it a bit clearer, but the bigger issue is that there's something that feels off with how the SharedMembers object is created in an intermediate scope between the main() and the Step classes that need it. It feels like the SharedMembers definition might not actually belong in the Step class?

Signed-off-by: Chris Helma <[email protected]>

gregschohn

I'm a bit confused at what the deletion model looks like. Please help to clarify it so that I can understand and approve this.

gregschohn · 2024-06-12T16:50:22Z

RFS/docker/TestSource_ES_7_17/Dockerfile

@@ -0,0 +1,22 @@
+FROM docker.elastic.co/elasticsearch/elasticsearch:7.17.21 AS base


I know that dynamic dockerfiles aren't ideal, but generating these docker images with an array of different base images programmatically and en-masse could be an improvement for down the line.

Probably; we'll want something like that for testing purposes once we have a bunch of different source/target versions to support. I'm unclear on how generalizable the Dockerfiles will be across major versions though.

gregschohn · 2024-06-12T16:51:03Z

RFS/src/main/java/com/rfs/DemoPrintOutSnapshot.java

@@ -27,6 +27,7 @@
 import com.rfs.common.SnapshotRepo;
 import com.rfs.common.SnapshotShardUnpacker;
 import com.rfs.common.ClusterVersion;
+import com.rfs.common.DefaultSourceRepoAccessor;


Is there a reason to keep this demo program around? Can it be deleted or moved to a testFixture?

This one, definitely not; been keeping it around "just in case" but now that you've pointed it out I can't think of a case where I'd want it. Will delete

gregschohn · 2024-06-12T16:51:20Z

RFS/src/main/java/com/rfs/ReindexFromSnapshot.java

@@ -329,7 +329,8 @@ public static void main(String[] args) throws InterruptedException {
                } else {
                    bufferSize = ElasticsearchConstants_ES_7_10.BUFFER_SIZE_IN_BYTES;
                }
-                SnapshotShardUnpacker unpacker = new SnapshotShardUnpacker(repo, luceneDirPath, bufferSize);
+                DefaultSourceRepoAccessor repoAccessor = new DefaultSourceRepoAccessor(repo);


Same question on this class. Can we delete it now?

We can once we integrate the new version of this stuff into the Migration Assistant, which currently has a dependency on this file.

gregschohn · 2024-06-12T16:52:20Z

RFS/src/main/java/com/rfs/RunRfsWorker.java

@@ -132,6 +137,7 @@ public static void main(String[] args) throws Exception {
        String targetPass = arguments.targetPass;
        List<String> indexTemplateAllowlist = arguments.indexTemplateAllowlist;
        List<String> componentTemplateAllowlist = arguments.componentTemplateAllowlist;
+        long maxShardSizeBytes = arguments.maxShardSizeBytes;


just curious, why do you copy the arg values as non-final locals?

I guess they should be finals, huh? Will start doing that. I make the intermediate copies to shorten the name when referring to them and also to make a conceptual split between the argument passed in and the thing we're using in the program. Makes more sense when we're doing validation or other processing first, but I just tend to apply the pattern generally.

Will make them final per your suggestion, though.

My question was more - why setup the aliases in the first place?

gregschohn · 2024-06-12T16:53:12Z

RFS/src/main/java/com/rfs/RunRfsWorker.java

@@ -165,10 +171,11 @@ public static void main(String[] args) throws Exception {
            indexWorker.run();

            ShardMetadata.Factory shardMetadataFactory = new ShardMetadataFactory_ES_7_10(repoDataProvider);
-            SnapshotShardUnpacker unpacker = new SnapshotShardUnpacker(sourceRepo, luceneDirPath, ElasticsearchConstants_ES_7_10.BUFFER_SIZE_IN_BYTES);
+            DeletingSourceRepoAccessor repoAccessor = new DeletingSourceRepoAccessor(sourceRepo);


DeletingSourceRepoAccessor sounds really weird. Do you want to say EphemeralSourceRepoAccessor or SourceRepoCachingAccessor?

Happy to rename. EphemeralSourceRepoAccessor tickles my brain a bit better, so I'll go with that.

gregschohn · 2024-06-12T16:58:11Z

RFS/src/main/java/com/rfs/common/DeletingSourceRepoAccessor.java

+        }
+    }
+
+    public static class DeletingFileInputStream extends FileInputStream {


This seems like it's really scary and not going to be maintainable. If a user needs to read from the file again from the beginning and just makes a new stream, they'll be surprised if they had closed the previous stream already. It's one thing to have a side effect to reverse another side effect that you're class was responsible for (you open a file descriptor in line 23, so you close it at line 40). In this case, you're passing ownership of the underlying file (not just the descriptor) to the DeletingFileInputStream.

What happens if some files aren't being consumed by the readers - what if there was an exception before you opened all of the streams? What if there are future refactorings?

A better approach might be to make a scratch directory for each migration-shard run (that way you can manage any number of sessions simultaneously within one process). As you finish running each of those sessions, just blow away everything that you had downloaded.

Probably easier to talk this one out, but - the way the SourceRepoAccessor works is that it wraps an underlying SourceRepo’s calls, which return the Path to a particular file within the repo and converts them to a Stream. In the case of an S3SourceRepo, when you get the Path to the file, it downloads it to disk if it’s not already there. So if you wrap an S3SourceRepo in an EphemeralSourceRepoAccessor, what happens is that you are returned an InputStream to a file which is downloaded if it doesn’t already exist and then that file is deleted when the Stream is closed.

It seems like any code that returns a stream would have the same problems you mentioned. I'm not opposed to doing it differently, but that would entail a wider refactoring and I was hoping to avoid that given we have more urgent things to be tackling right now (IMO).

After discussion, we decided to keep this around in case we need it in the future but use the Default (non-deleting) version in RFS for now. If the disk fills up from our s3 downloads, it should kill the process and naturally free up disk space by getting a new container.

gregschohn · 2024-06-12T16:59:44Z

RFS/src/main/java/com/rfs/common/PartSliceStream.java

@@ -12,24 +12,23 @@
 */

 public class PartSliceStream extends InputStream {


from the javadoc - what's the purpose of this class? I don't understand what the 'special sauce' is.

The special sauce is that it provides a single stream-like object that seamlessly reads through multiple Snapshot blob files that have split into multiple parts, per the Elasticsearch/OpenSearch convention of not having any individual file bigger than ~1 GB.

We don't have to do things this way, but it's what the ES/OS code did and I haven't had a reason to change it.

Is it like the SequenceInputStream, or maybe like that with a bit more support through it?

Looks like "kinda"? Basically, you just have a base blobfile name (foo) and if it's larger than 1 GB it's split into multiple files (foo.part0, foo.part1, foo.part2). The PartSliceStream creates a stream that mimics as if they were all the same file you're reading from.

gregschohn · 2024-06-12T17:02:44Z

RFS/src/main/java/com/rfs/common/SnapshotShardUnpacker.java

+    @Override
+    public void close() {
+        try {
+            Path luceneIndexDir = Paths.get(luceneFilesBasePath + "/" + shardMetadata.getIndexName() + "/" + shardMetadata.getShardId());


It looks like you're deleting a session directory here. Why did you need to delete files above when tied to the stream?

Not sure I understand the question. Can you re-phrase?

Is this redundant to what you are doing with the DeletingFileInputStream above? This place feels like where all of your cached files should be deleted - or are there other files that are created outside of this class?

Per discussion - there are two things that we need to clean up: the raw snapshot files we download from s3, and the files we convert those into which Lucene actually cares about. This is for cleaning up the later.

As per the jira in the log message added above, we should eventually get this down to no extra file buffers, which reduces the number of resources that we're managing and should make the code more efficient and easier to maintain.

gregschohn · 2024-06-12T17:04:19Z

RFS/src/main/java/com/rfs/common/SnapshotShardUnpacker.java

+            }
+
+        } catch (Exception e) {
+            throw new CouldNotCleanUpShard("Could not clean up shard: Index " + shardMetadata.getIndexId() + ", Shard " + shardMetadata.getShardId(), e);


Should/does this kill the process? Seems like it probably should since it would be safer to recycle the process, especially since you were at the end of working (if you have other pending work, maybe it should not take more on and flush out the current work items).

Makes sense. With the way things are currently implemented, I believe it will kill the process.

Do you have a test to confirm that? I ask because making sure that processes die, especially through refactorings, can be pretty tricky to maintain.

gregschohn · 2024-06-12T17:08:24Z

RFS/src/main/java/com/rfs/worker/DocumentsRunner.java

-                ShardMetadata.Factory shardMetadataFactory, SnapshotShardUnpacker unpacker, LuceneDocumentsReader reader,
-                DocumentReindexer reindexer) {
-        this.members = new DocumentsStep.SharedMembers(globalState, cmsClient, snapshotName, metadataFactory, shardMetadataFactory, unpacker, reader, reindexer);
+    public DocumentsRunner(GlobalState globalState, CmsClient cmsClient, String snapshotName, long maxShardSizeBytes,


In general, I go back and forth on this one in other areas of Java code. Doing the 'right' thing with builders is cumbersome and it only amounts to a runtime check.
Please try to put these on separate lines and order them to tell the best story possible. This is basically dependency injection and everything is required, right?

Signed-off-by: Chris Helma <[email protected]>

gregschohn

Responded - trying to figure out where all of the ephemeral files are written and who all has responsibilities to delete them.

Signed-off-by: Chris Helma <[email protected]>

gregschohn · 2024-06-13T11:52:40Z

RFS/src/main/java/com/rfs/common/SnapshotShardUnpacker.java

+    @Override
+    public void close() {
+        try {
+            Path luceneIndexDir = Paths.get(luceneFilesBasePath + "/" + shardMetadata.getIndexName() + "/" + shardMetadata.getShardId());


As per the jira in the log message added above, we should eventually get this down to no extra file buffers, which reduces the number of resources that we're managing and should make the code more efficient and easier to maintain.

chelma added 2 commits June 11, 2024 12:22

Swapped RFS Docker Compose to use ES 7.17

fb07197

Signed-off-by: Chris Helma <[email protected]>

Updated RFS to delete files when finished, reject large shards

b179824

Signed-off-by: Chris Helma <[email protected]>

chelma requested review from AndreKurait, gregschohn, lewijacn, mikaylathompson, okhasawn, peternied and sumobrian as code owners June 11, 2024 21:46

peternied reviewed Jun 11, 2024

View reviewed changes

chelma added 2 commits June 12, 2024 11:06

Updates per PR comments

9692d21

Signed-off-by: Chris Helma <[email protected]>

Small RFS bug fix

bd17b10

Signed-off-by: Chris Helma <[email protected]>

gregschohn reviewed Jun 12, 2024

View reviewed changes

More updates per PR comments

4c25b53

Signed-off-by: Chris Helma <[email protected]>

gregschohn reviewed Jun 12, 2024

View reviewed changes

More updates per PR discussion

30826c7

Signed-off-by: Chris Helma <[email protected]>

gregschohn approved these changes Jun 13, 2024

View reviewed changes

chelma merged commit 055e07a into opensearch-project:main Jun 13, 2024
6 of 7 checks passed

chelma deleted the MIGRATIONS-1749 branch June 13, 2024 13:14

		@@ -0,0 +1,22 @@
		FROM docker.elastic.co/elasticsearch/elasticsearch:7.17.21 AS base

		@@ -12,24 +12,23 @@
		*/

		public class PartSliceStream extends InputStream {

[RFS] Collection of Minor Changes #717

[RFS] Collection of Minor Changes #717

Conversation

chelma commented Jun 11, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chelma Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chelma commented Jun 11, 2024 •

edited

Loading

codecov bot commented Jun 11, 2024 •

edited

Loading

gregschohn Jun 12, 2024 •

edited

Loading

chelma Jun 12, 2024 •

edited

Loading