From daada647a6ee25a1186f834545015562ee4964aa Mon Sep 17 00:00:00 2001 From: Chris Helma <25470211+chelma@users.noreply.github.com> Date: Fri, 16 Aug 2024 10:28:10 -0500 Subject: [PATCH] Improved comments on RFS LuceneDocumentsReader (#908) Signed-off-by: Chris Helma --- .../com/rfs/common/LuceneDocumentsReader.java | 42 ++++++++++++------- 1 file changed, 28 insertions(+), 14 deletions(-) diff --git a/RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java b/RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java index c45ae7bba..70da58188 100644 --- a/RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java +++ b/RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java @@ -32,23 +32,37 @@ public static Function getFactory(boolean softDelet /** * There are a variety of states the documents in our Lucene Index can be in; this method extracts those documents - * that would be considered "live" from the ElasticSearch/OpenSearch perspective. + * that would be considered "live" from the ElasticSearch/OpenSearch perspective. The most important thing to know is + * that Lucene segments are immutable. For additional context, it is highly recommended to read this section of the + * Lucene docs for a high level overview of the topics involved: * - * For context, when ElasticSearch/OpenSearch deletes a document, it doesn't actually remove it from the Lucene Index. - * Instead, what happens is that the document is marked as "deleted" in the Lucene Index, but it is still present in the - * Lucene segment on disk. The next time that segment is merged, the deleted documents are removed from the Lucene Index. - * A similar thing happens when a document is updated; the old document is marked as "deleted" and a new document is - * added in a new Lucene segment. This means that from an ES/OS perspective, you could have a single document that has - * been created, deleted, recreated, updated, etc. multiple times and only a single version of the doc would exist when - * you queried the ES/OS Index - but every single iteration of that doc might still exist in the Lucene Segments on disk, - * all of which have the same _id (from the ES/OS perspective). + * https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/codecs/lucene80/package-summary.html + * + * When ElasticSearch/OpenSearch deletes a document, it doesn't actually remove it from the Lucene Index. Instead, what + * happens is that the document is marked as "deleted" in the Lucene Index, but it is still present in the Lucene segment + * on disk. The next time a merge occurs, that segment will be deleted, and the deleted documents in it are thereby + * removed from the Lucene Index. A similar thing happens when a document is updated; the old document is marked as + * "deleted" in the Lucene segment and the new version of the document is added in a new Lucene segment. Until a merge + * occurs, both the old and new versions of the document will exist in the Lucene Index in different segments, though only + * the new version will be returned in search results. This means that from an ES/OS perspective, you could have a single + * document that has been created, deleted, recreated, updated, etc. multiple times at the Elasticsearch/OpenSearch level + * and only a single version of the doc would exist when you queried the ES/OS Index - but every single iteration of that + * document might still exist in the Lucene segments on disk, all of which have the same _id (from the ES/OS perspective). * * Additionally, Elasticsearch 7 introduced a feature called "soft deletes" which allows you to mark a document as - * "deleted" in the Lucene Index without actually removing it from the Lucene Index. This works by having the - * application writing the Lucene Index define a field that is used to mark a document as "soft deleted" or not. When - * a document is marked as "soft deleted", it is not returned in search results, but it is still present in the Lucene - * Index. The status of whether any given document is "soft deleted" or not is stored in the Lucene Index itself. By - * default, Elasticsearch 7+ Indices have soft deletes enabled; this is an Index-level setting. + * "deleted" in the Lucene Index without actually removing it from the Lucene Index. From what I can gather, soft deletes + * are an optimization to reduce the likelyhood of needing to re-download full shards when a node drops out of the cluster, + * loses synchronization, and re-joins. They make it more likely the cluster can just replay the missed operations. You + * can read a bit more about soft deletes here: + * + * https://www.elastic.co/guide/en/elasticsearch/reference/7.10/index-modules-history-retention.html + * + * Soft deletes works by having the application writing the Lucene Index define a field that is used to mark a document as + * "soft deleted" or not. When a document is marked as "soft deleted", it is not returned in search results, but it is + * still present in the Lucene Index. The status of whether any given document is "soft deleted" or not is stored in the + * Lucene Index itself. By default, Elasticsearch 7+ Indices have soft deletes enabled; this is an Index-level setting. + * Just like deleted documents and old versions of updated documents, we don't want to reindex them agaisnt the target + * cluster. * * In order to retrieve only those documents that would be considered "live" in ES/OS, we use a few tricks: * 1. We make sure we use the latest Lucene commit point on the Lucene Index. A commit is a Lucene abstraction that