Skip to content

Commit

Permalink
Improved comments on RFS LuceneDocumentsReader (#908)
Browse files Browse the repository at this point in the history
Signed-off-by: Chris Helma <[email protected]>
  • Loading branch information
chelma authored Aug 16, 2024
1 parent 86ae153 commit daada64
Showing 1 changed file with 28 additions and 14 deletions.
42 changes: 28 additions & 14 deletions RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java
Original file line number Diff line number Diff line change
Expand Up @@ -32,23 +32,37 @@ public static Function<Path, LuceneDocumentsReader> getFactory(boolean softDelet

/**
* There are a variety of states the documents in our Lucene Index can be in; this method extracts those documents
* that would be considered "live" from the ElasticSearch/OpenSearch perspective.
* that would be considered "live" from the ElasticSearch/OpenSearch perspective. The most important thing to know is
* that Lucene segments are immutable. For additional context, it is highly recommended to read this section of the
* Lucene docs for a high level overview of the topics involved:
*
* For context, when ElasticSearch/OpenSearch deletes a document, it doesn't actually remove it from the Lucene Index.
* Instead, what happens is that the document is marked as "deleted" in the Lucene Index, but it is still present in the
* Lucene segment on disk. The next time that segment is merged, the deleted documents are removed from the Lucene Index.
* A similar thing happens when a document is updated; the old document is marked as "deleted" and a new document is
* added in a new Lucene segment. This means that from an ES/OS perspective, you could have a single document that has
* been created, deleted, recreated, updated, etc. multiple times and only a single version of the doc would exist when
* you queried the ES/OS Index - but every single iteration of that doc might still exist in the Lucene Segments on disk,
* all of which have the same _id (from the ES/OS perspective).
* https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/codecs/lucene80/package-summary.html
*
* When ElasticSearch/OpenSearch deletes a document, it doesn't actually remove it from the Lucene Index. Instead, what
* happens is that the document is marked as "deleted" in the Lucene Index, but it is still present in the Lucene segment
* on disk. The next time a merge occurs, that segment will be deleted, and the deleted documents in it are thereby
* removed from the Lucene Index. A similar thing happens when a document is updated; the old document is marked as
* "deleted" in the Lucene segment and the new version of the document is added in a new Lucene segment. Until a merge
* occurs, both the old and new versions of the document will exist in the Lucene Index in different segments, though only
* the new version will be returned in search results. This means that from an ES/OS perspective, you could have a single
* document that has been created, deleted, recreated, updated, etc. multiple times at the Elasticsearch/OpenSearch level
* and only a single version of the doc would exist when you queried the ES/OS Index - but every single iteration of that
* document might still exist in the Lucene segments on disk, all of which have the same _id (from the ES/OS perspective).
*
* Additionally, Elasticsearch 7 introduced a feature called "soft deletes" which allows you to mark a document as
* "deleted" in the Lucene Index without actually removing it from the Lucene Index. This works by having the
* application writing the Lucene Index define a field that is used to mark a document as "soft deleted" or not. When
* a document is marked as "soft deleted", it is not returned in search results, but it is still present in the Lucene
* Index. The status of whether any given document is "soft deleted" or not is stored in the Lucene Index itself. By
* default, Elasticsearch 7+ Indices have soft deletes enabled; this is an Index-level setting.
* "deleted" in the Lucene Index without actually removing it from the Lucene Index. From what I can gather, soft deletes
* are an optimization to reduce the likelyhood of needing to re-download full shards when a node drops out of the cluster,
* loses synchronization, and re-joins. They make it more likely the cluster can just replay the missed operations. You
* can read a bit more about soft deletes here:
*
* https://www.elastic.co/guide/en/elasticsearch/reference/7.10/index-modules-history-retention.html
*
* Soft deletes works by having the application writing the Lucene Index define a field that is used to mark a document as
* "soft deleted" or not. When a document is marked as "soft deleted", it is not returned in search results, but it is
* still present in the Lucene Index. The status of whether any given document is "soft deleted" or not is stored in the
* Lucene Index itself. By default, Elasticsearch 7+ Indices have soft deletes enabled; this is an Index-level setting.
* Just like deleted documents and old versions of updated documents, we don't want to reindex them agaisnt the target
* cluster.
*
* In order to retrieve only those documents that would be considered "live" in ES/OS, we use a few tricks:
* 1. We make sure we use the latest Lucene commit point on the Lucene Index. A commit is a Lucene abstraction that
Expand Down

0 comments on commit daada64

Please sign in to comment.