[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

skumawat2025 · 2024-10-23T03:58:51Z

Describe the bug

During index creation, there can be cases where some shards fail to initialize, often due to issues like exceeded disk usage on a node. Despite this partial failure, the index creation is considered successful, and we proceed to upload the remote index path file. This situation leads to a RED cluster status due to unassigned shards. When we attempt a remote store recovery in this state, it shows successful recovery for every shard even though the shards are still unassigned:

Response of Remote Restore API.

curl -X POST "localhost:9200/_remotestore/_restore" -H 'Content-Type: application/json' -d' 
{
  "indices": ["split-index-after-full-storage"]
}'
{"remote_store":{"snapshot":"remote_store","indices":["split-index-after-full-storage"],"shards":{"total":2,"failed":0,"successful":2}}}

We see this stack trace in master logs.

Caused by: [split-index-after-full-storage/1010112HQmAwMwSHmpvIuU8D6myQ][[split-index-after-full-storage][1]] IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp];
        ... 12 more
Caused by: TranslogCorruptedException[translog from source [/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp];
        at org.opensearch.index.translog.Checkpoint.read(Checkpoint.java:212)
        at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:558)
        at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:135)
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344)
        ... 9 more
Caused by: java.nio.file.NoSuchFileException:es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
        at java.nio.channels.FileChannel.open(FileChannel.java:309)
        at java.nio.channels.FileChannel.open(FileChannel.java:369)
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
        at org.opensearch.index.translog.Checkpoint.read(Checkpoint.java:204)
        ... 12 more

Related component

Storage:Durability

To Reproduce

Create a two data node cluster
Create an index test-index1
Achieve a high disk usage for one of the data node let say node1.
Now try to split the index test-index1 into new index split-index-after-full-storage with atleast 2 primary shards.
The primary shard of index split-index-after-full-storage will get initialised on node2 and the shard on node1 will remains unassigned. This will lead to RED cluster state.
When we try to restore the red index from remote storage. We see the described stack trace in master logs.

Expected behavior

We should not see the stack trace for recovery failure. Instead we should see info that there is no data on remote to be recovered.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

skumawat2025 added bug Something isn't working untriaged labels Oct 23, 2024

github-actions bot added the Storage:Durability Issues and PRs related to the durability framework label Oct 23, 2024

skumawat2025 added Storage:Remote and removed untriaged labels Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

skumawat2025 commented Oct 23, 2024 •

edited

Loading

[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

Comments

skumawat2025 commented Oct 23, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

skumawat2025 commented Oct 23, 2024 •

edited

Loading