Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Storage][BUG] Don't fail remote store recovery for shards when there is actually no data in remote store. #16443

Open
skumawat2025 opened this issue Oct 23, 2024 · 0 comments
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage:Remote

Comments

@skumawat2025
Copy link
Contributor

skumawat2025 commented Oct 23, 2024

Describe the bug

During index creation, there can be cases where some shards fail to initialize, often due to issues like exceeded disk usage on a node. Despite this partial failure, the index creation is considered successful, and we proceed to upload the remote index path file. This situation leads to a RED cluster status due to unassigned shards. When we attempt a remote store recovery in this state, it shows successful recovery for every shard even though the shards are still unassigned:

Response of Remote Restore API.

curl -X POST "localhost:9200/_remotestore/_restore" -H 'Content-Type: application/json' -d' 
{
  "indices": ["split-index-after-full-storage"]
}'
{"remote_store":{"snapshot":"remote_store","indices":["split-index-after-full-storage"],"shards":{"total":2,"failed":0,"successful":2}}}

We see this stack trace in master logs.

Caused by: [split-index-after-full-storage/1010112HQmAwMwSHmpvIuU8D6myQ][[split-index-after-full-storage][1]] IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp];
        ... 12 more
Caused by: TranslogCorruptedException[translog from source [/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[/es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp];
        at org.opensearch.index.translog.Checkpoint.read(Checkpoint.java:212)
        at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:558)
        at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:135)
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344)
        ... 9 more
Caused by: java.nio.file.NoSuchFileException:es/data/nodes/0/indices/1010112HQmAwMwSHmpvIuU8D6myQ/1/translog/translog.ckp
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
        at java.nio.channels.FileChannel.open(FileChannel.java:309)
        at java.nio.channels.FileChannel.open(FileChannel.java:369)
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
        at org.opensearch.index.translog.Checkpoint.read(Checkpoint.java:204)
        ... 12 more

Related component

Storage:Durability

To Reproduce

  1. Create a two data node cluster
  2. Create an index test-index1
  3. Achieve a high disk usage for one of the data node let say node1.
  4. Now try to split the index test-index1 into new index split-index-after-full-storage with atleast 2 primary shards.
  5. The primary shard of index split-index-after-full-storage will get initialised on node2 and the shard on node1 will remains unassigned. This will lead to RED cluster state.
  6. When we try to restore the red index from remote storage. We see the described stack trace in master logs.

Expected behavior

We should not see the stack trace for recovery failure. Instead we should see info that there is no data on remote to be recovered.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@skumawat2025 skumawat2025 added bug Something isn't working untriaged labels Oct 23, 2024
@github-actions github-actions bot added the Storage:Durability Issues and PRs related to the durability framework label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage:Remote
Projects
Status: 🆕 New
Development

No branches or pull requests

1 participant