Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG FIX] Fix bwc failure in neural sparse search #696

Conversation

zhichao-aws
Copy link
Member

@zhichao-aws zhichao-aws commented Apr 18, 2024

Description

Recently we find some bwc failures of neural sparse search in rolling-upgrade case (from 2.x to 3.0 upgrade).
ref:
https://github.com/opensearch-project/neural-search/actions/runs/8649948821/job/23718643879?pr=683
https://github.com/opensearch-project/neural-search/actions/runs/8728772234/job/23957283115?pr=694

The reason is, we introduced max_token_score to 2.11 and we cut the PR directly to 2.x, the main branch is not involved. Although we deprecate this field after upgrade to lucene 9.8, this field is still a breaking change in main and 2.x. It will cause serialization/deserialization inconsistence between 3.0 nodes and 2.x nodes. This inconsistence will fail the search request on the shard, and then we get empty search response in bwc test above.

In this PR we add the max_token_score parsing logics to NeuralSparseQueryBuilder. We can parse this field but ignore it in the doToQuery. This logics is consistence with what we do in 2.x after we deprecate the field. After this get merged, max_token_score field parsing and UT logics in main is consistent with 2.x now

Issues Resolved

#688

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zhichao-aws <[email protected]>
@zhichao-aws
Copy link
Member Author

https://github.com/opensearch-project/neural-search/actions/runs/8734284147/job/23965092639?pr=696
The bwc failed due to another exception, and I found we already created issue to track it opensearch-project/ml-commons#2333

REPRODUCE WITH: ./gradlew ':qa:restart-upgrade:testAgainstNewCluster' --tests "org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow" -Dtests.seed=A0EB48412B9D6A40 -Dtests.security.manager=false -Dtests.bwc.version=2.10.0 -Dtests.locale=sr-Latn -Dtests.timezone=Australia/Eucla -Druntime.java=21

org.opensearch.neuralsearch.bwc.SemanticSearchIT > testTextEmbeddingProcessor_E2EFlow FAILED
    org.opensearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:40357/], URI [/_plugins/_ml/models/FZY88I4BM2JwIbQlryn-], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"},"status":500}

Suite: Test class org.opensearch.neuralsearch.bwc.SemanticSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:restart-upgrade:testAgainstNewCluster' --tests "org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow" -Dtests.seed=A0EB48412B9D6A40 -Dtests.security.manager=false -Dtests.bwc.version=2.10.0 -Dtests.locale=sr-Latn -Dtests.timezone=Australia/Eucla -Druntime.java=21
  2> org.opensearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:40357/], URI [/_plugins/_ml/models/FZY88I4BM2JwIbQlryn-], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"},"status":500}
        at __randomizedtesting.SeedInfo.seed([A0EB48412B9D6A40:B0771E0F66EF3A19]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:385)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:355)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:330)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:905)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:878)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.deleteModel(BaseNeuralSearchIT.java:937)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.wipeOfTestResources(BaseNeuralSearchIT.java:1231)
        at app//org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow(SemanticSearchIT.java:46)

@zhichao-aws
Copy link
Member Author

Just found another bwc flaky test ref. It looks just like the flaky test for neural sparse: we ingest doc in old cluster, but can not search it in the mixed cluster.

org.opensearch.neuralsearch.bwc.HybridSearchIT > testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  1> [2024-04-18T10:14:02,796][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] before test
  1> [2024-04-18T10:14:03,023][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] initializing REST clients against [http://[::1]:44971, http://127.0.0.1:46107/, http://[::1]:34955, http://127.0.0.1:37089/, http://[::1]:45727, http://127.0.0.1:37209/]
  1> [2024-04-18T10:14:07,199][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/bulk[s][r]].

Suite: Test class org.opensearch.neuralsearch.bwc.HybridSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  2> java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
  2> NOTE: leaving temporary files on disk at: /home/runner/work/neural-search/neural-search/qa/rolling-upgrade/build/testrun/testAgainstOneThirdUpgradedCluster/temp/org.opensearch.neuralsearch.bwc.HybridSearchIT_15108A61038B5F36-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=1523, maxMBSortInHeap=6.857239323607784, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=hr, timezone=Africa/Maseru
  2> NOTE: Linux 6.5.0-1018-azure amd64/Azul Systems, Inc. 11.0.23 (64-bit)/cpus=4,threads=3,free=457856512,total=536870912
  2> NOTE: All tests run in this JVM: [HybridSearchIT]
  1> [2024-04-18T10:14:07,226][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] after test

Then I do some experiments further. I create a cluster with one 2.14 node and another 3.0 node. Then I create a index with 2 shards(one shard for each node) and write 10 docs to it. When I call cat index to each node, we can see stream IO related errors in node log, and cat index api only return the doc number of local shard(4+6)
image
And I use simple search API to search on each nodes, the request failed to broadcast between nodes, and only return the doc number of local shard
image
The search response looks like this. the search request fails to broadcast between nodes due to stream i_o_exception

{'took': 11,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 0,
    'index': 'test',
    'node': 'GrRxwmwZRnu7WJPbuafscA',
    'reason': {'type': 'i_o_exception',
     'reason': 'Invalid vInt ((ffffffef & 0x7f) << 28) | 51ce6d'}}]},
 'hits': {' ...

In those failed bwc tests, we only ingest one document. If this doc is ingested to other node, we'll get empty search response.(where we failed in bwc)

@zane-neo
Copy link
Collaborator

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

@zane-neo
Copy link
Collaborator

Just found another bwc flaky test ref. It looks just like the flaky test for neural sparse: we ingest doc in old cluster, but can not search it in the mixed cluster.

org.opensearch.neuralsearch.bwc.HybridSearchIT > testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  1> [2024-04-18T10:14:02,796][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] before test
  1> [2024-04-18T10:14:03,023][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] initializing REST clients against [http://[::1]:44971, http://127.0.0.1:46107/, http://[::1]:34955, http://127.0.0.1:37089/, http://[::1]:45727, http://127.0.0.1:37209/]
  1> [2024-04-18T10:14:07,199][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/bulk[s][r]].

Suite: Test class org.opensearch.neuralsearch.bwc.HybridSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  2> java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
  2> NOTE: leaving temporary files on disk at: /home/runner/work/neural-search/neural-search/qa/rolling-upgrade/build/testrun/testAgainstOneThirdUpgradedCluster/temp/org.opensearch.neuralsearch.bwc.HybridSearchIT_15108A61038B5F36-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=1523, maxMBSortInHeap=6.857239323607784, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=hr, timezone=Africa/Maseru
  2> NOTE: Linux 6.5.0-1018-azure amd64/Azul Systems, Inc. 11.0.23 (64-bit)/cpus=4,threads=3,free=457856512,total=536870912
  2> NOTE: All tests run in this JVM: [HybridSearchIT]
  1> [2024-04-18T10:14:07,226][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] after test

Then I do some experiments further. I create a cluster with one 2.14 node and another 3.0 node. Then I create a index with 2 shards(one shard for each node) and write 10 docs to it. When I call cat index to each node, we can see stream IO related errors in node log, and cat index api only return the doc number of local shard(4+6) image And I use simple search API to search on each nodes, the request failed to broadcast between nodes, and only return the doc number of local shard image The search response looks like this. the search request fails to broadcast between nodes due to stream i_o_exception

{'took': 11,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 0,
    'index': 'test',
    'node': 'GrRxwmwZRnu7WJPbuafscA',
    'reason': {'type': 'i_o_exception',
     'reason': 'Invalid vInt ((ffffffef & 0x7f) << 28) | 51ce6d'}}]},
 'hits': {' ...

In those failed bwc tests, we only ingest one document. If this doc is ingested to other node, we'll get empty search response.(where we failed in bwc)

Is it possible to identify the incompatible object between 2.14 and 3.0? If this is a breaking change, we might can skip this test.

CHANGELOG.md Show resolved Hide resolved
@zhichao-aws
Copy link
Member Author

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

Now we have 3 flaky test here: neural sparse search; hybrid search match query; and a test with error log "java.lang.Boolean.booleanValue()" because "isHidden" is null". This PR is fixing the first one.

For the first flaky test, there are 2 possible test cases, one is org.opensearch.neuralsearch.bwc.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_NeuralSparseSearch_E2EFlow; and another is org.opensearch.neuralsearch.bwc.NeuralSparseSearchIT.testSparseEncodingProcessor_E2EFlow. The reason behind is same: the neural sparse search request fail to broadcast between nodes and only return local result.

@zhichao-aws
Copy link
Member Author

zhichao-aws commented Apr 19, 2024

Is it possible to identify the incompatible object between 2.14 and 3.0? If this is a breaking change, we might can skip this test.

I just learned that we need to keep bwc between latest 2.x and 3.0 . So we can not skip this test. The bwc issue in core should be considered a BUG. (Now I've found the match query and cat index api have bwc issue)

@zane-neo
Copy link
Collaborator

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

Now we have 3 flaky test here: neural sparse search; hybrid search match query; and a test with error log "java.lang.Boolean.booleanValue()" because "isHidden" is null". This PR is fixing the first one.

For the first flaky test, there are 2 possible test cases, one is org.opensearch.neuralsearch.bwc.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_NeuralSparseSearch_E2EFlow; and another is org.opensearch.neuralsearch.bwc.NeuralSparseSearchIT.testSparseEncodingProcessor_E2EFlow. The reason behind is same: the neural sparse search request fail to broadcast between nodes and only return local result.

I don't see this PR is fixing the issue you mentioned in the related issue part, let's merge this but do not close the issue: #688

@vibrantvarun
Copy link
Member

LGTM

@vibrantvarun vibrantvarun requested a review from yuye-aws April 20, 2024 06:27
@zhichao-aws zhichao-aws merged commit 7b0229d into opensearch-project:main Apr 20, 2024
70 checks passed
conggguan pushed a commit to conggguan/neural-search that referenced this pull request Apr 22, 2024
* Adding integ tests for scenario of hybrid query with aggregations (opensearch-project#632)

* Adding tests and params to ignore tests if needed

Signed-off-by: Martin Gaievski <[email protected]>

* [BUG FIX] Fix bwc failure in neural sparse search (opensearch-project#696)

* fix comments

Signed-off-by: zhichao-aws <[email protected]>

---------

Signed-off-by: Martin Gaievski <[email protected]>
Signed-off-by: zhichao-aws <[email protected]>
Co-authored-by: Martin Gaievski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants