Fix bug where ingestion failed for input document containing list of nested objects #1040

yizheliu-amazon · 2024-12-22T00:51:55Z

Description

Fix bug where ingestion failed for input document containing list of nested objects

Related Issues

Resolves #1024

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…nested objects Signed-off-by: Yizhe Liu <[email protected]>

heemin32 · 2024-12-24T18:59:14Z

Can we have IT test for this?

heemin32 · 2024-12-24T18:57:50Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+        int nestedElementIndex
+    ) {
+        if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return;
+        if (sourceValue instanceof Map) {


Isn't sourceValue always an instance of Map?

No. In the case of doc with list of nested objects, sourceValue will become type of List in the last recursive call. You may check line 505 - 506

I mean, this method is called only when the sourceValue is instance of Map.

No, it is called in putNLPResultToSourceMapForMapType() when sourceValue is Map, but sourceValue is a nested object with list inside. During the recursive call of putNLPResultToSingleSourceMapInList(), it may reach to the level of list type

You're right. I missed the recursive aspect. I have a follow-up comment:

Is there a limit on the number of recursive calls that can occur? There was a security campaign emphasizing the need to avoid unlimited recursive calls. We should either impose a limit on the recursion depth or refactor the logic to use an iterative approach.

Yes. We have done fieldMap depth validation in ProcessorDocumentUtils.validateMapTypeValue(). Given the recursion depth of this method is same as fieldMap depth, it should be fine for us to not do recursion depth validation again here.

Pleas feel free to let me know your thoughts.

heemin32 · 2024-12-24T18:58:40Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+        int nestedElementIndex
+    ) {
+        if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return;
+        if (sourceValue instanceof Map) {


Suggested change

if (sourceValue instanceof Map) {

assert sourceValue instanceof Map, "sourceValue should be an instance of Map"

martin-gaievski · 2024-12-24T18:36:57Z

src/test/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessorTests.java

+        */
+        Map<String, Object> child1Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1);
+        Map<String, Object> child1Level1 = buildObjMapWithSingleField(CHILD_FIELD_LEVEL_1, child1Level2);
+        Map<String, Object> child2Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1);


is this critical for test case to have all identical values for both nested fields? In real life scenario most of the times values will be different, can we edit this method or add a new test case with 2+ different fields?

Sure. I can add different fields so that objects in the list are not identical.

martin-gaievski · 2024-12-24T18:57:08Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

+                        List<Map<String, Object>> nestedElementList = (List<Map<String, Object>>) sourceAndMetadataMap.get(processorKey);
+
+                        IntStream.range(0, nestedElementList.size()).forEach(nestedElementIndex -> {
+                            Map<String, Object> nestedElement = nestedElementList.get(nestedElementIndex);


how about following version, get by index from list can be not optimal in case of some list implementations like linked list:

Iterator<Map<String, Object>> iterator = nestedElementList.iterator(); Stream.iterate(0, i -> i + 1) .limit(nestedElementList.size()) .forEach(index -> { Map<String, Object> nestedElement = iterator.next(); putNLPResultToSingleSourceMapInList( entryKey, entryValue, results, indexWrapper, nestedElement, index ); });

That's good idea. Will change to this.

martin-gaievski · 2024-12-24T18:58:36Z

src/main/java/org/opensearch/neuralsearch/processor/InferenceProcessor.java

                        }
+                    } else if (inputNestedMapEntry.getValue() instanceof Map) {


can you please refactor logic for each type in a separate method, this should make code cleaner:

if (entryValue instanceof List) { processListTypeEntry(entryKey, (List<Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); } else if (entryValue instanceof Map) { processMapTypeEntry(entryKey, entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); }

I would rather have same method name.

if (entryValue instanceof List) { processEntry(entryKey, (List<Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); } else if (entryValue instanceof Map) { processEntry(entryKey, (Map<String, Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); } private void processEntry(..., List<Object> entryValue, ...){...} private void processEntry(..., Map<String, Object> entryValue, ...){...}

Thanks for sharing the ideas. I may go with @martin-gaievski 's suggestion since I prefer specific method name so that it is more readable.

In java, having same method name with different parameter signature is common pattern that is encouraged.

Thanks @heemin32 . Given this file already has such pattern of separating methods with name for List/Map type, such as putNLPResultToSourceMapForMapType(), buildNLPResultForListType. I may keep this pattern. Please feel free to let me know if you have any concerns or ideas. Thank you.

In putNLPResultToSourceMapForMapType(), the map type refers to sourceValue, which is an Object. As such, it's explicitly mentioned in the method name. However, I believe a better name would be putNLPResultToSourceMap() and that the method should use a more specific type instead of accepting a generic Object.

Consistency with other parts of the code isn't strictly necessary, especially when there's a clearly better approach.

That said, I won't insist further. Thanks!

Cool. That makes sense to me. Thank you @heemin32

yizheliu-amazon · 2024-12-26T18:26:00Z

Can we have IT test for this?

Thanks for the review. I tried adding IT test for it, but found a new issue in the case of doc containing list of nested objects with multiple dots .: issue #1042 . The ingest pipeline example in issue #1042 is actually from config file of existing IT. That being said, given pipeline config of existing IT in the code, new IT test for this change will fail. Such issue is not related to this bug fix PR, but related to case where doc containing list of nested objects with multiple dots . is being ingested. Existing ITs can pass because such case is not covered.

To work around it, we can either

fix current bug, then fix issue [BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042; in the PR for issue [BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042, I can add IT for the case of ingestion of doc with list of nested objects.
create a new pipeline configuration like below for IT which is working for this PR, but this may seem unnecessary because such new pipeline is very similar to existing one. If IT for this PR can pass given existing pipeline, it can also pass for below pipeline.

{
  "description": "text embedding pipeline for hybrid",
  "processors": [
    {
      "text_embedding": {
        "model_id": "%s",
        "field_map": {
          "title": "title_knn",
          "favor_list": "favor_list_knn",
          "favorites": {
            "game": "game_knn",
            "movie": "movie_knn"
          },
          "nested_passages": "level_1_embedding"
        }
      }
    }
  ]
}

I may prefer option 1 since option 2 seems unnecessary to me.

heemin32 · 2024-12-26T18:30:47Z

@yizheliu-amazon Thanks for the detail explanation. I will leave it to you to decided for the next step among the two option. Thanks!

Fix bug where ingestion failed for input document containing list of …

5b13778

…nested objects Signed-off-by: Yizhe Liu <[email protected]>

github-actions bot added the bug Something isn't working label Dec 22, 2024

This was referenced Dec 22, 2024

Fix bug where ingestion failed for input document with list of nested objects #1039

Closed

Fix bug where ingestion failed for input document has list of nested objects #1038

Closed

yizheliu-amazon marked this pull request as ready for review December 22, 2024 00:53

yizheliu-amazon requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, vibrantvarun, zhichao-aws, yuye-aws and minalsha as code owners December 22, 2024 00:53

heemin32 reviewed Dec 24, 2024

View reviewed changes

heemin32 added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Dec 24, 2024

martin-gaievski reviewed Dec 24, 2024

View reviewed changes

yizheliu-amazon mentioned this pull request Dec 26, 2024

[BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where ingestion failed for input document containing list of nested objects #1040

Fix bug where ingestion failed for input document containing list of nested objects #1040

yizheliu-amazon commented Dec 22, 2024

heemin32 commented Dec 24, 2024

heemin32 Dec 24, 2024

yizheliu-amazon Dec 26, 2024

heemin32 Dec 26, 2024

yizheliu-amazon Dec 26, 2024 •

edited

Loading

heemin32 Dec 27, 2024

yizheliu-amazon Dec 27, 2024

heemin32 Dec 24, 2024

martin-gaievski Dec 24, 2024

yizheliu-amazon Dec 26, 2024

martin-gaievski Dec 24, 2024

yizheliu-amazon Dec 26, 2024

martin-gaievski Dec 24, 2024

heemin32 Dec 24, 2024

yizheliu-amazon Dec 26, 2024

heemin32 Dec 26, 2024

yizheliu-amazon Dec 26, 2024

heemin32 Dec 27, 2024

yizheliu-amazon Dec 27, 2024

yizheliu-amazon commented Dec 26, 2024 •

edited

Loading

heemin32 commented Dec 26, 2024

	if (sourceValue instanceof Map) {
	assert sourceValue instanceof Map, "sourceValue should be an instance of Map"

		}
		} else if (inputNestedMapEntry.getValue() instanceof Map) {

Fix bug where ingestion failed for input document containing list of nested objects #1040

Are you sure you want to change the base?

Fix bug where ingestion failed for input document containing list of nested objects #1040

Conversation

yizheliu-amazon commented Dec 22, 2024

Description

Related Issues

Check List

heemin32 commented Dec 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yizheliu-amazon Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yizheliu-amazon commented Dec 26, 2024 • edited Loading

heemin32 commented Dec 26, 2024

yizheliu-amazon Dec 26, 2024 •

edited

Loading

yizheliu-amazon commented Dec 26, 2024 •

edited

Loading