Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where ingestion failed for input document containing list of nested objects #1040

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yizheliu-amazon
Copy link
Contributor

Description

Fix bug where ingestion failed for input document containing list of nested objects

Related Issues

Resolves #1024

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@heemin32
Copy link
Collaborator

Can we have IT test for this?

int nestedElementIndex
) {
if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return;
if (sourceValue instanceof Map) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't sourceValue always an instance of Map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. In the case of doc with list of nested objects, sourceValue will become type of List in the last recursive call. You may check line 505 - 506

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this method is called only when the sourceValue is instance of Map.

Copy link
Contributor Author

@yizheliu-amazon yizheliu-amazon Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is called in putNLPResultToSourceMapForMapType() when sourceValue is Map, but sourceValue is a nested object with list inside. During the recursive call of putNLPResultToSingleSourceMapInList(), it may reach to the level of list type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I missed the recursive aspect. I have a follow-up comment:

Is there a limit on the number of recursive calls that can occur? There was a security campaign emphasizing the need to avoid unlimited recursive calls. We should either impose a limit on the recursion depth or refactor the logic to use an iterative approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have done fieldMap depth validation in ProcessorDocumentUtils.validateMapTypeValue(). Given the recursion depth of this method is same as fieldMap depth, it should be fine for us to not do recursion depth validation again here.

Pleas feel free to let me know your thoughts.

int nestedElementIndex
) {
if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return;
if (sourceValue instanceof Map) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (sourceValue instanceof Map) {
assert sourceValue instanceof Map, "sourceValue should be an instance of Map"

@heemin32 heemin32 added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Dec 24, 2024
*/
Map<String, Object> child1Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1);
Map<String, Object> child1Level1 = buildObjMapWithSingleField(CHILD_FIELD_LEVEL_1, child1Level2);
Map<String, Object> child2Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this critical for test case to have all identical values for both nested fields? In real life scenario most of the times values will be different, can we edit this method or add a new test case with 2+ different fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can add different fields so that objects in the list are not identical.

List<Map<String, Object>> nestedElementList = (List<Map<String, Object>>) sourceAndMetadataMap.get(processorKey);

IntStream.range(0, nestedElementList.size()).forEach(nestedElementIndex -> {
Map<String, Object> nestedElement = nestedElementList.get(nestedElementIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about following version, get by index from list can be not optimal in case of some list implementations like linked list:

Iterator<Map<String, Object>> iterator = nestedElementList.iterator();
Stream.iterate(0, i -> i + 1)
    .limit(nestedElementList.size())
    .forEach(index -> {
        Map<String, Object> nestedElement = iterator.next();
        putNLPResultToSingleSourceMapInList(
            entryKey,
            entryValue,
            results,
            indexWrapper,
            nestedElement,
            index
        );
    });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good idea. Will change to this.

}
} else if (inputNestedMapEntry.getValue() instanceof Map) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please refactor logic for each type in a separate method, this should make code cleaner:

    if (entryValue instanceof List) {
        processListTypeEntry(entryKey, (List<Object>) entryValue, processorKey, 
                           results, indexWrapper, sourceAndMetadataMap);
    } else if (entryValue instanceof Map) {
        processMapTypeEntry(entryKey, entryValue, processorKey, 
                          results, indexWrapper, sourceAndMetadataMap);
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have same method name.

if (entryValue instanceof List) { 
  processEntry(entryKey, (List<Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); 
} 
else if (entryValue instanceof Map) { 
  processEntry(entryKey, (Map<String, Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap); 
}


private void processEntry(..., List<Object> entryValue, ...){...}
private void processEntry(..., Map<String, Object> entryValue, ...){...}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the ideas. I may go with @martin-gaievski 's suggestion since I prefer specific method name so that it is more readable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In java, having same method name with different parameter signature is common pattern that is encouraged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @heemin32 . Given this file already has such pattern of separating methods with name for List/Map type, such as putNLPResultToSourceMapForMapType(), buildNLPResultForListType. I may keep this pattern. Please feel free to let me know if you have any concerns or ideas. Thank you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In putNLPResultToSourceMapForMapType(), the map type refers to sourceValue, which is an Object. As such, it's explicitly mentioned in the method name. However, I believe a better name would be putNLPResultToSourceMap() and that the method should use a more specific type instead of accepting a generic Object.
  2. Consistency with other parts of the code isn't strictly necessary, especially when there's a clearly better approach.

That said, I won't insist further. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. That makes sense to me. Thank you @heemin32

@yizheliu-amazon
Copy link
Contributor Author

yizheliu-amazon commented Dec 26, 2024

Can we have IT test for this?

Thanks for the review. I tried adding IT test for it, but found a new issue in the case of doc containing list of nested objects with multiple dots .: issue #1042 . The ingest pipeline example in issue #1042 is actually from config file of existing IT. That being said, given pipeline config of existing IT in the code, new IT test for this change will fail. Such issue is not related to this bug fix PR, but related to case where doc containing list of nested objects with multiple dots . is being ingested. Existing ITs can pass because such case is not covered.

To work around it, we can either

  1. fix current bug, then fix issue [BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042; in the PR for issue [BUG] Fail to generate embedding for ingest document with nested field defined in field map #1042, I can add IT for the case of ingestion of doc with list of nested objects.
  2. create a new pipeline configuration like below for IT which is working for this PR, but this may seem unnecessary because such new pipeline is very similar to existing one. If IT for this PR can pass given existing pipeline, it can also pass for below pipeline.
{
  "description": "text embedding pipeline for hybrid",
  "processors": [
    {
      "text_embedding": {
        "model_id": "%s",
        "field_map": {
          "title": "title_knn",
          "favor_list": "favor_list_knn",
          "favorites": {
            "game": "game_knn",
            "movie": "movie_knn"
          },
          "nested_passages": "level_1_embedding"
        }
      }
    }
  ]
}

I may prefer option 1 since option 2 seems unnecessary to me.

@heemin32
Copy link
Collaborator

@yizheliu-amazon Thanks for the detail explanation. I will leave it to you to decided for the next step among the two option. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fail to ingest document with nested list into text_embedding processor
3 participants