-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where ingestion failed for input document containing list of nested objects #1040
base: main
Are you sure you want to change the base?
Fix bug where ingestion failed for input document containing list of nested objects #1040
Conversation
…nested objects Signed-off-by: Yizhe Liu <[email protected]>
Can we have IT test for this? |
int nestedElementIndex | ||
) { | ||
if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return; | ||
if (sourceValue instanceof Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't sourceValue always an instance of Map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In the case of doc with list of nested objects, sourceValue will become type of List in the last recursive call. You may check line 505 - 506
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this method is called only when the sourceValue is instance of Map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is called in putNLPResultToSourceMapForMapType()
when sourceValue
is Map, but sourceValue
is a nested object with list
inside. During the recursive call of putNLPResultToSingleSourceMapInList()
, it may reach to the level of list
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I missed the recursive aspect. I have a follow-up comment:
Is there a limit on the number of recursive calls that can occur? There was a security campaign emphasizing the need to avoid unlimited recursive calls. We should either impose a limit on the recursion depth or refactor the logic to use an iterative approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We have done fieldMap
depth validation in ProcessorDocumentUtils.validateMapTypeValue()
. Given the recursion depth of this method is same as fieldMap
depth, it should be fine for us to not do recursion depth validation again here.
Pleas feel free to let me know your thoughts.
int nestedElementIndex | ||
) { | ||
if (processorKey == null || sourceAndMetadataMap == null || sourceValue == null) return; | ||
if (sourceValue instanceof Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (sourceValue instanceof Map) { | |
assert sourceValue instanceof Map, "sourceValue should be an instance of Map" |
*/ | ||
Map<String, Object> child1Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1); | ||
Map<String, Object> child1Level1 = buildObjMapWithSingleField(CHILD_FIELD_LEVEL_1, child1Level2); | ||
Map<String, Object> child2Level2 = buildObjMapWithSingleField(CHILD_1_TEXT_FIELD, TEXT_VALUE_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this critical for test case to have all identical values for both nested fields? In real life scenario most of the times values will be different, can we edit this method or add a new test case with 2+ different fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I can add different fields so that objects in the list are not
identical.
List<Map<String, Object>> nestedElementList = (List<Map<String, Object>>) sourceAndMetadataMap.get(processorKey); | ||
|
||
IntStream.range(0, nestedElementList.size()).forEach(nestedElementIndex -> { | ||
Map<String, Object> nestedElement = nestedElementList.get(nestedElementIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about following version, get by index from list can be not optimal in case of some list implementations like linked list:
Iterator<Map<String, Object>> iterator = nestedElementList.iterator();
Stream.iterate(0, i -> i + 1)
.limit(nestedElementList.size())
.forEach(index -> {
Map<String, Object> nestedElement = iterator.next();
putNLPResultToSingleSourceMapInList(
entryKey,
entryValue,
results,
indexWrapper,
nestedElement,
index
);
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good idea. Will change to this.
} | ||
} else if (inputNestedMapEntry.getValue() instanceof Map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please refactor logic for each type in a separate method, this should make code cleaner:
if (entryValue instanceof List) {
processListTypeEntry(entryKey, (List<Object>) entryValue, processorKey,
results, indexWrapper, sourceAndMetadataMap);
} else if (entryValue instanceof Map) {
processMapTypeEntry(entryKey, entryValue, processorKey,
results, indexWrapper, sourceAndMetadataMap);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have same method name.
if (entryValue instanceof List) {
processEntry(entryKey, (List<Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap);
}
else if (entryValue instanceof Map) {
processEntry(entryKey, (Map<String, Object>) entryValue, processorKey, results, indexWrapper, sourceAndMetadataMap);
}
private void processEntry(..., List<Object> entryValue, ...){...}
private void processEntry(..., Map<String, Object> entryValue, ...){...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the ideas. I may go with @martin-gaievski 's suggestion since I prefer specific method name so that it is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In java, having same method name with different parameter signature is common pattern that is encouraged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @heemin32 . Given this file already has such pattern of separating methods with name for List/Map type, such as putNLPResultToSourceMapForMapType()
, buildNLPResultForListType
. I may keep this pattern. Please feel free to let me know if you have any concerns or ideas. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In
putNLPResultToSourceMapForMapType()
, the map type refers tosourceValue
, which is an Object. As such, it's explicitly mentioned in the method name. However, I believe a better name would beputNLPResultToSourceMap()
and that the method should use a more specific type instead of accepting a generic Object. - Consistency with other parts of the code isn't strictly necessary, especially when there's a clearly better approach.
That said, I won't insist further. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. That makes sense to me. Thank you @heemin32
Thanks for the review. I tried adding IT test for it, but found a new issue in the case of doc containing list of nested objects with multiple dots To work around it, we can either
I may prefer option 1 since option 2 seems unnecessary to me. |
@yizheliu-amazon Thanks for the detail explanation. I will leave it to you to decided for the next step among the two option. Thanks! |
Description
Fix bug where ingestion failed for input document containing list of nested objects
Related Issues
Resolves #1024
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.