-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add vector_search_by_key method to sync and async clients vec-330 #53
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #53 +/- ##
======================================
Coverage ? 71.24%
======================================
Files ? 25
Lines ? 2271
Branches ? 0
======================================
Hits ? 1618
Misses ? 653
Partials ? 0 ☔ View full report in Codecov by Sentry. |
key_namespace: str, | ||
vector_field: str, | ||
limit: int, | ||
set_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called "key_set" to be consistent with other arguments? Is it easier to understand that this is the set the record resides in if it is called "key_set"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.
Instead of index_namespace
, it should be search_namespace
. This way, it is clear that you find the key in key_namespace
, and you search in search_namespace
. Index_namespace
could mean where the index information is stored rather than where the vector data is stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good feedback, these make sense to me. Changed
…_vector_search is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions and comments. Looked good otherwise!
exclude_fields=test_case.exclude_fields, | ||
) | ||
|
||
assert list.sort(results) == list.sort(test_case.expected_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: Odd that the server doesn't return this in order, but I guess that isn't how the KNN algorithm works. I suppose it is a tad faster to send unsorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This posed problems for me when testing but it could have been that my test case data was in a different order. Either way I think being able to sort neighbors is useful.
src/aerospike_vector_search/types.py
Outdated
if not isinstance(other, Neighbor): | ||
return NotImplemented | ||
|
||
return self.distance >= other.distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation will lead to inconsistent sorting for records that are equal in distance. Consider the example:
neighbor1.distance = 4, neighbor2.distance = 4, neighbor3.distance = 0
array1 = [neighbor1, neighbor2, neighbor3]
array2 = [neighbor2, neighbor1, neighbor3]
the outcome of list.sort(array1)
will equal: [neighbor3, neighbor1, neighbor2,]
the outcome of list.sort(array2)
will equal: [neighbor3, neighbor2, neighbor1,]
This is due to the python list.sort being a stable sort implementation: See documentation here
If a user is sorting, they likely want a consistent output, and the current implementation won't provide that.
Sorting a key name in the event of distance equality will solve this issue.
It won't affect most, but the current behavior is sub-optimal for some use cases.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I added your suggestion. The key is compared if distances are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a problem... comparing keys of different types fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I settled on using the set and string representation of the keys to tie break, let me know what you think
vector = rec_and_key.fields[vector_field] | ||
|
||
neighbors = self.vector_search( | ||
namespace=index_namespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we want to search the index_namespace
rather than the key_namespace
?
Doesn't the index_namespace
just have index info on it, while key_namespace
has all the records?
A test case should be added to verify that this behavior is sound. All test cases use the same value for both key_namespace
and index_namespace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the comment above before commenting on the conversation below. I understand that you might search a different namespace than the one you found the key in. Changing index_namespace
to search_namespace
could clear up the confusion.
A valid test case should still be added for this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed index_namespace
to search_namespace
I'll see about the test case. I think I'll have to add another namespace to the test Aerospike configs etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test case where the record and searchspace are in separate namespaces
key_namespace: str, | ||
vector_field: str, | ||
limit: int, | ||
set_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.
Instead of index_namespace
, it should be search_namespace
. This way, it is clear that you find the key in key_namespace
, and you search in search_namespace
. Index_namespace
could mean where the index information is stored rather than where the vector data is stored.
…ex_namespace to search_namespace
… if distance is equal
…different namespaces
This feature is a convenience method that enables HNSW searches by Aerospike record primary key instead of by literal vectors. The client does this by reaching out to AVS to get the vector data for the record the user wishes to search on. Some extra arguments are required for this search, namely the record set name and the vector field name.
This PR is based on the vec-373 PRs hence the old commits about ci etc