Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add vector_search_by_key method to sync and async clients vec-330 #53

Open
wants to merge 23 commits into
base: dev
Choose a base branch
from

Conversation

dwelch-spike
Copy link
Contributor

@dwelch-spike dwelch-spike commented Oct 2, 2024

This feature is a convenience method that enables HNSW searches by Aerospike record primary key instead of by literal vectors. The client does this by reaching out to AVS to get the vector data for the record the user wishes to search on. Some extra arguments are required for this search, namely the record set name and the vector field name.

This PR is based on the vec-373 PRs hence the old commits about ci etc

@codecov-commenter
Copy link

codecov-commenter commented Oct 2, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 9 lines in your changes missing coverage. Please review.

Please upload report for BASE (dev@dd73d96). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/aerospike_vector_search/types.py 55.00% 9 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##             dev      #53   +/-   ##
======================================
  Coverage       ?   71.24%           
======================================
  Files          ?       25           
  Lines          ?     2271           
  Branches       ?        0           
======================================
  Hits           ?     1618           
  Misses         ?      653           
  Partials       ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

key_namespace: str,
vector_field: str,
limit: int,
set_name: Optional[str] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called "key_set" to be consistent with other arguments? Is it easier to understand that this is the set the record resides in if it is called "key_set"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.

Instead of index_namespace, it should be search_namespace. This way, it is clear that you find the key in key_namespace, and you search in search_namespace. Index_namespace could mean where the index information is stored rather than where the vector data is stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback, these make sense to me. Changed

Copy link
Collaborator

@DomPeliniAerospike DomPeliniAerospike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions and comments. Looked good otherwise!

exclude_fields=test_case.exclude_fields,
)

assert list.sort(results) == list.sort(test_case.expected_results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Odd that the server doesn't return this in order, but I guess that isn't how the KNN algorithm works. I suppose it is a tad faster to send unsorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This posed problems for me when testing but it could have been that my test case data was in a different order. Either way I think being able to sort neighbors is useful.

if not isinstance(other, Neighbor):
return NotImplemented

return self.distance >= other.distance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation will lead to inconsistent sorting for records that are equal in distance. Consider the example:

neighbor1.distance = 4, neighbor2.distance = 4, neighbor3.distance = 0

array1 = [neighbor1, neighbor2, neighbor3]

array2 = [neighbor2, neighbor1, neighbor3]

the outcome of list.sort(array1) will equal: [neighbor3, neighbor1, neighbor2,]

the outcome of list.sort(array2) will equal: [neighbor3, neighbor2, neighbor1,]

This is due to the python list.sort being a stable sort implementation: See documentation here

If a user is sorting, they likely want a consistent output, and the current implementation won't provide that.

Sorting a key name in the event of distance equality will solve this issue.

It won't affect most, but the current behavior is sub-optimal for some use cases.

Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added your suggestion. The key is compared if distances are the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a problem... comparing keys of different types fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I settled on using the set and string representation of the keys to tie break, let me know what you think

vector = rec_and_key.fields[vector_field]

neighbors = self.vector_search(
namespace=index_namespace,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to search the index_namespace rather than the key_namespace?

Doesn't the index_namespace just have index info on it, while key_namespace has all the records?

A test case should be added to verify that this behavior is sound. All test cases use the same value for both key_namespace and index_namespace.

Copy link
Collaborator

@DomPeliniAerospike DomPeliniAerospike Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the comment above before commenting on the conversation below. I understand that you might search a different namespace than the one you found the key in. Changing index_namespace to search_namespace could clear up the confusion.

A valid test case should still be added for this situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed index_namespace to search_namespace I'll see about the test case. I think I'll have to add another namespace to the test Aerospike configs etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case where the record and searchspace are in separate namespaces

key_namespace: str,
vector_field: str,
limit: int,
set_name: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.

Instead of index_namespace, it should be search_namespace. This way, it is clear that you find the key in key_namespace, and you search in search_namespace. Index_namespace could mean where the index information is stored rather than where the vector data is stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants