feat: Add vector_search_by_key method to sync and async clients vec-330 #53

dwelch-spike · 2024-10-02T00:55:45Z

This feature is a convenience method that enables HNSW searches by Aerospike record primary key instead of by literal vectors. The client does this by reaching out to AVS to get the vector data for the record the user wishes to search on. Some extra arguments are required for this search, namely the record set name and the vector field name.

This PR is based on the vec-373 PRs hence the old commits about ci etc

codecov-commenter · 2024-10-02T01:17:23Z

Codecov Report

Attention: Patch coverage is 70.00000% with 9 lines in your changes missing coverage. Please review.

Please upload report for BASE (dev@dd73d96). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/aerospike_vector_search/types.py	55.00%	9 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##             dev      #53   +/-   ##
======================================
  Coverage       ?   71.24%           
======================================
  Files          ?       25           
  Lines          ?     2271           
  Branches       ?        0           
======================================
  Hits           ?     1618           
  Misses         ?      653           
  Partials       ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dwelch-spike · 2024-10-10T21:10:13Z

src/aerospike_vector_search/aio/client.py

+        key_namespace: str,
+        vector_field: str,
+        limit: int,
+        set_name: Optional[str] = None,


Should this be called "key_set" to be consistent with other arguments? Is it easier to understand that this is the set the record resides in if it is called "key_set"?

I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.

Instead of index_namespace, it should be search_namespace. This way, it is clear that you find the key in key_namespace, and you search in search_namespace. Index_namespace could mean where the index information is stored rather than where the vector data is stored.

Good feedback, these make sense to me. Changed

…_vector_search is true

DomPeliniAerospike

Left some questions and comments. Looked good otherwise!

DomPeliniAerospike · 2024-10-11T00:01:03Z

tests/standard/sync/test_vector_client_search_by_key.py

+        exclude_fields=test_case.exclude_fields,
+    )
+
+    assert list.sort(results) == list.sort(test_case.expected_results)


Comment: Odd that the server doesn't return this in order, but I guess that isn't how the KNN algorithm works. I suppose it is a tad faster to send unsorted.

This posed problems for me when testing but it could have been that my test case data was in a different order. Either way I think being able to sort neighbors is useful.

DomPeliniAerospike · 2024-10-11T00:24:26Z

src/aerospike_vector_search/types.py

+        if not isinstance(other, Neighbor):
+            return NotImplemented
+
+        return self.distance >= other.distance


This implementation will lead to inconsistent sorting for records that are equal in distance. Consider the example:

neighbor1.distance = 4, neighbor2.distance = 4, neighbor3.distance = 0

array1 = [neighbor1, neighbor2, neighbor3]

array2 = [neighbor2, neighbor1, neighbor3]

the outcome of list.sort(array1) will equal: [neighbor3, neighbor1, neighbor2,]

the outcome of list.sort(array2) will equal: [neighbor3, neighbor2, neighbor1,]

This is due to the python list.sort being a stable sort implementation: See documentation here

If a user is sorting, they likely want a consistent output, and the current implementation won't provide that.

Sorting a key name in the event of distance equality will solve this issue.

It won't affect most, but the current behavior is sub-optimal for some use cases.

Let me know what you think.

Good point. I added your suggestion. The key is compared if distances are the same

We have a problem... comparing keys of different types fails

I settled on using the set and string representation of the keys to tie break, let me know what you think

DomPeliniAerospike · 2024-10-11T00:41:02Z

src/aerospike_vector_search/client.py

+        vector = rec_and_key.fields[vector_field]
+
+        neighbors = self.vector_search(
+            namespace=index_namespace,


Are you sure we want to search the index_namespace rather than the key_namespace?

Doesn't the index_namespace just have index info on it, while key_namespace has all the records?

A test case should be added to verify that this behavior is sound. All test cases use the same value for both key_namespace and index_namespace.

I left the comment above before commenting on the conversation below. I understand that you might search a different namespace than the one you found the key in. Changing index_namespace to search_namespace could clear up the confusion.

A valid test case should still be added for this situation.

renamed index_namespace to search_namespace I'll see about the test case. I think I'll have to add another namespace to the test Aerospike configs etc

Added a test case where the record and searchspace are in separate namespaces

DomPeliniAerospike · 2024-10-11T00:52:02Z

src/aerospike_vector_search/aio/client.py

+        key_namespace: str,
+        vector_field: str,
+        limit: int,
+        set_name: Optional[str] = None,


I think namespace and set should be consistent, they are both key information and should be uniform. Was never a fan of the "set_name" choice.

Instead of index_namespace, it should be search_namespace. This way, it is clear that you find the key in key_namespace, and you search in search_namespace. Index_namespace could mean where the index information is stored rather than where the vector data is stored.

…ex_namespace to search_namespace

… if distance is equal

…different namespaces

…t namespaces

dwelch-spike added 6 commits October 1, 2024 14:18

ci: split extensive vector search tests into another file

be35723

ci: trigger extensive vector search tests on push to dev and main only

4650019

trigger tests

f0b3428

ci: remove extensive vector tests from normal integration test workflow

17bf383

feat: add vector_search_by_key() client method

fc2d7ea

feat: add vector_search_by_key async client method

143c7fe

dwelch-spike added 9 commits October 2, 2024 13:34

fix vector search by key test case bins

a6c9cfe

use test case set in vecto serach by key

8ae04d8

set limit correctly in vector searh by key tet case

9ff9e05

chore: define __repr__ for Neighbor and Key types

db89cf0

remove breakpoint

f816744

feat: define lt, le, gt, ge for Neighbor type

67f6ed2

add missing set_name arg to vector search by key tests

2f3eae3

add key_namespace to vector_search_by_key

88f5003

merge dev into vec-330

1ed18c5

dwelch-spike commented Oct 10, 2024

View reviewed changes

remove incorrect field_name docstring

17cf07f

dwelch-spike requested review from DomPeliniAerospike and hev October 10, 2024 21:13

ci: only run test_vector_search_with_set_same_as_index when extensive…

b413ea2

…_vector_search is true

DomPeliniAerospike requested changes Oct 11, 2024

View reviewed changes

dwelch-spike added 6 commits October 11, 2024 09:53

check key when sorting Neighbors if distance is equal

2381d9a

rename vector_search_by_key argument from set_name to key_set and ind…

9fec62f

…ex_namespace to search_namespace

try a test run without sorting vector search results

6a30658

change neighbor comparison methods to check str representation of key…

a6468f8

… if distance is equal

add test case for search by key where search space and record are in …

3cbce0c

…different namespaces

add a test for search by key with data and search records in differen…

9c0d290

…t namespaces

dwelch-spike requested a review from DomPeliniAerospike October 11, 2024 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add vector_search_by_key method to sync and async clients vec-330 #53

feat: Add vector_search_by_key method to sync and async clients vec-330 #53

dwelch-spike commented Oct 2, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited

Loading

dwelch-spike Oct 10, 2024

DomPeliniAerospike Oct 11, 2024

dwelch-spike Oct 11, 2024

DomPeliniAerospike left a comment

DomPeliniAerospike Oct 11, 2024

dwelch-spike Oct 11, 2024

DomPeliniAerospike Oct 11, 2024

dwelch-spike Oct 11, 2024

dwelch-spike Oct 11, 2024

dwelch-spike Oct 11, 2024

DomPeliniAerospike Oct 11, 2024

DomPeliniAerospike Oct 11, 2024 •

edited

Loading

dwelch-spike Oct 11, 2024

dwelch-spike Oct 11, 2024

DomPeliniAerospike Oct 11, 2024

feat: Add vector_search_by_key method to sync and async clients vec-330 #53

Are you sure you want to change the base?

feat: Add vector_search_by_key method to sync and async clients vec-330 #53

Conversation

dwelch-spike commented Oct 2, 2024 • edited Loading

codecov-commenter commented Oct 2, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DomPeliniAerospike left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DomPeliniAerospike Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwelch-spike commented Oct 2, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited

Loading

DomPeliniAerospike Oct 11, 2024 •

edited

Loading