Add bulk import #386

jhamon · 2024-08-22T15:31:13Z

Problem

Implement the following new methods:

start_import
describe_import
list_imports
cancel_import

Solution

Code generation changes

Since these features are in prerelease, they only exist in the spec for the upcoming 2024-10 API version. This required me to make modifications to the codegen script that is now run as:

./codegen/build-oas.sh 2024-07 false && ./codegen/build-oas.sh 2024-10 true

The second boolean argument is used to tell the codegen script whether the generated code should be stored in a new pinecone/core_ea subpackage. In the future we should probably do more to hide this complexity from the developer, but for now it is good enough.

Code organization

For the bespoke bits of the implementation that wrap the generated code, I have put them into a new class, ImportFeatureMixin, that the Index class inherits from. These functions could have all been implemented directly in the Index class, but I thought it a bit tidier to segregate these into a separate spot than just dump everything into one giant file.

Overridden repr representation on generated objects

The default print output in the generated classes comes from pprint and it looks quite poor for large objects. So I installed overrides that dump the objects into a formatted json style instead. I had previously done something similar for describe_index, etc, methods, so for this PR it was just a matter of cleaning up that logic a bit and moving it somewhere it could be reused.

So far, I haven't tweaked the generated classes to do this approach across the board because it doesn't work well for long arrays of vector values.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Infrastructure change (CI configs, etc)
Non-code change (docs, etc)
None of the above: (explain here)

Test Plan

Manual testing with a dev release is in this demo notebook

jhamon · 2024-09-17T20:51:16Z

.github/workflows/alpha-release.yaml

+  # unit-tests:
+  #   uses: './.github/workflows/testing-unit.yaml'
+  #   secrets: inherit
+  # integration-tests:
+  #     uses: './.github/workflows/testing-integration.yaml'
+  #     secrets: inherit
+  # dependency-tests:
+  #   uses: './.github/workflows/testing-dependency.yaml'
+  #   secrets: inherit

  pypi:
    uses: './.github/workflows/publish-to-pypi.yaml'
-    needs: 
-      - unit-tests
-      - integration-tests
-      - dependency-tests
+    # needs: 
+    #   - unit-tests
+    #   - integration-tests
+    #   - dependency-tests


I disabled these to speed up the process of creating dev releases. But will uncomment before merging. Eventually some sort of menu toggle for skipping tests would be nice, but I didn't want to yak shave on that too much.

jhamon · 2024-09-17T20:55:27Z

pinecone/core/openapi/shared/model_utils.py

+    elif isinstance(input_value, datetime):
+        # this must be higher than the date check because
+        # isinstance(datetime_instance, date) == True
+        return datetime
+    elif isinstance(input_value, date):
+        return date


The date/datetime changes in these generated files reflect some adjustments I had to make in the template code to get correct handling of dates in API responses since the bulk import feature introduces date fields for the first time. Previously we had all of this commented out as a quick-and-dirty fix for an unrelated issue where untyped user metadata strings stored with vectors were sometimes coerced into datetime when fetched if the content looked date-ish.

There's a unit test covering that case to make sure we haven't regressed on that issue while still enabling us to interact with datetime objects when that is our intention.

Answers my previous question about the templates, thanks.

jhamon · 2024-09-17T22:09:10Z

tests/unit/data/test_bulk_import.py

@@ -0,0 +1,172 @@
+import pytest


Here's where new unit tests were added for the main functionality

jhamon · 2024-09-17T22:12:34Z

pinecone/config/openapi.py

@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs):
        openapi_config.host = host
        openapi_config.ssl_ca_cert = certifi.where()
        openapi_config.socket_options = cls._get_socket_options()
+        openapi_config.discard_unknown_keys = True


Discovered this along the way. Seems like a better default behavior than erroring when unexpected data is returned.

Looks sweet

jhamon · 2024-09-17T22:12:58Z

pinecone/data/features/bulk_import.py

@@ -0,0 +1,193 @@
+from enum import Enum


Majority of the new code is in this file.

aulorbe · 2024-09-17T23:04:22Z

Total nit that obviously you can ignore if you want, but maybe we should change the title of this PR from "Early access bulk import" to "Add early access bulk import", just so users later on know that this PR added the functionality (instead of iterated/removed/etc.)

aulorbe · 2024-09-17T23:07:06Z

CONTRIBUTING.md

+
+Prerequisites:
+- You must be an employee with access to private Pinecone repositories
+- You must have [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed and running. Our code generation script uses a dockerized version of the openapi CLI.


Nit: you might want to capitalize OpenAPI

aulorbe · 2024-09-17T23:09:18Z

Makefile

@@ -11,7 +11,7 @@ develop:

 test-unit:
 	@echo "Running tests..."
-	poetry run pytest --cov=pinecone --timeout=120 tests/unit
+	poetry run pytest --cov=pinecone --timeout=120 tests/unit -s -vv


Just making sure: you want to permanently add these flags here? I just know that they're usually added for debugging, so want to make sure they weren't left accidentally

Yeah, I pretty much always want these flags.

aulorbe · 2024-09-17T23:10:05Z

codegen/build-oas.sh

-		git fetch
-		git checkout main
-		git pull
+		# git fetch


OOC why'd you remove these?

Nice catch, I didn't mean to leave these commented. I disabled the pull because I was experimenting with some local changes to get the datetime parsing properly.

aulorbe · 2024-09-17T23:26:08Z

pinecone/data/features/bulk_import.py

+        error_mode: Optional[Literal["CONTINUE", "ABORT"]] = "CONTINUE",
+    ) -> StartImportResponse:
+        """Import data from a storage provider into an index. The uri must start with the scheme of a supported
+        storage provider. For buckets that are not publicly readable, you will also need to separately configure


Maybe so users have a better understanding of what we mean when we say "storage provider," you write "e.g. "S3"" or something?

Currently only S3 is supported, but I don't want to write that explicitly because I think it will quickly go out of date. The error message the API sends back is relatively helpful if you get this wrong.

aulorbe · 2024-09-17T23:26:17Z

pinecone/data/features/bulk_import.py

+    ) -> StartImportResponse:
+        """Import data from a storage provider into an index. The uri must start with the scheme of a supported
+        storage provider. For buckets that are not publicly readable, you will also need to separately configure
+        a storage integration and pass the integration name.


Should this read "integration ID"? (not name)

aulorbe · 2024-09-17T23:26:36Z

pinecone/data/features/bulk_import.py

+
+        Args:
+            uri (str): The URI of the data to import. The URI must start with the scheme of a supported storage provider.
+            integration (Optional[str], optional): If your bucket requires authentication to access, you need to pass the name of your storage integration using this property. Defaults to None.


Should this be integration ID not name?

aulorbe · 2024-09-17T23:28:13Z

pinecone/data/features/bulk_import.py

+            print(op)
+        ```
+
+        You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function:


Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔

I had written one for the List endpoint but ended up not shipping it.

aulorbe · 2024-09-17T23:28:37Z

pinecone/data/features/bulk_import.py

+        ```
+
+        You should be cautious with this approach because it will fetch all operations at once, which could be a large number
+        network calls and a lot of memory to hold the results.


I think you're missing an "of" here (large number OF network calls)

aulorbe

lgtm! just some nits/language tweaks

austin-denoble

LGTM, really nice work getting all of this in place. Very thoughtful approach. Do you think it would make sense doing something similar in other repos? I recently created a release candidate branch to work against in Go.

Really nice work, Jen.

austin-denoble · 2024-09-18T02:04:22Z

CONTRIBUTING.md

@@ -142,3 +142,26 @@ Hello, from your virtualenv!
 ```

 If you experience any issues please [file a new issue](https://github.com/pinecone-io/pinecone-python-client/issues/new).
+
+
+## Consuming API version upgrades


thought: We should definitely put something like this in other repos where we're using a /codegen/ submodule in other repos, we had someone external asking about codegen in the Rust codebase. Would be good to be explicit until we've got the specs published, plus for better documenting how the codegen actually works in each repo.

austin-denoble · 2024-09-18T02:06:24Z

codegen/build-oas.sh

+is_early_access=$2 # e.g. true
+
+# if is_early_access is true, add the "ea" module
+if [ "$is_early_access" = "true" ]; then


thought: Clever way of going about this, nice. 👍

austin-denoble · 2024-09-18T02:22:39Z

codegen/build-oas.sh

+	# Hack to prevent coercion of strings into datetimes within "object" types while still
+	# allowing datetime parsing for fields that are explicitly typed as datetime


Cool, does this replace the templating we used to have to do to fix the previous datetime problem?

Yeah, exactly. I rolled back the old changes to the templates and replaced with this. It's only this one spot where the datetime stuff was problematic.

austin-denoble · 2024-09-18T02:23:15Z

pinecone/config/openapi.py

@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs):
        openapi_config.host = host
        openapi_config.ssl_ca_cert = certifi.where()
        openapi_config.socket_options = cls._get_socket_options()
+        openapi_config.discard_unknown_keys = True


austin-denoble · 2024-09-18T02:24:32Z

pinecone/core/openapi/shared/model_utils.py

+    elif isinstance(input_value, datetime):
+        # this must be higher than the date check because
+        # isinstance(datetime_instance, date) == True
+        return datetime
+    elif isinstance(input_value, date):
+        return date


Answers my previous question about the templates, thanks.

austin-denoble · 2024-09-18T02:35:41Z

pinecone/data/index.py

@@ -64,7 +66,7 @@ def parse_query_response(response: QueryResponse):
    return response


-class Index:
+class Index(ImportFeatureMixin):


Very cool, I really like the mixin approach.

austin-denoble · 2024-09-18T02:36:26Z

pinecone/data/features/bulk_import.py

+            print(op)
+        ```
+
+        You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function:


Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔

I had written one for the List endpoint but ended up not shipping it.

jhamon force-pushed the jhamon/import-feature branch from c784fe8 to ca2bf2b Compare August 29, 2024 22:53

jhamon added 21 commits September 17, 2024 11:52

WIP on bulk import

87aec08

Fix unit tests

62c95e0

WIP

764822e

WIP on tests

10cfc28

Add on_error

97b0027

Update error_mode unit test

2a87b8a

Some operationId > id fixes

fe7f93f

Lint fixes

82d7905

Add skipTests for rc release

bc09023

Add skipTests for rc release

8118ed3

Revert skipTests

686538d

Improve types and json formatting

3ec5838

Fix json print

6a1c7a1

Fix issue with InProgress enum value

b0bc0f3

Adjust list imports iterator

e9158ae

Docstring updates

70ac40d

Rename from completed_at to finished_at

9857a96

Passing error mode

b8494c0

Fix prerelease decorator

7649b00

Fix integrationId casing, add tests

1a7f4b0

Adjust warning message

984c8b0

jhamon force-pushed the jhamon/import-feature branch from 31ee39a to 984c8b0 Compare September 17, 2024 15:52

jhamon added 4 commits September 17, 2024 12:45

Custom warning class

cfbd722

Cleanup

f3b2bab

Fix mypy issue

b19cce2

Remove warnings

5711249

jhamon commented Sep 17, 2024

View reviewed changes

Merge branch 'main' into jhamon/import-feature

6d5ab6e

jhamon marked this pull request as ready for review September 17, 2024 22:13

aulorbe requested review from aulorbe and austin-denoble September 17, 2024 22:59

aulorbe reviewed Sep 17, 2024

View reviewed changes

aulorbe approved these changes Sep 17, 2024

View reviewed changes

Audrey feedback

5b900c2

jhamon changed the title ~~Early access bulk import~~ Add bulk import Sep 18, 2024

austin-denoble approved these changes Sep 18, 2024

View reviewed changes

jhamon merged commit ff7b81d into main Sep 18, 2024
84 checks passed

jhamon deleted the jhamon/import-feature branch September 18, 2024 04:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bulk import #386

Add bulk import #386

jhamon commented Aug 22, 2024 •

edited

Loading

jhamon Sep 17, 2024

jhamon Sep 17, 2024

austin-denoble Sep 18, 2024

jhamon Sep 17, 2024

jhamon Sep 17, 2024

aulorbe Sep 17, 2024

austin-denoble Sep 18, 2024

jhamon Sep 17, 2024

aulorbe commented Sep 17, 2024 •

edited

Loading

aulorbe Sep 17, 2024

aulorbe Sep 17, 2024

jhamon Sep 18, 2024

aulorbe Sep 17, 2024

jhamon Sep 18, 2024

aulorbe Sep 17, 2024

jhamon Sep 18, 2024

aulorbe Sep 17, 2024

aulorbe Sep 17, 2024

aulorbe Sep 17, 2024

austin-denoble Sep 18, 2024

aulorbe Sep 17, 2024

aulorbe left a comment

austin-denoble left a comment

austin-denoble Sep 18, 2024

austin-denoble Sep 18, 2024

austin-denoble Sep 18, 2024

jhamon Sep 18, 2024

austin-denoble Sep 18, 2024

austin-denoble Sep 18, 2024

austin-denoble Sep 18, 2024

austin-denoble Sep 18, 2024

		# Hack to prevent coercion of strings into datetimes within "object" types while still
		# allowing datetime parsing for fields that are explicitly typed as datetime

Add bulk import #386

Add bulk import #386

Conversation

jhamon commented Aug 22, 2024 • edited Loading

Problem

Solution

Code generation changes

Code organization

Overridden repr representation on generated objects

Type of Change

Test Plan

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aulorbe commented Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aulorbe left a comment

Choose a reason for hiding this comment

austin-denoble left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamon commented Aug 22, 2024 •

edited

Loading

aulorbe commented Sep 17, 2024 •

edited

Loading