Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk import #386

Merged
merged 27 commits into from
Sep 18, 2024
Merged

Add bulk import #386

merged 27 commits into from
Sep 18, 2024

Conversation

jhamon
Copy link
Collaborator

@jhamon jhamon commented Aug 22, 2024

Problem

Implement the following new methods:

  • start_import
  • describe_import
  • list_imports
  • cancel_import

Solution

Code generation changes

Since these features are in prerelease, they only exist in the spec for the upcoming 2024-10 API version. This required me to make modifications to the codegen script that is now run as:

./codegen/build-oas.sh 2024-07 false && ./codegen/build-oas.sh 2024-10 true

The second boolean argument is used to tell the codegen script whether the generated code should be stored in a new pinecone/core_ea subpackage. In the future we should probably do more to hide this complexity from the developer, but for now it is good enough.

Code organization

For the bespoke bits of the implementation that wrap the generated code, I have put them into a new class, ImportFeatureMixin, that the Index class inherits from. These functions could have all been implemented directly in the Index class, but I thought it a bit tidier to segregate these into a separate spot than just dump everything into one giant file.

Overridden repr representation on generated objects

The default print output in the generated classes comes from pprint and it looks quite poor for large objects. So I installed overrides that dump the objects into a formatted json style instead. I had previously done something similar for describe_index, etc, methods, so for this PR it was just a matter of cleaning up that logic a bit and moving it somewhere it could be reused.

So far, I haven't tweaked the generated classes to do this approach across the board because it doesn't work well for long arrays of vector values.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

Manual testing with a dev release is in this demo notebook

Comment on lines 27 to 42
# unit-tests:
# uses: './.github/workflows/testing-unit.yaml'
# secrets: inherit
# integration-tests:
# uses: './.github/workflows/testing-integration.yaml'
# secrets: inherit
# dependency-tests:
# uses: './.github/workflows/testing-dependency.yaml'
# secrets: inherit

pypi:
uses: './.github/workflows/publish-to-pypi.yaml'
needs:
- unit-tests
- integration-tests
- dependency-tests
# needs:
# - unit-tests
# - integration-tests
# - dependency-tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled these to speed up the process of creating dev releases. But will uncomment before merging. Eventually some sort of menu toggle for skipping tests would be nice, but I didn't want to yak shave on that too much.

Comment on lines +743 to +748
elif isinstance(input_value, datetime):
# this must be higher than the date check because
# isinstance(datetime_instance, date) == True
return datetime
elif isinstance(input_value, date):
return date
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date/datetime changes in these generated files reflect some adjustments I had to make in the template code to get correct handling of dates in API responses since the bulk import feature introduces date fields for the first time. Previously we had all of this commented out as a quick-and-dirty fix for an unrelated issue where untyped user metadata strings stored with vectors were sometimes coerced into datetime when fetched if the content looked date-ish.

There's a unit test covering that case to make sure we haven't regressed on that issue while still enabling us to interact with datetime objects when that is our intention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answers my previous question about the templates, thanks.

@@ -0,0 +1,172 @@
import pytest
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where new unit tests were added for the main functionality

@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs):
openapi_config.host = host
openapi_config.ssl_ca_cert = certifi.where()
openapi_config.socket_options = cls._get_socket_options()
openapi_config.discard_unknown_keys = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovered this along the way. Seems like a better default behavior than erroring when unexpected data is returned.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sweet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@@ -0,0 +1,193 @@
from enum import Enum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majority of the new code is in this file.

@jhamon jhamon marked this pull request as ready for review September 17, 2024 22:13
@aulorbe
Copy link

aulorbe commented Sep 17, 2024

Total nit that obviously you can ignore if you want, but maybe we should change the title of this PR from "Early access bulk import" to "Add early access bulk import", just so users later on know that this PR added the functionality (instead of iterated/removed/etc.)

CONTRIBUTING.md Outdated

Prerequisites:
- You must be an employee with access to private Pinecone repositories
- You must have [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed and running. Our code generation script uses a dockerized version of the openapi CLI.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you might want to capitalize OpenAPI

@@ -11,7 +11,7 @@ develop:

test-unit:
@echo "Running tests..."
poetry run pytest --cov=pinecone --timeout=120 tests/unit
poetry run pytest --cov=pinecone --timeout=120 tests/unit -s -vv
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure: you want to permanently add these flags here? I just know that they're usually added for debugging, so want to make sure they weren't left accidentally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I pretty much always want these flags.

git fetch
git checkout main
git pull
# git fetch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC why'd you remove these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I didn't mean to leave these commented. I disabled the pull because I was experimenting with some local changes to get the datetime parsing properly.

error_mode: Optional[Literal["CONTINUE", "ABORT"]] = "CONTINUE",
) -> StartImportResponse:
"""Import data from a storage provider into an index. The uri must start with the scheme of a supported
storage provider. For buckets that are not publicly readable, you will also need to separately configure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe so users have a better understanding of what we mean when we say "storage provider," you write "e.g. "S3"" or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only S3 is supported, but I don't want to write that explicitly because I think it will quickly go out of date. The error message the API sends back is relatively helpful if you get this wrong.

) -> StartImportResponse:
"""Import data from a storage provider into an index. The uri must start with the scheme of a supported
storage provider. For buckets that are not publicly readable, you will also need to separately configure
a storage integration and pass the integration name.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this read "integration ID"? (not name)


Args:
uri (str): The URI of the data to import. The URI must start with the scheme of a supported storage provider.
integration (Optional[str], optional): If your bucket requires authentication to access, you need to pass the name of your storage integration using this property. Defaults to None.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be integration ID not name?

print(op)
```

You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔

I had written one for the List endpoint but ended up not shipping it.

```

You should be cautious with this approach because it will fetch all operations at once, which could be a large number
network calls and a lot of memory to hold the results.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing an "of" here (large number OF network calls)

Copy link

@aulorbe aulorbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! just some nits/language tweaks

@jhamon jhamon changed the title Early access bulk import Add bulk import Sep 18, 2024
Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really nice work getting all of this in place. Very thoughtful approach. Do you think it would make sense doing something similar in other repos? I recently created a release candidate branch to work against in Go.

Really nice work, Jen.

@@ -142,3 +142,26 @@ Hello, from your virtualenv!
```

If you experience any issues please [file a new issue](https://github.com/pinecone-io/pinecone-python-client/issues/new).


## Consuming API version upgrades
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We should definitely put something like this in other repos where we're using a /codegen/ submodule in other repos, we had someone external asking about codegen in the Rust codebase. Would be good to be explicit until we've got the specs published, plus for better documenting how the codegen actually works in each repo.

is_early_access=$2 # e.g. true

# if is_early_access is true, add the "ea" module
if [ "$is_early_access" = "true" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Clever way of going about this, nice. 👍

Comment on lines +90 to +91
# Hack to prevent coercion of strings into datetimes within "object" types while still
# allowing datetime parsing for fields that are explicitly typed as datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, does this replace the templating we used to have to do to fix the previous datetime problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly. I rolled back the old changes to the templates and replaced with this. It's only this one spot where the datetime stuff was problematic.

@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs):
openapi_config.host = host
openapi_config.ssl_ca_cert = certifi.where()
openapi_config.socket_options = cls._get_socket_options()
openapi_config.discard_unknown_keys = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Comment on lines +743 to +748
elif isinstance(input_value, datetime):
# this must be higher than the date check because
# isinstance(datetime_instance, date) == True
return datetime
elif isinstance(input_value, date):
return date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answers my previous question about the templates, thanks.

@@ -64,7 +66,7 @@ def parse_query_response(response: QueryResponse):
return response


class Index:
class Index(ImportFeatureMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, I really like the mixin approach.

print(op)
```

You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔

I had written one for the List endpoint but ended up not shipping it.

@jhamon jhamon merged commit ff7b81d into main Sep 18, 2024
84 checks passed
@jhamon jhamon deleted the jhamon/import-feature branch September 18, 2024 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants