-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bulk import #386
Add bulk import #386
Conversation
c784fe8
to
ca2bf2b
Compare
31ee39a
to
984c8b0
Compare
.github/workflows/alpha-release.yaml
Outdated
# unit-tests: | ||
# uses: './.github/workflows/testing-unit.yaml' | ||
# secrets: inherit | ||
# integration-tests: | ||
# uses: './.github/workflows/testing-integration.yaml' | ||
# secrets: inherit | ||
# dependency-tests: | ||
# uses: './.github/workflows/testing-dependency.yaml' | ||
# secrets: inherit | ||
|
||
pypi: | ||
uses: './.github/workflows/publish-to-pypi.yaml' | ||
needs: | ||
- unit-tests | ||
- integration-tests | ||
- dependency-tests | ||
# needs: | ||
# - unit-tests | ||
# - integration-tests | ||
# - dependency-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled these to speed up the process of creating dev releases. But will uncomment before merging. Eventually some sort of menu toggle for skipping tests would be nice, but I didn't want to yak shave on that too much.
elif isinstance(input_value, datetime): | ||
# this must be higher than the date check because | ||
# isinstance(datetime_instance, date) == True | ||
return datetime | ||
elif isinstance(input_value, date): | ||
return date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The date/datetime changes in these generated files reflect some adjustments I had to make in the template code to get correct handling of dates in API responses since the bulk import feature introduces date fields for the first time. Previously we had all of this commented out as a quick-and-dirty fix for an unrelated issue where untyped user metadata strings stored with vectors were sometimes coerced into datetime when fetched if the content looked date-ish.
There's a unit test covering that case to make sure we haven't regressed on that issue while still enabling us to interact with datetime objects when that is our intention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answers my previous question about the templates, thanks.
@@ -0,0 +1,172 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where new unit tests were added for the main functionality
@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs): | |||
openapi_config.host = host | |||
openapi_config.ssl_ca_cert = certifi.where() | |||
openapi_config.socket_options = cls._get_socket_options() | |||
openapi_config.discard_unknown_keys = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discovered this along the way. Seems like a better default behavior than erroring when unexpected data is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sweet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
@@ -0,0 +1,193 @@ | |||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Majority of the new code is in this file.
Total nit that obviously you can ignore if you want, but maybe we should change the title of this PR from "Early access bulk import" to "Add early access bulk import", just so users later on know that this PR added the functionality (instead of iterated/removed/etc.) |
CONTRIBUTING.md
Outdated
|
||
Prerequisites: | ||
- You must be an employee with access to private Pinecone repositories | ||
- You must have [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed and running. Our code generation script uses a dockerized version of the openapi CLI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you might want to capitalize OpenAPI
@@ -11,7 +11,7 @@ develop: | |||
|
|||
test-unit: | |||
@echo "Running tests..." | |||
poetry run pytest --cov=pinecone --timeout=120 tests/unit | |||
poetry run pytest --cov=pinecone --timeout=120 tests/unit -s -vv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure: you want to permanently add these flags here? I just know that they're usually added for debugging, so want to make sure they weren't left accidentally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I pretty much always want these flags.
codegen/build-oas.sh
Outdated
git fetch | ||
git checkout main | ||
git pull | ||
# git fetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC why'd you remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I didn't mean to leave these commented. I disabled the pull because I was experimenting with some local changes to get the datetime parsing properly.
error_mode: Optional[Literal["CONTINUE", "ABORT"]] = "CONTINUE", | ||
) -> StartImportResponse: | ||
"""Import data from a storage provider into an index. The uri must start with the scheme of a supported | ||
storage provider. For buckets that are not publicly readable, you will also need to separately configure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe so users have a better understanding of what we mean when we say "storage provider," you write "e.g. "S3"" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only S3 is supported, but I don't want to write that explicitly because I think it will quickly go out of date. The error message the API sends back is relatively helpful if you get this wrong.
) -> StartImportResponse: | ||
"""Import data from a storage provider into an index. The uri must start with the scheme of a supported | ||
storage provider. For buckets that are not publicly readable, you will also need to separately configure | ||
a storage integration and pass the integration name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this read "integration ID"? (not name)
|
||
Args: | ||
uri (str): The URI of the data to import. The URI must start with the scheme of a supported storage provider. | ||
integration (Optional[str], optional): If your bucket requires authentication to access, you need to pass the name of your storage integration using this property. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be integration ID not name?
print(op) | ||
``` | ||
|
||
You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔
I had written one for the List
endpoint but ended up not shipping it.
``` | ||
|
||
You should be cautious with this approach because it will fetch all operations at once, which could be a large number | ||
network calls and a lot of memory to hold the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing an "of" here (large number OF network calls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! just some nits/language tweaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really nice work getting all of this in place. Very thoughtful approach. Do you think it would make sense doing something similar in other repos? I recently created a release candidate branch to work against in Go.
Really nice work, Jen.
@@ -142,3 +142,26 @@ Hello, from your virtualenv! | |||
``` | |||
|
|||
If you experience any issues please [file a new issue](https://github.com/pinecone-io/pinecone-python-client/issues/new). | |||
|
|||
|
|||
## Consuming API version upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: We should definitely put something like this in other repos where we're using a /codegen/
submodule in other repos, we had someone external asking about codegen in the Rust codebase. Would be good to be explicit until we've got the specs published, plus for better documenting how the codegen actually works in each repo.
is_early_access=$2 # e.g. true | ||
|
||
# if is_early_access is true, add the "ea" module | ||
if [ "$is_early_access" = "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Clever way of going about this, nice. 👍
# Hack to prevent coercion of strings into datetimes within "object" types while still | ||
# allowing datetime parsing for fields that are explicitly typed as datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, does this replace the templating we used to have to do to fix the previous datetime
problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. I rolled back the old changes to the templates and replaced with this. It's only this one spot where the datetime stuff was problematic.
@@ -24,6 +24,8 @@ def build(cls, api_key: str, host: Optional[str] = None, **kwargs): | |||
openapi_config.host = host | |||
openapi_config.ssl_ca_cert = certifi.where() | |||
openapi_config.socket_options = cls._get_socket_options() | |||
openapi_config.discard_unknown_keys = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
elif isinstance(input_value, datetime): | ||
# this must be higher than the date check because | ||
# isinstance(datetime_instance, date) == True | ||
return datetime | ||
elif isinstance(input_value, date): | ||
return date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answers my previous question about the templates, thanks.
@@ -64,7 +66,7 @@ def parse_query_response(response: QueryResponse): | |||
return response | |||
|
|||
|
|||
class Index: | |||
class Index(ImportFeatureMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, I really like the mixin approach.
print(op) | ||
``` | ||
|
||
You can convert the generator into a list by wrapping the generator in a call to the built-in `list` function: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we bumped the version of Node and some of our TypeScript config stuff, we may be able to use generators in the TypeScript codebase. 🤔
I had written one for the List
endpoint but ended up not shipping it.
Problem
Implement the following new methods:
start_import
describe_import
list_imports
cancel_import
Solution
Code generation changes
Since these features are in prerelease, they only exist in the spec for the upcoming 2024-10 API version. This required me to make modifications to the codegen script that is now run as:
The second boolean argument is used to tell the codegen script whether the generated code should be stored in a new
pinecone/core_ea
subpackage. In the future we should probably do more to hide this complexity from the developer, but for now it is good enough.Code organization
For the bespoke bits of the implementation that wrap the generated code, I have put them into a new class,
ImportFeatureMixin
, that theIndex
class inherits from. These functions could have all been implemented directly in theIndex
class, but I thought it a bit tidier to segregate these into a separate spot than just dump everything into one giant file.Overridden repr representation on generated objects
The default print output in the generated classes comes from pprint and it looks quite poor for large objects. So I installed overrides that dump the objects into a formatted json style instead. I had previously done something similar for describe_index, etc, methods, so for this PR it was just a matter of cleaning up that logic a bit and moving it somewhere it could be reused.
So far, I haven't tweaked the generated classes to do this approach across the board because it doesn't work well for long arrays of vector values.
Type of Change
Test Plan
Manual testing with a dev release is in this demo notebook