Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sharded read rows #766

Merged
merged 405 commits into from
Jun 23, 2023
Merged

Conversation

daniel-sanche
Copy link
Contributor

@daniel-sanche daniel-sanche commented Apr 17, 2023

Blocked on Read Rows PR: #762

This PR adds query sharding.

  • Adds query.shard(), to split one large query into multiple smaller ones
  • Adds client.sample_keys, to get a list of "sample points" for a table, that are used to efficiently shard
  • Adds client.read_rows_sharded, to execute a set of sharded queries in parallel, and return the results as a single list
  • made some usability changes to RowRange and Query classes

google/cloud/bigtable/read_rows_query.py Outdated Show resolved Hide resolved
google/cloud/bigtable/read_rows_query.py Outdated Show resolved Hide resolved
Raises:
- ShardedReadRowsExceptionGroup: if any of the queries failed
- ValueError: if the query_list is empty
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need an error. Also the rows will be de-duplicated on the serverside

google/cloud/bigtable/client.py Outdated Show resolved Hide resolved
google/cloud/bigtable/client.py Outdated Show resolved Hide resolved
operation_timeout: int | float | None = 60,
per_sample_timeout: int | float | None = 10,
per_request_timeout: int | float | None = None,
operation_timeout: int | float | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no attempt timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't implement retries since I noticed it wasn't on go/cbt-client-debugging-steps#client-retry-settings

I added retries, using the same retryable exceptions as mutate_rows. Let me know if that works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go/cbt-client-debugging-steps#client-retry-settings needs to be updated :P

google/cloud/bigtable/client.py Show resolved Hide resolved
google/cloud/bigtable/client.py Outdated Show resolved Hide resolved
# then add start_segment back to get the correct index
cropped_pts = split_points[start_segment:]
end_segment = (
bisect_left(cropped_pts, this_range.end.key) + start_segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check this_range.end.is_inclusive here also? If it's exclusive we should do bisect_left but if it's inclusive I think it should go to right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for the end segment, since the split_points mark the inclusive ends of each segment. Whether the range includes or excludes the split point, the results will be the same (left side)

Do you think I should add a comment to call this out?

]
if exception_list:
# if any sub-request failed, raise an exception instead of returning results
raise ShardedReadRowsExceptionGroup(exception_list, len(query_list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are taking the effort to let all of the shards finish despite the error, you might as well add the partial results in the exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added a successful_rows field to the exception

Comment on lines 328 to 330
# use binary search to find the segment the end key belongs to.
# optimization: remove keys up to start_segment from searched list,
# then add start_segment back to get the correct index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just pass lo=start_segment to bisect_left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@daniel-sanche daniel-sanche merged commit ec2b983 into googleapis:v3 Jun 23, 2023
daniel-sanche added a commit that referenced this pull request Feb 5, 2024
* feat: add new v3.0.0 API skeleton (#745)

* feat: improve rows filters (#751)

* feat: read rows query model class (#752)

* feat: implement row and cell model classes (#753)

* feat: add pooled grpc transport (#748)

* feat: implement read_rows (#762)

* feat: implement mutate rows (#769)

* feat: literal value filter (#767)

* feat: row_exists and read_row (#778)

* feat: read_modify_write and check_and_mutate_row (#780)

* feat: sharded read rows (#766)

* feat: ping and warm with metadata (#810)

* feat: mutate rows batching (#770)

* chore: restructure module paths (#816)

* feat: improve timeout structure (#819)

* fix: api errors apply to all bulk mutations

* chore: reduce public api surface (#820)

* feat: improve error group tracebacks on < py11 (#825)

* feat: optimize read_rows (#852)

* chore: add user agent suffix (#842)

* feat: optimize retries (#854)

* feat: add test proxy (#836)

* chore(tests): add conformance tests to CI for v3 (#870)

* chore(tests): turn off fast fail for conformance tets (#882)

* feat: add TABLE_DEFAULTS enum for table method arguments (#880)

* fix: pass None for retry in gapic calls (#881)

* feat: replace internal dictionaries with protos in gapic calls (#875)

* chore: optimize gapic calls (#863)

* feat: expose retryable error codes to users (#879)

* chore: update api_core submodule (#897)

* chore: merge main into experimental_v3 (#900)

* chore: pin conformance tests to v0.0.2 (#903)

* fix: bulk mutation eventual success (#909)

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants