Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 3 with cached cross chunk edges #454

Open
wants to merge 105 commits into
base: main
Choose a base branch
from
Open

Version 3 with cached cross chunk edges #454

wants to merge 105 commits into from

Conversation

akhileshh
Copy link
Contributor

@akhileshh akhileshh commented Aug 6, 2023

  • Adds a new column family for cached cross chunks edges.
  • Adds MaxAgeGCRule for previous column family with supervoxel cross chunk edges; only needed during ingest and they get deleted eventually.
  • Edits make use of cached cross chunk edges.

Summary of changes in pychunkedgraph.ingest:

  • Layer 2 creation is mostly unchanged; stores cross chunk edges with supervoxels
    • The column family used to store these edges now has a max age garbage collection rule
    • During ingest, these edges can be used to cache higher layer cross chunk edges; will be deleted eventually by BigTable's garbage collection routines.
  • When ingesting layer 3, cross edges for children (layer 2) get updated and "lifted" by using the previously mentioned supervoxel cross chunk edges, these have a different column family so they're retained forever.
    • At the same time, cross edges for parents at layer 3 will get created by merging cross edges of their children, these are intermediate and will be lifted when ingesting the next parent layer.
  • For each layer > 3 until root layer:
    • Update children cross chunk edges by "lifting" the edges created during the previous layer ingest.
    • Add parent cross chunk edges by merging children cross chunk edges; they will be updated when ingesting the next layer.

This assumes all chunks at lower layer have been created before creating the current layer so we can no longer queue parent chunk jobs automatically when its children chunks are complete.

We must now ingest/create one layer at a time.

Summary of changes in pychunkedgraph.graph.edits:

  • Edits are expected to be faster now; going to layer 2 to extract cross chunk edges is no longer necessary since they're cached at each layer.
  • During an edit, these cached cross chunk edges must be updated from both directions - to and from the newly created nodes and its existing neighbors.
    • Most changes in this module are to handle this step.
    • Caching these edges has also made the edits logic simpler and cleaner.
    • When updating new cross edges, we need to ensure descendants get replaced by the highest parent.
    • For splits, we need to filter out inactive cross edges after the local graph is read from bucket storage.

@akhileshh akhileshh requested a review from sdorkenw August 6, 2023 20:04
@akhileshh akhileshh changed the title WIP WIP V3 Aug 11, 2023
@akhileshh akhileshh marked this pull request as ready for review August 23, 2023 22:56
@akhileshh akhileshh changed the title WIP V3 Version 3 with cached cross chunk edges Aug 23, 2023
@akhileshh akhileshh requested a review from fcollman August 24, 2023 00:43

def parents_multiple(self, node_ids: np.ndarray, *, time_stamp: datetime = None):
node_ids = np.array(node_ids, dtype=NODE_ID)
Copy link
Contributor

@nkemnitz nkemnitz Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this here (and some other places) - same as in #458: np.array will by default create a copy. np.asarray will avoid copies, if the requirements are already met.

Copy link
Contributor

@sdorkenw sdorkenw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good besides the one point - a tricky one though - that I marked

new_cx_edges_d[layer] = edges
assert np.all(edges[:, 0] == new_id)
cg.cache.cross_chunk_edges_cache[new_id] = new_cx_edges_d
entries = _update_neighbor_cross_edges(
Copy link
Contributor

@sdorkenw sdorkenw Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this here can introduce problems if a neighboring node is a neighbor to multiple new_l2_ids.

_update_neighbor_cross_edges looks right to me. It writes a complete new set of L2 edges for a node. But if the same node is updated multiple times, then only the last update is reflected. Maybe the logic here takes care of this somehow but then it still introduces multiple unnecessary writes.

So, if I am correct about this, the solution would be to consolidate this call across all new_l2_ids to only make one call per neighboring node id.

new_cx_edges_d[layer] = edges
assert np.all(edges[:, 0] == new_id)
cg.cache.cross_chunk_edges_cache[new_id] = new_cx_edges_d
entries = _update_neighbor_cross_edges(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue as above

akhileshh and others added 24 commits December 5, 2024 20:32
* feat: convert edges to ocdbt

* feat: worker function to convert edges to ocdbt

* feat: ocdbt option, consolidate ingest cli

* fix(ingest): move fn to utils

* fix(ingest): move ocdbt setup to a fn

* add tensorstore req, fix build kaniko cache

* feat: copy fake_edges to column family 4

* feat: upgrade atomic chunks

* fix: rename abstract module to parent

* feat: upgrade higher layers, docs

* feat: upgrade cli, move common fns to utils

* add copy_fake_edges in upgrade fn

* handle earliest_timestamp, add test flag to upgrade

* fix: fake_edges serialize np.uint64

* add get_operation method, fix timestamp in repair, check for parent

* check for l2 ids invalidated by edit retries

* remove unnecessary parent assert

* remove unused vars

* ignore invalid ids, assert parent after earliest_ts

* check for ids invalidated by retries in higher layers

* parallelize update_cross_edges

* overwrite graph version, create col family 4

* improve status print formatting

* remove ununsed code, consolidate small common module

* efficient check for chunks not done

* check for empty chunks, use get_parents

* efficient get_edit_ts call by batching all children

* reduce earliest_ts calls

* combine bigtable calls, use numpy unique

* add completion rate command

* fix: ignore children without cross edges

* add span option to rate calculation

* reduce mem usage with global vars

* optimize cross edge reading

* use existing layer var

* limit cx edge reading above given layer

* fix: read for earliest_ts check only if true

* filter cross edges fn with timestamps

* remove git from dockerignore, print stats

* shuffle for better distribution of ids

* fix: use different var name for layer

* increase bigtable read timeout

* add message with assert

* fix: make span option int

* handle skipped connections

* fix: read cross edges at layer >= node_layer

* handle another case of skipped nodes

* check for unique parent count

* update graph_id in meta

* uncomment line

* make repair easier to use

* add sanity check for edits

* add sanity check for each layer

* add layers flag for cx edges

* use better names for functions and vars, update types, fix docs
* feat(ingest): use temporarily cached cross chunk edges

* fix: switch to using partners vector instead of 2d edges array

* fix(edits): l2 - use and store cx edges that become relevant only at l2

* chore: rename counterpart to partner

* fix: update partner cx edges

* feat(edits): use layer relevant partners

* fix tests

* persist cross chunk layers with each node

* fix: update cross chunk layers in edits

* fix: update cross layer from old ids in l2

* update deprecated utcnoww

* fix split tests

* Bump version: 3.0.0 → 3.0.1

* fix: missed timestamp arg

* update docs, remove unnecessary methods

* revert structural changes

* fix new tests; revert bumpversion.cfg
@akhileshh akhileshh force-pushed the pcgv3 branch 2 times, most recently from 033ba89 to 47f2d2f Compare December 16, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants