Skip to content

Commit

Permalink
Support caching in Apache Beam by using relative co_filename paths. (#…
Browse files Browse the repository at this point in the history
…32979)

* The motivation for this change is to support caching in Apache Beam.

Apache Beam does the following:
- Pickle Python code
- Send the pickled source code to "worker" VMs
- The workers unpickle and execute the code

In the environment that these Beam pipelines execute, the source code is
in a temporary directory whose name is random and changes. The source
code paths relative to the temporary directory are constant. Using
absolute paths prevents pickled code from being cached because the
absolute path keeps changing. Using relative paths enables this caching
and promises significant resource savings and speed-ups.

Additionally the absolute paths leak information about the directory
structure of the machine pickling the source code. When the pickled code
is passed across the network to another machine, the absolute paths may
no longer be valid when the other machine has a different directory
structure.

The reason for using relative paths rather than omitting the path
entirely is because Python uses the co_filename attribute to create
stack traces.

* The motivation for this change is to support caching in Apache Beam
for Google.

Apache Beam does the following:
- Pickle Python code
- Send the pickled source code to "worker" VMs
- The workers unpickle and execute the code

In the environment that these Beam pipelines execute, the source code is
in a temporary directory whose name is random and changes. The source
code paths relative to the temporary directory are constant. Using
absolute paths prevents pickled code from being cached because the
absolute path keeps changing. Using relative paths enables this caching
and promises significant resource savings and speed-ups.

Additionally the absolute paths leak information about the directory
structure of the machine pickling the source code. When the pickled code
is passed across the network to another machine, the absolute paths may
no longer be valid when the other machine has a different directory
structure.

The reason for using relative paths rather than omitting the path
entirely is because Python uses the co_filename attribute to create
stack traces.

* Simplify.

---------

Co-authored-by: Robert Bradshaw <[email protected]>
  • Loading branch information
kushmiD and robertwb authored Nov 1, 2024
1 parent 0987e79 commit 02ab9da
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 6 deletions.
19 changes: 13 additions & 6 deletions sdks/python/apache_beam/internal/dill_pickler.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,15 @@

settings = {'dill_byref': None}

if sys.version_info >= (3, 10) and dill.__version__ == "0.3.1.1":
# Let's make dill 0.3.1.1 support Python 3.11.
patch_save_code = sys.version_info >= (3, 10) and dill.__version__ == "0.3.1.1"


def get_normalized_path(path):
"""Returns a normalized path. This function is intended to be overridden."""
return path


if patch_save_code:
# The following function is based on 'save_code' from 'dill'
# Author: Mike McKerns (mmckerns @caltech and @uqfoundation)
# Copyright (c) 2008-2015 California Institute of Technology.
Expand All @@ -66,6 +72,7 @@

@dill.register(CodeType)
def save_code(pickler, obj):
co_filename = get_normalized_path(obj.co_filename)
if hasattr(obj, "co_endlinetable"): # python 3.11a (20 args)
args = (
obj.co_argcount,
Expand All @@ -78,7 +85,7 @@ def save_code(pickler, obj):
obj.co_consts,
obj.co_names,
obj.co_varnames,
obj.co_filename,
co_filename,
obj.co_name,
obj.co_qualname,
obj.co_firstlineno,
Expand All @@ -100,7 +107,7 @@ def save_code(pickler, obj):
obj.co_consts,
obj.co_names,
obj.co_varnames,
obj.co_filename,
co_filename,
obj.co_name,
obj.co_qualname,
obj.co_firstlineno,
Expand All @@ -120,7 +127,7 @@ def save_code(pickler, obj):
obj.co_consts,
obj.co_names,
obj.co_varnames,
obj.co_filename,
co_filename,
obj.co_name,
obj.co_firstlineno,
obj.co_linetable,
Expand All @@ -138,7 +145,7 @@ def save_code(pickler, obj):
obj.co_consts,
obj.co_names,
obj.co_varnames,
obj.co_filename,
co_filename,
obj.co_name,
obj.co_firstlineno,
obj.co_lnotab,
Expand Down
5 changes: 5 additions & 0 deletions sdks/python/apache_beam/internal/pickler_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,11 @@ def test_pickle_rlock(self):

self.assertIsInstance(loads(dumps(rlock_instance)), rlock_type)

def test_save_paths(self):
f = loads(dumps(lambda x: x))
co_filename = f.__code__.co_filename
self.assertTrue(co_filename.endswith('pickler_test.py'))

@unittest.skipIf(NO_MAPPINGPROXYTYPE, 'test if MappingProxyType introduced')
def test_dump_and_load_mapping_proxy(self):
self.assertEqual(
Expand Down

0 comments on commit 02ab9da

Please sign in to comment.