try dill to support lambda function？ #22

slimtom95 · 2019-04-26T12:17:36Z

Greetings.

I know pickle doesn't support lambda function serialization, but another serialization library dill does. And there is also a multiprocessing library, multiprocess, which uses dill to replace pickle.

I'm new here. There may be some reasons that we don't support lambda functions, because of upstream dependent package or else.Just want to mention these, if we didn't notice them before.

Regards

The text was updated successfully, but these errors were encountered:

lores · 2019-07-17T23:11:07Z

+1. Adding lambda support would allow identical syntax between the regular and parallel versions of apply, and in my experiments it was as simple as @slimtom95 says.

nalepae · 2019-07-22T11:51:24Z

Pandarallel uses concurrent.futures.ProcessPoolExecutor, which uses itself pickle, and lambda functions are not pickleable...

A good idea could be to use pathos, but today pathos does not support concurrent.futures: uqfoundation/pathos#90.

nalepae · 2019-07-23T10:37:09Z

Pandarallel 1.3.0 supports now lambda functions.

lores · 2019-07-23T13:40:38Z

Thanks! I've made it work with lambdas from Jupyter. One issue, though - does it take some sort of footprint of the state of the code the first time it runs?
If I do

def myfunc(x):
    pass
df['ip_long'].parallel_map(lambda x: myfunc(x))

it works fine. Then if I re-run the cell with:

def myfunc2(x):
    pass
df['ip_long'].parallel_map(lambda x: myfunc2(x))

I get

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py", line 23, in worker_map
    res = getattr(series[chunk], map_func)(arg, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/series.py", line 3382, in map
    arg, na_action=na_action)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/base.py", line 1218, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
  File "<ipython-input-16-96ab38ec54e8>", line 34, in <lambda>
NameError: name 'myfunc2' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-16-96ab38ec54e8> in <module>()
     32 
     33 
---> 34 df['ip_long'].parallel_map(lambda x: myfunc2(x))

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py in closure(data, arg, **kwargs)
     36 
     37             with ProcessingPool(nb_workers) as pool:
---> 38                 result_workers = pool.map(Series.worker_map, workers_args)
     39 
     40             result = pd.concat([

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pathos/multiprocessing.py in map(self, f, *args, **kwds)
    135         AbstractWorkerPool._AbstractWorkerPool__map(self, f, *args, **kwds)
    136         _pool = self._serve()
--> 137         return _pool.map(star(f), zip(*args)) # chunksize
    138     map.__doc__ = AbstractWorkerPool.map.__doc__
    139     def imap(self, f, *args, **kwds):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

NameError: name 'myfunc2' is not defined

manfye · 2022-10-05T00:32:46Z

same error here

nalepae self-assigned this Jul 22, 2019

nalepae added the enhancement New feature or request label Jul 22, 2019

nalepae closed this as completed Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try dill to support lambda function？ #22

try dill to support lambda function？ #22

slimtom95 commented Apr 26, 2019

lores commented Jul 17, 2019

nalepae commented Jul 22, 2019

nalepae commented Jul 23, 2019

lores commented Jul 23, 2019

manfye commented Oct 5, 2022

try dill to support lambda function？ #22

try dill to support lambda function？ #22

Comments

slimtom95 commented Apr 26, 2019

lores commented Jul 17, 2019

nalepae commented Jul 22, 2019

nalepae commented Jul 23, 2019

lores commented Jul 23, 2019

manfye commented Oct 5, 2022