Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try dill to support lambda function? #22

Closed
slimtom95 opened this issue Apr 26, 2019 · 5 comments
Closed

try dill to support lambda function? #22

slimtom95 opened this issue Apr 26, 2019 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@slimtom95
Copy link

Greetings.

I know pickle doesn't support lambda function serialization, but another serialization library dill does. And there is also a multiprocessing library, multiprocess, which uses dill to replace pickle.

I'm new here. There may be some reasons that we don't support lambda functions, because of upstream dependent package or else.Just want to mention these, if we didn't notice them before.

Regards

@lores
Copy link

lores commented Jul 17, 2019

+1. Adding lambda support would allow identical syntax between the regular and parallel versions of apply, and in my experiments it was as simple as @slimtom95 says.

@nalepae nalepae self-assigned this Jul 22, 2019
@nalepae nalepae added the enhancement New feature or request label Jul 22, 2019
@nalepae
Copy link
Owner

nalepae commented Jul 22, 2019

Pandarallel uses concurrent.futures.ProcessPoolExecutor, which uses itself pickle, and lambda functions are not pickleable...

A good idea could be to use pathos, but today pathos does not support concurrent.futures: uqfoundation/pathos#90.

@nalepae
Copy link
Owner

nalepae commented Jul 23, 2019

Pandarallel 1.3.0 supports now lambda functions.

@nalepae nalepae closed this as completed Jul 23, 2019
@lores
Copy link

lores commented Jul 23, 2019

Thanks! I've made it work with lambdas from Jupyter. One issue, though - does it take some sort of footprint of the state of the code the first time it runs?
If I do

def myfunc(x):
    pass
df['ip_long'].parallel_map(lambda x: myfunc(x))

it works fine. Then if I re-run the cell with:

def myfunc2(x):
    pass
df['ip_long'].parallel_map(lambda x: myfunc2(x))

I get

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py", line 23, in worker_map
    res = getattr(series[chunk], map_func)(arg, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/series.py", line 3382, in map
    arg, na_action=na_action)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/base.py", line 1218, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
  File "<ipython-input-16-96ab38ec54e8>", line 34, in <lambda>
NameError: name 'myfunc2' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-16-96ab38ec54e8> in <module>()
     32 
     33 
---> 34 df['ip_long'].parallel_map(lambda x: myfunc2(x))

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     61             """Please see the docstring of this method without `parallel`"""
     62             try:
---> 63                 return func(*args, **kwargs)
     64 
     65             except _PlasmaStoreFull:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandarallel/series.py in closure(data, arg, **kwargs)
     36 
     37             with ProcessingPool(nb_workers) as pool:
---> 38                 result_workers = pool.map(Series.worker_map, workers_args)
     39 
     40             result = pd.concat([

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pathos/multiprocessing.py in map(self, f, *args, **kwds)
    135         AbstractWorkerPool._AbstractWorkerPool__map(self, f, *args, **kwds)
    136         _pool = self._serve()
--> 137         return _pool.map(star(f), zip(*args)) # chunksize
    138     map.__doc__ = AbstractWorkerPool.map.__doc__
    139     def imap(self, f, *args, **kwds):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/multiprocess/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

NameError: name 'myfunc2' is not defined

@manfye
Copy link

manfye commented Oct 5, 2022

same error here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants