Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask_cudf, when OOM or illegal access, hangs #6279

Closed
pseudotensor opened this issue Oct 22, 2020 · 4 comments
Closed

dask_cudf, when OOM or illegal access, hangs #6279

pseudotensor opened this issue Oct 22, 2020 · 4 comments

Comments

@pseudotensor
Copy link
Contributor

See for setup details: #6232

Running dask_cudf in way very similar to rapidsai/ucx-py#655

illegal.txt.zip

fragment:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  device free failed: an illegal memory access was encountered
*** Aborted
Register dump:

 RAX: 0000000000000000   RBX: 000056407911a540   RCX: 000014e7e3ca7f47
 RDX: 0000000000000000   RSI: 000014e739694620   RDI: 0000000000000002
 RBP: 000014e5bc063738   R8 : 0000000000000000   R9 : 000014e739694620
 R10: 0000000000000008   R11: 0000000000000246   R12: 000014e6552a0aa0
 R13: 0000000000000000   R14: 000014e5dbdc8933   R15: 0000000000000000
 RSP: 000014e739694620

 RIP: 000014e7e3ca7f47   EFLAGS: 00000246

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000004   OldMask: 00000004   CR2: 0109e008

 FPUCW: 0000037f   FPUSW: 00000420   TAG: 000014e7
 RIP: a5d5306c   RDP: 00000000

 ST(0) ffff 8000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) ffff f800000000000000
 ST(4) ffff 81ceb32c4b43fcf5   ST(5) ffff 8000000000000000
 ST(6) ffff 8000000000000000   ST(7) 8000 8000000000000000
 mxcsr: 1fa0
 XMM0:  000000000000000000000000ffffffff XMM1:  000000000000000000000000ffffffff
 XMM2:  000000000000000000000000ffffffff XMM3:  000000000000000000000000ffffffff
 XMM4:  000000000000000000000000ffffffff XMM5:  000000000000000000000000ffffffff
 XMM6:  000000000000000000000000ffffffff XMM7:  000000000000000000000000ffffffff
 XMM8:  000000000000000000000000ffffffff XMM9:  000000000000000000000000ffffffff
 XMM10: 000000000000000000000000ffffffff XMM11: 000000000000000000000000ffffffff
 XMM12: 000000000000000000000000ffffffff XMM13: 000000000000000000000000ffffffff
 XMM14: 000000000000000000000000ffffffff XMM15: 000000000000000000000000ffffffff

Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x14e7e3ca7f47]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x14e7e3ca98b1]
/home/jon/minicondadai/lib/python3.6/site-packages/cupy/core/../../../../libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc)[0x14e7a5d3984a]
/home/jon/minicondadai/lib/python3.6/site-packages/cupy/core/../../../../libstdc++.so.6(+0xabf47)[0x14e7a5d37f47]
/home/jon/minicondadai/lib/python3.6/site-packages/cupy/core/../../../../libstdc++.so.6(+0xab3a5)[0x14e7a5d373a5]
/home/jon/minicondadai/lib/python3.6/site-packages/cupy/core/../../../../libstdc++.so.6(__gxx_personality_v0+0x348)[0x14e7a5d37bd8]
/home/jon/minicondadai/lib/python3.6/site-packages/numpy/core/../../../.././libgcc_s.so.1(+0xcadc)[0x14e7dfa56adc]
/home/jon/minicondadai/lib/python3.6/site-packages/numpy/core/../../../.././libgcc_s.so.1(_Unwind_RaiseException+0xe6)[0x14e7dfa56dda]
/home/jon/minicondadai/lib/python3.6/site-packages/cupy/core/../../../../libstdc++.so.6(__cxa_throw+0x42)[0x14e7a5d3814d]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x3d7dd0)[0x14e5c427fdd0]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x3d88c6)[0x14e5c42808c6]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost16HostDeviceVectorIfED1Ev+0x89)[0x14e5c4283469]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost6common13HistogramCutsD1Ev+0x11)[0x14e5c404eba1]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost15EllpackPageImplC2EPNS_7DMatrixERKNS_10BatchParamE+0x555)[0x14e5c42abf45]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost11EllpackPageC2EPNS_7DMatrixERKNS_10BatchParamE+0x2e)[0x14e5c42ac01e]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost4data13SimpleDMatrix17GetEllpackBatchesERKNS_10BatchParamE+0xa5)[0x14e5c40c3375]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE12InitDataOnceEPNS_7DMatrixE+0x1af)[0x14e5c43b116f]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x26b)[0x14e5c43ba01b]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0xc0d)[0x14e5c40f980d]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_20PredictionCacheEntryE+0x10c)[0x14e5c40fb0ec]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiSt10shared_ptrINS_7DMatrixEE+0x39d)[0x14e5c413337d]
/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x54)[0x14e5c402cac4]
/home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)[0x14e7e253d630]
/home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)[0x14e7e253cfed]
/home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)[0x14e7e2553f9e]
/home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5)[0x14e7e25549d5]
dask-worker [ucx://127.0.0.1:50751](_PyObject_FastCallDict+0x8b)[0x564078f0b00b]
dask-worker [ucx://127.0.0.1:50751](+0x1a179e)[0x564078f9979e]

This happens when using dask_cudf and I'm fitting over and over again, all that works. But then one more fit in slightly different python context (same fork/thread though) leads to this. it doesn't always happen, and I'll try to make an MRE, but maybe something is clear from the back trace.

@trivialfis
Copy link
Member

That's a really weird backtrace. From xgboost to cupy to numpy then to cupy. And from libstdc++ to libgcc then back to libstdc++ ..

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 23, 2020

That's a really weird backtrace. From xgboost to cupy to numpy then to cupy. And from libstdc++ to libgcc then back to libstdc++ ..

Yes, I noticed that too, didn't know whether or not it was odd. I would guess that is a cupy issue, just doing some super basic numpy things that don't use CPU data, but not sure.

@trivialfis
Copy link
Member

To avoid hanging, the best way is just fixing the segfault, proper Python exception is fine and should not lead to hang. Another way is let RABIT detect whether current allreduce is consistent with rest of the workers, which is quite difficult to implement at the moment.

@trivialfis
Copy link
Member

I don't think we can handle segfault with fault tolerance. If you have specific example of segfault please share, we will do our best to address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants