Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with Tesla V100 #130

Open
ZimingLu opened this issue Oct 15, 2018 · 2 comments
Open

problem with Tesla V100 #130

ZimingLu opened this issue Oct 15, 2018 · 2 comments

Comments

@ZimingLu
Copy link

Now I want to do some tests on Tesla V100, I met some problems. The problem is following:
The problem is missing on 1080ti, so I wonder whether this is due to Tesla V100. I hope someone could help me to solve this problem. Thx a lot!

terminate called after throwing an instance of 'dmlc::Error'
what(): [11:50:29] src/engine/./threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x7f651b5c354d]
[bt] (1) python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x7f651b5c39da]
[bt] (2) python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xb26) [0x7f651e5536a6]
[bt] (3) python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xd3) [0x7f651e5656d3]
[bt] (4) python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent)+0x3e) [0x7f651e56590e]
[bt] (5) python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x3b) [0x7f651e55283b]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb6970) [0x7f65b9ddd970]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8064) [0x7f65ceddf064]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f65ce1f163d]

@ThomasDelteil
Copy link

I think I have the same problem, can't run the GPU tests after building with CUDA 9.2

./test_gpu
Running GPU tests
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error: compute_ctc_loss in small_test, stat = execution failed
Aborted (core dumped)

@ThomasDelteil
Copy link

FIX is here: #118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants