Crash tracker spawning long-running child processes? #3954

luc-financeit · 2024-09-25T15:17:29Z

Current behaviour

Upgrading the datadog gem from 2.2.0 to 2.3.0 causes a spec in our test suite to hang. The code being tested does something like

6.times do
  Process.fork { ... }
end

Process.waitall # Hangs

Looking at ps output, there's a libdatadog-crashtracking-receiver child process which our waitall is hung up on — manually killing the crashtracker process, or disabling the crashtracker entirely (via DD_CRASHTRACKING_ENABLED) allows the test to complete successfully.

F   UID     PID    PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0     328       0  20   0  16460  7680 -      Ss   ?          0:00 bash -l
4     0     557     328  20   0 190964 29656 -      Sl   ?          0:00  \_ ruby /usr/local/rvm/gems/ruby-3.1.6/bin/parallel_rspec --verbose spec
4     0     559     557  20   0 1414052 518752 -    Sl   ?          1:18      \_ ruby bin/rspec -O .rspec_parallel spec/the_hanging_spec.rb
0     0     560     559  20   0  24128  3844 -      S    ?          0:00          \_ /usr/local/rvm/gems/ruby-3.1.6/gems/libdatadog-11.0.0.1.0-x86_64-linux/vendor/libdatadog-11.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/bin/libdatadog-crashtracking-receiver

Expected behaviour

No hang.

Steps to reproduce

I'm afraid I don't have a stand-alone reproduction — the actual code is significantly more complicated, and I haven't tried slimming it down to see if there's additional triggers. I can dig in to it more if the above isn't enough to go on.

Environment

datadog version: 2.3.0
Configuration block (Datadog.configure ...): See gist.
Ruby version: 3.1.6
Operating system: Amazon Linux 2
Relevant library versions: N/A

The text was updated successfully, but these errors were encountered:

anmarchenko · 2024-09-26T06:12:07Z

Hi @luc-financeit! Thanks for reporting this and sorry you've run into this issue.

You are absolutely correct, the crashtracker starts an external process that sends out crash report: if Ruby VM crashes, it is unable to send the report to us, so we need an additional process to send out these reports. I think there is really no other way of doing this.

Given that this is not critical for gem datadog or datadog-ci to work correctly, disabling crashtracker via DD_CRASHTRACKING_ENABLED=0 looks like a sensible thing to do. Would this workaround work for you?

delner · 2024-09-27T01:35:28Z

Thanks for reporting @luc-financeit ! And to @anmarchenko for a quick response and work around.

I'm sharing this with our team, see if we can diagnose/plan on what to do about this. We will share updates here as we have them.

TonyCTHsu · 2024-09-27T06:40:56Z

👋 @luc-financeit Thanks for reporting.

I can reproduce it with

require "bundler/inline"
gemfile { gem "datadog", "2.3.0", source: "https://rubygems.org" }

require "datadog"
Datadog.configure {}

Process.waitall

Indeed, Process.waitall is hanging when crashtracking is enabled. In addition, I have found that this behaviour does not applied to other of methods from Process (.wait, wait2, waitpid and waitpid2) when provided with specific pid.

For example:

pid = Process.fork do
  puts Process.pid
end

Process.wait(pid)

We will be looking into this issue and I suggest circumvent this by setting DD_CRASHTRACKING_ENABLED=false temporarily in your test suite.

Crashtracking has been identified as causing issues with `ECHILD` and `waitpid`, and possibly anything involving `SIGCHLD` with obscure failure modes of hard to anticipate consequences. Disable by default as a cautionary measure. Mitigates #3954

p-datadog · 2024-10-03T15:07:25Z

waitall waits for all child processes, and will attempt to wait for child processes spawned by crash tracker.

It is convenient to use but does make the assumption that there aren't any processes spawned in the background by dependent libraries or the core runtime. A more robust solution is to keep track of pids for processes spawned by the application and wait for them explicitly using one of the other wait* methods.

luc-financeit · 2024-10-03T15:25:02Z

Sure, I agree that our code doing waitall is questionable, for several reasons. But this was still frustrating to debug, given that there is, as far as I can tell, zero documentation about the crash tracker — even the env var to turn it off isn't documented in the additional configuration section? (I should also note that turning it off via code-based configuration didn't seem to work for us — if I had to guess, probably similar load-order reasons as startup logs.)

"Don't use waitall" or "turn off the crash tracker if you do" are both reasonable solutions, but perhaps that tradeoff can be documented somewhere?

delner · 2024-10-04T14:20:22Z

Hey @luc-financeit : we're going to disable this crash tracker feature by default until we can test and deploy a suitable alternative. Look for this in our next release. Sorry for the frustration and thanks for your patience!

Crashtracking has been identified as causing issues with `ECHILD` and `waitpid`, and possibly anything involving `SIGCHLD` with obscure failure modes of hard to anticipate consequences. Disable by default as a cautionary measure. Mitigates #3954

luc-financeit added bug Involves a bug community Was opened by a community member labels Sep 25, 2024

TonyCTHsu self-assigned this Sep 27, 2024

TonyCTHsu linked a pull request Sep 27, 2024 that will close this issue

[WIP] Fix Process.waitall hanging when crashtracking enabled #3959

Closed

lloeki mentioned this issue Oct 3, 2024

Disable crashtracking by default #3970

Merged

lloeki mentioned this issue Oct 9, 2024

Fix crashtracking specs #3983

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash tracker spawning long-running child processes? #3954

Crash tracker spawning long-running child processes? #3954

luc-financeit commented Sep 25, 2024

anmarchenko commented Sep 26, 2024

delner commented Sep 27, 2024

TonyCTHsu commented Sep 27, 2024 •

edited

Loading

p-datadog commented Oct 3, 2024

luc-financeit commented Oct 3, 2024

delner commented Oct 4, 2024

Crash tracker spawning long-running child processes? #3954

Crash tracker spawning long-running child processes? #3954

Comments

luc-financeit commented Sep 25, 2024

anmarchenko commented Sep 26, 2024

delner commented Sep 27, 2024

TonyCTHsu commented Sep 27, 2024 • edited Loading

p-datadog commented Oct 3, 2024

luc-financeit commented Oct 3, 2024

delner commented Oct 4, 2024

TonyCTHsu commented Sep 27, 2024 •

edited

Loading