Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash tracker spawning long-running child processes? #3954

Open
luc-financeit opened this issue Sep 25, 2024 · 6 comments
Open

Crash tracker spawning long-running child processes? #3954

luc-financeit opened this issue Sep 25, 2024 · 6 comments
Assignees
Labels
bug Involves a bug community Was opened by a community member

Comments

@luc-financeit
Copy link

Current behaviour

Upgrading the datadog gem from 2.2.0 to 2.3.0 causes a spec in our test suite to hang. The code being tested does something like

6.times do
  Process.fork { ... }
end

Process.waitall # Hangs

Looking at ps output, there's a libdatadog-crashtracking-receiver child process which our waitall is hung up on — manually killing the crashtracker process, or disabling the crashtracker entirely (via DD_CRASHTRACKING_ENABLED) allows the test to complete successfully.

F   UID     PID    PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4     0     328       0  20   0  16460  7680 -      Ss   ?          0:00 bash -l
4     0     557     328  20   0 190964 29656 -      Sl   ?          0:00  \_ ruby /usr/local/rvm/gems/ruby-3.1.6/bin/parallel_rspec --verbose spec
4     0     559     557  20   0 1414052 518752 -    Sl   ?          1:18      \_ ruby bin/rspec -O .rspec_parallel spec/the_hanging_spec.rb
0     0     560     559  20   0  24128  3844 -      S    ?          0:00          \_ /usr/local/rvm/gems/ruby-3.1.6/gems/libdatadog-11.0.0.1.0-x86_64-linux/vendor/libdatadog-11.0.0/x86_64-linux/libdatadog-x86_64-unknown-linux-gnu/bin/libdatadog-crashtracking-receiver 

Expected behaviour

No hang.

Steps to reproduce

I'm afraid I don't have a stand-alone reproduction — the actual code is significantly more complicated, and I haven't tried slimming it down to see if there's additional triggers. I can dig in to it more if the above isn't enough to go on.

Environment

  • datadog version: 2.3.0
  • Configuration block (Datadog.configure ...): See gist.
  • Ruby version: 3.1.6
  • Operating system: Amazon Linux 2
  • Relevant library versions: N/A
@luc-financeit luc-financeit added bug Involves a bug community Was opened by a community member labels Sep 25, 2024
@anmarchenko
Copy link
Member

Hi @luc-financeit! Thanks for reporting this and sorry you've run into this issue.

You are absolutely correct, the crashtracker starts an external process that sends out crash report: if Ruby VM crashes, it is unable to send the report to us, so we need an additional process to send out these reports. I think there is really no other way of doing this.

Given that this is not critical for gem datadog or datadog-ci to work correctly, disabling crashtracker via DD_CRASHTRACKING_ENABLED=0 looks like a sensible thing to do. Would this workaround work for you?

@delner
Copy link
Contributor

delner commented Sep 27, 2024

Thanks for reporting @luc-financeit ! And to @anmarchenko for a quick response and work around.

I'm sharing this with our team, see if we can diagnose/plan on what to do about this. We will share updates here as we have them.

@TonyCTHsu
Copy link
Contributor

TonyCTHsu commented Sep 27, 2024

👋 @luc-financeit Thanks for reporting.

I can reproduce it with

require "bundler/inline"
gemfile { gem "datadog", "2.3.0", source: "https://rubygems.org" }

require "datadog"
Datadog.configure {}

Process.waitall

Indeed, Process.waitall is hanging when crashtracking is enabled. In addition, I have found that this behaviour does not applied to other of methods from Process (.wait, wait2, waitpid and waitpid2) when provided with specific pid.

For example:

pid = Process.fork do
  puts Process.pid
end

Process.wait(pid)

We will be looking into this issue and I suggest circumvent this by setting DD_CRASHTRACKING_ENABLED=false temporarily in your test suite.

@TonyCTHsu TonyCTHsu self-assigned this Sep 27, 2024
lloeki added a commit that referenced this issue Oct 3, 2024
Crashtracking has been identified as causing issues with `ECHILD` and
`waitpid`, and possibly anything involving `SIGCHLD` with obscure
failure modes of hard to anticipate consequences.

Disable by default as a cautionary measure.

Mitigates #3954
lloeki added a commit that referenced this issue Oct 3, 2024
Crashtracking has been identified as causing issues with `ECHILD` and
`waitpid`, and possibly anything involving `SIGCHLD` with obscure
failure modes of hard to anticipate consequences.

Disable by default as a cautionary measure.

Mitigates #3954
@p-datadog
Copy link
Contributor

waitall waits for all child processes, and will attempt to wait for child processes spawned by crash tracker.

It is convenient to use but does make the assumption that there aren't any processes spawned in the background by dependent libraries or the core runtime. A more robust solution is to keep track of pids for processes spawned by the application and wait for them explicitly using one of the other wait* methods.

@luc-financeit
Copy link
Author

Sure, I agree that our code doing waitall is questionable, for several reasons. But this was still frustrating to debug, given that there is, as far as I can tell, zero documentation about the crash tracker — even the env var to turn it off isn't documented in the additional configuration section? (I should also note that turning it off via code-based configuration didn't seem to work for us — if I had to guess, probably similar load-order reasons as startup logs.)

"Don't use waitall" or "turn off the crash tracker if you do" are both reasonable solutions, but perhaps that tradeoff can be documented somewhere?

@delner
Copy link
Contributor

delner commented Oct 4, 2024

Hey @luc-financeit : we're going to disable this crash tracker feature by default until we can test and deploy a suitable alternative. Look for this in our next release. Sorry for the frustration and thanks for your patience!

lloeki added a commit that referenced this issue Oct 9, 2024
Crashtracking has been identified as causing issues with `ECHILD` and
`waitpid`, and possibly anything involving `SIGCHLD` with obscure
failure modes of hard to anticipate consequences.

Disable by default as a cautionary measure.

Mitigates #3954
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants