Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corner case of mismatching GC_suspend_ack_sem value and n_live_threads #631

Closed
wants to merge 0 commits into from

Conversation

delguoqing
Copy link
Contributor

The cause of this problem

This is a corner case I found in Android, but this also applies for other platforms, as long as they have chances of receiving signals and therefore pausing their threads.

When the user or system want to create a bugreport, it will send a SIGQUIT to each process. Because android app is forked from zygote, they all have a Signal Catcher thread that handles SIGQUIT. The signal catcher thread will in turn send SIGRT_1 to all of the other threads to pause them and and get the ucontext, then do a stack unwind.

1.When the GC_stopping_thread have finished collecting and want to restart the world, it first change GC_stop_count to GC_stop_count + 1.

2.Just before the GC_stopping_thread have a chance to raise signals to other threads by calling GC_restart_all. Other threads might have been awaken by other signals than GC_sig_thr_restart(in my case SIGRT_1), then because of the updated value of GC_stop_count. The following loop condition will not meet.

  do {
      sigsuspend(&suspend_handler_mask);
      /* Iterate while not restarting the world or thread is suspended. */
  } while (ao_load_acquire_async(&GC_stop_count) == my_stop_count
#          ifdef GC_ENABLE_SUSPEND_THREAD
             || ((suspend_cnt & 1) != 0
                 && (word)ao_load_async(&(me -> ext_suspend_cnt))
                    == suspend_cnt)
#          endif
          );

The threads can go straight to the end of the suspend_handler and update last_stop_count.

  {
    /* If the RESTART signal loss is possible (though it should be      */
    /* less likely than losing the SUSPEND signal as we do not do       */
    /* much between the first sem_post and sigsuspend calls), more      */
    /* handshaking is provided to work around it.                       */
    sem_post(&GC_suspend_ack_sem);
    /* Set the flag that the thread has been restarted. */
    if (GC_retry_signals)
      ao_store_release_async(&(me -> last_stop_count),
                             my_stop_count | THREAD_RESTARTED);
  }
  RESTORE_CANCEL(cancel_state);
  1. The GC_stopping_thread then continues and call GC_restart_all, in order to send GC_sig_thr_restart to all other threads. However, it sees that some threads have their last_stop_count already updated and decide that it will not raise signal to it "again". So n_live_threads is not counting those threads that runs too fast in step 2.
    for (i = 0; i < THREAD_TABLE_SZ; i++) {
      for (p = GC_threads[i]; p != NULL; p = p -> tm.next) {
        if (!THREAD_EQUAL(p -> id, self)) {
          if ((p -> flags & (FINISHED | DO_BLOCKING)) != 0) continue;
#         ifdef GC_ENABLE_SUSPEND_THREAD
              if ((p -> ext_suspend_cnt & 1) != 0) continue;
#         endif
          if (GC_retry_signals
                && AO_load(&(p -> last_stop_count)) == GC_stop_count)
              continue; /* The thread has been restarted. */
          n_live_threads++;
#         ifdef DEBUG_THREADS
            GC_log_printf("Sending restart signal to %p\n", (void *)p->id);
#         endif
  1. In the end, we can have n_live_threads less than GC_suspend_ack_sem. So we will call sem_wait only n_live_threads times, thus leaving GC_suspend_ack_sem a non-zero value after this round of GC.

My fix for this problem

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.
I added a boolean argument for GC_suspend_all/GC_restart_all to indicate if we're actually doing the retry. If not, we ignore last_stop_count and always raise signals to suspending threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant