Fix corner case of mismatching GC_suspend_ack_sem value and n_live_threads #631
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The cause of this problem
This is a corner case I found in Android, but this also applies for other platforms, as long as they have chances of receiving signals and therefore pausing their threads.
When the user or system want to create a bugreport, it will send a SIGQUIT to each process. Because android app is forked from zygote, they all have a Signal Catcher thread that handles SIGQUIT. The signal catcher thread will in turn send SIGRT_1 to all of the other threads to pause them and and get the ucontext, then do a stack unwind.
1.When the GC_stopping_thread have finished collecting and want to restart the world, it first change GC_stop_count to GC_stop_count + 1.
2.Just before the GC_stopping_thread have a chance to raise signals to other threads by calling GC_restart_all. Other threads might have been awaken by other signals than GC_sig_thr_restart(in my case SIGRT_1), then because of the updated value of GC_stop_count. The following loop condition will not meet.
The threads can go straight to the end of the suspend_handler and update last_stop_count.
My fix for this problem
I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.
I added a boolean argument for GC_suspend_all/GC_restart_all to indicate if we're actually doing the retry. If not, we ignore last_stop_count and always raise signals to suspending threads.