draft: implement the prepare fiber canceling #2126
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When applying of a clusterwide config in a highly loaded cluster, an timeout error often occurs.
In special cases, the error is caused by a timeout at the prepare phase for some instance.
cartridge/cartridge/twophase.lua
Lines 467 to 470 in 97a4629
After it happens, we get "locked" instance. (see repro in the linked issue)
It happens because we abort (i.e. release lock) only prepared instances.
cartridge/cartridge/twophase.lua
Lines 521 to 528 in 97a4629
This error handling method works well in cases when the "prepare phase" didn't start or failed before a lock was caught.
But in case of a timeout there is situation when the "prepare phase" has started on an intance, but no answer is followed.
Afterwards, when the procedure has finished, it leads to lock catching and blocks the following configs updates.
For now, to fix this situation, the use of
cartridge.twophase.force_reapply
method is needed.But it seems to me, we can resolve this case automatically.
For this, we have to make abort (i.e. release two-phase commit lock) for all instances,
which has started the execution of the "prepare phase" procedure.
For this, we need to do 2 things:
Implementation of my solution is to stop fiber which performs the preparation procedure.
I didn't forget about
Close #2119