Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: implement the prepare fiber canceling #2126

Closed
wants to merge 1 commit into from

Conversation

palage4a
Copy link
Contributor

@palage4a palage4a commented Jul 25, 2023

When applying of a clusterwide config in a highly loaded cluster, an timeout error often occurs.
In special cases, the error is caused by a timeout at the prepare phase for some instance.

local retmap, errmap = pool.map_call(opts.fn_prepare, {upload_id}, {
uri_list = opts.uri_list,
timeout = vars.options.validate_config_timeout,
})

After it happens, we get "locked" instance. (see repro in the linked issue)
It happens because we abort (i.e. release lock) only prepared instances.

::abort::
do
log.warn('(2PC) %s abort phase...', activity_name)
local retmap, errmap = pool.map_call(opts.fn_abort, nil,{
uri_list = abortion_list,
timeout = vars.options.netbox_call_timeout,
})

This error handling method works well in cases when the "prepare phase" didn't start or failed before a lock was caught.
But in case of a timeout there is situation when the "prepare phase" has started on an intance, but no answer is followed.
Afterwards, when the procedure has finished, it leads to lock catching and blocks the following configs updates.
For now, to fix this situation, the use of cartridge.twophase.force_reapply method is needed.

But it seems to me, we can resolve this case automatically.
For this, we have to make abort (i.e. release two-phase commit lock) for all instances,
which has started the execution of the "prepare phase" procedure.
For this, we need to do 2 things:

  • abort all instances;
  • make abort function idempotent.

Implementation of my solution is to stop fiber which performs the preparation procedure.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

Close #2119

@palage4a palage4a force-pushed the palage4a/2pc-is-locked-fix branch 9 times, most recently from 24116f8 to 39dda59 Compare July 27, 2023 13:11
@filonenko-mikhail
Copy link
Collaborator

Gently closing. No plans to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Two-phase commit is locked" after prepare phase timeout
2 participants