-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcmur_device: add priv lock support #667
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the scope of the issue. The description says that the problem is that the status update handler could encounter a NULL state but the PR goes on to change all handlers, including I/O handers such as read and write. Would tcmu-runner core really initiate I/O on a closed image?
Could we instead look at something like TCMUR_DEV_FLAG_IS_OPEN
? Wrapping all handlers with rdev->priv_lock
seems too heavyweight to me.
rbd.c
Outdated
@@ -115,11 +115,12 @@ static darray(char *) blacklist_caches; | |||
#ifdef LIBRADOS_SUPPORTS_SERVICES | |||
|
|||
#ifdef RBD_LOCK_ACQUIRE_SUPPORT | |||
/* rdev->priv_lock is held_*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* rdev->priv_lock is held_*/ | |
/* rdev->priv_lock is held */ |
here and everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it.
rbd.c
Outdated
struct tcmur_device *rdev = tcmu_dev_get_private(dev); | ||
struct tcmu_rbd_state *state = tcmur_dev_get_private(dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a purely cosmetic change, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe introduced by my previous changes and there should be some other changes in this func but removed again, this could be removed and makes no sense.
rbd.c
Outdated
bool has_lock) | ||
{ | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squash "rbd: remove possible warning" commit into an earlier commit that made the void -> int change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest splitting unrelated bug fixes ("rbd: fix use-after-free of addr", "rbd: fix memory leak when fails to get the address" and "rbd: fix and add more debug logs") into a separate PR. |
Yeah, it's possible.
Let me check it more about this and have a try.
Done. |
96199cb
to
1d948c0
Compare
@idryomov Please take a look, thanks. |
tcmur_device.c
Outdated
ret = rhandler->report_event(dev); | ||
if (ret) | ||
tcmu_dev_err(dev, "Could not report events. Error %d.\n", ret); | ||
pthread_mutex_unlock(&rdev->state_lock); | ||
|
||
pthread_mutex_unlock(&rdev->rdev_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code path tested?
pthread_mutex_unlock(&rdev->rdev_lock); | |
pthread_mutex_lock(&rdev->rdev_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code path tested?
Yeah, tested, checked it this has been fixed but wasn't amended to it.
I found in my setups one node has fixed this, another didn't.
For local test I didn't hit any issue.
} | ||
|
||
rdev->flags |= TCMUR_DEV_FLAG_REPORTING_EVENT; | ||
pthread_mutex_unlock(&rdev->rdev_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is rdev_lock
released here? ->report_event()
used to be called with it held.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rdev_lock
should always be released when calling the handler's hooks. I think we need to pass a has_lock
boolean parameter.
tcmur_device.h
Outdated
|
||
pthread_cond_t report_event_cond; | ||
|
||
pthread_spinlock_t cmds_list_lock; /* protects cmds_list */ | ||
struct list_head cmds_list; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have checked in a kmod-devel-25-16.el8.x86_64.rpm
binary by mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have removed it.
I think I'm still missing something. I'm going to ignore the renames for the moment and speak in terms of what is currently in master. From my reading of the description, the problem is that |
The reopen and event report will be run in two different threads. The reopen will be split into In case that just after the The use-after-free bug still exists... We need to let the reopen thread wait a bit to be sure that the event report thread has finished. |
If the device is in recovery, we can defer reporting the event in the recovery when reopening the device. And if the device is stopped or stopping we can just skip it. Just wait for the report event to finish when recoverying the device, because the recovery will close and then open the device during which the private data maybe released. And it may cause use-after-free crash in report event routine. Signed-off-by: Xiubo Li <[email protected]>
Run the following test for 2 hours, worked fine for me.
|
In master, You are removing |
As I remembered long time ago as discussed, the rule is that the Or possibly in the handler's hooks it will sleep, so holding the The current code in master is buggy when begin to support event report feature. |
When the tcmu-runner detect that the lock is lost, it will try to
queue a work event to reopen the image and at the same time queue
a work event to update the service status. While the reopen is not
atomic, and there has a gap between image close and image open,
during which the rbd image's state resource will be released and if
the update status event is fired, we will hit the crash bug.
This commit will add one rdev->priv_lock to protect the private data
in rdev struct. For the service status updating code just skip it
if it's in the reopen gap. And for all the other IOs just return
EBUSY to let the client try it again.
Signed-off-by: Xiubo Li [email protected]