-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Redis global locking approach used by waitlock.py may not be entirely reliable #3
Comments
I suspect this was caused by the server being under extremely high load the the lock expiry being set to 5 seconds, which under normal circumstances is plenty of time for the EXTENDS command to be frequently sent to Redis, and allows us to more quickly move on to the next container or retry on an actual crash/failure... Under extremely high load, it's possible that the thread sending EXTENDS commands was delayed for more than 5 seconds, causing it to fail with an error about the lock having not being acquired or having already expired. I think we should increase this to 60 seconds and see if the problem resurfaces. https://github.com/ixc/ixc-django-docker/blob/master/ixc_django_docker/bin/waitlock.py#L81 |
It might also be worth making the waitlock.py script more resilient to the |
@jmurty I think this error won't happen when acquiring a lock, only when attempting to extend a lock while it is already acquired and the command is executing. I think it's probably OK to just expect the caller (Python/Bash script or Docker Cloud) to retry on failure. Docker services should already be configured to restart Any retry option in |
When deploying environments with multiple Django-based containers, all of which will try to do startup tasks like running DB migrations at the same time, we use Redis to acquire "global" locks before running the commands. This ensures that only one container at a time will run DB migrations, or other jobs that only need to be done once.
On a recent update of a staging environment the Redis global lock mechanism failed, causing 4 containers to try and run DB migrations at the same time, causing a large load on the underlying node (since DB migration processing is expensive) and bringing everything to a near standstill.
The specific Redis global lock failures were
NotAcquired
errors from the waitlock.py helper script, e.g.We (IC) worked around the problem by stopping the less important containers in the sfmoma-staging stack that were also running migrations in Docker Cloud (celery, celeryflower, celerybeat) to allow just the one django container to do the work.
The root cause seemed to be failures within the
redis_lock
library when attempting to apply anEXTEND
operation to extend an existing lock.Here is the
extend()
method inredis_lock
: https://github.com/ionelmc/python-redis-lock/blob/369e95bb5e26284ef0944e551f93d9f2596e5345/src/redis_lock/__init__.py#L243Ultimately, for some reason the
EXTEND
operation applied via scripting to the Redis locking mechanism 1 returned an errorcode value of 1. I do not know why, and have been unable to find any useful details or explanation with preliminary research.See https://github.com/sfmoma/sfmoma/issues/263
The text was updated successfully, but these errors were encountered: