Limit CLUSTER_CANT_FAILOVER_DATA_AGE log to 10 times period

If a replica is step into data_age too old stage, it can not trigger the failover and currently it can not be automatically recovered and we will print a log every CLUSTER_CANT_FAILOVER_RELOG_PERIOD, which is every second. If the primary has not recovered or there is no manual failover, this log will flood the log file. In this case, limit its frequency to 10 times period, which is 10 seconds in our code. Also in this data_age too old stage, the repeated logs also can stand for the progress of the failover. See also valkey-io#780 for more details about it. Signed-off-by: Binbin <[email protected]>
enjoy-binbin · Oct 18, 2024 · 2fb5558 · 2fb5558
1 parent a62d1f1
commit 2fb5558
Showing 1 changed file with 6 additions and 0 deletions.
diff --git a/src/cluster_legacy.c b/src/cluster_legacy.c
@@ -4439,6 +4439,12 @@ void clusterLogCantFailover(int reason) {
         time(NULL) - lastlog_time < CLUSTER_CANT_FAILOVER_RELOG_PERIOD)
         return;
 
+    /* If data age is too old, this log may be printed repeatedly since it
+     * can not be automatically recovered. In this case, limit its frequency. */
+    if (reason == server.cluster->cant_failover_reason && reason == CLUSTER_CANT_FAILOVER_DATA_AGE &&
+        time(NULL) - lastlog_time < 10 * CLUSTER_CANT_FAILOVER_RELOG_PERIOD)
+        return;
+
     server.cluster->cant_failover_reason = reason;
 
     switch (reason) {