MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

shlomi-noach · 2018-05-20T09:22:40Z

On behalf of @almeida-pythian, cross post from outbrain-inc/orchestrator#304

Hi @shlomi-noach , I think I might have found a bug during graceful-master-takeover process.
All slaves, prior to graceful-master-takeover starting, have the following: Auto_Position: 1. However, after graceful-master-takeover takes place, Auto_Position is set to 0, and further graceful failovers do not work until I set it back to 1.

I have the following test scenario below:

[root@po-proxysql1 orchestrator]# orchestrator-client -c topology -i po-mysql1:53306
po-mysql1:53306     [0s,ok,5.7.21-21-log,rw,MIXED,>>,GTID]
+ po-mysql2:53306   [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]
+ po-mysql3:53306   [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]
  + po-mysql4:53306 [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]

I wrote a post graceful-master-takeover hook, which does the following:

Restarts the slave threads on old master (now a slave)
Gets a list of all secondary slaves from the old master (for now this is hard coded as you can see below as this is proof of concept)
Moves the secondary slaves as slaves of the old master (now a slave) after graceful-failover
Starts slave threads on secondary slaves

#!/bin/bash
echo "Restarting slave threads on old master ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}"
orchestrator -c start-slave -i ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}

echo "Getting list of secondary slaves from new master"
SEC_SLAVES=()
for secondary_slave in `orchestrator-client -c which-replicas -i ${ORC_SUCCESSOR_HOST}:${ORC_FAILED_PORT} | grep  po-mysql4`
do
SEC_SLAVES+=(${secondary_slave})
done

for ancillary_slave in "${SEC_SLAVES[@]}"
do
echo "Making SECONDARY SLAVE ${ancillary_slave} as a SLAVE of ${ORC_FAILED_HOST}"
orchestrator -c relocate -i ${ancillary_slave} -d ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}
orchestrator -c start-slave -i ${ancillary_slave}
done

Here are the before and after pictures. Notice this only worked after I did the following on the old master
after graceful-master-takeover was all finished:

STOP SLAVE; CHANGE MASTER TO MASTER_AUTO_POSITION = 1; START SLAVE;

Screeshots below show the before and after:

Here's my config:

[root@po-proxysql1 orchestrator]# cat /etc/orchestrator.conf.json
{
  "Debug": false,
  "EnableSyslog": false,
  "ListenAddress": ":3000",
  "BackendDB": "sqlite",
  "SQLite3DataFile": "/usr/local/orchestrator/orchestrator.db",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "orchestrator_password",
  "MySQLTopologyCredentialsConfigFile": "",
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orchestrator",
  "MySQLOrchestratorPassword": "orchestrator_password",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 20,
  "CandidateInstanceExpireMinutes": 60,
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "RemoveTextFromHostnameDisplay": ".:53306",
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "",
  "HTTPAuthPassword": "",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "SlaveLagQuery": "",
  "DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
  "PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": "",
  "AgentSSLValidOUs": [],
  "UseSSL": false,
  "UseMutualTLS": false,
  "SSLSkipVerify": false,
  "SSLPrivateKeyFile": "",
  "SSLCertFile": "",
  "SSLCAFile": "",
  "SSLValidOUs": [],
  "URLPrefix": "",
  "StatusEndpoint": "/api/status",
  "StatusSimpleHealth": true,
  "StatusOUVerify": false,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "SeedAcceptableBytesDiff": 8192,
  "PseudoGTIDPattern": "",
  "PseudoGTIDPatternIsFixedSubstring": false,
  "PseudoGTIDMonotonicHint": "asc:",
  "DetectPseudoGTIDQuery": "",
  "BinlogEventsChunkSize": 10000,
  "SkipBinlogEventsContaining": [],
  "ReduceReplicationAnalysisCount": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "*"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "*"
  ],
   "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log",
    "/usr/local/orchestrator/pregracefulfailover.sh >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log",
    "/usr/local/orchestrator/postgracefulfailover.sh >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster": true,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "MasterFailoverDetachSlaveMasterHost": false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeSlaveRecoveryOnLagMinutes": 0,
  "OSCIgnoreHostnameFilters": [],
  "GraphiteAddr": "",
  "GraphitePath": "",
  "GraphiteConvertHostnameDotsToUnderscores": true
}
}

Thanks for your help.

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2018-05-20T09:33:10Z

Thank you @almeida-pythian, I can confirm I'm able to reproduce this.

To be more specific: the demoted master, now returned as a replica, is set with auto_position=0 even if the topology is all using auto_position=1. The rest of the failover is fine and other replicas maintain their auto_position setting.

I'll look into it, but worth noting that the way Oracle implemented GTID calls for some confusion. Each replica chooses whether to auto_position or not. We can have a hybrid topology. And the master itself? It's not replicating; so does it use or does it not use GTID?

What if the master had one replica with auto_position=0 and one with auto_position=1? What should happen after failover?

Sigh. Yet another "try and think like a human" for orchestrator here, and yet another "no single solution to satisfy all cases and all users".

almeida-pythian · 2018-05-21T15:10:04Z

Hi @shlomi-noach Thanks for looking into this. I'm sorry I posted to the other site (outbrain), my brain must have been out and I did not realize I was on the wrong place :-) Thanks for moving it here.

shlomi-noach mentioned this issue May 20, 2018

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover outbrain-inc/orchestrator#304

Open

shlomi-noach mentioned this issue May 20, 2018

graceful and forced failover: better anlaysis including GTID fix #509

Merged

shlomi-noach closed this as completed in #509 May 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

shlomi-noach commented May 20, 2018

shlomi-noach commented May 20, 2018

almeida-pythian commented May 21, 2018

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

Comments

shlomi-noach commented May 20, 2018

shlomi-noach commented May 20, 2018

almeida-pythian commented May 21, 2018