Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

Closed
shlomi-noach opened this issue May 20, 2018 · 2 comments · Fixed by #509
Closed

MASTER_AUTO_POSITION being reset to 0 after graceful-master-takeover #508

shlomi-noach opened this issue May 20, 2018 · 2 comments · Fixed by #509

Comments

@shlomi-noach
Copy link
Collaborator

On behalf of @almeida-pythian, cross post from outbrain-inc/orchestrator#304

Hi @shlomi-noach , I think I might have found a bug during graceful-master-takeover process.
All slaves, prior to graceful-master-takeover starting, have the following: Auto_Position: 1. However, after graceful-master-takeover takes place, Auto_Position is set to 0, and further graceful failovers do not work until I set it back to 1.

I have the following test scenario below:

[root@po-proxysql1 orchestrator]# orchestrator-client -c topology -i po-mysql1:53306
po-mysql1:53306     [0s,ok,5.7.21-21-log,rw,MIXED,>>,GTID]
+ po-mysql2:53306   [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]
+ po-mysql3:53306   [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]
  + po-mysql4:53306 [0s,ok,5.7.21-21-log,ro,MIXED,>>,GTID]

I wrote a post graceful-master-takeover hook, which does the following:

  1. Restarts the slave threads on old master (now a slave)
  2. Gets a list of all secondary slaves from the old master (for now this is hard coded as you can see below as this is proof of concept)
  3. Moves the secondary slaves as slaves of the old master (now a slave) after graceful-failover
  4. Starts slave threads on secondary slaves
#!/bin/bash
echo "Restarting slave threads on old master ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}"
orchestrator -c start-slave -i ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}

echo "Getting list of secondary slaves from new master"
SEC_SLAVES=()
for secondary_slave in `orchestrator-client -c which-replicas -i ${ORC_SUCCESSOR_HOST}:${ORC_FAILED_PORT} | grep  po-mysql4`
do
SEC_SLAVES+=(${secondary_slave})
done

for ancillary_slave in "${SEC_SLAVES[@]}"
do
echo "Making SECONDARY SLAVE ${ancillary_slave} as a SLAVE of ${ORC_FAILED_HOST}"
orchestrator -c relocate -i ${ancillary_slave} -d ${ORC_FAILED_HOST}:${ORC_FAILED_PORT}
orchestrator -c start-slave -i ${ancillary_slave}
done

Here are the before and after pictures. Notice this only worked after I did the following on the old master
after graceful-master-takeover was all finished:

STOP SLAVE; CHANGE MASTER TO MASTER_AUTO_POSITION = 1; START SLAVE;

Screeshots below show the before and after:

screenshot from 2018-05-17 14-45-08

screenshot from 2018-05-17 14-46-58

Here's my config:

[root@po-proxysql1 orchestrator]# cat /etc/orchestrator.conf.json
{
  "Debug": false,
  "EnableSyslog": false,
  "ListenAddress": ":3000",
  "BackendDB": "sqlite",
  "SQLite3DataFile": "/usr/local/orchestrator/orchestrator.db",
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "orchestrator_password",
  "MySQLTopologyCredentialsConfigFile": "",
  "MySQLTopologySSLPrivateKeyFile": "",
  "MySQLTopologySSLCertFile": "",
  "MySQLTopologySSLCAFile": "",
  "MySQLTopologySSLSkipVerify": true,
  "MySQLTopologyUseMutualTLS": false,
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orchestrator",
  "MySQLOrchestratorPassword": "orchestrator_password",
  "MySQLOrchestratorCredentialsConfigFile": "",
  "MySQLOrchestratorSSLPrivateKeyFile": "",
  "MySQLOrchestratorSSLCertFile": "",
  "MySQLOrchestratorSSLCAFile": "",
  "MySQLOrchestratorSSLSkipVerify": true,
  "MySQLOrchestratorUseMutualTLS": false,
  "MySQLConnectTimeoutSeconds": 1,
  "DefaultInstancePort": 3306,
  "DiscoverByShowSlaveHosts": true,
  "InstancePollSeconds": 5,
  "UnseenInstanceForgetHours": 240,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "HostnameResolveMethod": "default",
  "MySQLHostnameResolveMethod": "@@hostname",
  "SkipBinlogServerUnresolveCheck": true,
  "ExpiryHostnameResolvesMinutes": 60,
  "RejectHostnameResolvePattern": "",
  "ReasonableReplicationLagSeconds": 10,
  "ProblemIgnoreHostnameFilters": [],
  "VerifyReplicationFilters": false,
  "ReasonableMaintenanceReplicationLagSeconds": 20,
  "CandidateInstanceExpireMinutes": 60,
  "AuditLogFile": "",
  "AuditToSyslog": false,
  "RemoveTextFromHostnameDisplay": ".:53306",
  "ReadOnly": false,
  "AuthenticationMethod": "",
  "HTTPAuthUser": "",
  "HTTPAuthPassword": "",
  "AuthUserHeader": "",
  "PowerAuthUsers": [
    "*"
  ],
  "SlaveLagQuery": "",
  "DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
  "DetectClusterDomainQuery": "",
  "DetectInstanceAliasQuery": "",
  "DetectPromotionRuleQuery": "",
  "DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
  "PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
  "PromotionIgnoreHostnameFilters": [],
  "DetectSemiSyncEnforcedQuery": "",
  "ServeAgentsHttp": false,
  "AgentsServerPort": ":3001",
  "AgentsUseSSL": false,
  "AgentsUseMutualTLS": false,
  "AgentSSLSkipVerify": false,
  "AgentSSLPrivateKeyFile": "",
  "AgentSSLCertFile": "",
  "AgentSSLCAFile": "",
  "AgentSSLValidOUs": [],
  "UseSSL": false,
  "UseMutualTLS": false,
  "SSLSkipVerify": false,
  "SSLPrivateKeyFile": "",
  "SSLCertFile": "",
  "SSLCAFile": "",
  "SSLValidOUs": [],
  "URLPrefix": "",
  "StatusEndpoint": "/api/status",
  "StatusSimpleHealth": true,
  "StatusOUVerify": false,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "SeedAcceptableBytesDiff": 8192,
  "PseudoGTIDPattern": "",
  "PseudoGTIDPatternIsFixedSubstring": false,
  "PseudoGTIDMonotonicHint": "asc:",
  "DetectPseudoGTIDQuery": "",
  "BinlogEventsChunkSize": 10000,
  "SkipBinlogEventsContaining": [],
  "ReduceReplicationAnalysisCount": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "*"
  ],
  "RecoverIntermediateMasterClusterFilters": [
    "*"
  ],
   "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log",
    "/usr/local/orchestrator/pregracefulfailover.sh >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log",
    "/usr/local/orchestrator/postgracefulfailover.sh >> /tmp/recovery.log"
  ],
  "CoMasterRecoveryMustPromoteOtherCoMaster": true,
  "DetachLostSlavesAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "MasterFailoverDetachSlaveMasterHost": false,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeSlaveRecoveryOnLagMinutes": 0,
  "OSCIgnoreHostnameFilters": [],
  "GraphiteAddr": "",
  "GraphitePath": "",
  "GraphiteConvertHostnameDotsToUnderscores": true
}
}

Thanks for your help.

@shlomi-noach
Copy link
Collaborator Author

Thank you @almeida-pythian, I can confirm I'm able to reproduce this.

To be more specific: the demoted master, now returned as a replica, is set with auto_position=0 even if the topology is all using auto_position=1. The rest of the failover is fine and other replicas maintain their auto_position setting.

I'll look into it, but worth noting that the way Oracle implemented GTID calls for some confusion. Each replica chooses whether to auto_position or not. We can have a hybrid topology. And the master itself? It's not replicating; so does it use or does it not use GTID?

What if the master had one replica with auto_position=0 and one with auto_position=1? What should happen after failover?

Sigh. Yet another "try and think like a human" for orchestrator here, and yet another "no single solution to satisfy all cases and all users".

@almeida-pythian
Copy link

Hi @shlomi-noach Thanks for looking into this. I'm sorry I posted to the other site (outbrain), my brain must have been out and I did not realize I was on the wrong place :-) Thanks for moving it here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants