一、部署MHA前提要求(必须满足)
1. SSH公钥认证
基本上MHA manager,MNA node,以及二次检测的节点,都需要互相信任。如果slave比较多,实例比较多,最好提高下 /etc/ssh/sshd_config MaxStartups 的值(默认是10)。
2. 操作系统
仅在Linux上测试过。
3. 单写master和多slave或者只读master
打从一开始,MHA就是为了解决数据一致性而出生,所以,最好是多个slave。如果你只有一个slave,根本就碰不到数据一致性问题,也就不需要mha了
如果是一个slave,用半同步复制也能解决。从mha 0.52开始,就支持多master的复制架构了,下面列举了多master环境下注意点:
- 多master,但是只允许单点写入。
- 默认情况下,只支持2层复制架构。
4. 三层或者多层复制环境
默认情况下,MHA是不支持3层或多层复制架构的(Master1 -> Master2 -> Slave3)。MHA可以恢复Master2,但是不能恢复Slave3,因为Master2,Slave3有不同的master。为了让MHA支持以上架构,可以参考如下配置:
- 在配置文件中,只配置两层(master1 and master2)。
- 使用 “multi_tier_slave=1” 参数,然后设置所有hosts。
5. MySQL版本必须是5.0或者高于5.0
- MySQL版本必须大于等于5.0。
- 尽量使用高版本的MySQL。
6. 使用mysqlbinlog 5.1+ 支持MySQL 5.1+
- MHA使用mysqlbinlog来应用日志到目标slave上的。
- 如果MySQL master设置的是row格式,那么MySQL必须是大于等于5.1版本,因为5.0不支持row。
- mysqlbinlog版本可以这样被检测,mysqlbinlog –version。
- 如果你使用的是MySQL5.1,那么mysqlbinlog必须大于等于3.3。
- 如果mysqlbinlog的版本是3.2,而mysql的版本是5.1,那么mha manager会报错,且停止monitoring。
7. log-bin必须在候选master上开启
- 如果当前slave没有设置log-bin,那么很显然它不能成为提升为new master。
- 如果没有任何机器设置了log-bin,那么mha会报错且停止failover。
8. binlog,relay-log主从环境必须全部一致
- 对于主从两边的配置最好一模一样(不需要配置中设置read_only),主要是主库要有从库相关的复制参数配置。
- 复制过滤规则(binlog-do-db, replicate-ignore-db等等)必须全部一致,不然检查到会退出。
9. 主库复制用户和业务用户必须在候选master上要存在
- 切换完成后,所有业务连接到新的主库,所以旧的主库用到的业务账号也必须在新主库有一份,不然就有大问题。
- 切换完成后,所有slave都必须执行change master命令。在new master上复制用户必须有(REPLICATEION SLAVE权限)。
10. 使用purge_relay_logs来定期删除relay logs(两个节点则不必关心此条)
默认情况下,如果SQL线程执行完relay-log,relay logs就会被自动删除。但是这些relay-logs也许还会用来恢复其他的slave,所以你需要关闭自动删除relay-logs的purge线程,然后自己阶段性的来删除。如果是你自己来删的话,必须考虑复制延迟问题,最好让slave删除relay log不要在同一时间点,假如需要恢复,那么这个时间点所有relay logs都被删除了就不好了。
11. 不要在SBR的环境中使用load data infile
不管是SBR,还是RBR,最好不要使用load data。
二、部署MHA
1. 环境准备
接下来部署MHA,具体的搭建环境如下(所有操作系统均为centos 7.2 64bit,不是必须,mysql02和mysql03是mysql01的从,复制环境搭建后面会简单演示,但是相关的安全复制不会详细说明,需要的童鞋请参考前面的文章,MySQL Replication需要注意的问题):
角色 | IP地址 | 主机名 | server_id | 类型 |
mha manager | 10.99.73.7 | mha | – | mha |
mysql master | 10.99.73.9 | mysql01 | 100 | 写入 |
mysql slave | 10.99.73.10 | mysql02 | 200 | 读(candidate_master) |
mysql slave | 10.99.73.11 | mysql03 | 300 | 读(no_master) |
其中master对外提供写服务,备选master(实际的slave,主机名mysql02)提供读服务,slave也提供相关的读服务,一旦master宕机,将会把备选master提升为新的master,slave指向新的master。
关闭各个主机iptables加selinux
1 2 |
$ systemctl stop firewalld $ setenforce 0 |
各个主机配置hosts
1 2 3 4 5 |
$ cat /etc/hosts 10.99.73.7 mha 10.99.73.9 mysql01 10.99.73.10 mysql02 10.99.73.11 mysql03 |
各个主机配置SSH互信,基本上MHA manager,MHA node,以及二次检测的节点,都需要互相信任。如下示例:
1 2 3 4 5 |
$ ssh-keygen $ ssh-copy-id root@10.99.73.7 $ ssh-copy-id root@10.99.73.9 $ ssh-copy-id root@10.99.73.10 $ ssh-copy-id root@10.99.73.11 |
2. 安装MHA Node
1)在所有节点安装MHA node所需的perl模块,需要epel源,使用yum安装。
1 |
$ yum install epel-release perl-DBD-MySQL perl-CPAN -y |
2)在所有的节点安装mha node
1 2 3 4 |
$ tar xvf mha4mysql-node-0.57.tar.gz $ cd mha4mysql-node-0.57 $ perl Makefile.PL $ make && make install |
安装完成后会在/usr/local/bin目录下生成以下脚本文件:
1 2 3 4 5 6 |
$ ll /usr/local/bin/ total 44 -r-xr-xr-x 1 root root 16381 Feb 8 14:02 apply_diff_relay_logs -r-xr-xr-x 1 root root 4807 Feb 8 14:02 filter_mysqlbinlog -r-xr-xr-x 1 root root 8261 Feb 8 14:02 purge_relay_logs -r-xr-xr-x 1 root root 7525 Feb 8 14:02 save_binary_logs |
关于上面脚本的功能,上面已经介绍过了,这里不再重复了。
3. 安装MHA Manager
MHA Manager中主要包括了几个管理员的命令行工具,例如master_manger,master_master_switch等。MHA Manger也依赖于perl模块,具体如下:
1)安装MHA Node软件包之前需要安装依赖。我这里使用yum完成,没有epel源的可以使用上面提到的脚本(epel源安装也简单)。注意:在MHA Manager的主机也是需要安装MHA Node。
1 |
$ yum install epel-release perl-DBD-MySQL perl-CPAN -y |
安装MHA Node软件包,和上面的方法一样,如下:
1 2 3 4 |
$ tar xvf mha4mysql-node-0.57.tar.gz $ cd mha4mysql-node-0.57 $ perl Makefile.PL $ make && make install |
2)安装MHA Manager,首先安装MHA Manger依赖的perl模块(我这里使用yum安装):
1 |
$ yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y |
安装MHA Manager软件包:
1 2 3 4 |
$ tar xvf mha4mysql-manager-0.57.tar.gz $ cd mha4mysql-manager-0.57 $ perl Makefile.PL $ make && make install |
安装完成后会在/usr/local/bin目录下面生成以下脚本文件,前面已经说过这些脚本的作用,这里不再重复。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[root@mha ~]# ll /usr/local/bin/ total 164 -r-xr-xr-x 1 root root 16381 Feb 8 14:09 apply_diff_relay_logs -r-xr-xr-x 1 root root 4807 Feb 8 14:09 filter_mysqlbinlog -r-xr-xr-x 1 root root 1995 Feb 8 14:10 masterha_check_repl -r-xr-xr-x 1 root root 1779 Feb 8 14:10 masterha_check_ssh -r-xr-xr-x 1 root root 1865 Feb 8 14:10 masterha_check_status -r-xr-xr-x 1 root root 3201 Feb 8 14:10 masterha_conf_host -r-xr-xr-x 1 root root 2517 Feb 8 14:10 masterha_manager -r-xr-xr-x 1 root root 2165 Feb 8 14:10 masterha_master_monitor -r-xr-xr-x 1 root root 2373 Feb 8 14:10 masterha_master_switch -r-xr-xr-x 1 root root 5171 Feb 8 14:10 masterha_secondary_check -r-xr-xr-x 1 root root 1739 Feb 8 14:10 masterha_stop -r-xr-xr-x 1 root root 8261 Feb 8 14:09 purge_relay_logs -r-xr-xr-x 1 root root 7525 Feb 8 14:09 save_binary_logs -rwxr-xr-x 1 root root 80213 Sep 30 2009 sendEmail |
复制相关脚本到/usr/local/bin目录(软件包解压缩后就有了,不是必须,因为这些脚本不完整,需要自己修改,这是软件开发着留给我们自己发挥的,如果开启下面的任何一个脚本对应的参数,而对应这里的脚本又没有修改,则会抛错,自己被坑的很惨)
1 2 3 4 5 6 7 |
[root@mha ~]# cp -frp mha4mysql-manager-0.57/samples/scripts/* /usr/local/bin/ [root@mha ~]# ll mha4mysql-manager-0.57/samples/scripts/ total 32 -rwxr-xr-x 1 root root 3648 May 31 2015 master_ip_failover -rwxr-xr-x 1 root root 9870 May 31 2015 master_ip_online_change -rwxr-xr-x 1 root root 1186 May 31 2015 power_manager -rwxr-xr-x 1 root root 1360 May 31 2015 send_report |
master_ip_failover:故障自动切换时对vip管理的脚本,不是必须。如果我们使用keepalived的,我们可以自己编写脚本完成对vip的管理,比如监控mysql,如果mysql异常,我们停止keepalived就行,这样vip就会自动漂移。
master_ip_online_change:在线切换时对vip的管理,不是必须,同样可以自行编写简单的shell完成。
power_manager:故障发生后关闭主机的脚本,不是必须。
send_report:因故障切换后发送报警的脚本,不是必须,可自行编写简单的shell完成。
4. 搭建主从复制环境(binlog模式异步复制)
注意:所有slave上的replicate-ignore-db设置必须相同,MHA在启动时候会检测过滤规则,如果过滤规则不同,MHA不启动监控和故障转移。
在整个复制集群中,都是使用基于binlog模式的复制,关于复制构建请看MySQL基于日志点做主从复制(二)。下面只给出一些复制相关操作。
1)在mysql01上创建复制用户
1 2 |
grant replication slave on *.* to 'mysql_slave'@'%' identified by '123456'; flush privileges; |
2)在mysql02上连接mysql01
1 2 |
change master to master_host='10.99.73.9',master_port=3306,master_log_file='mysql-bin.000001',master_log_pos=0,master_user='mysql_slave',master_password='123456'; start slave; |
3)在mysql03上连接mysql01
1 2 |
change master to master_host='10.99.73.9',master_port=3306,master_log_file='mysql-bin.000001',master_log_pos=0,master_user='mysql_slave',master_password='123456'; start slave; |
4)两台slave服务器设置read_only(从库对外提供读服务,只所以没有写进配置文件,是因为随时slave会提升为master)
1 |
set global read_only=1; |
5)创建监控用户(在master上执行,也就是10.99.73.9)
1 2 |
grant all privileges on *.* to 'mha'@'%' identified by '123456'; flush privileges; |
到这里整个复制集群环境已经搭建完毕,剩下的就是配置MHA软件了。
5. 配置MHA Manager
1)创建MHA的工作目录,并且创建相关配置文件(在软件包解压后的目录里面有样例配置文件)。
1 2 3 |
[root@mha ~]# mkdir /etc/masterha -p [root@mha ~]# cp ./mha4mysql-manager-0.57/samples/conf/masterha_default.cnf /etc/ [root@mha ~]# cp ./mha4mysql-manager-0.57/samples/conf/app1.cnf /etc/masterha/ |
修改masterha_default.cnf全局配置文件,适用于所有app.cnf。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
[root@mha ~]# cat /etc/masterha_default.cnf [server default] # 设置监控用户mha user=mha # 设置mha用户的密码 password=123456 # 设置复制环境中的复制用户名 repl_user=mysql_slave # 设置复制用户的密码 repl_password=123456 # 设置ssh的登录用户名 ssh_user=root # 设置ssh的登录端口(不写默认22端口) ssh_port=22 # 设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行failover ping_interval=3 # 设置mysql master保存binlog的目录,以便MHA可以找到master的二进制日志 master_binlog_dir=/data/mysql/log/binlog/ # 设置mysql master在发生切换时保存binlog的目录,在mysql master上创建这个目录(不写默认为/var/tmp) remote_workdir=/data/mysql/mha # 一旦MHA到mysql01的监控之间出现问题,MHA Manager将会尝试从mysql02,mysql03登录到mysql01 secondary_check_script=/usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --port=22 --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 --ping_type=SELECT # 设置自动failover时候的切换脚本(脚本有瑕疵,需要自行修改) #master_ip_failover_script=/usr/local/bin/master_ip_failover # 设置手动切换时候的切换脚本(脚本有瑕疵,需要自行修改) #master_ip_online_change_script=/usr/local/bin/master_ip_online_change # 设置发生切换后发送的报警的脚本(可自行编写) #report_script=/usr/local/bin/send_report # 设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用) #shutdown_script="" |
Tips:对于masterha_secondary_check二次检测脚本,master_host和master_ip参数一般都设置最好,没有主机名可以都设置为同一个IP即可。另外,对于二次检测方式ping_type的参数可以有 SELECT 和 INSERT,很容易看出,一个是用读来检测,一个是用写来检测(通过创建库,有风险)。
如果只修改app1.cnf配置文件,只针对单个应用生效,但是app1.cnf的配置参数优先级高于masterha_default.cnf,一般都会在app1.cnf包含masterha_default.cnf所有参数,而不使用masterha_default.cnf文件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
[root@mha ~]# cat /etc/masterha/app1.cnf [server default] # 设置manager的工作目录和日志目录 manager_workdir=/var/log/masterha/app1 manager_log=/var/log/masterha/app1/manager.log user=mha password=123456 repl_user=mysql_slave repl_password=123456 ssh_user=root ssh_port=22 ping_interval=3 master_binlog_dir=/data/mysql/log/binlog/ remote_workdir=/data/mysql/mha report_script=/usr/local/bin/send_report secondary_check_script=/usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --port=22 --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 --ping_type=SELECT [server1] hostname=10.99.73.9 port=3306 candidate_master=1 [server2] hostname=10.99.73.10 port=3306 candidate_master=1 [server3] hostname=10.99.73.11 port=3306 no_master=1 |
当candidate_master设置为1时,表示为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave。默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master check_repl_delay=0。
你也可以设置no_master=1,表示此节点不参选master。
2)设置relay log的清除方式(在每个slave节点上)
1 |
mysql> set global relay_log_purge=0; |
注意:MHA在发生切换的过程中,从库的恢复过程中依赖于relay log的相关信息,所以这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。在默认情况下,从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中,这些中继日志在恢复其他从服务器时可能会被用到,因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下,删除大的文件需要一定的时间,会导致严重的复制延时。为了避免复制延时,需要暂时为中继日志创建硬链接,因为在Linux系统中通过硬链接删除大文件速度会很快。(在mysql数据库中,删除大表时,通常也采用建立硬链接的方式)
MHA节点中包含了pure_relay_logs命令工具,它可以为中继日志创建硬链接,执行SET GLOBAL relay_log_purge=1,等待几秒钟以便SQL线程切换到新的中继日志,再执行SET GLOBAL relay_log_purge=0。
pure_relay_logs脚本参数如下所示:
1 2 3 4 5 |
--user mysql #用户名; --password mysql #密码; --port #端口号; --workdir #指定创建relay log的硬链接的位置,默认是/var/tmp,由于系统不同分区创建硬链接文件会失败,故需要执行硬链接具体位置,成功执行脚本后,硬链接的中继日志文件被删除; --disable_relay_log_purge #默认情况下,如果relay_log_purge=1,脚本会什么都不清理,自动退出,通过设定这个参数,当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后,最后将参数设置为OFF; |
3)设置定期清理relay脚本(两台slave服务器)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ cat purge_relay_log.sh #!/bin/bash user=root passwd=123456 port=3306 log_dir='/data/masterha/log' work_dir='/data' purge='/usr/local/bin/purge_relay_logs' if [ ! -d $log_dir ];then mkdir $log_dir -p fi $purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1 |
添加到crontab定期执行
1 2 |
$ crontab -l 0 4 * * * /bin/bash /root/purge_relay_log.sh |
purge_relay_logs脚本删除中继日志不会阻塞SQL线程,下面我们手动执行看看什么情况。
1 |
$ purge_relay_logs --user=root --password=123456 --port=3306 -disable_relay_log_purge --workdir=/data/ |
6. 检查SSH配置(masterha_check_ssh)
检查MHA Manger到所有MHA Node的SSH连接状态。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[root@mha ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf Mon Feb 20 15:11:07 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 15:11:07 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:11:07 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:11:07 2017 - [info] Starting SSH connection tests.. Mon Feb 20 15:11:07 2017 - [debug] Mon Feb 20 15:11:07 2017 - [debug] Connecting via SSH from root@10.99.73.9(10.99.73.9:22) to root@10.99.73.10(10.99.73.10:22).. Mon Feb 20 15:11:07 2017 - [debug] ok. Mon Feb 20 15:11:07 2017 - [debug] Connecting via SSH from root@10.99.73.9(10.99.73.9:22) to root@10.99.73.11(10.99.73.11:22).. Mon Feb 20 15:11:07 2017 - [debug] ok. Mon Feb 20 15:11:08 2017 - [debug] Mon Feb 20 15:11:07 2017 - [debug] Connecting via SSH from root@10.99.73.10(10.99.73.10:22) to root@10.99.73.9(10.99.73.9:22).. Mon Feb 20 15:11:07 2017 - [debug] ok. Mon Feb 20 15:11:07 2017 - [debug] Connecting via SSH from root@10.99.73.10(10.99.73.10:22) to root@10.99.73.11(10.99.73.11:22).. Mon Feb 20 15:11:08 2017 - [debug] ok. Mon Feb 20 15:11:08 2017 - [debug] Mon Feb 20 15:11:08 2017 - [debug] Connecting via SSH from root@10.99.73.11(10.99.73.11:22) to root@10.99.73.9(10.99.73.9:22).. Mon Feb 20 15:11:08 2017 - [debug] ok. Mon Feb 20 15:11:08 2017 - [debug] Connecting via SSH from root@10.99.73.11(10.99.73.11:22) to root@10.99.73.10(10.99.73.10:22).. Mon Feb 20 15:11:08 2017 - [debug] ok. Mon Feb 20 15:11:08 2017 - [info] All SSH connection tests passed successfully. |
可以看见各个节点ssh验证都是ok的。如果你使用了masterha_secondary_check做二次验证,那么这里面提到的机器也必须SSH互信。
MHA恢复slave是并行的,也就意味着恢复需要开启多个ssh通道,所以为了安全和谨慎,在/etc/ssh/sshd_config(默认10)提高MaxStartups的数量,然后重启sshd,或者reload。
7. 检查整个复制环境状况(masterha_check_repl)
暂时先注释配置文件中master_ip_failover_script= /usr/local/bin/master_ip_failover这个选项,不然这个检查过不去的。后面引入keepalived后和修改该脚本以后再开启该选项就ok了。
通过masterha_check_repl脚本查看整个集群的状态。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
[root@mha ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf Mon Feb 20 15:11:25 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 15:11:25 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:11:25 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:11:25 2017 - [info] MHA::MasterMonitor version 0.57. Mon Feb 20 15:11:25 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:11:25 2017 - [info] Dead Servers: Mon Feb 20 15:11:25 2017 - [info] Alive Servers: Mon Feb 20 15:11:25 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:11:25 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:11:25 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:11:25 2017 - [info] Alive Slaves: Mon Feb 20 15:11:25 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:11:25 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:11:25 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:11:25 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:11:25 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:11:25 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:11:25 2017 - [info] Current Alive Master: 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:11:25 2017 - [info] Checking slave configurations.. Mon Feb 20 15:11:25 2017 - [info] Checking replication filtering settings.. Mon Feb 20 15:11:25 2017 - [info] binlog_do_db= , binlog_ignore_db= Mon Feb 20 15:11:25 2017 - [info] Replication filtering check ok. Mon Feb 20 15:11:25 2017 - [info] GTID (with auto-pos) is not supported Mon Feb 20 15:11:25 2017 - [info] Starting SSH connection tests.. Mon Feb 20 15:11:27 2017 - [info] All SSH connection tests passed successfully. Mon Feb 20 15:11:27 2017 - [info] Checking MHA Node version.. Mon Feb 20 15:11:27 2017 - [info] Version check ok. Mon Feb 20 15:11:27 2017 - [info] Checking SSH publickey authentication settings on the current master.. Mon Feb 20 15:11:27 2017 - [info] HealthCheck: SSH to 10.99.73.9 is reachable. Mon Feb 20 15:11:27 2017 - [info] Master MHA Node version is 0.57. Mon Feb 20 15:11:27 2017 - [info] Checking recovery script configurations on 10.99.73.9(10.99.73.9:3306).. Mon Feb 20 15:11:27 2017 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/save_binary_logs_test --manager_version=0.57 --start_file=mysql-bin.000002 Mon Feb 20 15:11:27 2017 - [info] Connecting to root@10.99.73.9(10.99.73.9:22).. Creating /data/mysql/mha if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql/log/binlog, up to mysql-bin.000002 Mon Feb 20 15:11:28 2017 - [info] Binlog setting check done. Mon Feb 20 15:11:28 2017 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Mon Feb 20 15:11:28 2017 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.99.73.10 --slave_ip=10.99.73.10 --slave_port=3306 --workdir=/data/mysql/mha --target_version=5.7.17-log --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000002 --slave_pass=xxx Mon Feb 20 15:11:28 2017 - [info] Connecting to root@10.99.73.10(10.99.73.10:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Temporary relay log file is /data/mysql/3306/log/relaylog/relay-log.000003 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Mon Feb 20 15:11:28 2017 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.99.73.11 --slave_ip=10.99.73.11 --slave_port=3306 --workdir=/data/mysql/mha --target_version=5.7.17-log --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000002 --slave_pass=xxx Mon Feb 20 15:11:28 2017 - [info] Connecting to root@10.99.73.11(10.99.73.11:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Temporary relay log file is /data/mysql/3306/log/relaylog/relay-log.000003 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Mon Feb 20 15:11:28 2017 - [info] Slaves settings check done. Mon Feb 20 15:11:28 2017 - [info] 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:11:28 2017 - [info] Checking replication health on 10.99.73.10.. Mon Feb 20 15:11:28 2017 - [info] ok. Mon Feb 20 15:11:28 2017 - [info] Checking replication health on 10.99.73.11.. Mon Feb 20 15:11:28 2017 - [info] ok. Mon Feb 20 15:11:28 2017 - [warning] master_ip_failover_script is not defined. Mon Feb 20 15:11:28 2017 - [warning] shutdown_script is not defined. Mon Feb 20 15:11:28 2017 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK. |
如果发现如下错误:
1 2 3 |
Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99. mysqlbinlog version not found! Testing mysql connection and privileges..sh: mysql: command not found |
提示很明显,找不到mysqlbinlog和mysql,添加一个环境变量就可以解决了。
如果检测到失败,会返回错误:
* 复制过滤规则不同,SQL线程有错误,复制环境中存在多个master 。
* IO线程stopped状态。
* SQL线程stopped状态。
* 复制延迟太长(–seconds_behind_master=(seconds))。
* 配置文件有问题。
masterha_check_repl的一些参数,可以加到配置文件中。
1 2 |
--seconds_behind_master=(seconds) #masterha_check_repl会检测延迟,如果超过这个描述,会报错,默认是30秒; --skip_check_ssh #通过masterha_check_ssh去检测ssh的环境,如果设置了该参数,可以忽略SSH的检测; |
8. 检查MHA Manager的状态(masterha_check_status)
通过master_check_status脚本查看Manager的状态。
1 2 |
[root@mha ~]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING). |
注意:如果正常,会显示”PING_OK”,否则会显示”NOT_RUNNING”,这代表MHA监控没有开启。
9. 开启MHA Manager监控(masterha_manager)
1 |
[root@mha ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover & |
启动参数介绍:
--conf
--remove_dead_master_conf
设置了这个参数后,如果MHA failover结束后,MHA Manager会自动在配置文件中删除dead master的相关项。如果不设置,由于dead master的配置还存在文件中,那么当MHA failover后,当再次restart MHA manager后,会报错(there is a dead slave previous dead master)。
--manger_log
日志存放位置。
--ignore_last_failover
在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为–ignore_last_failover。
--wait_on_monitor_error=(seconds)
这个功能主要用在自动监控和failover上,如果在监控期间发生了错误,masterha_manager会等待wait_on_monitor_error=N 秒,然后退出,默认是0
。如果遇到错误,在restart监控前最好还是等等。
--ignore_fail_on_start
默认情况下,如果一个或者多个slave down掉了,master monitor进程就会停掉,就算你设置了ignore_fail。如果设置了–ignore_fail_on_start参数,ignore_fail标记了slave挂掉也不会让master monitor进程停掉。
查看MHA Manager监控是否正常:
1 2 |
[root@mha ~]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:30768) is running(0:PING_OK), master:10.99.73.9 |
可以看见已经在监控了,而且master的主机为10.99.73.9。
PS:如果起不来,注释master_ip_failover_script。
10. 查看启动日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
Mon Feb 20 15:09:42 2017 - [info] MHA::MasterMonitor version 0.57. Mon Feb 20 15:09:42 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:09:42 2017 - [info] Dead Servers: Mon Feb 20 15:09:42 2017 - [info] Alive Servers: Mon Feb 20 15:09:42 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:09:42 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:09:42 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:09:42 2017 - [info] Alive Slaves: Mon Feb 20 15:09:42 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:09:42 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:09:42 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:09:42 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:09:42 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:09:42 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:09:42 2017 - [info] Current Alive Master: 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:09:42 2017 - [info] Checking slave configurations.. Mon Feb 20 15:09:42 2017 - [info] read_only=1 is not set on slave 10.99.73.10(10.99.73.10:3306). Mon Feb 20 15:09:42 2017 - [warning] relay_log_purge=0 is not set on slave 10.99.73.10(10.99.73.10:3306). Mon Feb 20 15:09:42 2017 - [info] read_only=1 is not set on slave 10.99.73.11(10.99.73.11:3306). Mon Feb 20 15:09:42 2017 - [warning] relay_log_purge=0 is not set on slave 10.99.73.11(10.99.73.11:3306). Mon Feb 20 15:09:42 2017 - [info] Checking replication filtering settings.. Mon Feb 20 15:09:42 2017 - [info] binlog_do_db= , binlog_ignore_db= Mon Feb 20 15:09:42 2017 - [info] Replication filtering check ok. Mon Feb 20 15:09:42 2017 - [info] GTID (with auto-pos) is not supported Mon Feb 20 15:09:42 2017 - [info] Starting SSH connection tests.. Mon Feb 20 15:09:43 2017 - [info] All SSH connection tests passed successfully. Mon Feb 20 15:09:43 2017 - [info] Checking MHA Node version.. Mon Feb 20 15:09:44 2017 - [info] Version check ok. Mon Feb 20 15:09:44 2017 - [info] Checking SSH publickey authentication settings on the current master.. Mon Feb 20 15:09:44 2017 - [info] HealthCheck: SSH to 10.99.73.9 is reachable. Mon Feb 20 15:09:44 2017 - [info] Master MHA Node version is 0.57. Mon Feb 20 15:09:44 2017 - [info] Checking recovery script configurations on 10.99.73.9(10.99.73.9:3306).. Mon Feb 20 15:09:44 2017 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/save_binary_logs_test --manager_version=0.57 --start_file=mysql-bin.000002 Mon Feb 20 15:09:44 2017 - [info] Connecting to root@10.99.73.9(10.99.73.9:22).. Creating /data/mysql/mha if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql/log/binlog, up to mysql-bin.000002 Mon Feb 20 15:09:44 2017 - [info] Binlog setting check done. Mon Feb 20 15:09:44 2017 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Mon Feb 20 15:09:44 2017 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.99.73.10 --slave_ip=10.99.73.10 --slave_port=3306 --workdir=/data/mysql/mha --target_version=5.7.17-log --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000002 --slave_pass=xxx Mon Feb 20 15:09:44 2017 - [info] Connecting to root@10.99.73.10(10.99.73.10:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Temporary relay log file is /data/mysql/3306/log/relaylog/relay-log.000003 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Mon Feb 20 15:09:45 2017 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='mha' --slave_host=10.99.73.11 --slave_ip=10.99.73.11 --slave_port=3306 --workdir=/data/mysql/mha --target_version=5.7.17-log --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000002 --slave_pass=xxx Mon Feb 20 15:09:45 2017 - [info] Connecting to root@10.99.73.11(10.99.73.11:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Temporary relay log file is /data/mysql/3306/log/relaylog/relay-log.000003 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Mon Feb 20 15:09:45 2017 - [info] Slaves settings check done. Mon Feb 20 15:09:45 2017 - [info] 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:09:45 2017 - [warning] master_ip_failover_script is not defined. Mon Feb 20 15:09:45 2017 - [warning] shutdown_script is not defined. Mon Feb 20 15:09:45 2017 - [info] Set master ping interval 3 seconds. Mon Feb 20 15:09:45 2017 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 Mon Feb 20 15:09:45 2017 - [info] Starting ping health check on 10.99.73.9(10.99.73.9:3306).. Mon Feb 20 15:09:45 2017 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. |
其中”Ping(SELECT) succeeded, waiting until MySQL doesn’t respond..”说明整个系统已经开始监控了。
11. 关闭MHA Manage监控
关闭很简单,使用masterha_stop命令完成。
1 2 |
[root@mha ~]# masterha_stop --conf=/etc/masterha/app1.cnf Stopped app1 successfully. |
12. 简单的自动failover测试
在整个MHA加复制集群都搞定的情况下,MHA正常监控下,就可以进行failover测试了。
最好可以先对mysql master进行压测,打入大量的数据(我测试过程中发现如果没有数据直接切换MHA会卡主),使用sysbench生成测试数据(使用yum快速安装)
1 |
$ yum install sysbench -y |
在主库(10.99.73.9)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --mysql-user=mha \ --mysql-password=123456 \ --mysql-socket=/data/mysql/3306/mysql.sock \ --mysql-db=sbtest \ --db-driver=mysql \ --tables=1 \ --table-size=1000000 \ --report-interval=10 \ --threads=128 \ --time=120 \ prepare |
1 2 3 4 5 6 |
mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ |
然后可以选择直接关闭mysql master,看mha的切换。
1 2 |
[root@mysql01 ~]# mysqladmin -S /data/mysql/3306/mysql.sock shutdown [1]+ Done nohup mysqld --defaults-file=/data/mysql/3306/conf/my.cnf |
测试过程中,发现当mha进行一次切换后,其进程就自动退出了。观察其配置文件,会发现mysql01的配置文件被删除了,是为了防止误操作。
可以看一下整个failover过程,日志信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
[root@mha ~]# cat /var/log/masterha/app1/manager.log Mon Feb 20 15:22:21 2017 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query) Mon Feb 20 15:22:21 2017 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s mysql02 -s mysql03 --user=root --master_host=mysql01 --master_ip=10.99.73.9 --master_port=3306 --user=root --master_host=10.99.73.9 --master_ip=10.99.73.9 --master_port=3306 --master_user=mha --master_password=123456 --ping_type=SELECT Mon Feb 20 15:22:21 2017 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/save_binary_logs_test --manager_version=0.57 --binlog_prefix=mysql-bin Mon Feb 20 15:22:21 2017 - [info] HealthCheck: SSH to 10.99.73.9 is reachable. Monitoring server mysql02 is reachable, Master is not reachable from mysql02. OK. Monitoring server mysql03 is reachable, Master is not reachable from mysql03. OK. Mon Feb 20 15:22:21 2017 - [info] Master is not reachable from all other monitoring servers. Failover should start. Mon Feb 20 15:22:24 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:24 2017 - [warning] Connection failed 2 time(s).. Mon Feb 20 15:22:27 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:27 2017 - [warning] Connection failed 3 time(s).. Mon Feb 20 15:22:30 2017 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.99.73.9' (111)) Mon Feb 20 15:22:30 2017 - [warning] Connection failed 4 time(s).. Mon Feb 20 15:22:30 2017 - [warning] Master is not reachable from health checker! Mon Feb 20 15:22:30 2017 - [warning] Master 10.99.73.9(10.99.73.9:3306) is not reachable! Mon Feb 20 15:22:30 2017 - [warning] SSH is reachable. Mon Feb 20 15:22:30 2017 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status.. Mon Feb 20 15:22:30 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Feb 20 15:22:30 2017 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:22:30 2017 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Feb 20 15:22:30 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:22:30 2017 - [info] Dead Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] Checking slave configurations.. Mon Feb 20 15:22:30 2017 - [info] Checking replication filtering settings.. Mon Feb 20 15:22:30 2017 - [info] Replication filtering check ok. Mon Feb 20 15:22:30 2017 - [info] Master is down! Mon Feb 20 15:22:30 2017 - [info] Terminating monitoring script. Mon Feb 20 15:22:30 2017 - [info] Got exit code 20 (Master dead). Mon Feb 20 15:22:30 2017 - [info] MHA::MasterFailover version 0.57. Mon Feb 20 15:22:30 2017 - [info] Starting master failover. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 1: Configuration Check Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] GTID failover mode = 0 Mon Feb 20 15:22:30 2017 - [info] Dead Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Checking master reachability via MySQL(double check)... Mon Feb 20 15:22:30 2017 - [info] ok. Mon Feb 20 15:22:30 2017 - [info] Alive Servers: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:30 2017 - [info] Alive Slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] Starting Non-GTID based failover. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] ** Phase 1: Configuration Check Phase completed. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] Forcing shutdown so that applications never connect to the current master.. Mon Feb 20 15:22:30 2017 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address. Mon Feb 20 15:22:30 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Mon Feb 20 15:22:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3: Master Recovery Phase.. Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase.. # 寻找所有slave中最新的slave,为10.99.73.11 - mysql-bin.000002:97142702,但它被设置为不能成为最新的master # 寻找所有slave中最老的slave,为10.99.73.10 - mysql-bin.000002:90008622,它被设置为可以成为最新的master Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:97142702 Mon Feb 20 15:22:30 2017 - [info] Latest slaves (Slaves that received relay log files to the latest): Mon Feb 20 15:22:30 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:30 2017 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:90008622 Mon Feb 20 15:22:30 2017 - [info] Oldest slaves: Mon Feb 20 15:22:30 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:30 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:30 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. # 保存最新的slave 10.99.73.11与dead master 10.99.73.9之间的binlog日志 # save_binary_logs --command=save --start_file=mysql-bin.000002 --start_pos=97142702 Mon Feb 20 15:22:30 2017 - [info] Mon Feb 20 15:22:30 2017 - [info] Fetching dead master's binary logs.. Mon Feb 20 15:22:30 2017 - [info] Executing command on the dead master 10.99.73.9(10.99.73.9:3306): save_binary_logs --command=save --start_file=mysql-bin.000002 --start_pos=97142702 --binlog_dir=/data/mysql/log/binlog --output_file=/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from mysql-bin.000002 pos 97142702 to mysql-bin.000002 EOF into /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. Dumping effective binlog data from /data/mysql/log/binlog/mysql-bin.000002 position 97142702 to tail(123299186).. ok. Binlog Checksum enabled Concat succeeded. Mon Feb 20 15:22:31 2017 - [info] scp from root@10.99.73.9:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:31 2017 - [info] HealthCheck: SSH to 10.99.73.10 is reachable. Mon Feb 20 15:22:32 2017 - [info] HealthCheck: SSH to 10.99.73.11 is reachable. Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] * Phase 3.3: Determining New Master Phase.. # 确认新的master为10.99.73.10 # 在最新的slave 10.99.73.11上的relay log中寻找最新的slave与最老的slave之间差异的binlog日志 # apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Mon Feb 20 15:22:32 2017 - [info] Checking whether 10.99.73.11 has relay logs from the oldest position.. Mon Feb 20 15:22:32 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 --server_id=103309 --workdir=/data/mysql/mha --timestamp=20170220152230 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000003 : Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000003, start_pos:90008835. Target relay log FOUND! Mon Feb 20 15:22:32 2017 - [info] OK. 10.99.73.11 has all relay logs. Mon Feb 20 15:22:32 2017 - [info] Searching new master from slaves.. Mon Feb 20 15:22:32 2017 - [info] Candidate masters from the configuration file: Mon Feb 20 15:22:32 2017 - [info] 10.99.73.10(10.99.73.10:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:32 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:32 2017 - [info] Primary candidate for the new Master (candidate_master is set) Mon Feb 20 15:22:32 2017 - [info] Non-candidate masters: Mon Feb 20 15:22:32 2017 - [info] 10.99.73.11(10.99.73.11:3306) Version=5.7.17-log (oldest major version between slaves) log-bin:enabled Mon Feb 20 15:22:32 2017 - [info] Replicating from 10.99.73.9(10.99.73.9:3306) Mon Feb 20 15:22:32 2017 - [info] Not candidate for the new Master (no_master is set) Mon Feb 20 15:22:32 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Mon Feb 20 15:22:32 2017 - [info] Not found. Mon Feb 20 15:22:32 2017 - [info] Searching from all candidate_master slaves.. Mon Feb 20 15:22:32 2017 - [info] New master is 10.99.73.10(10.99.73.10:3306) Mon Feb 20 15:22:32 2017 - [info] Starting master failover.. Mon Feb 20 15:22:32 2017 - [info] From: 10.99.73.9(10.99.73.9:3306) (current master) +--10.99.73.10(10.99.73.10:3306) +--10.99.73.11(10.99.73.11:3306) To: 10.99.73.10(10.99.73.10:3306) (new master) +--10.99.73.11(10.99.73.11:3306) Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. # 新master与最新slave之间差异binlog的生成,从最新slave的relay log中提取 # apply_diff_relay_logs --command=generate_and_send --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 Mon Feb 20 15:22:32 2017 - [info] Mon Feb 20 15:22:32 2017 - [info] Server 10.99.73.10 received relay logs up to: mysql-bin.000002:90008622 Mon Feb 20 15:22:32 2017 - [info] Need to get diffs from the latest slave(10.99.73.11) up to: mysql-bin.000002:97142702 (using the latest slave's relay logs) Mon Feb 20 15:22:32 2017 - [info] Connecting to the latest slave host 10.99.73.11, generating diff relay log files.. Mon Feb 20 15:22:32 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=10.99.73.10 --latest_mlf=mysql-bin.000002 --latest_rmlp=97142702 --target_mlf=mysql-bin.000002 --target_rmlp=90008622 --server_id=103309 --diff_file_readtolatest=/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog --workdir=/data/mysql/mha --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --relay_dir=/data/mysql/3306/log/relaylog --current_relay_log=relay-log.000003 Mon Feb 20 15:22:33 2017 - [info] Relay log found at /data/mysql/3306/log/relaylog, up to relay-log.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:relay-log.000003, start_pos:90008835. Concat binary/relay logs from relay-log.000003 pos 90008835 to relay-log.000003 EOF into /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 367.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000003 position 90008835 to tail(97142915).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Generating diff relay log succeeded. Saved at /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog . scp mysql03:/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog to root@10.99.73.10(22) succeeded. Mon Feb 20 15:22:33 2017 - [info] Generating diff files succeeded. Mon Feb 20 15:22:33 2017 - [info] Sending binlog.. Mon Feb 20 15:22:34 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to root@10.99.73.10:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:34 2017 - [info] Mon Feb 20 15:22:34 2017 - [info] * Phase 3.4: Master Log Apply Phase.. # 新master应用与最新slave生成的差异binlog - relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog # 新master应用最新slave与dead master生成的差异binlog - saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog # 获取新master binlog and position信息 Mon Feb 20 15:22:34 2017 - [info] Mon Feb 20 15:22:34 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Mon Feb 20 15:22:34 2017 - [info] Starting recovery on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:34 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:22:34 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:22:34 2017 - [info] done. Mon Feb 20 15:22:34 2017 - [info] Getting slave status.. Mon Feb 20 15:22:34 2017 - [info] This slave(10.99.73.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000002:90008622). No need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:22:34 2017 - [info] Connecting to the target slave host 10.99.73.10, running recover script.. Mon Feb 20 15:22:34 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.10 --slave_ip=10.99.73.10 --slave_port=3306 --apply_files=/data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:22:37 2017 - [info] Concat all apply files to /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog .. Copying the first binlog file /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog to /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog.. Binlog Checksum enabled dumped up to pos 154. ok. /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog has effective binlog events from pos 154. Dumping effective binlog data from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog position 154 to tail(26156638).. ok. Concat succeeded. All apply target binary logs are concatinated at /data/mysql/mha/total_binlog_for_10.99.73.10_3306.20170220152230.binlog . MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/relay_from_read_to_latest_10.99.73.10_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog on 10.99.73.10:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:22:37 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:22:37 2017 - [info] Getting new master's binlog name and position.. Mon Feb 20 15:22:37 2017 - [info] mysql-bin.000002:62292198 Mon Feb 20 15:22:37 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.99.73.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000002', MASTER_LOG_POS=62292198, MASTER_USER='mysql_slave', MASTER_PASSWORD='xxx'; Mon Feb 20 15:22:37 2017 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address. Mon Feb 20 15:22:37 2017 - [info] Setting read_only=0 on 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:37 2017 - [info] ok. Mon Feb 20 15:22:37 2017 - [info] ** Finished master recovery successfully. Mon Feb 20 15:22:37 2017 - [info] * Phase 3: Master Recovery Phase completed. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4: Slaves Recovery Phase.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. # 多线程生成其余slave与最新slave的差异binlog Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] -- Slave diff file generation on host 10.99.73.11(10.99.73.11:3306) started, pid: 9926. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220152230.log if it takes time.. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Mon Feb 20 15:22:37 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:22:37 2017 - [info] -- 10.99.73.11(10.99.73.11:3306) has the latest relay log events. Mon Feb 20 15:22:37 2017 - [info] Generating relay diff files from the latest slave succeeded. Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. # 多线程恢复与最新slave生成的差异binlog,此时所有slave数据应该是一致的了 # 从manager节点发送最新slave与dead master生成的差异binlog到所有slave节点,并且恢复 Mon Feb 20 15:22:37 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) started, pid: 9928. Check tmp log /var/log/masterha/app1/10.99.73.11_3306_20170220152230.log if it takes time.. Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] Log messages from 10.99.73.11 ... Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:37 2017 - [info] Sending binlog.. Mon Feb 20 15:22:37 2017 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog to root@10.99.73.11:/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog succeeded. Mon Feb 20 15:22:37 2017 - [info] Starting recovery on 10.99.73.11(10.99.73.11:3306).. Mon Feb 20 15:22:37 2017 - [info] Generating diffs succeeded. Mon Feb 20 15:22:37 2017 - [info] Waiting until all relay logs are applied. Mon Feb 20 15:22:37 2017 - [info] done. Mon Feb 20 15:22:37 2017 - [info] Getting slave status.. Mon Feb 20 15:22:37 2017 - [info] This slave(10.99.73.11)'s Exec_Master_Log_Pos(mysql-bin.000002:96173537) does not equal to Read_Master_Log_Pos(mysql-bin.000002:97142702). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos. Mon Feb 20 15:22:37 2017 - [info] Saving local relay logs from exec pos to read pos on 10.99.73.11: from relay-log.000003:96173750 to the end of the relay log.. Mon Feb 20 15:22:37 2017 - [info] Executing command : save_binary_logs --command=save --start_file=relay-log.000003 --start_pos=96173750 --output_file=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --binlog_dir=/data/mysql/3306/log/relaylog Mon Feb 20 15:22:38 2017 - [info] Creating /data/mysql/mha if not exists.. ok. Concat binary/relay logs from relay-log.000003 pos 96173750 to relay-log.000003 EOF into /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog .. Binlog Checksum enabled Binlog Checksum enabled Dumping binlog format description event, from position 0 to 367.. ok. Dumping effective binlog data from /data/mysql/3306/log/relaylog/relay-log.000003 position 96173750 to tail(97142915).. ok. Binlog Checksum enabled Binlog Checksum enabled Concat succeeded. Mon Feb 20 15:22:38 2017 - [info] Connecting to the target slave host 10.99.73.11, running recover script.. Mon Feb 20 15:22:38 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='mha' --slave_host=10.99.73.11 --slave_ip=10.99.73.11 --slave_port=3306 --apply_files=/data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog --workdir=/data/mysql/mha --target_version=5.7.17-log --timestamp=20170220152230 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.57 --slave_pass=xxx Mon Feb 20 15:22:40 2017 - [info] Concat all apply files to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog .. Copying the first binlog file /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog to /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog.. Binlog Checksum enabled dumped up to pos 154. ok. /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog has effective binlog events from pos 154. Dumping effective binlog data from /data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog position 154 to tail(26156638).. ok. Concat succeeded. All apply target binary logs are concatinated at /data/mysql/mha/total_binlog_for_10.99.73.11_3306.20170220152230.binlog . MySQL client version is 5.7.17. Using --binary-mode. Applying differential binary/relay log files /data/mysql/mha/relay_from_exec_to_read_10.99.73.11_3306_20170220152230.binlog,/data/mysql/mha/saved_master_binlog_from_10.99.73.9_3306_20170220152230.binlog on 10.99.73.11:3306. This may take long time... Applying log files succeeded. Mon Feb 20 15:22:40 2017 - [info] All relay logs were successfully applied. Mon Feb 20 15:22:40 2017 - [info] Resetting slave 10.99.73.11(10.99.73.11:3306) and starting replication from the new master 10.99.73.10(10.99.73.10:3306).. Mon Feb 20 15:22:40 2017 - [info] Executed CHANGE MASTER. Mon Feb 20 15:22:40 2017 - [info] Slave started. Mon Feb 20 15:22:40 2017 - [info] End of log messages from 10.99.73.11. Mon Feb 20 15:22:40 2017 - [info] -- Slave recovery on host 10.99.73.11(10.99.73.11:3306) succeeded. Mon Feb 20 15:22:40 2017 - [info] All new slave servers recovered successfully. Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] * Phase 5: New master cleanup phase.. # 新master最后准备阶段,开始对外提供服务 Mon Feb 20 15:22:40 2017 - [info] Mon Feb 20 15:22:40 2017 - [info] Resetting slave info on the new master.. Mon Feb 20 15:22:40 2017 - [info] 10.99.73.10: Resetting slave info succeeded. Mon Feb 20 15:22:40 2017 - [info] Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:22:40 2017 - [info] Deleted server1 entry from /etc/masterha/app1.cnf . Mon Feb 20 15:22:40 2017 - [info] ----- Failover Report ----- app1: MySQL Master failover 10.99.73.9(10.99.73.9:3306) to 10.99.73.10(10.99.73.10:3306) succeeded Master 10.99.73.9(10.99.73.9:3306) is down! Check MHA Manager logs at mha:/var/log/masterha/app1/manager.log for details. Started automated(non-interactive) failover. The latest slave 10.99.73.11(10.99.73.11:3306) has all relay logs for recovery. Selected 10.99.73.10(10.99.73.10:3306) as a new master. 10.99.73.10(10.99.73.10:3306): OK: Applying all logs succeeded. 10.99.73.11(10.99.73.11:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 10.99.73.11(10.99.73.11:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.99.73.10(10.99.73.10:3306) 10.99.73.10(10.99.73.10:3306): Resetting slave info succeeded. Master failover to 10.99.73.10(10.99.73.10:3306) completed successfully. Mon Feb 20 15:22:40 2017 - [info] Sending mail.. |
由mha切换日志可以看出,整个故障切换已经完成了。整个过程各个阶段核心切换逻辑简化后如下:
Phase 1:配置文件检查
Phase 2:非存活Master关闭服务
Phase 3:Master恢复
Phase 3.1:获取与Master延迟最小的Slave节点和延迟最大的Slave节点
Phase 3.2:生成Master与延迟最小的Slave节点的差异binlog并保存到manager节点
Phase 3.3:找出新的New Master,并且在延迟最小的Slave的Relay log中寻找延迟最小Slave与延迟最大Slave之间差异的binlog日志是否存在
Phase 3.3:如果New Master不是最新的Slave节点,那么需要从最新Slave的Relay log中生成它们之间的差异Relay log
Phase 3.4:New Master恢复差异Relay log和差异binlog日志,随后获取Master binlog位点信息
Phase 4:Slaves恢复
Phase 4.1:多线程生成延迟最小的Slave节点与其他一个或多个Slave差异Relay log
Phase 4.2:多线程恢复Slave节点与延迟最小的Slave之间的差异Relay log,并且恢复manager节点保存的差异binlog,然后change master到NEW MASTER节点
Phase 5:New Master清理Slave信息,并删除掉MHA配置文件中的选主信息防止误操作等
然后查看mysql02是否为我们所设置的那样切换为主了(对于mysql02设置了candidate_master=1)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
mysql> show slave status; Empty set (0.00 sec) mysql> select count(*) from sbtest.sbtest; +----------+ | count(*) | +----------+ | 1000000 | +----------+ mysql> show global variables like 'read_only'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | read_only | OFF | +---------------+-------+ 1 row in set (0.00 sec) |
由这三个条件,可以看出切换已经成功了。show slave status时没有slave信息,并且前面我们设置的read_only给关闭了;另外,mysql03成为了mysql02的slave。