原创

Redis Cluster节点服务器宕机后导致集群重启失败案例

这里说下自己碰到的一种情况:
redis cluster集群由三个节点服务器组成,一个6个redis实例,每个节点开启2个端口,三主三从。
reids部署目录是/data/redis-4.0.1,集群情况如下:


172.16.50.245:7000  master主节点

172.16.50.245:7001  slave从节点,是172.16.50.246:7002的从节点

172.16.50.246:7002  master主节点

172.16.50.246:7003  slave从节点,是172.16.50.247:7004的从节点

172.16.50.247:7004  master主节点

172.16.50.247:7005  slave从节点,是172.16.50.245:7000的从节点

由上面可以看出:

三个master主节点分布在三个不同的服务器上,三个slave从节点也分布在三个不同的服务器上。

由于上面的这三个节点服务器是虚拟机,这三台虚拟机部署在同一个物理宿主机上,某天这台宿主机由于硬件故障突然关机,从而导致这三台节点的redis服务也重启了。

下面是重启这三台节点机器的redis情况:

1)三台节点重启redis服务的命令分别为:


172.16.50.245

[root@50.245-server ~]# ps -ef|grep redis|grep -v grep

[root@50.245-server ~]#

[root@50.245-server ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[root@50.245-server ~]# ps -ef|grep redis

root      2059     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7000 [cluster]                  

root      2061     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.245:7001 [cluster]                  

root      2092  1966  0 22:29 pts/0    00:00:00 grep redis

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.245 -c -p 7000

172.16.50.245:7000> cluster nodes

678211b78a4eb15abf27406d057900554ff70d4d :7000@17000 myself,master - 0 0 0 connected

  

172.16.50.246

[root@50.246-server ~]# ps -ef|grep redis|grep -v grep

[root@50.246-server ~]#

[root@50.246-server ~]# for((i=2;i<=3;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[root@50.246-server ~]# ps -ef|grep redis

root      1985     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7002 [cluster]                  

root      1987     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.246:7003 [cluster]                  

root      2016  1961  0 22:29 pts/0    00:00:00 grep redis

[root@50.246-server ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.246 -c -p 7002

172.16.50.246:7002> cluster nodes

2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:7002@17002 myself,master - 0 0 0 connected

  

172.16.50.247

[root@50.247-server ~]# ps -ef|grep redis|grep -v grep

[root@50.247-server ~]#

[root@50.247-server ~]# for((i=4;i<=5;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

[root@50.247-server ~]# ps -ef|grep redis

root      1987     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7004 [cluster]                  

root      1989     1  0 22:29 ?        00:00:00 /data/redis-4.0.1/src/redis-server 172.16.50.247:7005 [cluster]                  

root      2018  1966  0 22:29 pts/0    00:00:00 grep redis

[root@50.247-server ~]# /data/redis-4.0.1/src/redis-cli -h 172.16.50.247 -c -p 7004

172.16.50.247:7004> cluster nodes

ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:7004@17004 myself,master - 0 0 0 connected

172.16.50.247:7004>

  

由上面可知,三个redis节点重启后,默认是没有加到redis cluster集群中的。

2)由于redis clster集群节点宕机(或节点的redis服务重启),导致了部分slot数据分片丢失;在用check检查集群运行状态时,遇到下面报错:
注意:创建redis cluster机器的操作要在安装了gem工具的机器上,这里在172.16.50.245节点服务器上操作(其他两台节点服务器没有按照gem工具,故不能在这两台机器上操作


[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb check 172.16.50.245:7000

........

[ERR] Not all 16384 slots are covered by nodes.

  

================================================================================

原因:一般是由于slot总数没有达到16384,其实也就是slots分布不正确导致的。其他2个节点服务器的redis实例也是这种情况。

  

解决办法:

官方是推荐使用redis-trib.rb fix 来修复集群。通过cluster nodes看到7001这个节点被干掉了。可以按照下面操作进行修复

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000

......

Fix these slots by covering with a random node? (type 'yes' to accept): yes

  

修复后再次check检查就正常了

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7000

>>> Performing Cluster Check (using node 172.16.50.245:7000)

M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000

   slots:0-16383 (16384 slots) master

   0 additional replica(s)

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

  

其他5个redis实例照此方法进行修复

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.245:7001

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7002

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.246:7003

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7004

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb fix 172.16.50.247:7005

=================================================================================

3)接着将这三个redis节点重启添加到集群中去(也就是重新创建redis cluster机器,这里采用自主添加master和slave主从节点)
注意:创建redis cluster机器的操作要在安装了gem工具的机器上,这里在172.16.50.245节点服务器上操作(其他两台节点服务器没有按照gem工具,故不能在这两台机器上操作)


先添加redis master主节点

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb create  172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004

Using 3 masters:

172.16.50.245:7000

172.16.50.246:7002

172.16.50.247:7004

M: 678211b78a4eb15abf27406d057900554ff70d4d 172.16.50.245:7000

   slots:0-16383 (16384 slots) master

M: 2ebe8bbecddae0ba0086d1b8797f52556db5d3fd 172.16.50.246:7002

   slots:0-16383 (16384 slots) master

M: ccd4ed6ad27eeb9151ab52eb5f04bcbd03980dc6 172.16.50.247:7004

   slots:0-16383 (16384 slots) master

Can I set the above configuration? (type 'yes' to accept):  yes   #输入yes

     

==============================================================================

如果提示下面报错:

[ERR] Node 172.16.50.245:7000 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

[ERR] Node 192.16.50.246:7002 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

[ERR] Node 192.16.50.247:7004 is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or

contains some key in database 0.

     

或者

Can I set the above configuration? (type 'yes' to accept): yes

/usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis/client.rb:119:in `call': ERR Slot 0 is already busy (Redis::CommandError)

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2764:in `block in method_missing'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'

  from /usr/local/rvm/rubies/ruby-2.3.1/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'

  from /usr/local/rvm/gems/ruby-2.3.1/gems/redis-4.0.1/lib/redis.rb:2763:in `method_missing'

  from /data/redis-4.0.1/src/redis-trib.rb:212:in `flush_node_config'

  from /data/redis-4.0.1/src/redis-trib.rb:776:in `block in flush_nodes_config'

  from /data/redis-4.0.1/src/redis-trib.rb:775:in `each'

  from /data/redis-4.0.1/src/redis-trib.rb:775:in `flush_nodes_config'

  from /data/redis-4.0.1/src/redis-trib.rb:1296:in `create_cluster_cmd'

  from /data/redis-4.0.1/src/redis-trib.rb:1700:in `<main>'

     

解决办法:

a)将172.16.50.245、172.16.50.246、172.16.50.247三个节点机redis下的aof、rdb等本地备份文件全部删除,删除之前先备份(或者直接进行mv移动到别处)

[root@50.245-server ~]# cd /data/redis-4.0.1/redis-cluster/

[root@50.245-server redis-cluster]# ls

7000  7001  appendonly.aof  dump.rdb  nodes_7000.conf  nodes_7001.conf

[root@50.245-server redis-cluster]# mv appendonly.aof /opt/

[root@50.245-server redis-cluster]# mv dump.rdb /opt/

[root@50.245-server redis-cluster]# mv nodes_7000.conf /opt/

[root@50.245-server redis-cluster]# mv nodes_7001.conf /opt/

     

b)登陆redis后,执行 "flushdb"命令进行数据清除操作

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000

172.16.50.245:7000> flushdb

OK

172.16.50.245:7000>

     

c)重启reds服务

[root@50.245-server ~]# pkill -9 redis

[root@50.245-server ~]# for((i=0;i<=1;i++)); do /data/redis-4.0.1/src/redis-server /data/redis-4.0.1/redis-cluster/700$i/redis.conf; done

     

d)再次执行集群创建操作就不会报错了

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb create  172.16.50.245:7000 172.16.50.246:7002 172.16.50.247:7004

=========================================================================

     

然后添加以上3个master主节点对应的3个从节点

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.247:7005 172.16.50.245:7000

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.245:7001 172.16.50.246:7002

[root@50.245-server ~]# /data/redis-4.0.1/src/redis-trib.rb add-node --slave 172.16.50.246:7003 172.16.50.247:7004

     

然后查看redis cluster集群情况(这个查看命令再哪个redis节点服务器上都可以查看)

[root@localhost redis-cluster]# /data/redis-4.0.1/src/redis-cli -c -h 172.16.50.245 -p 7000

172.16.50.245:7000> cluster nodes

1032fedd0c1ca7ac12be3041a34232593bd82343 172.16.50.245:7000@17000 myself,master - 0 1531150844000 1 connected 0-5460

1f389da1c7db857a7b72986289d7a2132e30b879 172.16.50.247:7004@17004 master - 0 1531150846000 3 connected 10923-16383

778c92cdf73232864e1455edb7c2e1b07c6d067e 172.16.50.247:7005@17005 slave 1032fedd0c1ca7ac12be3041a34232593bd82343 0 1531150846781 1 connected

e6649e6abf3ab496ca8a32896c5d72858d36dce9 172.16.50.246:7002@17002 master - 0 1531150845779 2 connected 5461-10922

d0f4c92f0b6bcc5f257a7546d4d121c602c78df8 172.16.50.246:7003@17003 slave 1f389da1c7db857a7b72986289d7a2132e30b879 0 1531150844777 3 connected

846943fd218aeafe6e2b4ec96505714fab2e1861 172.16.50.245:7001@17001 slave e6649e6abf3ab496ca8a32896c5d72858d36dce9 0 1531150847784 2 connected

     

172.16.50.245:7000> cluster info

cluster_state:ok

cluster_slots_assigned:16384

cluster_slots_ok:16384

cluster_slots_pfail:0

cluster_slots_fail:0

cluster_known_nodes:6

cluster_size:3

cluster_current_epoch:3

cluster_my_epoch:1

cluster_stats_messages_ping_sent:769

cluster_stats_messages_pong_sent:711

cluster_stats_messages_meet_sent:1

cluster_stats_messages_sent:1481

cluster_stats_messages_ping_received:706

cluster_stats_messages_pong_received:770

cluster_stats_messages_meet_received:5

cluster_stats_messages_received:1481

  

======================================================

由此总结几点(redis重启后,master实例变为slave实例,master之前对应的slave则变为master。):

1)master主节点实例最好分布在不同的服务器上,slave从节点实例也最好分布在不同的服务器上

2)master主节点实例和它自己所包括的从节点实例也最好分布在不同的服务器上,不要在同一个服务器上。

   部署redis cluster的机器如果是虚拟机,则虚拟机最好也别部署在同一台物理宿主机上。

   这样做是为了避免因服务器挂掉或主从节点机器全部挂掉,导致数据丢失,集群启用失败。

3)如果redis机器挂掉的数量不超过集群机器总数量的1/2,那么redis集群服务将不受影响,可以继续使用。

   但是节点机器重启后,则它上面的redis节点实例加入到集群中后,它上面之前的master实例变为slave实例(这个master之前对应的slave将随之变为master),

   也就是说重启后这个节点的两个redis实例全部变为slave了(原来是一主一从,重启后两从)。

4)如果redis机器挂掉的数量超过了集群机器总数量的1/2,则不仅要重启机器的redis服务,还要重新加入到redis cluster集群

   重新创建集群时,注意master和slave关系要和之前一样。

5)千万注意redis cluster中的节点重启事会先从集中中踢出来,即它上面的master和slave实例会消失,master之前对应的slave变为新的master,待该节点重启

   成功后会自动重新加入到集群中,并全部成为slave实例。由于节点重启期间,在从集群中踢出来到自动重新加入到集群中存在一定的时间间隔,所以集群中的节点

   不能同时重启,也不能短时间内一起重启,否则会造成集群崩溃,集群数据丢失。一般而言,重启一台节点,待该节点重启后并检查成功重新加入到集群中够,再

   另外重启别的节点。

 

redis作为纯缓存服务时,数据丢失,一般对业务是无感的,不影响业务,丢失后会再次写入。但如果作为存储服务(即作为存储数据库),数据丢失则对业务影响很大。

不过一般业务场景,存储数据库会用mysql、oracle或mongodb。

 

正文到此结束