Rancher 2.0 etcd disaster recovery

This doc shows how to restore to a single node etcd cluster after a 3, 5 or 7 node cluster has lost quorum.

Ideally with these sorts of failures you want to try your best to get the original etcd hosts back up.

This is also done at your own risk, I have no association with Rancher nor am I a Rancher professional. It is also highly recommended to test this in a staging environment first. I will NOT be responsible for the loss of all your or your company’s data; which is exactly what will happen if this procedure fails.

With that out of the way; please read on.

This doc assumes you have
1. rancher_cli installed on your local machine
2. a working internet connection on the surviving etcd host

1. Login to the surviving host

rancher context switch
rancher ssh <surviving_etcd>

At this point you may want to do a docker inspect etcd to ensure the the following two directories are bind-mounted

...
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/var/lib/etcd",
                "Destination": "/var/lib/rancher/etcd",
                "Mode": "z",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "bind",
                "Source": "/etc/kubernetes",
                "Destination": "/etc/kubernetes",
                "Mode": "z",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],

If you do not see the above.. Stop.

2. check the health of the cluster

docker exec -it etcd etcdctl member list
docker exec -it etcd etcdctl endpoint health

You should see unhealthy cluster

3. Take a snapshot of cluster

This ensures that if for any reason this operation fails, you have not lost all your data. We will store our snapshot in the /etc/kubernetes dir which is bind-mounted onto the same path on the host

mkdir -p /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)
docker exec -it etcd etcdctl snapshot save /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)/snapshot.db

4. Get deploy command

Lavie (https://github.com/lavie/runlike) has this great tool which approximates the deploy command used to put up a docker container. We will use it to get out etcd configuration. Run the following:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike etcd

the output should be a pretty long docker run type string. Save it in a safe place for later

5. Destroy/Rename the old etcd container

docker stop etcd
docker rename etcd etcd_old

6. Start the new etcd container

  1. Edit the --initial-cluster area of the command from step 4, leaving only the surviving container.
  2. Append --force-new-cluster at the end of the command

Use this new string to deploy a new container.

7. Delete old nodes

In the rancher UI. You should now be able to access your cluster again. Delete the pools of the nodes that died. (This will take a while as rancher will redeploy etcd)

You are now free to continue using your cluster or create new nodes to expand your etcd cluster

END


Extra

In case everything went to hell, we can use the snapshot taken in step 3…

docker exec -it etcd etcdctl snapshot --data-dir=/var/lib/rancher/etcd/snapshot restore /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)/snapshot.db

docker stop etcd
mv /var/lib/etcd/member /var/lib/etcd/member_old
mv /var/lib/etcd/snapshot/member /var/lib/etcd/member
rmdir /var/lib/etcd/snapshot
docker start etcd

The above restores the snapshot to /var/lib/rancher/etcd/snapshot
We then stop etcd, archive the messed up etcd data (member_old) and replace it with the restored data