Rancher 2.0 etcd disaster recovery
This doc shows how to restore to a single node etcd cluster after a 3, 5 or 7 node cluster has lost quorum.
Ideally with these sorts of failures you want to try your best to get the original etcd hosts back up.
This is also done at your own risk, I have no association with Rancher nor am I a Rancher professional. It is also highly recommended to test this in a staging environment first. I will NOT be responsible for the loss of all your or your company’s data; which is exactly what will happen if this procedure fails.
With that out of the way; please read on.
This doc assumes you have
1. rancher_cli installed on your local machine
2. a working internet connection on the surviving etcd host
1. Login to the surviving host
rancher context switch
rancher ssh <surviving_etcd>
At this point you may want to do a docker inspect etcd
to ensure the the following two directories are bind-mounted
...
"Mounts": [
{
"Type": "bind",
"Source": "/var/lib/etcd",
"Destination": "/var/lib/rancher/etcd",
"Mode": "z",
"RW": true,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/etc/kubernetes",
"Destination": "/etc/kubernetes",
"Mode": "z",
"RW": true,
"Propagation": "rprivate"
}
],
If you do not see the above.. Stop.
2. check the health of the cluster
docker exec -it etcd etcdctl member list
docker exec -it etcd etcdctl endpoint health
You should see unhealthy cluster
3. Take a snapshot of cluster
This ensures that if for any reason this operation fails, you have not lost all your data. We will store our snapshot in the /etc/kubernetes
dir which is bind-mounted onto the same path on the host
mkdir -p /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)
docker exec -it etcd etcdctl snapshot save /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)/snapshot.db
4. Get deploy command
Lavie (https://github.com/lavie/runlike) has this great tool which approximates the deploy command used to put up a docker container. We will use it to get out etcd configuration. Run the following:
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike etcd
the output should be a pretty long docker run
type string. Save it in a safe place for later
5. Destroy/Rename the old etcd container
docker stop etcd
docker rename etcd etcd_old
6. Start the new etcd container
- Edit the
--initial-cluster
area of the command from step 4, leaving only the surviving container. - Append
--force-new-cluster
at the end of the command
Use this new string to deploy a new container.
7. Delete old nodes
In the rancher UI. You should now be able to access your cluster again. Delete the pools of the nodes that died. (This will take a while as rancher will redeploy etcd)
You are now free to continue using your cluster or create new nodes to expand your etcd cluster
END
Extra
In case everything went to hell, we can use the snapshot taken in step 3…
docker exec -it etcd etcdctl snapshot --data-dir=/var/lib/rancher/etcd/snapshot restore /etc/kubernetes/etcd-snapshots/etcd-$(date +%Y%m%d)/snapshot.db
docker stop etcd
mv /var/lib/etcd/member /var/lib/etcd/member_old
mv /var/lib/etcd/snapshot/member /var/lib/etcd/member
rmdir /var/lib/etcd/snapshot
docker start etcd
The above restores the snapshot to /var/lib/rancher/etcd/snapshot
We then stop etcd, archive the messed up etcd data (member_old) and replace it with the restored data
Recent Comments