has_many :codes

Fun experiment with Kubernetes: live migration of a cluster from a cloud provider to another

Published  

One of the many benefits of Kubernetes that I often hear people talk about is portability. Kubernetes makes it possible to migrate workloads from a cluster to another easily, and this can be useful if for example a company wants to change infrastructure provider due to cost or any other reason.

While there are various ways one can migrate workloads from a cluster to another, how to do it actually depends on where and how the clusters are provisioned (e.g. if with the same provider or not). In most cases the easiest option is to back everything up on the old cluster, and restore in the new cluster before making the switch. This implies a certain amount of downtime depending on the apps and data to migrate. I have done this several times using Velero, a free, popular backup tool for Kubernetes. Migrating with zero downtime is certainly a more complex task.

I was bored so I did a fun experiment doing a live migration (as opposed to backup followed by a restore) of a cluster from Hetzner Cloud (referral link, we both receive credits) to DigitalOcean (referral link, we both receive credits), but where your nodes are actually hosted doesn't really matter if you do the migration the way I did. 

To design a cluster in such a way that it would be possible to migrate it live to some other infrastructure even totally different, you need to take into account some limitations. In particular, your cluster cannot use features that are specific to cloud providers such as load balancers and block storage; also, you cannot use private networks because of course nodes in two different environments would not be able to communicate each other - unless you set up some kind of VPN or similar.

What I used for this test instead is my favorite Kubernetes tool: Rancher. If you've happened to read this blog before you probably know that I absolutely love Rancher. It makes deploying and managing Kubernetes a lot easier, especially for a team of one like me.

In a previous post I wrote on how to set up a highly available load balancer with haproxy and keepalived (see also the article I wrote for the Hetzner Community on how to automate the same thing with Ansible), so I set up load balancers this way. For the sake of simplicity, I will skip that part in this post and just use a regular ingress connecting directly to a node.

For the storage instead, I used another project by Rancher, Longhorn. Longhorn is awesome in that it offers many features like a great dashboard, backups to S3/NFS and DR, and is open source and totally free. Longhorn also makes it incredibly easy to move volume replicas between nodes so it's a great fit for this experiment.

Prerequisites

For this post, I assume you

  • are already familiar with Kubernetes, so if you are looking for an introduction to it this isn't the right post :) 
  • are already familiar with Rancher. If not, you can anyway set it up easily by following the official documentation. After that it's a very intuitive graphical UI
  • have an account with a couple of cloud providers or anyway access to two different infrastructures where you can create some instances. Which providers you use doesn't really matter.

Creating the first cluster in Hetzner Cloud

To get started, head to the Hetzner Cloud console (you can use the CLI if you prefer) and create three servers so we can test with a HA cluster. I called my nodes hc-node1..3. Choose Ubuntu 18.04 or another OS supported by Rancher, and ensure that the servers have 2-3 cores and 2-4 GB of ram each.

Once the servers are up and running, SSH into them and update system packages, configure firewall etc if you wish (you may not need this if you are just testing this migration).

We have only two requirements concerning the software installed in the nodes. The first is Docker, which you can install with the following command:

curl -fsSL https://get.docker.com | sh

The second requirement is open-iscsi, which is required by Longhorn:

apt install open-iscsi -y

We can now proceed with provisioning Kubernetes with Rancher. From the "Global" view, click "Add Cluster" and select the custom nodes option:

Give the cluster a name such as "test" and choose Weave as the network provider:

You don't have to use Weave for this test, but in a real life scenario you should use a CNI plugin that enables encryption between the nodes since we are going to connect them to each other via the public Internet.

Click "Next", then select all the roles for the nodes and copy the provided Docker command:

Then run that Docker command on each node. After a few seconds, the UI will report "3 new nodes have registered". Click "Done", then click on the name of the cluster from the list, then click on the "Nodes" tab to see the provisioning progress. 

Preparing the nodes in DigitalOcean

The provisioning of Kubernetes will take several minutes, so in the meantime you can start preparing the nodes in DigitalOcean since the objective of this test is to migrate the cluster later.

So go ahead and create three droplets with at least 2 cores and Ubuntu 18.04 as the OS.

Install Docker and open-iscsi as explained earlier. The nodes are now ready.

Installing Longhorn

When Kubernetes provisioning is complete, the nodes will be marked as "Active" like in the picture below:

Now from the top menu click on "test" (or whatever name you gave to the cluster), then "System", the click on the "Apps" tab. 

Click on "Launch", search for Longhorn and click on it. Launch with the default settings, and wait - the installation will take a couple of minutes and you will see the progress bar. When green, Longhorn should be ready to provision storage. Double check by clicking on "Resources" > "Workloads" and scrolling to the "longhorn-system" namespace. You will see the same resources as in the picture below:

Click again on the "Apps" tab and you'll notice a link to "/index.html" in the Longhorn card. Click on that link to securely access the Longhorn dashboard:

Now go to the "Node" tab and ensure that all the nodes are schedulable:

By default, Longhorn will use the available storage on the nodes to provision persistent volumes, so you don't need to configure or install anything else.

Click on the "Volume" tab and keep it open so we can check later when volumes are created.

Installing an example workload - Magento

To test the migration, we'll deploy Magento as an example workload that uses two persistent volumes, one for Magento itself and the other for MariaDB. 

So from the top menu go to "test" > "Default" > "Apps" and search for Magento. Click on the only result, configure username and password you'll use to access the app, and make sure persistent volumes are enabled both for Magento and MariaDB like in the picture:

You don't need to select a storage class since Longhorn's is set as default.

Set "Magento Host URL" to "magento.<ip of a node>.xip.io" so we don't have to configure some DNS somewhere. Any hostname for the xip.io domain that includes an IP address will resolve to that IP address, which is handy for tests.

Under "Hostname" click on "Specify a hostname to use" and enter the same hostname as for the previous step, then click "Launch".

Now go back to the "Volume" tab of the Longhorn UI and see that the volumes are created. It may take a little while for the Magento volume since the container image is a little large.

When the Magento and MariaDB pods are up and running open the hostname you configured earlier in the browser and ensure that the app works as expected.

Migrating the cluster live

So far we have a working Kubernetes cluster with a real life workload that uses persistent volumes, in Hetzner Cloud. Let's now migrate the cluster to DigitalOcean.

From the top menu in Rancher click on the project title, then click on "Edit Cluster". Scroll to the bottom, and you will see a familiar form: select all the roles for the new nodes, copy the Docker command and this time run that command on the DigitalOcean nodes that you prepared earlier.

Click on "Save" then on the "Nodes" tab and wait for the nodes to be "Active". It's gonna take several minutes and at the end you will have six active nodes:

We need to migrate both the persistent volumes and the workloads from the Hetzner Cloud nodes to the DigitalOcean nodes, so select the Hetzner Cloud nodes (hc-node1..3) and then click on "Cordon". Go to the Longhorn "Node" tab and verify that the Hetzner Cloud nodes are no longer schedulable for the storage.

Click on the "Volume" tab, then click on the first volume. You should see the current three replicas of the volume on the Hetzner Cloud volumes:

Click on the small drop down at the top right corner of the "Replicas" section and click on "Update Replicas Count". Set it to 6 and confirm. Longhorn will now create a new replica on each of the only schedulable nodes, which are the DigitalOcean ones. Wait until the new replicas are ready (blue background and "Running" status). Then set the replica count back to 3 and delete the replicas on the Hetzner Cloud nodes, so that only the replicas on the DigitalOcean nodes remain.

Ensure the status of the volume on the left is "Healthy", and repeat the same process with the second volume.

Once all the volume replicas have been migrated, ensure that Magento is still working. At this point we have the pods in Hetzner Cloud and the data in DigitalOcean. Magento should work exactly as before, just a little slower due to the higher latency.

Next, go to the "Nodes" tab in Rancher, select the Hetzner Cloud nodes and click on "Drain" with the options configured like in the picture to speed up the draining:

Wait until the Hetzner Cloud nodes are marked as "Drained", then check that the Magento and MariaDB pods have been rescheduled on the DigitalOcean nodes. If the pods don't get recreated because the volumes cannot be attached, schedule the deployments to 0 and back to 1. Sometimes it can take a little while to reattach the volumes. Check that Magento is still working.

Go to the "Apps" tab, click on the three dots in the Magento card and then click "Upgrade". At the bottom of the form change the hostname so that it points to the IP of one of the DigitalOcean nodes. Give it a few seconds, then try the new hostname in the browser. Magento should still work.

Back to the "Nodes" tab, you can now delete the Hetzner Cloud nodes; wait until the "The cluster is currently updating" message is gone, then delete the actual servers from the Hetzner Cloud console too.

The cluster has now been migrated to DigitalOcean and Magento should still work.

Conclusion

This was a little fun experiment. I have been using Rancher for a while to deploy Kubernetes, so I knew this was possible with the "Custom nodes" provisioning, but I hadn't tried it before. For this experiment I didn't install Magento in HA mode, so there was a brief downtime while the pods were being rescheduled in DigitalOcean. With production environments you would of course install workloads in HA mode so there wouldn't be any downtime during the migration :) Also like I said I skipped the load balancing part here, but setting up a load balancer with haproxy is easy and that would also avoid downtime since we could just update the DNS records once the migration is complete.

The question now is, should you set up your clusters this way? I did this just for fun but it's an option. If you think you might need to change infrastructure, then you could go this route. But you would lose some benefits such as load balancers, block storage and private networks that you could use otherwise. Personally I now keep my cluster in Hetzner Cloud, of course deployed with Rancher until HC offers a managed Kubernetes service. HC has both a cloud controller manager and a CSI driver on Github which are super easy to install, so I can use their load balancers and block storage. I can also use the private networks, so I can use Canal instead of Weave and have better performance. But it's nice to know that there are tools that make it possible to migrate a Kubernetes cluster from an infrastructure to another "live" and without any downtime if everything is installed in HA mode. What do you think about this approach? Let me know in the comments :)

© Vito Botta