Provisioning, backing up and restoring a highly available Rancher cluster with Terraform


(Update Aug 9, 2021: I have built a tool that can deploy a Kubernetes cluster - also highly available if needed - in Hetzner Cloud very, very quickly. It's perhaps the easiest and fastest way to deploy Kubernetes to Hetzner Cloud, so I would suggest you check this option too before proceeding with alternatives proposed on this blog. You can find more info here .)
Main Terraform configuration
I keep the code for each environment in a directory called terraform/environments/. So for Rancher I have a directory named terraform/environments/rancher. In side this directory I have the following structure:
terraform/environments/rancher ├── kube_config_rancher.yml ├── main.tf ├── modules │ ├── rancher │ │ ├── main.tf │ │ └── variables.tf │ ├── rke │ │ ├── main.tf │ │ ├── output.tf │ │ └── variables.tf │ └── servers │ ├── main.tf │ ├── output.tf │ └── variables.tf ├── rke_data │ ├── cluster.rkestate │ ├── cluster.yml │ └── kube_config_cluster.yml ├── terraform.tfvars └── variables.tf
I will skip going through the servers module because it’s very similar to what I describe in the previous post about creating servers in Hetzner Cloud. In the root of the rancher directory, we have main.tf which contains the definitions of the providers and modules in use; terraform.tfvars which contains some non-secret configuration variables; and variables.tf which defines the variables that can be customised. The modules directory contains a few modules I use to avoid keeping all the code in one big file. The kube_config_rancher.yaml file and the rke_data directory are automatically generated by Terraform when creating or updating the cluster. These are quite important because they allow us to take backups of the etcd data or restore from and existing snapshot using the RKE binary directly, since the (unofficial) RKE provider for Terraform doesn’t seem to offer a native way to do these.
main.tf
This is the main file, and contains the following sections:
terraform { backend "s3" { skip_requesting_account_id = true skip_credentials_validation = true skip_get_ec2_platforms = true skip_metadata_api_check = true } }
Tells Terraform that we want to store the state in an s3 compatible bucket instead of the local disk. This is recommended.
provider "hcloud" { token = var.hetzner_cloud_token } module "servers" { source = "./modules/servers" environment_name = var.environment_name node_count = var.node_count server_type = var.server_type ansible_playbook_path = var.ansible_playbook_path ansible_vault_password_path = var.ansible_vault_password_path location = var.server_location }
Tells Terraform that we want to use the hcloud provider for Hetzner Cloud, and refers to a module that takes care of creating the servers and provisioning them with Ansible (like I said I will skip this part here because it’s very similar to what I’ve described in the previous post). As you can see, we just specify the name of the environment, which will likely be rancher in this case, the number of servers we want the cluster to include, and type (e.g. Hetzner Cloud server plan) and the location of the servers. The Ansible specific settings are used when creating the servers.
module "rke" { source = "./modules/rke" nodes = module.servers.nodes environment_name = var.environment_name private_ips = module.servers.private_ips etcd_snapshots_s3_access_key = var.etcd_snapshots_s3_access_key etcd_snapshots_s3_secret_key = var.etcd_snapshots_s3_secret_key etcd_snapshots_s3_bucket_name = var.etcd_snapshots_s3_bucket_name etcd_snapshots_s3_region = var.etcd_snapshots_s3_region etcd_snapshots_s3_endpoint = var.etcd_snapshots_s3_endpoint }
Here we include the rke module which will create a Kubernetes cluster using the nodes created by the servers module, so we need to pass the data concerning the servers/nodes. Because of the way the Hetzner Cloud provider for Terraform works, I couldn’t get the IPs of the private network directly from the nodes data, so in the servers module I save the IPs in a variable I can refer to in the rke module. But again we won’t look into this now. We also configure etcd so that recurring snapshots are stored in an s3 bucket.
provider "kubernetes" { host = module.rke.cluster.api_server_url client_certificate = module.rke.cluster.client_cert client_key = module.rke.cluster.client_key cluster_ca_certificate = module.rke.cluster.ca_crt }
We’ll use te Kubernetes provider to create a service account and a cluster role binding for Tiller, Helm’s server side component. We’ll use Helm to install cert-manager (for the TLS certificate) and Rancher.
module "rancher" { source = "./modules/rancher" kubeconfig_file = module.rke.kubeconfig_file kubernetes_cluster = module.rke.cluster rancher_hostname = var.rancher_hostname letsencrypt_email = var.rancher_letsencrypt_email }
Here we reference the rancher module, which will take care of installing Rancher and its dependencies.
variables.tf
# Required variable "hetzner_cloud_token" { } variable "environment_name" { } variable "node_count" { } variable "ansible_playbook_path" { } variable "ansible_vault_password_path" { } variable "server_type" { } variable "etcd_snapshots_s3_access_key" { } variable "etcd_snapshots_s3_secret_key" { } variable "etcd_snapshots_s3_bucket_name" { } variable "etcd_snapshots_s3_region" { } variable "etcd_snapshots_s3_endpoint" { } variable "rancher_hostname" { } variable "rancher_letsencrypt_email" { } variable "rancher_version" { } # Optional variable "server_location" { default = "nbg1" }
In this file we define the main settings.
terraform.tfvars
This file contains non-secret configuration settings:
environment_name = "rancher" ansible_playbook_path = "../../../ansible/provision.yml" ansible_vault_password_path = "~/.secrets/ansible-vault-pass" server_type = "cx31" server_location = "hel1" node_count = 1 rancher_hostname = "rancher.mydomain.com" rancher_version = "v2.3.1" rancher_letsencrypt_email = "[email protected]"
As you can see I am starting with one node, but if later I want to expand the cluster I can just update the node_count.
The modules
Let’s see how I’ve split the code into modules now.
servers
This is where I create the servers with Hetzner Cloud, so the content of this module depends on the cloud provider in use.
rke
This module creates the Kubernetes cluster using the nodes created by the servers module. You may need to tweak some settings depending on your cloud provider.
modules/rke/main.tf
resource rke_cluster "rancher-cluster" { cluster_name = var.environment_name dynamic nodes { for_each = var.nodes content { address = nodes.value.ipv4_address hostname_override = nodes.value.name internal_address = lookup(var.private_ips, nodes.value.id, 0) user = var.ssh_user role = ["controlplane", "etcd", "worker"] ssh_key = file(var.private_ssh_key_file) } } services_etcd { snapshot = var.etcd_snapshots_enabled backup_config { interval_hours = var.etcd_snapshots_interval_hours retention = var.etcd_snapshots_retention s3_backup_config { access_key = var.etcd_snapshots_s3_access_key secret_key = var.etcd_snapshots_s3_secret_key bucket_name = var.etcd_snapshots_s3_bucket_name region = var.etcd_snapshots_s3_region endpoint = var.etcd_snapshots_s3_endpoint } } } network { plugin = "canal" } } resource "local_file" "kubeconfig-yaml" { filename = "${path.root}/kube_config_${var.environment_name}.yml" content = rke_cluster.rancher-cluster.kube_config_yaml } resource "local_file" "rke_state" { filename = "${path.root}/rke_data/cluster.rkestate" content = rke_cluster.rancher-cluster.rke_state } resource "local_file" "rke_cluster_yaml" { filename = "${path.root}/rke_data/cluster.yml" content = rke_cluster.rancher-cluster.rke_cluster_yaml }
In this file, we create a cluster with RKE using the nodes passed as a variable, and configure recurring etcd backups. I am only ever going to use 1 or 3 nodes for Rancher, so I have hardcoded all the roles here for each node. I won’t need to have separate workers etc. Note that I am telling Terraform to create some files with the cluster data, namely the kubeconfig file that I can use with kubectl and helm commands, and the RKE state data files which I can later use to take snapshots of etcd manually or restore from an existing snapshot, which we’ll look into later.
modules/rke/output.tf
output "cluster" { value = rke_cluster.rancher-cluster } output "kubeconfig_file" { value = local_file.kubeconfig-yaml.filename }
Here we “emit” the cluster data and the path of the kubeconfig file so that they can be read by the main module and passed to the rancher module to set up Rancher.
modules/rke/variables.tf
variable "nodes" { } variable "private_ips" { } variable "ssh_user" { default = "deploy" } variable "private_ssh_key_file" { default = "~/.ssh/id_rsa" } variable "environment_name" { } variable "etcd_snapshots_interval_hours" { default = 6 } variable "etcd_snapshots_retention" { default = 28 } variable "etcd_snapshots_enabled" { default = true } variable "etcd_snapshots_s3_access_key" { } variable "etcd_snapshots_s3_secret_key" { } variable "etcd_snapshots_s3_bucket_name" { } variable "etcd_snapshots_s3_region" { } variable "etcd_snapshots_s3_endpoint" { }
Here we list the variables that can be customised for this module.
rancher
This is the module that actually installs Rancher.
modules/rancher/variables.tf
variable "kubeconfig_file" { } variable "kubernetes_cluster" { } variable "rancher_hostname" { } variable "letsencrypt_email" { } variable "rancher_version" {
First, we make sure we must specify the location of the kubeconfig file as well as the data of the RKE cluster that we’ll need to authenticate to the cluster with the kubernetes and helm providers. We also need to specify the hostname/domain name for Rancher and the email to use with LetsEncrypt to request a TLS certificate for Rancher.
modules/rancher/main.tf
Before we can install Rancher we need to install cert-manager that will handle issuing and renewing the TLS certificate; both cert-manager and Rancher will be installed with Helm, so we also need to install Tiller in the cluster. Because of RBAC we need to create a service account and a cluster role binding for Tiller, which we can do using the Kubernetes provider for Terraform:
resource "kubernetes_service_account" "tiller" { metadata { name = "tiller" namespace = "kube-system" } } resource "kubernetes_cluster_role_binding" "tiller" { depends_on = [kubernetes_service_account.tiller] metadata { name = "tiller" } role_ref { api_group = "rbac.authorization.k8s.io" kind = "ClusterRole" name = "cluster-admin" } subject { kind = "ServiceAccount" name = "tiller" namespace = "kube-system" } }
There isn’t a dedicated step to install Tiller. Instead, we tell Terraform that we want to use the Helm provider, and to install Tiller if it’s not installed yet:
provider "helm" { kubernetes { host = var.kubernetes_cluster.api_server_url client_certificate = var.kubernetes_cluster.client_cert client_key = var.kubernetes_cluster.client_key cluster_ca_certificate = var.kubernetes_cluster.ca_crt } install_tiller = true service_account = "tiller" }
As you can see we authenticate to the cluster using the data taken from the rke module and passed to the rancher module by the parent module.
Next, we need to install cert-manager. To do this we need to install some CRDs first which requires applying a manifest; I couldn’t find a way to install a manifest directly with the kubernetes provider, so for now I am doing it as follows:
resource "null_resource" "cert-manager-crds" { depends_on = [kubernetes_cluster_role_binding.tiller] provisioner "local-exec" { command = "kubectl --kubeconfig ${var.kubeconfig_file} apply --validate=false -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.9/deploy/manifests/00-crds.yaml" } }
As you can see we just use kubectl for this as instructed by the cert-manager docs. We can then install the cert-manager chart:
data "helm_repository" "jetstack" { name = "jetstack" url = "https://charts.jetstack.io" } resource "helm_release" "cert-manager" { depends_on = [null_resource.cert-manager-crds] name = "cert-manager" namespace = "cert-manager" repository = "${data.helm_repository.jetstack.metadata.0.name}" chart = "cert-manager" version = "v0.9.1" wait = true }
So we first add the JetStack repository and then install the chart, version 0.9.1 since that’s the latest that works for sure with Rancher at the moment. There are newer versions but they won’t work for now.
Finally, we can install Rancher:
data "helm_repository" "rancher-latest" { name = "rancher-latest" url = "https://releases.rancher.com/server-charts/latest" } resource "helm_release" "rancher" { depends_on = [helm_release.cert-manager] name = "rancher" namespace = "cattle-system" repository = "${data.helm_repository.rancher-latest.metadata.0.name}" chart = "rancher" version = var.rancher_version wait = true set { name = "hostname" value = var.rancher_hostname } set { name = "ingress.tls.source" value = "letsEncrypt" } set { name = "letsEncrypt.email" value = var.letsencrypt_email } }
Like with cert-manager, we add the Rancher repository ad install the Rancher chart. I’m using the latest repository which at the moment offers Rancher 2.3.1; you may want to use the stable repository and version 2.2.9 instead, do as you prefer.
Applying the configuration to create the cluster
Before we can create the cluster, we need to initialise the Terraform state. We need to create a file with the required secrets preferably outside the code repository; I keep such files in ~/.secrets/terraform/. For the configuration above to work, we need a file with the following secrets:
export ENVIRONMENT_NAME=rancher export TF_STATE_S3_BUCKET= export TF_STATE_S3_REGION= export TF_STATE_S3_ENDPOINT= export TF_STATE_S3_ACCESS_KEY= export TF_STATE_S3_SECRET_KEY= export TF_VAR_hetzner_cloud_token= export TF_VAR_etcd_snapshots_s3_access_key= export TF_VAR_etcd_snapshots_s3_secret_key= export TF_VAR_etcd_snapshots_s3_bucket_name= export TF_VAR_etcd_snapshots_s3_region= export TF_VAR_etcd_snapshots_s3_endpoint=
Make sure you set the correct values or else everything will fail miserably :D
Now we can initialise the state:
cd terraform/environments/rancher source ~/.secrets/terraform/rancher terraform init \ -backend-config="bucket=$TF_STATE_S3_BUCKET" \ -backend-config="region=$TF_STATE_S3_REGION" \ -backend-config="endpoint=$TF_STATE_S3_ENDPOINT" \ -backend-config="access_key=$TF_STATE_S3_ACCESS_KEY" \ -backend-config="secret_key=$TF_STATE_S3_SECRET_KEY" \ -backend-config="key=terraform/terraform.tfstate"
This will create a terraform/environments/rancher/.terraform directory that contains secrets, so please remember to add it to your .gitignore file.
One last step before we can create the cluster: all the providers in use here are official and will be downloaded automatically with the init command, apart from the RKE provider. You need to download it from here and save it into terraform/environments/rancher/.terraform/plugins/darwin_amd64 if you are using Mac, or anyway in the relevant plugins directory.
Boom! We can now create the cluster with:
terraform apply
Confirm by entering ‘yes’. If all went well, a Kubernetes cluster will be provisioned with RKE and Rancher will be installed into it, ready for use. Once Terraform has done its thing, use kubectl with the kubeconfig file generated and wait for the pods to be ready, then open the rancher URL in your browser and ensure it is working. Let me know in the comments if something went wrong. Done!
Backups and restores
Before closing, I wanted to share how I have been testing backups and restores of etcd with a cluster deployed this way. Unfortunately, while the RKE binary makes it easy to take snapshots and restore from a snapshot, the RKE provider for Terraform doesn’t seem to offer a native way to do the same operations (if I’m missing something please let me know!), but I found a workaround using the RKE binary directly instead of Terraform only when I need to back up or restore. To do this, the RKE binary requires the cluster.rkestate file as well as cluster.yaml in order to be able to operate on the cluster. As you may remember, in our code we tell the rke module to save these files to disk, in the rke_data directory. So we can just use the RKE binary with them! You just need to make sure you use a version of the RKE binary that matches the version of RKE used by the Terraform provider, otherwise ..well you may screw up your cluster or something. :)
In the provider’s releases page you can see the RKE version for the release you are using. The latest version of the provider at the moment uses RKE 0.2.8, so head over to the right RKE release page and download the binary somewhere.
Backups
You can take a backup/snapshot manually with the following commands:
export RKE_BINARY=<location of the binary you have downloaded> cd terraform/environments/rancher/rke_data $RKE_BINARY etcd snapshot-save \ --config cluster.yml \ --name <name of the snapshot> \ --s3 \ --access-key <s3 access key> \ --secret-key <s3 secret key> \ --bucket-name <s3 bucket> \ --s3-endpoint <s3 endpoint>
This will back up the etcd data in a zip file in the s3 bucket specified.
Restores
Restoring is equally easy:
$RKE_BINARY etcd snapshot-restore \ --config cluster.yml \ --name <name of the snapashot> \ --s3 \ --access-key <s3 access key> \ --secret-key <s3 secret key> \ --bucket-name <s3 bucket> \ --s3-endpoint <s3 endcpoint>
I would recommend you test backups and restores. For example you can create the cluster, take a backup, create a user in Rancher, restore from the backup and make sure the user is gone. Or something like that.
Final words
The more I use tools such as Rancher and Terraform the happier I am because I am learning to automate almost everything. This is nice because, with my current setup, if a major disaster happens and I need to recreate both the Rancher cluster and my app cluster, I can quickly provision both clusters with Terraform first and then restore apps and data from a Velero backup. At the moment I don’t have much going on because I am not in production yet, but it’s nice to know that as things stand currently I can be up and running again in less than a hour from when a major outage starts :) Hope your find this post useful. Let me know in the comments if you run into issues.