has_many :codes

From zero to Kubernetes in Hetzner Cloud with Terraform, Ansible and Rancher

Published  

(Update Aug 9, 2021: I have built a tool that can deploy a Kubernetes cluster - also highly available if needed - in Hetzner Cloud very, very quickly. It's perhaps the easiest and fastest way to deploy Kubernetes to Hetzner Cloud, so I would suggest you check this option too before proceeding with alternatives proposed on this blog. You can find more info here.)


In this post I would like to share how I currently set up my Kubernetes clusters. I use Hetzner Cloud (referral link, we both receive credits) as my cloud provider since it’s much more affordable than the popular ones (half the price for twice as much memory!) and is very reliable with excellent performance. So far I’ve been using Ansible to “prepare” my nodes with some basic security configuration and then Rancher to set up Kubernetes. These steps were manual, because I first had to purchase the servers from Hetzner Cloud, then I would run my Ansible playbook, and finally install Kubernetes by creating a cluster in Rancher and running a Docker command (provided by Rancher for that cluster) on each node. It worked, but again it was quite a manual process, slow and error prone. Then, just a few days ago, I “discovered” Terraform by HashiCorp, and it blew my mind. I had heard of it before but I hadn’t looked into it yet. It’s a very powerful tool that enables you to implement almost any sort of automation thanks to the many providers available, allowing for the so called “infrastructure as code”. Terraform happens to have a provider for Hetzner Cloud and one for Rancher, plus there is a provisioner to integrate Ansible with Terraform that can be installed manually. These make for an amazing combination of powerful tools thanks to which I can now spin up a new cluster by creating a couple of configuration files and running a single terraform apply command - all the steps mentioned above are performed automatically.

OK, but why?

There are many ways of deploying Kubernetes, and certainly the easiest is by just using a managed Kubernetes service like Google GKE, Amazon EKS, Microsoft AKS, or even DigitalOcean (referral link, we both receive credits) - these are the most popular ones, but there are many others. These services allow you to spin up a cluster in minutes with a few clicks, but they are generally a bit expensive for my pockets, even if some of them charge only for the workers. Like I mentioned, with Hetzner Cloud I typically have twice as much memory for half the price of DigitalOcean, and the difference is even bigger when compared to the other providers. So one reason for setting up Kubernetes myself is cost. Another reason is that this way I can have more control and flexibility with my clusters, even if this means that I have to take care of the control plane and all the rest.

The easiest and most convenient way I found so far to deploy Kubernetes myself is Rancher’s RKE. You can use their CLI tool to set up a cluster from command line, or you can use the Rancher UI directly, which I prefer because it’s nicer and I like to store the configurations for my clusters in a central place with proper backups (using the CLI instead, the state of the cluster is saved to your local machine).

Using Terraform, I can now spin up a cluster in Hetzner Cloud in minutes like with the managed services. I should mention that there are various node drivers for Rancher - including one for Hetzner Cloud - that allow Rancher to create servers directly with those providers before deploying Kubernetes, so you don’t have to do that manually. Unfortunately Rancher doesn’t set up any firewall or harden SSH or things like that when creating servers, so Kubernetes is left exposed to the Internet. I tried using some cloud init configuration with the node drivers to do these things, but if these steps take a little while (like updating the system upon server creation for example) then Rancher seems to “think” that there’s a problem with the servers, and therefore it deletes them and recreates them in a sort of loop. So the cloud init option to set up servers with Rancher doesn’‘t work well. For this reason, until now I was manually running an Ansible playbook that a) configures a firewall to allow communication for Kubernetes only between the nodes, and open a few selected ports; b) install fail2ban to stop small brute force attacks; c) disable root/password authentication with SSH; d) install Docker. Then like I said I would run the Rancher Docker command on each node to deploy Kubernetes.

So over the past couple of days I’ve learnt the basics of writing Terraform modules, and using the aforementioned providers I have Terraform create the servers in Hetzner Cloud, prepare them with Ansible and then configure the Kubernetes cluster with Rancher, all automatically.

OK, how?

What this post is: an opinionated way of setting up Kubernetes in Hetzner Cloud, with Terraform, Ansible and Rancher.

What is post is not: a detailed introduction to Terraform, Ansible or Rancher. You can refer to the documentation for each for a proper introduction. Anyway, in short:

Terraform

Terraform is an automation tool that leverages so called “providers” to interact with many services via APIs and perform various tasks automatically. You write some code with a simple language and the nice thing is that you can track the history of your code - and thus of your infrastructure - with git, with obvious benefits. I no longer need to keep notes somewhere, everything is documented in code. Terraform also can provision the servers it creates in a number of ways, and I found this provisioner for Ansible that can automatically execute an Ansible playbook with each server. I also use a provisioner to run the Docker command required by Rancher so that the node can join the Kubernetes cluster. Terraform is very clever: at each run of the apply command, it figures out what needs to be changed in the infrastructure in order to achieve the desired state. The code is mostly organized in modules, and a module can include/refer to another module.

Ansible

Ansible is a configuration management tool. Similarly to Terraform, you define configuration in simple YAML files that you can track in a git repo. The two may seem to overlap in some areas, but I think they can work well together: Terraform to provision infrastructure, and Ansible to configure servers. Ansible “playbooks” can be run any number of times, and they are typically written in such a way that any action performed on the servers is idempotent. A playbook is made of one of more “roles”, each of which take care of a specific part of the configuration of a server.

Rancher

Rancher is an amazing tool to provision and manage Kubernetes clusters. It can provsion a cluster using any of the major managed Kubernetes services, or by interacting with cloud providers to create servers directly, or by allowing you to use “custom” nodes you provision yourself. You can even import existing clusters! It truly is amazing software and I feel lucky I can use it for free even though I cannot afford support yet and need to figure out everything myself - having said that, Rancher is very easy to use! You can install Rancher either as a single node (just a Docker command) or in HA mode, which basically means that Rancher runs itself in a Kubernetes cluster (you can for example create this Kubernetes cluster with the Rancher RKE CLI, and then manage other clusters with Rancher).

Setting up the configuration

To make things easier for who is just starting, I have shared the code that I use with Terraform and Ansible on Github. First things first, I like to have a repository where I keep all of the stuff I need to manage clusters. In this repo I have - among other things - a directory for Ansible and a directory for Terraform.

The Terraform part

To configure a new cluster, I like to create a subdirectory called terraform/clusters/. In this directory I have four files:

└── terraform
    └── clusters
        ├── <cluster name>
        │   ├── main.tf
        │   ├── output.tf
        │   ├── terraform.tfvars
        │   └── variables.tf

main.tf

This file defines the Terraform providers and modules to use, and the variables (=configuration settings) that each module needs. At the moment, this file looks like this:

terraform {
  backend "s3" {
    skip_requesting_account_id  = true
    skip_credentials_validation = true
    skip_get_ec2_platforms      = true
    skip_metadata_api_check     = true
  }
}

provider "rancher2" {
  api_url   = "${var.rancher_api_url}"
  token_key = "${var.rancher_api_token}"
}

module "rancher" {
  source = "github.com/vitobotta/terraform-rancher"

  cluster_name        = var.cluster_name
  etcd_s3_access_key  = var.etcd_s3_access_key
  etcd_s3_secret_key  = var.etcd_s3_secret_key
  etcd_s3_region      = var.etcd_s3_region
  etcd_s3_endpoint    = var.etcd_s3_endpoint
  etcd_s3_bucket_name = var.etcd_s3_bucket_name
}


provider "hcloud" {
  token = var.hcloud_token
}

module "hcloud" {
  source = "github.com/vitobotta/terraform-hcloud"

  servers                     = var.servers
  cluster_name                = var.cluster_name
  ssh_private_key             = var.ssh_private_key
  ssh_public_key              = var.ssh_public_key
  ansible_playbook_path       = var.ansible_playbook_path
  ansible_vault_password_path = var.ansible_vault_password_path
  rancher_node_command        = module.rancher.node_command
}

In the first block I tell Terraform that I want to use an S3 bucket to store the “state” of the cluster (any S3 compatible storage will do - I use Wasabi - but there are other backends available). Using a remote backend for the state is recommended; if no backend is specified, Terraform will store it in the local filesystem. Either way, it is important that you keep a copy of this state at all times or you will lose the ability to update an existing cluster.

The following blocks tell Terraform that I want to use the Hetzner Cloud and Rancher providers. For the Rancher provider, the only configuration settings required are the URL of the API (at the moment if the format rancher_url/v3) and the token to authenticate with Rancher, which you can easily create from the Rancher UI. I am also using a custom module though, so I am definiing which variables to use for its configuration. In the source variable you can see the URL of the repository where I uploaded the module’s code. Check it out, it’s very easy to understand and you can see from the variables.tf file all the available configuration options. The module creates a cluster in Rancher, specifying if/where to backup the etcd state, and some options for the monitoring. From the code you can see that I am using Canal as the CNI, and that cloud-provider for kubelet is set to external. This is something we’ll need later to use a floating IP with Hetzner Cloud. The way we are going to set up the cluster, the floating IP is automatically assigned by Kubernetes to an healthy node, so we can refer to this IP in the DNS settings instead of the IPs of nodes that may go down or be removed altogether.

Then there’s the Hetzner Cloud section; I am specifying a variable for the token to use to authenticate with the API. Then I am telling Terraform that I want to use a custom module you can also find at the specified Github repo. This module will create the servers specified in the servers variable, create a private network and attach every server to it, and create a floating IP. By default, the module will also run an Ansible playbook with each server upon creation, and finally run the join commnd generated with the Rancher cluster. Again, there’s no point I replicate all the code here, just read it in the repo since it’s very easy to understand. Here I’m just listing the steps required to create a cluster.

output.tf

output "server_ips" {
  value = module.hcloud.server_ips
}

output "floating_ip" {
  value = module.hcloud.floating_ip
}

output "kube_config" {
  value = module.rancher.kube_config
}

In this file we define some “outputs”, that is information that will be displayed when Terraform is done setting up the cluster, including the kubeconfig file to manage the cluster (note: this kubeconfig will proxy access to the cluster through your Rancher instance).

terraform.tfvars

Here we set the configuration options required/needed. There’s more that you can see in the repos for the modules, but this is what is required at a minimum to create a cluster:

# Shared configuration
cluster_name    = "test"
ssh_private_key = "~/.ssh/id_rsa"
ssh_public_key  = "~/.ssh/id_rsa.pub"

# Rancher configuration
rancher_api_url     = "..."
etcd_s3_bucket_name = "..."
etcd_s3_endpoint    = "..."
etcd_s3_region      = "..."

# Hetzner Cloud configuration
ansible_playbook_path       = "../../../ansible/provision.yml"
ansible_vault_password_path = "~/.secrets/ansible-vault-pass"

servers = {
  1 = {
    name               = "test-master1"
    private_ip_address = "10.0.0.2"
    server_type        = "cx41"
    image              = "centos-7"
    location           = "nbg1"
    backups            = true
    roles              = "--worker --etcd --controlplane"
  },

  2 = {
    name               = "test-master2"
    private_ip_address = "10.0.0.3"
    server_type        = "cx41"
    image              = "centos-7"
    location           = "nbg1"
    backups            = true
    roles              = "--worker --etcd --controlplane"
  },

  3 = {
    name               = "test-master3"
    private_ip_address = "10.0.0.4"
    server_type        = "cx41"
    image              = "centos-7"
    location           = "nbg1"
    backups            = true
    roles              = "--worker --etcd --controlplane"
  },
}

So we give the cluster a name, tell Terraform where to find our SSH keys to set up the servers, the Rancher API URL, the location of the Ansible playbook as well as the location of the password file to use to decrypt the Ansible vault with the secrets; finally, the list of the servers to create. As you can see, for each server we give a name, and specify the IP address to assign with the private network interface (we’ll set up Kubernetes to use the private network so to minimise the risk of somene intercepting the traffic between the nodes). We also specify the model of server to create (you can see the available options here), the OS image to use (all this stuff expects CentOS 7 by the way), the location of the server, whether to enable backups and finally the roles of the server in Kubernetes, according to the format of the Rancher join commands. I currently manage small clusters; if a cluster has three nodes, then each node has all the roles, otherwise 3 nodes have etcd and controlplane roles and the remaining nodes have only the worker role.

variables.tf

This file defines which configuration options must be configured. As you can see from the code in the repos of the modules, there are more settings available; those not specified in this file have a default value - check them out to customise for your clusters.

variable "cluster_name" {}
variable "hcloud_token" {}
variable "ssh_private_key" {}
variable "ssh_public_key" {}
variable "rancher_api_url" {}
variable "rancher_api_token" {}
variable "etcd_s3_access_key" {}
variable "etcd_s3_secret_key" {}
variable "etcd_s3_bucket_name" {}
variable "etcd_s3_endpoint" {}
variable "etcd_s3_region" {}
variable "servers" {}
variable "ansible_playbook_path" {}
variable "ansible_vault_password_path" {}

The Ansible part

The content of my ansible directory is simple: we don’t have a roles subdirectory because to make things easier I’ve shared the roles I am using in repos on Github.

ansible
├── ansible.cfg
├── group_vars
│   └── all
│      ├── vars.yml
│      └── vault.yml
├── provision.yml
├── requirements.yml

ansible.cfg

[defaults]
nocows = 1
vault_password_file = ~/.secrets/ansible-vault-pass
host_key_checking = False

[ssh_connection]
pipelining = True

vault.yml

Create the ~/.secrets/ansible-vault-pass file with the password for your Ansible vault in it. We’ll use a vault to keep secrets in an encrypted format, so we can commit them into git. We’ll create the vault in group_vars/all/vault.yml; all means that the configuration files in this directory will be applied to each server provisioned with Ansible. To create the vault, run the following command from inside the ansible directory:

ansible-vault create group_vars/all/vault.yml

This will open your default editor and in it you can specify secrets in YAML format. At the moment, the secrets required for my Ansible roles are:

---
user_password: ...
smtp_host: ...
smtp_port: ...
smtp_username: ...
smtp_password: ...
smtp_hostname: ...
kernelcare_key: ...

The first is to configure a password for the deploy user on each server. This is because the users role will disable root authentication with SSH. Then we have some settings for the SMTP server to use as relay for Postfix to ensure we receive alerts from the servers (such as Fail2ban notifications when an IP is banned because of brute force attempts, notifications of successful logins, and notifications when there are updates available); the last setting is for the Kernelcare key, which is optional. I use Kernelcare to keep the kernel patched (against security vulnerabilities) to minimise the need for reboots.

vars.yml

---
# SSH
ssh_public_key_files:
  - ~/.ssh/id_rsa.pub

# firewalld
private_network_subnet: "10.0.0.0/16"
whitelisted_ips:
  - "<your ip>/32"
open_services:
  - ssh
  - http
  - https

# Email
admin_email: <sender email address from which the server will send alerts/notifications>
notifications_email: <your email address>

# Fail2ban
fail2ban_sender: <sender address for Fail2ban notifications>
fail2ban_ignoreips:
  - <your ip>
fail2ban_destemail: <your email address>

provision.yml

---
- name: Node provisioning
  hosts: all
  remote_user: deploy
  become: yes
  become_method: sudo
  roles:
    - role: ansible-bootstrap-role
      tags: bootstrap
    - role: ansible-postfix-role
      tags: postfix
    - role: ansible-users-role
      tags: users
    - role: ansible-fail2ban-role
      tags: fail2ban
    - role: ansible-hcloud-floating-ip-role
      tags: floating_ip
    - role: ansible-firewalld-role
      tags: firewalld
    - role: ansible-docker-role
      tags: docker
    - role: ansible-kernelcare-role
      tags: kernelcare

It’s just a list of the roles. You can see that the user to connect with SSH to the servers is deploy here; this is to be able to update the server configuration by manually running Ansible after a server has already been provisioned when created with Terraform. For the first run triggered by Terraform, the user is set to root as you can see here.

requirements.yml

This file specifies the Ansible roles to install as dependencies for our playbook. These are all repositories I have pushed to Github for ease of use.

- src: https://github.com/vitobotta/ansible-bootstrap-role.git
- src: https://github.com/vitobotta/ansible-docker-role.git
- src: https://github.com/vitobotta/ansible-fail2ban-role.git
- src: https://github.com/vitobotta/ansible-firewalld-role.git
- src: https://github.com/vitobotta/ansible-hcloud-floating-ip-role.git
- src: https://github.com/vitobotta/ansible-kernelcare-role.git
- src: https://github.com/vitobotta/ansible-postfix-role.git
- src: https://github.com/vitobotta/ansible-users-role.git

To install those roles, run from the ansible directory:

ansible-galaxy install -r ansible/requirements.yml

Creating the cluster

The configuration is basically done. It takes longer to write about the files than to actually create them. It’s just some small configuration that you need to write once and can then duplicate and customise for each new cluster, provided these clusters share the way they are set up.

Before we can create the cluster, let’s write a simple wrapper around the terraform command so we don’t have to cd each time into the cluster directory and load the secrets for it. Create the file tf in a directory in your path with this content:

#!/bin/bash

POSITIONAL=()
while [[ $# -gt 0 ]]
do
key="$1"

case $key in
  --cluster)
    CLUSTER="$2"
    shift # past argument
    shift # past value
  ;;
  *)    # unknown option
    POSITIONAL+=("$1") # save it in an array for later
    shift # past argument
  ;;
esac
done
set -- "${POSITIONAL[@]}" # restore positional parameters

[[ -z "$CLUSTER" ]] && { echo "Please specify the name of the cluster with '--cluster'" ; exit 1; }

cd terraform/clusters/$CLUSTER && source ~/.secrets/terraform-$CLUSTER.sh && terraform "$@"

Create the file ~/.secrets/terraform-.sh with the following secrets:

export TF_VAR_hcloud_token=
export TF_VAR_rancher_api_token=
export TF_VAR_etcd_s3_access_key=
export TF_VAR_etcd_s3_secret_key=

Next we need to initialise the cluster state with Terraform. As mentioned earlier, we are keeping this state in an S3 bucket. To initialise the state, run:

tf --cluster <cluser name> init \
  -backend-config="bucket=<s3 bucket>" \
  -backend-config="region=<s3 region>" \
  -backend-config="endpoint=<s3 endpoint, not needed with Amazon S3" \
  -backend-config="access_key=<s3 access key>" \
  -backend-config="secret_key=<s3 secret key>" \
  -backend-config="key=terraform/terraform.tfstate"

The command above will:

  1. Create a state in the terraform folder in the s3 bucket
  2. Create the directory terraform/clusters//.terraform on your machine and download the Rancher and Hetzner Cloud providers in it
  3. Create the config file terraform/clusters//.terraform/terraform.tfstate; this file contains the secrets needed to connect to the s3 bucket, so remember to add it to your .gitignore file

Please note that Terraform only downloads automatically the providers, but not the Ansible provisioner. You need to download it from here and save it as terraform/clusters//.terraform/plugins/darwin_amd64/terraform-provisioner-ansible_v2.3 - change the directory from darwin_amd64 if you are not using MacOS.

Now that we have initialised the state and have the provisioner and providers in place, we can run the following command from the root directory to see what Terraform is going to do with our cluster:

tf --cluster <cluster name> plan

Terraform will figure out that it’s a new cluster so it will say that it’s going to create all the resources required, from the servers to the network and floating ip to Rancher cluster.

Finally, to actually create the cluster run

tf --cluster <cluster name> apply

and confirm by entering yes.

What happens next? You’ll just have to wait 10-15 minutes depending on how fast the servers you choose to create are. You’ll first see when Terraform is done the kubeconfig for the Kubernetes cluster, the IPs of the servers and the floating IP. You can then see the progress of the Kubernetes deployment from the Rancher UI.

Once Rancher reports that the cluster has been deployed, you’ll need one last step to enable the automatic assignment of the floating IP to an healthy node (in the meantime it won’t be possible to schedule pods on the nodes). First, install the Hetzner Cloud controller manager (I assume you have kubectl available and that you have already saved the kubeconfig printed by Terraform to a file, and exported its path as the KUBECONFIG environment variable):

HETZNER_CLOUD_TOKEN=...

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: hcloud
  namespace: kube-system
stringData:
  token: "$HETZNER_CLOUD_TOKEN"
  network: "default"
EOF

kubectl apply -f https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/deploy/v1.4.0-networks.yaml

Then, you need to install the floating IP controller, which will do the actual assignment of the floating IP:

FLOATING_IP=<taken from the Terraform output>

kubectl -n kube-system patch ds canal --type json -p '[{"op":"add","path":"/spec/template/spec/tolerations/-","value":{"key":"node.cloudprovider.kubernetes.io/uninitialized","value":"true","effect":"NoSchedule"}}]'

kubectl create namespace fip-controller
kubectl apply -f https://raw.githubusercontent.com/cbeneke/hcloud-fip-controller/master/deploy/rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/cbeneke/hcloud-fip-controller/master/deploy/deployment.yaml

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: fip-controller-config
  namespace: fip-controller
data:
  config.json: |
    {
      "hcloudFloatingIPs": [ "$FLOATING_IP" ],
      "nodeAddressType": "external"
    }
---
apiVersion: v1
kind: Secret
metadata:
  name: fip-controller-secrets
  namespace: fip-controller
stringData:
  HCLOUD_API_TOKEN: $HETZNER_CLOUD_TOKEN
EOF

Done! The cluster is now ready and pods can be scheduled on the nodes. As a bonus, you may want to use Hetzner Cloud Volumes as storage backend for your persistent volumes. To do this, you can install their CSI driver. First create a secret:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: hcloud-csi
  namespace: kube-system
stringData:
  token: $HETZNER_CLOUD_TOKEN
EOF

Then install:

kubectl apply -f https://raw.githubusercontent.com/hetznercloud/csi-driver/master/deploy/kubernetes/hcloud-csi.yml

This will create the storage class.

Wrapping up

I’m very pleased with this way of setting up my Kubernetes clusters. I save a ton of money (well, for my pockets) and once you create the configuration files, it takes minutes to spin up a new cluster from scratch. It took more effort to write this post than it takes to actually do this stuff, especially considering that I am sharing the Ansible roles and Terraform modules ready :)

Please let me know in the comments if you run into any issues and I will be happy to help. Hope you find this useful :)

© Vito Botta