has_many :codes

Storing Grafana Loki log data in a Google Cloud Storage bucket

Published  

I am a big fan of Grafana Loki for log collection in Kubernetes because it's simple and lightweight, and makes both collecting and querying logs with Grafana very easy.

Until recently I always configured Loki to use persistent volumes to store log data. It works well, but the problem with persistent volumes is that it's difficult to predict how large volumes you need depending on the amount of logs and the configured retention, so just to simplify things and stop worrying about volume size I decided to switch the storage to a Google Cloud Storage bucket instead.

Since our apps are hosted in Google Kubernetes Engine in the same location, performance is still pretty good and we can store an unlimited amount of logs indefinitely if we want. 

In this post I am going to describe how to configure Loki to use a Google Cloud Storage to store log data.

In a terminal, enter the following environment variables so we can remove some duplication in commands we need to run:

ENVIRONMENT=production
PROJECT=brella-$ENVIRONMENT
BUCKET_NAME=$PROJECT-loki
SA_NAME=loki-logging
NS=logging
SA=${SA_NAME}@${PROJECT}.iam.gserviceaccount.com

These basically set the GCP project name (we at Brella have a project for each environment named with the convention `brella-<environment>`, so for production the project would simply be `brella-production`). You can of course customize the names of the project, the bucket , the service account (that we'll use to be able to manipulate objects in the bucket) and the namespace in Kubernetes in which we'll install Loki.

Creating the storage bucket

To create the storage bucket from the terminal run this command:

gsutil mb -b on -l eu -p ${PROJECT} gs://${BUCKET_NAME}/ 

In my case I chose "eu" as the location to have the log data stored in multiple regions inside Europe, so if you prefer you can choose a single region to save on costs. To double check that the bucket has been created:

gsutil ls gs://${BUCKET_NAME}/

 

Creating the service account

Once installed in Kubernetes, Loki's components need to be able to access the bucket and manipulate the objects inside of it. The best way to authenticate deployments in GKE and grant them the required permissions to modify the contents of the bucket, the best approach is using Workload Identity (do read that page because you need to enable Workload Identity for your cluster before proceeding).  This basically links a GCP service account that has the required permissions, with a regular Kubernetes service account, so that the latter "inherits" the permissions from the former. This makes things easier than for example mounting a service account key file in the pods etc.

To create the service account run:

gcloud iam service-accounts create ${SA_NAME} --project ${PROJECT} --display-name="Service account for Loki"

gcloud projects add-iam-policy-binding ${PROJECT} \
    --member="serviceAccount:${SA_NAME}@${PROJECT}.iam.gserviceaccount.com" \
    --project ${PROJECT} \
    --role="roles/storage.objectAdmin"

As you can see, here we assign a role to the service account that allows it to modify the content of storage buckets.

Deploying Loki

We'll deploy Loki with Helm. There are various official charts from Grafana for this, but the one I prefer is Loki-distributed because it's simple to manage and allows for a scalable Loki deployment where each component can be scaled separately according to your needs.

First, we need to create a file containing the configuration options for the Helm chart. So run the following command to create a file named /tmp/loki.yaml or wherever you want with the following:

cat > /tmp/loki.yaml <<EOF
fullnameOverride: loki
enabled: false
distributor:
  replicas: 3
  maxUnavailable: 1
gateway:
  replicas: 3
  maxUnavailable: 1
  basicAuth:
    enabled: false
customParams:
  gcsBucket: loki
ingester:
  replicas: 3
  maxUnavailable: 1
  persistence:
    enabled: true
querier:
  replicas: 3
  maxUnavailable: 1
  persistence:
    enabled: true
serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account: ${SA}
loki:
  config: |
      auth_enabled: false
      server:
        http_listen_port: 3100
      distributor:
        ring:
          kvstore:
            store: memberlist
      memberlist:
        join_members:
          - {{ include "loki.fullname" . }}-memberlist
      schema_config:
        configs:
          - from: 2020-09-07
            store: boltdb-shipper
            object_store: gcs
            schema: v11
            index:
              prefix: loki_index_
              period: 24h
      ingester:
        lifecycler:
          ring:
            kvstore:
              store: memberlist
            replication_factor: 1
        chunk_idle_period: 10m
        chunk_block_size: 262144
        chunk_encoding: snappy
        chunk_retain_period: 1m
        max_transfer_retries: 0
        wal:
          dir: /var/loki/wal
      limits_config:
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        max_cache_freshness_per_query: 10m
        retention_period: 2160h
        split_queries_by_interval: 15m
      storage_config:
        gcs:
          bucket_name: {{ .Values.customParams.gcsBucket }}
        boltdb_shipper:
          active_index_directory: /var/loki/boltdb-shipper-active
          cache_location: /var/loki/boltdb-shipper-cache
          cache_ttl: 24h
          shared_store: gcs
      chunk_store_config:
        max_look_back_period: 0s
      table_manager:
        retention_deletes_enabled: true
        retention_period: 2160h
      query_range:
        align_queries_with_step: true
        max_retries: 5
        cache_results: true
        results_cache:
          cache:
            enable_fifocache: true
            fifocache:
              max_size_items: 1024
              validity: 24h
      frontend_worker:
        frontend_address: loki-query-frontend:9095
      frontend:
        log_queries_longer_than: 5s
        compress_responses: true
        tail_proxy_url: http://loki-querier:3100
      compactor:
        shared_store: gcs
        retention_enabled: true
        retention_delete_delay: 2h
        retention_delete_worker_count: 150
        compaction_interval: 10m
      ruler:
        storage:
          type: local
          local:
            directory: /etc/loki/rules
        ring:
          kvstore:
            store: memberlist
        rule_path: /tmp/loki/scratch
EOF

Most of these settings should be straightforward if you are familiar with how Loki works.  If not, please read about its architecture as well as its configuration.

I just want to highlight one small bit:

serviceAccount:
  annotations:
    iam.gke.io/gcp-service-account: ${SA}

This annotation for the Kubernetes service account used by Loki's components is what actually enables the link between this service account and the GCP service account for Workload Identity to work (Kubernetes to GCP side).

Next steps:

- Create the namespace:

kubectl create ns ${NS}

- Enable Workload Identity for the k8s service account

gcloud iam service-accounts add-iam-policy-binding --project ${PROJECT} ${SA} \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:$PROJECT.svc.id.goog[logging/loki]"

- Add the Grafana Helm repository:

helm repo add grafana https://grafana.github.io/helm-charts/
helm repo update

- Install Loki distributed:

helm upgrade --install loki grafana/loki-distributed \
-f /tmp/loki.yaml \
--set customParams.gcsBucket=${BUCKET_NAME} \
--version 0.67.1 \
--namespace ${NS} 

- We also need to install Promtail, which is a DaemonSet that will send the logs from the nodes to Loki:

helm upgrade --install \
--namespace ${NS} \
--set "loki.serviceName=loki-query-frontend" \
--set "tolerations[0].operator=Exists,promtail.tolerations[0].effect=NoSchedule,promtail.tolerations[0].key=cloud.google.com/gke-spot" \
--set "config.clients[0].url=http://loki-gateway/loki/api/v1/push" \
promtail grafana/promtail

That's really it for a basic setup that stores Loki logs in a Google Cloud Storage bucket. Once all the pods are up and running, add a new Loki data source to Grafana with the URL http://loki-query-frontend.logging:3100 and confirm that the logs are coming through. 

To confirm that the logs are being stored in the bucket wait for 10-15 minutes then run 

gsutil ls gs://${BUCKET_NAME}/

You should see a json file as well as a couple of directories that store the log chunks and the index.

You can also wait for a while to collect some logs, then restart all the Loki components (you'll see a few deployments as well as a couple of StatefulSets and the Promtail DaemonSet) and confirm that you can still search the old logs, to confirm the data is retrieved from the bucket and not from local storage. In my case since it was the first time I set up Loki with a Google bucket I wanted to be 100% sure about this, so I basically tested by completely uninstalling Loki, and ensuring I could still search through old logs after reinstalling Loki again. It works nicely and so far I haven't seen any difference in query speed compared to persistent volumes. Perhaps this might change as the amount of logs grows, dunno yet.

If you already use Loki and have the same problem with persistent volumes that I had, or if you are looking to use Loki as a lightweight log collector in Kubernetes and also use GCP, then I hope this post was useful. Please let me know in the comments if you run into any issues.

© Vito Botta