has_many :codes

Scaling Rails web sockets in Kubernetes with AnyCable

Published  

One of the things I love the most about Rails is how easy it makes implementing many features with very little setup from my part. One example is ActionCable for web sockets. Before ActionCable, whenever I needed to build realtime features I had to resort to separate tools or perhaps a dedicated hosted service if scale was an issue (for example we have been using Sendbird for years at Brella).

ActionCable made everything so much easier in terms of feature implementation, but with a catch: it really doesn't perform all that well with many concurrent clients. As soon as you start having a few thousands of clients active at the same time, latency goes beyond what is considered "realtime".

Here comes in AnyCable. It's a fantastic project by the awesome team at Evil Martians (FOSS but with optional paid support and features, so check those out) that brilliantly solves the scalability issue that makes ActionCable less appealing for serious web sockets setups.

The key component is a Go process that takes care of actually handling the web sockets; since Go is much better at concurrency than Ruby, performance with many concurrently connected clients is much better. CPU and memory usage are also very low compared to what ActionCable requires with the same amount of clients.

The Go process just handles web sockets connections, but isn't aware of any of the business logic, which stays in your Rails app. The link between the two is an additional process for your Rails app that takes care of doing things like managing data and whatever else your app does. The two communicate via gRPC protocol, so you must take this into account when designing the system for load balancing, as we'll see later.

In this post we'll see:

  • how to switch from regular ActionCable to AnyCable
  • how to deploy the required components to Kubernetes
  • how to solve an issue with load balancing
  • how to perform some benchmarks

Let's dive in.

Setting up AnyCable in Rails

The basic setup for AnyCable simply requires you to add the anycable-rails gem and run the 

bundle exec rails g anycable:setup

command to create/update some config files. One of these files is config/cable.yml - which is the default config file for ActionCable. In here you need to change the broadcasting adapter to any_cable, i.e.

production:
  adapter: any_cable

Then you need to specify the URL for your Redis instance (required for pub/sub) in config/anycable.yml:

production:
  redis_url: redis://....:6379

Next, you need to edit the config file for the target environment and change the cable URL:

# config/environments/production.rb
config.action_cable.url = "wss://domain.com/ws-path"

It's up to you whether you want to use a separate hostname for the web sockets or just a dedicated path in your app's main domain. For the chat feature we are building for Brella I have opted for a path in the same domain, as we'll see later when talking about ingress.

With this, you can now start the Rails part of AnyCable (the RPC process) with

RAILS_ENV=production bundle exec anycable

Of course, you also need to run the Go process to complete the setup. First, you need to install it (e.g. `brew install anycable-go` on macOS - also see this), then run 

anycable-go --host=localhost --port=8080

to start it. AnyCable-Go will connect to the RPC process on the port 50051 by default, but you can customize it with the `--rpc_host ip:port` argument.

That's it basically for a quick setup to test with AnyCable. Now web sockets connections will be handled by AnyCable instead of regular ActionCable as long as the clients use the new cable URL. 

In most cases you won't need to change much else in order to just swap ActionCable for AnyCable. However, while compatibility is great, there are still a few differences. For example you cannot use regular instance variables in a channel class, because of the way the connection objects are handled by AnyCable. So instead of something like this

class MyChannel < ApplicationCable::Channel
  def subscribed
    @somevar = ...
  end

  def send_message
    do_something_with @somevar
  end

You'll have to use state_attr_accessor:

class MyChannel < ApplicationCable::Channel
  state_attr_accessor :somevar

  def subscribed
    self.somevar = ...
  end

  def send_message
    do_something_with somevar
  end

Just a small thing to remember. For the channels in the Brella backend I didn't need to change anything else but take a look at this page for more info on some differences you may encounter.

Deployment in Kubernetes


1.The Go process

The easiest way to deploy AnyCable-Go is with the official Helm chart. Take a look at the README for details on which configuration options you can set. In my case I am installing the chart with these settings:

helm repo add anycable https://helm.anycable.io/

helm upgrade --install \
--create-namespace \
--namespace myapp \
--set anycable-go.replicas=3 \
--set anycable-go.env.anycablePath=/ \
--set anycable-go.env.anycableRedisUrl=/ \
--set anycable-go.env.anycableRpcHost=myapp-rpc:50051 \
--set anycable-go.env.anycableLogLevel=debug \
--set anycable-go.serviceMonitor.enabled=true \
--set anycable-go.ingress.enable=false \
--set anycable-go.env.anycableHeaders='authorization\\,origin\\,cookie' \
anycable-go anycable/anycable-go

2. The RPC process

The RPC can be a regular Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-rpc
  labels:
    app.kubernetes.io/name: myapp-rpc
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp-rpc
  template:
    metadata:
      labels:
        app.kubernetes.io/name: myapp-rpc
    spec:
      containers:
        - name: myapp
          image: ...
          command:
          - bundle
          - exec
          - anycable
          - --rpc-host=0.0.0.0:50051
          - --http-health-port=54321
          securityContext:
            allowPrivilegeEscalation: false
          ports:
            - name: http
              containerPort: 50051
              protocol: TCP
          readinessProbe:
            httpGet:
              path: "/health"
              port: 54321
              scheme: HTTP
            initialDelaySeconds: 25
            timeoutSeconds: 2
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 10
          env:
            - name: ANYCABLE_RPC_POOL_SIZE
              value: "40"
            - name: MAX_THREADS
              value: "50"

This is the relevant part taken from the manifest I use for our Helm chart. The important bits are:
  • the command: here we are running the anycable RPC process ensuring it's exposed at 0.0.0.0 so it can be connected to from outside the pod; we also specify the health check port to enable health checks;
  • the port, which is by default 50051
  • the readiness probe, which uses the health check to ensure the pod is available to process requests only when ready to do so
  • a couple of environment variables to configure AnyCable's own thread pool size, as well as the database pool size (typically the database.yml is configured to set the db pool size to the number specified with the MAX_THREADS env variable,  especially with Puma based apps; adjust the variable name if needed). Make sure the db pool size is greater than the thread pool size.

Next we need a service:

apiVersion: v1
kind: Service
metadata:
  name: myapp-rpc
  labels:
    app.kubernetes.io/name: myapp-svc-rpc
spec:
  ports:
    - port: 50051
      targetPort: 50051
      protocol: TCP
      name: rpc
  selector:
    app.kubernetes.io/name: myapp-rpc

This is what AnyCable Go will be connecting to.

3. Ingress

You can, if you want, enable a separate ingress for AnyCable Go with the settings for its Helm chart, but I prefer to keep web sockets under a path in the app's default domain.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: myapp
spec:
  tls:
  - hosts:
    - domain.dom
    secretName: domain-tls
  rules:
  - host: domain.com
    http:
      paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: "myapp-web"
              port:
                number: 3000
        - path: /ws
          pathType: Prefix
          backend:
            service:
              name: "myapp-anycable-go"
              port:
                number: 8080

This will ensure that all the requests going to domain.com/ws actually go to the AnyCable Go process.

gRPC Load Balancing

If you try this setup, it should work, but if you look closely you will notice that only one of the RPC pods actually handles the requests, while the others don't and remain unutilized.

This is because gRPC requires per request load balancing, while Kubernetes' default service types perform connection based load balancing. One workaround could be using a headless service, but this only works if the grpc client can perform DNS based load balancing by itself. In any case, I prefer using a service mesh since not only does it fix the load balancing issue with grpc, it also enhances observability and security with your deployments so it's always convenient to have.

In this case, I use Linkerd, which is a very fast and lightweight service mesh. The way this works is that Linkerd automatically injects a proxy container into the RPC pods that intercepts all the requests to the pod. This proxy container can handle grpc requests properly so the issue with load balancing is no longer an issue. 

Setting up Linkerd for this is very easy. First, you need to install the CLI:

curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh

and verify that you can install it in your cluster:

linkerd check --pre

Assuming that all the checks are green, you can proceed with the installation:

linkerd install | kubectl apply -f -

And ensure again that everything went smoothly:

linkerd check

Then you need to add an annotation to the spec template for the RPC pods:

      annotations:
        linkerd.io/inject: enabled

Also, you need to reinstall AnyCable Go with this additional setting:

--set "pod.annotations.linkerd\.io\/inject=enabled"

That's it! This will make sure that when the pods are created, they are injected with the Linkerd proxy containers and the grpc load balancing will work as expected.

You don't need to, but I recommend you install the awesome viz extension for observability of the "meshed" services:

linkerd viz install | kubectl apply -f -
linkerd check

Then, to open the dashboard:

linkerd viz dashboard

to see lots of useful metrics in realtime.

Bonus: private GKE clusters


At Brella we use private GKE clusters for our staging and production environments. "Private" means that neither the control plane nor the nodes are reachable directly from the Internet, which is awesome for security. In order to access these clusters, we use a bastion host with a proxy for the Kubernetes API; the bastion host is also private, so we can only access it with authenticated connections via the Google Identity Aware Proxy. This is a bit out of scope for this article, but I wanted to mention that if you, too, use private GKE clusters, you'll run into an issue with Linkerd not working properly. I won't go into details here, but I recommend your read this page on how to set up a firewall rule that fixes this issue.

Benchmarking web sockets

Depending on your use case, you may want to run some benchmarks in order to understand whether your needs warrant the switch from ActionCable to AnyCable.  I took inspiration from this page by the authors of AnyCable (I used smaller instances for my tests though), and installed websocket-bench.

To use the benchmark as in the examples, I created a test channel at app/channels/benchmark_channel.rb:

class BenchmarkChannel < ApplicationCable::Channel
  STREAMS = (1..10).to_a

  def subscribed
    Rails.logger.info "a client subscribed"
    stream_from "all#{STREAMS.sample if ENV['SAMPLED']}"
  end

  def echo(data)
    transmit data
  end

  def broadcast(data)
    ActionCable.server.broadcast "all#{STREAMS.sample if ENV['SAMPLED']}", data
    data["action"] = "broadcastResult"
    transmit data
  end
end


Then I ran the benchmark while monitoring logs and Linkerd's dashboard for both the Go and RPC pods:

websocket-bench broadcast $WS_URL --concurrent 8 --sample-size 100 --step-size 1000 --payload-padding 200 --total-steps 10 --server-type=actioncable 

This tests with batches of clients until it reaches 10K connections, all sending some messages. From my initial experimentation with this, I found that with a few replicas for both the Go process and the RPC process I was able to see a median rtt of 200ms for 1K clients and 700ms for 10K clients, and this was with fairly slow E2 instances, meaning that performance should be better with the C2 instances we use in production.


Wrapping up

I wrote this post quickly since a few people asked me about it. All in all I am very impressed with the progress made with AnyCable so far (I've used it in the past but it was early days so it's much better today). It requires some setup compared to zero setup with regular ActionCable, but it solves the problem with websockets performance and scalability while allowing you to keep everything in house.

I love it, it's currently one of my favorite projects. Having said that, I recommend you always think about whether your web sockets needs require more than what regular ActionCable can offer, due to the additional setup involved with AnyCable. 

In our case, we are an event management platform and networking is our killer feature, and being able to exchange messages with other attendees in an event prior to having meetings with them is a must have feature for us. We also wanted to detach from the requirement for a third party service and at the same time we wanted to be able to scale web sockets in the case we happen to manage big events that can cause significant spikes in chat usage. Your mileage may vary, so always evaluate whether ActionCable is good enough for your use case.
© Vito Botta