How deep is your connection pool? A concurrency gotcha with Puma and Sidekiq.

Problem

Our team supports a webhook that allows iOS apps to push events to us. Occasionally it receives a 100 rps spike in traffic.

Our infrastructure should be able to eat 100 rps for breakfast. This endpoint is backed by 3 web dynos each running 40 Puma threads. But what’s this?

ConnectionPool::TimeoutError waited 1 sec

Our web dynos would intermittently throw these timeout errors. These errors would cascade, causing downtime.

Workaround

Since we can’t have downtime and the issue wasn’t obvious, we needed a temporary workaround. Restarting the dynos is always a good first approach, and indeed that worked here. The restarted dynos stopped throwing timeout errors and the app went back up.

Manually restarting is slow, error-prone, and causes availability issues. Who’s going to restart these dynos at 3 AM? Not me.

Root Cause

The endpoint has a simple implementation. All it does is enqueue a job that will process the event asynchronously.

We can see that the error is thrown when that job is enqueued:

Sidekiq::Client.enqueue(OnPush, @service.name, params_body)

Rails is trying to enqueue the job with Sidekiq, which is backed by a Redis queue. The error is thrown when Sidekiq fails to connect to the Redis service.

My first thought was that this was an issue of scale. That we were exceeding the number of concurrent connections offered by our Redis service. But that was a deadend. We’re using Heroku’s Redis Enterprise Cloud add-on, which offers unlimited connections.

Our CTO Alessio suggested that the issue was not of scale, but of concurrency.

After greping Sidekiq’s source, I found that it uses the connection pool gem to hold a pool of Redis connections. How deep is that pool? 5! Five Redis connections were being shared between 40 Puma threads.

When traffic spiked, the threads would compete over that pool. Occasionally a thread would wait for an entire second to be allocated a connection, causing a timeout.

Why is 5 the default size? It’s not really… Here’s how Sidekiq determines that size:

size = if symbolized_options[:size]
symbolized_options[:size]
elsif Sidekiq.server?
# Give ourselves plenty of connections. pool is lazy
# so we won't create them until we need them.
Sidekiq.options[:concurrency] + 5
elsif ENV["RAILS_MAX_THREADS"]
Integer(ENV["RAILS_MAX_THREADS"])
else
5
end

First it attempts to use a passed-in size. I couldn’t find any documentation on where to pass that in, so I suspect it’s only used internally.

Then it checks Sidekiq.server? which will return true if it’s running on a worker responsible for processing jobs. That’s not true here since this is a Sidekiq client.

Then it checks ENV[“RAILS_MAX_THREADS”] before finally giving up and defaulting to 5.

What’s ENV[“RAILS_MAX_THREADS”]? This is the environment variable that the community has standardized on to set the number of threads that your webserver uses to process requests concurrently.

Since we use Puma, config/puma.rb should set the thread count to that envvar. Let’s take a look at that file:

max_threads = ENV["PUMA_MAX_THREADS"]

Oh no! We’re using an older envvar that was the previous Puma-specific standard. Had we been using ENV[“RAILS_MAX_THREADS”], Sidekiq would have set our connection pool size to our thread count, ensuring every thread always had a Redis connection available.

Short-term patch

The immediate solution was to change config/puma.rb to reference the newer envvar and set that on our environments. Now each thread has its own Redis connection—no more timeouts!

You might be thinking ‘what if I don’t have unlimited Redis connections? I can’t afford to have a giant pool of Redis connections that go unused 99% of the time’.

Don’t worry, the connection pool gem is pretty smart. It lazily creates the connections in the pool as-needed. Our system will continue to use ~5 Redis connections until traffic spikes.

Long-term fix

The default thread count for Puma running on MRI is 5.

The Heroku documentation recommends 5 threads for Puma.

Sidekiq fallsback to 5.

Why then are we using 40? Eventually I would love to lower our thread count to 5 to align with the community standard.

Evidation Health is hiring!

Would you like to work on problems like this? We’re hiring!

--

--

--

Seeking staff engineer position

Love podcasts or audiobooks? Learn on the go with our new app.

NUT Starts Free Transfer from May.1st

Children’s Game

Python for Geosciences: Working with Satellite Images (step by step)

Monitoring with Syntropy — Grafana, Prometheus and node-exporter approach

Numio Developer Update — November 30th

Make Your Life Easier With Trello Project Management App

Shouldn’t a Vaccine That Stops a Global Pandemic Be Free?

Full stack web overview

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lily Reile

Lily Reile

Seeking staff engineer position

More from Medium

Rust for Node.js designers

Amazon Selling Partner API Authorization Guide with Ruby on Rails.

worker queues and avoiding pitfalls of race conditions

Just One More Feature! Build from front(end) to back