How deep is your connection pool? A concurrency gotcha with Puma and Sidekiq.
Our team supports a webhook that allows iOS apps to push events to us. Occasionally it receives a 100 rps spike in traffic.
Our infrastructure should be able to eat 100 rps for breakfast. This endpoint is backed by 3 web dynos each running 40 Puma threads. But what’s this?
ConnectionPool::TimeoutError waited 1 sec
Our web dynos would intermittently throw these timeout errors. These errors would cascade, causing downtime.
Since we can’t have downtime and the issue wasn’t obvious, we needed a temporary workaround. Restarting the dynos is always a good first approach, and indeed that worked here. The restarted dynos stopped throwing timeout errors and the app went back up.
Manually restarting is slow, error-prone, and causes availability issues. Who’s going to restart these dynos at 3 AM? Not me.
The endpoint has a simple implementation. All it does is enqueue a job that will process the event asynchronously.
We can see that the error is thrown when that job is enqueued:
Sidekiq::Client.enqueue(OnPush, @service.name, params_body)
Rails is trying to enqueue the job with Sidekiq, which is backed by a Redis queue. The error is thrown when Sidekiq fails to connect to the Redis service.
My first thought was that this was an issue of scale. That we were exceeding the number of concurrent connections offered by our Redis service. But that was a deadend. We’re using Heroku’s Redis Enterprise Cloud add-on, which offers unlimited connections.
Our CTO Alessio suggested that the issue was not of scale, but of concurrency.
After greping Sidekiq’s source, I found that it uses the connection pool gem to hold a pool of Redis connections. How deep is that pool? 5! Five Redis connections were being shared between 40 Puma threads.
When traffic spiked, the threads would compete over that pool. Occasionally a thread would wait for an entire second to be allocated a connection, causing a timeout.
Why is 5 the default size? It’s not really… Here’s how Sidekiq determines that size:
size = if symbolized_options[:size]
# Give ourselves plenty of connections. pool is lazy
# so we won't create them until we need them.
Sidekiq.options[:concurrency] + 5
First it attempts to use a passed-in size. I couldn’t find any documentation on where to pass that in, so I suspect it’s only used internally.
Then it checks
Sidekiq.server? which will return true if it’s running on a worker responsible for processing jobs. That’s not true here since this is a Sidekiq client.
Then it checks
ENV[“RAILS_MAX_THREADS”] before finally giving up and defaulting to 5.
ENV[“RAILS_MAX_THREADS”]? This is the environment variable that the community has standardized on to set the number of threads that your webserver uses to process requests concurrently.
Since we use Puma,
config/puma.rb should set the thread count to that envvar. Let’s take a look at that file:
max_threads = ENV["PUMA_MAX_THREADS"]
Oh no! We’re using an older envvar that was the previous Puma-specific standard. Had we been using
ENV[“RAILS_MAX_THREADS”], Sidekiq would have set our connection pool size to our thread count, ensuring every thread always had a Redis connection available.
The immediate solution was to change
config/puma.rb to reference the newer envvar and set that on our environments. Now each thread has its own Redis connection—no more timeouts!
You might be thinking ‘what if I don’t have unlimited Redis connections? I can’t afford to have a giant pool of Redis connections that go unused 99% of the time’.
Don’t worry, the connection pool gem is pretty smart. It lazily creates the connections in the pool as-needed. Our system will continue to use ~5 Redis connections until traffic spikes.
Sidekiq fallsback to 5.
Why then are we using 40? Eventually I would love to lower our thread count to 5 to align with the community standard.
Evidation Health is hiring!
Would you like to work on problems like this? We’re hiring!