Ever seen this from engineyard?
The site we were working on had recently had a jump in traffic on the order of 15x. The site held up just fine. Things seemed a bit slower, but generally functional.
One of the services the site offers to users uses Delayed::Job to message/email folks about things. And to keep everyone abreast of the status of the site, we have a status page which reports, among other things, if the jobs are getting backed up. This page does a seemingly simple query of the Delayed::Jobs table and finds jobs that haven’t completed in a reasonable amount of time – it finds the delayed Delayed::Jobs. As it turns out, that query is fairly expensive. It’s a full table scan and with more users using the service, that table has gotten big. With a few services checking the system for ‘up’ status (EngineYard, 100Pulse etc), that query was getting run pretty often and ended up loading up our db server.
Using EngineYards performance graphs (to confirm that the load was consistently bad) and New Relic’s server code level monitoring we narrowed the load down to this Delayed::Job#find which was called by a #get_pending_jobs method in our status page.
How do we solve it? Caching, of course.
I’m a big fan of memcache. I’ve seen it work wonders on huge systems with piles of traffic. But we wanted a quicker fix. Setting up a new memcached machine was a little more work (though not much) than we were willing to do for this first cut. Rails caching (we’re still on 2.3.5), by default, uses ActiveSupport::Cache::MemoryStore – a super simple memcache like system (cached in memory with no persistence) but running inside the app. Perfect.
We wrapped our get_pending_jobs call like so:
def fetch_delayed_job_status jobs_cache = Rails.cache.read(JOB_CACHE_KEY) if !jobs_cache || !jobs_cache[:expires_at] || (jobs_cache[:expires_at].to_i get_pending_jobs :expires_at => (Time.now + JOB_CACHE_EXPIRY).to_i } Rails.cache.write(JOB_CACHE_KEY, job_cache) else # we got it from the cache end jobs_cache end
and we’re off and caching. Because we do want to get notified when something is wrong, we don’t want the cache time to be too long. Since our most frequent test service checks in every 5 minutes, we went with a 5 minute cache. So, worst case, things could be screwy for 10 minutes before we get notified. But we can live with that.
This quick and dirty solution worked great. We got immediate feedback. The email stopped coming. The load graphs on the server took a drastic dive. And New Relic showed us that the amount of time spent doing Delayed::Job#find was reduced by about 30%. Not too bad for 10ish lines of code.
One of the annoying things about this is that the cache expiry had to be managed here in this little loop. In Rails 2, only ActiveSupport::Cache::MemCacheStore accepts and uses the :expires_in option to manage cache entry expiration. In Rails 3, all the included Cache::Store implementations allow and honor the :expires_in param.