A few weeks ago, we experienced a major problem in production. Our normal background processes were stalling and taking hours and timing out.
The first problem was detection: we have monitoring for processes starting, but we don't track that they finish (it hadn't been a problem before, and sometimes processes take a long time as normal behavior). All our processes were starting normally!
The problem was they were taking a long time. It took us a couple hours to notice this. We also noticed our database CPU was much higher than normal (although other metrics, strangely, were the same).
We checked current connections on the db and noticed queries against a particular table (one of our largest) were taking longer than usual. These queries hadn't changed - some had been running regularly for years. As a quick fix we shut down functionality that ran these queries (we restarted these after we fixed the issue) so the rest of the system wouldn't be affected.
First we checked out recent actions we'd taken, but those turned out to be irrelevant. We thought maybe we had a corrupted index or something. We checked the explains. Something jumped out: despite having the values for a particular index in the where clause, the query optimizer was selecting a different index that didn't make any sense (from a human perspective - we knew what the best index was already). Our solution was to force the use of the correct index for these queries. This sort of requirement is pretty far outside the normal requirements for Rails - you have to use a bit of Arel do a .from("table_name FORCE INDEX(index_name)")
.
So why did this happen? Well, it turns out that MySQL automatically runs statistics by default on tables each time they grow/change by 10%. This is normally a good thing! You want the optimizer to select the right indexes and query plan, and you assume it will do a better job than you will. In this case, it hurt us tremendously. We are doing what is probably sufficient sampling (already a significant increase from defaults), but we don't quite understand why the optimizer chose the index it did. We could do a deep dive on the optimizer algorithm, but since the problem is solved and we're about to do a major version upgrade anyway, we let this one lie.
tl;dr: automated statistics can potentially cause queries to suddenly become problems with no changes from dev.