This weekend we experienced a lot of trouble with My Opera due to the servers being unable to connect to the databases. After debugging some, we saw that two particular tables were experiencing a lot of table locking, being ancient MyISAM tables created long before I started working for Opera.
Initial signs
Being called out at 11 PM, I identified the problem and disabled two of the non-vital update queries causing the lockups. I monitored the system for a while and everything seemed to be good. I went to bed.
The next day I checked in on the system when I got up, and noticed that we were having a small replication lag, that is the time it takes for the changes to propagate from the main database to the search databases, across the services even though the queries were disabled. I killed some hanging queries after noting them down, and they didn't come back. Until later that day when it came back and bit us again.
Round two
This time I converted the affected, huge tables to InnoDB to eliminate the possibility of table locks when updating. I've wanted to do that for a couple of years now, so why not do that when My Opera was done to begin with since it would solve the issue. After converting the two tables, a third table was locking up and we did that too. Things were starting to look good, and I drove home while Good Guy Andreas looked after the system, even though he wasn't on the technician on call at the time and was sick at home.
When I came home, Andreas reported that another table locked up. We fixed it. Many queries were still hanging. We killed them off, and watched the system. The replication lag was increasing.
Trying tons of stuff, we dug through the code and logs and eventually found some places that could cause problems, although we didn't know exactly why they started acting up lately. There were no seen problems, but the replication was hanging forever on "freeing items".
A couple of hotfixes were sent live and at 5 AM the system was looking good. It was slow, but good. I went to bed.
Hell breaks loose
Monday morning arrives. I was supposed to be in a meeting at 10:30, but woke up at 10:20. Good start. Seeing that My Opera was burning as bad as before, Good Guy Andreas was commanding the salvage operation from home, still being sick. He had discovered us being crawled by some ominous search engine, which isn't unusual, but it could enhance the problems if we got even more strain on the poor databases.
We blocked off the search engine for the time being, and it eased the load a bit. During the next hours, many different things were attempted, such as changing the innodb_thread_concurrence setting, but still the replication was lagging behind with "freeing items".
Finally a solution
After Googling like a boss, I saw that people were saying that Percona, a drop in replacement for mysql, had seemingly improved this problem. As we have been thinking of using Percona for quite a while, I agreed with myself and a couple of others that this was a perfect time. The change was done, and everything was looking good. Replication lag to 0 again.
Later that night, we experienced a couple of slow queries, but they were fixed and My Opera was once again stable.
I have throughout the Monday and Tuesday upgraded the servers one by one to run the latest kernels and Percona with mysql 5.5, up from stock mysql 5.1. Only the master database remains, but we'll perform a master switch later today to get it up to speed too. The poor master database has got almost 1000 days of uptime, so it's very much needed.
So, what was the problem?
Our current theory is that multiple, random happenstances caused the database trouble. Something caused a lot of table writes, locking the table, while being crawled causing lots of reads throughout the system and slowly but steadily bringing the database servers to their knees.
How can we guard ourselves from this in the future?
We have removed the possibility for easy locks by converting many of the much-written-to tables to InnoDB instead of MyISAM. It was about time; they were created years and years ago and had grown too large for doing this without causing trouble to the system.
Instead of simply doing something like:
ALTER TABLE foo ENGINE=InnoDB;
We used a well known less-locking method:
CREATE TABLE foo_innodb LIKE foo;
ALTER TABLE foo_innodb ENGINE=InnoDB;
INSERT INTO foo_innodb SELECT * FROM foo WHERE id >= X AND id < Y;
RENAME TABLE foo TO foo_myisam;
RENAME TABLE foo_innodb TO foo;
DROP TABLE foo_myisam;
Basically creating a copy of the table, changing its engine and then inserting chunks of data that were manageable for the system without locking it for too long, then repeating the step until all is done and finally renaming the tables and replacing the MyISAM table with the InnoDB one.
When everything's looking good again, foo_myisam can be dropped. You should do the last INSERT INTO chained with the RENAME TABLE command so you don't lose data.
Effectively we reduced the locks from 10 minutes of fully locking the table with an ALTER TABLE, to several smaller chunks of INSERT INTO with SELECTS that blocked the table for smaller periods of time, so the change was much more transparent.
Updates
-
Wednesday: We are aware of the issue with the erroneous timestamps for anonymous comments, and are working on a solution to restore them to their former state.
-
Thursday: The database master has once again been restored to our bigdog, watching over us.
-
Thursday: If you are experiencing any problems with things that haven't been up to date (such as watches, etc) the past days, please try to see if the issues can be reproduced now, as it could have been due to delays in data propagation throughout the system caused by the problems experienced.