It shouldn’t happen at all but does more often than you’d expect: developers or DBAs going about their normal activities sometimes drop a table in a production database. In the case below, the mistake was immediately noticed, but in others, it takes a while until it’s realized that something is amiss. It’s at times like these, best practices—and quality consultants—prove their value.
It Started as a Phone Call
The hero of our story, Mr. A, was the consultant on duty the morning when the customer called, in a slight panic. “My application is completely down. One of my developers dropped an important table in the production database. Can you help? This is going to cost me a bundle.”
Table Terror
Mr. A rushed to the customer site and was immediately approached by a nervous chief systems administrator. A short explanation revealed the issue: a developer had actually wanted to restore a dump in the staging environment, but accidentally used the IP address of the master production database instead. The dump file had “drop table” and “create table” statements. Due to the IP address error, the table was dropped from the master and the drop statement was replicated to slaves.
As soon as the developer realized that he had made a mistake, he stopped the script, but it was too late. The app in question was an eCommerce shopping app with around 10,000 weekly users. The countdown to app Armageddon had begun.
The Bad News
The dropped table was essential from an application perspective. As soon as it was gone, the app started producing errors and rapidly losing stability. Eventually, it stopped working altogether, and the downtime clock started. For many companies, this is what really hurts: the median cost of downtime per hour for enterprises is about $300,000. Add to that the loss of reputation and the cost of service level agreements, and you’re looking at a hefty price for a small accidental deletion.
A quick restore was not possible because the table size was around 10GB and the table itself was frequently updated, so a lot of data was already missing.
The Good News
Fortunately, the client was running full database backups at midnight via Percona Xtrabackup. The “drop table” incident happened around 9:30 a.m., so the backup files had an update gap of 9.5 hours. The client also had binary logs that had recorded all the changes made to the database after the latest backup.
To the Rescue
Mr. A decided to use the last Xtrabackup file and the binary logs for point-in-time recovery. Luckily, the customer had a spare server with enough resources for the process. Mr. A copied the Xtrabackup files to the extra server, decoded the latest binary log file on the master server, and found exactly where the “drop table” statement was executed. Finally, Mr. A applied the binary logs to the running MySQL instance, starting from the position found in Xtrabackup and just before the “drop table” event.
After the client reviewed and confirmed the recovered table, it was copied to the production instance. But simply taking the dump of a 10GB table and restoring it in the production database would have been another time-consuming activity and added more downtime. Instead, Mr. A used the “flush tables for export” function in the InnoDB Transportable Tablespaces feature of MySQL 5.6. to eventually copy the table to the production master.
This final step enabled the application to run again. Mr. A also delayed a slave by four hours to permit faster recovery time in case similar mistakes reoccurred.
Prevention is the Best Medicine
Mr. A’s experience and knowledge led to a very satisfied customer and minimal downtime for a highly active app. However, the whole situation could have been easily avoided.
- Policy enforcement: If the company had enforced policy on database scripts, the system could have automatically prevented this human error. Policies prohibiting “drop table” statements, truncating tables or prohibiting daytime modifications to production could have all stopped this crisis before it happened.
- Roles and access: Should a developer even have access to the production database?
Was the developer following best practices? Was management providing oversight? Should the company have run backups more often, or aimed for a higher level of recovery targets? It would be interesting to know if Mr. A’s client made any real changes to their systems after this crash. Judging from the number of similar horror stories, enterprises still have a long way to go in learning the lessons that Mr. A’s client must have—the hard way.