XXXXXX, a midsize retail company with 760 employees, faced an unforeseen challenge on February 21st, 2022.
At 10:15am, a few resellers and buyers informed the CEO of the retail company that they could not place orders due to a technical issue. Later that day the CIO reported an important portal application that streamlined most of the company’s business stopped working, and that the entire R&D team were dedicated to resolving the technical issue.
The portal application eventually went back online, but only after being offline for 18 hours.
The impacted direct revenue loss was assessed at $2.1M. In addition there were other significant losses that were difficult to quantify such as the eco-system frustration by their inability to function. Several resellers have already showed dissatisfaction with the retailer via angry email threads to the CEO.
Deeper color on this database release chronology of the evolving events and challenge:
The problem was recognized by a R&D team member at 11:20 am, followed by several angry calls from the VP of sales. After a short inquiry by the R&D team, they found a suspect: one of the scheduled application’s upgrade tasks. The App team started a deep-dive into the upgrade procedure looking for the shutdown root-cause.
Making the deep-dive more complex to trace, the portal shutdown happened due to a major release (typically made once a quarter). A major release typically consists of several hundreds of application code changes and many dozens of database code change management processes.
Looking back in time leading up to the event, it had been a hectic six months where change frequency increased and grew dramatically in-scope. Plenty of major and minor releases were being applied, yet they were all documented within the corporate JIRA system. Nevertheless, the specific suspected major release came after investing a significant amount of resources and efforts which constantly focused on addressing several high-priority requirements serving the sales organization.
Adding more color, these massive number of changes touched the corporate product catalog table, the customer table, the ordering system table, and many others. At XXXXX, database code change is released managed, monitored, and executed manually. There are two full-time employed corporate DBAs, and they are the only ones that are authorized to handle these specific database schema changes. For each new release, the DBAs maintain an ever-growing library of database code change files categorized by date. They started digging into these code change files looking for the potential destructive code. The R&D team also allocated a group of developers to track the recent changes in the corporate ticketing system, JIRA, looking for more clues related to the crash.
What did we find?
After a short investigation, it was found that the bulk database code changes successfully passed the pre-production environment. It was 100% within the corporate’s change validation procedure and got all the needed thumbs up and approvals, yet the application failed to run in production.
To make a long story short, the problem was found (after 18 hours and a redeye night). It was discovered that a few months earlier (and several rollouts ago), an urgent bug fix was introduced to the code running in the production database and was never documented (Not allowed to share the details). Three months later, a planned change was introduced to the same code, overriding and contradicting the first one and breaking its logic. This led to the VP of R&D asking several questions from the team leader of the DBAs:
How come code change performed well in the QA environment and failed in production?
Why did it take 18 hours (having the CEO and management team’s attention) to resolve it?
How can we avoid such issues from happening moving forward?
Managing, monitoring, and executing database code changes are essential tasks in software development, just like with application code. Here are some reasons why:
Consistency: Application code and database code are closely related, and changes in on one side can affect the other. Managing and monitoring database code changes with the same rigor as application code ensures that changes are consistent and do not negatively impact the application’s functionality.
Risk Mitigation: Database code changes can introduce risks, such as data loss or corruption, performance degradation, or security vulnerabilities. Treating database code changes with the same care as application code helps mitigate these risks and prevent potential issues from occurring.
Collaboration: Collaboration between developers and database administrators is critical in software development. Treating database code changes as part of the overall codebase and development process facilitates collaboration, improves communication, and ensures all team members are on the same page.
There’s no difference between database change quality assurance and application code quality assurance. Both include testing, debugging, and code review. Treating database code changes with the same level of attention and scrutiny as application code helps ensure that the code is written with high quality and meets the organization’s standards.
Change Management: Effective change management is crucial in software development to prevent errors, avoid unexpected consequences, and ensure compliance with regulations and industry standards. Managing database code changes with the same level of rigor as application code helps ensure that changes are well-documented, tracked, and approved before implementation.
Back to the VP of R&D’s questions, his answer to the first question was that because of the urgency, this change was introduced directly to the higher environments (pre-prod and production). It wasn’t part of the regular release cycle (starting at development and being documented in source control). The planned follow-up change had no indication it was about to conflict with an urgent bug fix.
The proper practice and well-defined process weren’t implemented due to the demand for a prompt response.
The second question’s answered was that the enormous amount of database code changes produced this year manually documented made the scanning for issues tricky and timely. Figuring out the conflict with an undocumented fix was more by chance rather than through a thoughtful process.
The third question’s answer was already apparent and included in the previous two. Manual database change release dramatically increases the risk of downtime, and the risk grows with the growing sheer amount of database changes.
Here are some reasons why manual database releases can result in downtime:
Human Error: Manual database releases are often error-prone and can introduce mistakes, such as incorrect syntax, schema changes, or data loss, which can cause downtime or system failure.
Lack of Visibility: Manual database releases lack the visibility and transparency that are necessary for tracking changes, monitoring performance, and detecting potential issues before they become critical.
Dependency Management: Modern database environments are complex, with many interdependent components, which makes managing database releases manually challenging. A single mistake in one release sometimes cascades into a chain of problems that results in downtime.
Security Risks: Manual database releases can also introduce security vulnerabilities, such as exposing sensitive data, granting incorrect permissions, or failing to patch known vulnerabilities.
In order to mitigate the risks of downtime in database releases, organizations should consider implementing processes and tools that can standardize, track, and automate the database change release process. By automating the database release process, organizations will reduce human error, increase visibility, manage dependencies more efficiently, and enhance security. Automated release management can also provide detailed audit trails, documentation, and reporting to help organizations meet regulatory compliance requirements.