Quite a bit of hubbub over WordPress’s recent outage. A number of high profile blogs including Techcrunch, GigaOm, CNN, and your very own SmoothSpan use WordPress. Matt Mullenweg told Read/WriteWeb:
“The cause of the outage was a very unfortunate code change that overwrote some key options in the options table for a number of blogs. We brought the site down to prevent damage and have been bringing blogs back after we’ve verified that they’re 100% okay.”
Apparently, WordPress has three data centers, 1300 servers, and is home to on the order of 10 million blogs. Techcrunch is back and talking about it, but as I write this, GigaOm is still out. Given the nature of the outage, WordPress presumably has to hand tweak that option information back in for all the blogs that got zapped. If it is restoring from backup, that can be painful too.
While one can lay blame at the doorstep of whatever programmer made the mistake, the reality is that programmers make mistakes. It is unavailable. The important question is what has been done from an Operations and Architecture standpoint that either mitigates or compounds the likelihood such mistakes cause a problem. In this case, I blame multitenancy. When you can make a single code change that zaps all you customers very quickly like this, you had to have help from your architecture to pull it off.
Don’t get me wrong, I’m all for multitenancy. In fact, it’s essential for many SaaS operations. But, companies need to have a plan to manage the risks inherent in multitenancy. The primary risk is the rapidity with which rolling out a change can affect your customer base. When operations are set up so that every tenant is in the same “hotel”, this problem is compounded, because it means everyone gets hit.
What to do?
First, your architecture needs to support multiple hotels, and it needs to include tools that make it easy for your operations personnel to manage which tenants are in which hotels, which codelines run on which hotels (more on that one in a minute), and to rapidly rehost tenants to a different hotel, if desired. These capabilities pave the way for a tremendous increase in operational flexibility that makes it far easier to do all sorts of things and possible to do some things that are completely impossible with a single hotel.
Second, I highly encourage the use of a Cloud data center, such as Amazon Web Services. Here again, the reason is operational flexibility. Spinning up more servers rapidly for any number of reasons is easy to do, and you take the cost of temporarily having a lot more servers (for example, to give your customers a beta test of a new release) off the table because it is so cheap to temporarily have a lot of extra servers.
Last step: use a feathered release cycle. When you roll out a code change, no matter how well-tested it is, don’t deploy to all the hotels. A feathered release cycle delivers the code change to one hotel at a time, and waits an appropriate length of time to see that nothing catastrophic has occurred. It’s amazing what a difference a day makes in understanding the potential pitfalls of a new release. Given the operational flexibility of being able to manage multiple hotels, you can adopt all sorts of release feathering strategies. Start with smaller customers, start with brand new customers, start with your freemium customers, and start out by beta testing customers are all possibilities that can result in considerable risk mitigation for the majority of your customer base.
If you’re a customer looking at SaaS solutions, ask about their capacity for multiple hotels and release feathering. It just may save you considerable pain.

My take on this one is slightly different, Bob. While one simple change shouldn’t take out an entire chunk of the web with a simple code commit, the fact is that it did and WordPress’ vulnerability is no different than just about any other system out there. What is needed is a fundamental rethink on how systems are designed and built. This problem has been around for decades and it is only getting worse as systems grow larger and more complex.
http://thewellrunsite.com/2010/06/11/wordpress-outage-woodpeckers-web-sites-weinberg-oh-my/
Michael, of course, that is the point of my post, that you are going to do very high risk things. The point was also to suggest some architectural and operational means of mitigating the risks.
You’ve called for a fundamental rethink in your post, but you don’t really prescribe any remedies. Hand wringing is fine and well, but we need treatment as much or more than diagnosis.
Cheers,
BW
Calling for a rethink was the point of my post, not necessarily prescribing remedies. People must realize the uncomfortable fact that, at their deepest levels, most every computing system built since ENIAC’s power switch was first flipped is vulnerable to the same house-of-cards design flaws. Treatment, such as you’ve suggested in your post, is fine, and I would advocate many of the same things you’re written. But the fundamental problem with treatment is that the underlying problem remains lurking, only to happen again, on some other system, perhaps with more severe consequences.
I think it’s time we start considering how systems can be designed better, so they don’t suffer from these flaws. As long as humans continue designing systems in this same way, human error will occasionally bring them down. I’m not advocating any particular approach, not yet at least. But after watching things blow up like this for more than three decades, and considering the increasing centralization of computing services and their importance to society as a whole, the impact and the risks are becoming greater still.
I would argue that WordPress is NOT a true multi-tenant… as I understand their architecture, it’s all shared resources right down to the blogging software. In a true multi-tenant environment, each instance would be silo’d off from another so a simple change in code on one would not effect the other 10 million.
Also, as I understand it, it’s not a cloud, it’s an ISP host which is a huge difference.
Greg, many, if not most, multitenant architectures even comingle the instances in the same tables. That’s not very silo’d at all.
Cheers,
BW
Hi Bob
It’s a great post you wrote there highlighting certain areas where WP needs to work on, Indeed.
nevertheless, as far as my technical understandings are concern, i wouldn’t mind calling it another human error made by certain group of programmers which could term as unavoidable in certain cases.
Now, the point comes even if the programmer made the mistakes. I heard, MATT himself clarifying
the reason behind the outage or disruption of service was due to a single code error which made the servers act strangely on the options table in the wordpress platform itself.
Well, i wonder if we know something as SOFTWARE TESTING ?! or maybe QUALITY TESTING?
before making the changes into the live blogs. not to mention the numbers (10 millions of them)
Something fishy in there, indeed.
But, not to focus on the past it is better to figure out a future strategy to avoid such a mishaps where big tech blogs and research honchos can go blank in a blink as a consequence.
Implementing CLOUD to host WP isn’t a bad idea. Top of it, i think MATT and his team must be in the process of thinking or scripting out a WPCLOUD already.
I think, in terms of growing number of demands by users/customers. Stability and 24/7 back up is a necessary requirement for the current generation and would play an extensive role in terms of Applications of Cloud Computing in real time for coming few years or so.
Nevertheless, being a WP user myself. I admire MATT for his innovative thoughts behind WP and it’s subsidiaries.
Long live the CLOUD.
=]
[…] an example, check out how a simple software error affected tens of millions of users of WordPress (WordPress and the dark side of multitenancy.) While we’re talking about a different layer in the stack, the issue is the […]
[…] Bob Warfield, serial entrepreneur and experienced SaaS executive posted though, there are better ways now to design a SaaS infrastructure that won’t be as vulnerable […]
[…] through a variety of sources coming through my Twitter stream, and made the switch anyway. Bob Warfield’s post on Enterprise Irregulars crossed my plate repeatedly as […]
Thx for your informations 5 Stars!