Search this site
Match case Regex search

Entries matching “"failure happens"” from O'Reilly Radar

Thu

Apr 10
2008

Jesse Robbins

Velocity preview at Web2.0 Expo

by Jesse Robbins@jesserobbinscomments: 2

At the Web2.0 Expo this month we have a small preview of some of the topics and speakers at the Velocity Web Performance & Operations conference.  (Radar readers get a 20% discount by using "vel08js" as a discount code... and yes it works with the $300 early registration discount!).

Failure Happens
Friday @ 11:00 am, Room 2009

funny-pictures-bird-cat-cage.jpgArtur Bergman and I will kick off the day with an entertaining/informative/eye-opening review of the year’s biggest failures, disasters, and painful lessons learned.

We'll review incidents by underlying root cause with a focus on what could have been done to prevent it. We promise not to be too harsh on anybody, although we will give special attention to particularly ironic failures or those that are "entertainingly coupled" to absurd marketing claims.

(Hint: Send your boss to this talk if they don't understand why you and your whole team need to go to Velocity.)

Even Faster Web Sites
Friday @ 1:30 pm, Room 2012

souders.jpgSteve Souders is the co-chair of Velocity and author of the bestselling book High Performance Web Sites. At the Expo last year Steve gave an incredibly popular talk on the 14 best practices he developed while working as the Chief Performance Yahoo!.

(continue reading)

tags: open source, operations, performance, platform plays, upcoming appearances, velocity, velocity08, web 2.0, web 2.0 expo, web2expo, webopscomments: 2
submit: Reddit Digg stumbleupon   

 

Thu

Mar 27
2008

Jesse Robbins

Amazon improves EC2 (by embracing failure)

by Jesse Robbins@jesserobbinscomments: 5

Amazon just announced two big improvements to EC2:

  • Multiple Locations
    Amazon EC2 now provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and Availability Zones. Regions are geographically dispersed and will be in separate geographic areas or countries. Currently, Amazon EC2 exposes only a single region. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same region. Regions consist of one or more Availability Zones. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.

  • Elastic IP Addresses
    Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account not a particular instance, and you control that address until you choose to explicitly release it. Unlike traditional static IP addresses, however, Elastic IP addresses allow you to mask instance or Availability Zone failures by programmatically remapping your public IP addresses to any instance in your account. Rather than waiting on a data technician to reconfigure or replace your host, or waiting for DNS to propagate to all of your customers, Amazon EC2 enables you to engineer around problems with your instance or software by quickly remapping your Elastic IP address to a replacement instance.

Datacenters and geographic regions are Single Points of Failure (SPOF) too.  Failure Happens, and it's far better (and cheaper) to build services that are resilient to failure than to try to prevent them from happening.  This is a big step in the right direction.

Update: RightScale posted an excellent overview of how this works.

tags: amazon, aws, ec2, failure happens, infrastructure, internet policy, mysql conference, operations, platform plays, velocity08comments: 5
submit: Reddit Digg stumbleupon   

 

Sat

Feb 2
2008

Jesse Robbins

Failure Happens: Transcontinental fiber-optic submarine cables

by Jesse Robbins@jesserobbinscomments: 9

The Guardian published a summary of the ongoing impact from the transcontinental fiber-optic submarine cable cuts along with a map from Telegeography.com:

According to reports, the internet blackout, which has left 75 million people with only limited access, was caused by a ship that tried to moor off the coast of Egypt in bad weather on Wednesday. Since then phone and internet traffic has been severely reduced across a huge swath of the region, slashed by as much as 70% in countries including India, Egypt and Dubai. [...]

"It will depend on how bad the damage is, but they'll find the sections in question and bring them up onto a ship for repair before sinking them again," said Mauldin. "It could take a week or possibly two weeks."

The fibre optic wires in question - called Flag Europe-Asia and Sea-Me-We 4 - are some of the most vital information pipelines between Europe and the east. The latter, which runs in an uninterrupted line from western Europe to Singapore, had only recently been opened after a mammoth £500m, three-year installation project. Between them, the two lines are responsible for around 75% of all connectivity in the Middle East and south Asia.

guardian-transcontinental-cable.jpg

(continue reading)

tags: geo, internet policy, operations, platform plays, web 2.0, worriescomments: 9
submit: Reddit Digg stumbleupon   

 

Mon

Nov 12
2007

Jesse Robbins

Failure Happens: An SLA is just a contract & Data Centers are single points of failure too

by Jesse Robbins@jesserobbinscomments: 9

Rackspace just had a Data Center failure as Scott Beale, TechCrunch, and Valleywag are reporting. Rackspace has been one of the most reliable infrastructure providers for many years, and it has done much to live up to the slogan of providing "fanatical support". Unfortunately, many people misinterpret its "zero downtime network" marketing as a promise that it will not fail.

Rackspace does not promise that its system will not fail, instead, it establishes a Service Level Agreement (SLA) which specifies how customers will be compensated when failure happens. The Rackspace SLA is actually one of the clearest in the industry:

Rackspace's SLA is a contract between you, the customer, and Rackspace. It defines the terms of our responsibility and the money back guarantees if our responsibilities are not met. We want our customers to feel at ease with their decision to move their site to Rackspace, and knowing that Rackspace takes your site's uptime as seriously as you do is imperative. [...]

Rackspace guarantees that the critical infrastructure systems will be available 100% of the time in a given month, excluding scheduled maintenance. Critical infrastructure includes functioning of all power and HVAC infrastructure including UPSs, PDUs and cabling, but does not include the power supplies on customers' servers. Infrastructure downtime exists when a particular server is shut down due to power or heat problems and is measured from the time the trouble ticket is opened to the time the problem is resolved and the server is powered back on.

Rackspace Guarantee: Upon experiencing downtime, Rackspace will credit the customer 5% of the monthly fee for each 30 minutes of downtime (up to 100% of customer's monthly fee for the affected server).

Please remember that Data Centers are single points of failure too. (see: Artur Bergman's post and My followup after 365 Main outage)


Incidentally, the Recovery Oriented Computing project is an exceptional resource for those interested in building resilient systems:

The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services. In a significant divergence from traditional fault-tolerance approaches, ROC emphasizes recovery from failures rather than failure-avoidance. This philosophy is motivated by the observation that even the most robust systems still occasionally encounter failures due to human operator error, transient or permanent hardware failure, and software anomalies resulting from "Heisenbugs" or software aging.

The ROC approach takes the following three assumptions as its basic tenets:
* failure rates of both software and hardware are non-negligible and increasing
* systems cannot be completely modeled for reliability analysis, and thus their failure modes cannot be predicted in advance
* human error by system operators and during system maintenance is a major source of system failures
These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

Update: "Rackspace outage was third in two days" (Valleywag), "Truck Crash Knocks Rackspace Offline" (Data Center Knowledge)

Technorati Tags: , , , , ,

tags: operations, web 2.0comments: 9
submit: Reddit Digg stumbleupon   

 

Sat

Nov 3
2007

Jesse Robbins

Failure Happens: Taser-wielding thieves steal servers, attack staff, and cause outages at Chicago colocation facility

by Jesse Robbins@jesserobbinscomments: 7

Dan Goodin at The Register reports that C I Hosts' Chicago facility was robbed last month for the third time... fourth time... second time (the other two times were merely "break-ins where things were stolen")

In the most recent incident, "at least two masked intruders entered the suite after cutting into the reinforced walls with a power saw," according to a letter C I Host officials sent customers. "During the robbery, C I Host's night manager was repeatedly tazered and struck with a blunt instrument. After violently attacking the manager, the intruders stole equipment belonging to C I Host and its customers." At least 20 data servers were stolen, said Patrick Camden, deputy director of news affairs for the Chicago Police Department.

The Chicago location has been hit by similar breaches in the past, according to police reports. One report detailing an occurrence on September 23, 2005, recounts a "hole cut through the wall coming out onto the hallway of third floor." During a September 20, 2006 incident, an intruder "placed a silver + blk handgun to [victim's] head and stated 'lay down on the floor.'" The victim, a C I Host employee, was then blindfolded, bound with black tape and struck on the head with a weapon, according to the report.

Wow... I hope that everybody is now okay. There is some interesting discussion by affected customers over on the WebHostingTalk forums.

I'll be doing a post-incident report using the Simple Availability Report format I introduced last week. (If you would like to contribute please post in the comments or email me directly jesse AT oreilly.com)

Updated: Anastasia Tubanos (theWHIR.com) has posted her interview and followup with James Eckles, chief corporate counsel for CI Host. (link)

"There's no resolution really," he says. "We're dealing with the situation on a customer-by-customer basis. We've got nothing to hide, even though people have been saying otherwise online. The forums have been a bed of misinformation - extortion compounded with defamation. One of the biggest mistakes is that people are talking about four robberies. A robbery means than property has been seized through violence or intimidation. C I Host has technically only been robbed twice in two years. The other two were break-ins where things were stolen, but not robberies."

Technorati Tags: , , , , , , , , , ,

tags: operations, web 2.0, worriescomments: 7
submit: Reddit Digg stumbleupon   

 

Fri

Jul 27
2007

Jesse Robbins

Failure Happens: A summary of the power outage at 365 Main

by Jesse Robbins@jesserobbinscomments: 19

Datacenter provider 365 Main released their initial report from Tuesday's power failure which affected Craigslist, Technorati, Yelp, TypePad, LiveJournal, Vox, and others. This outage is an excellent example of complex systems failure, and so I'll be using it as the basis for my next few posts on Operations. This is my own analysis using publicly available data.

The 365main site does not have a typical battery backup system. Instead they rely on Continuous Power Supplies (CPS) which use a flywheel driven alternator to generate electricity. The flywheel is connected to both a large diesel motor and an electric motor which runs on utility power. The flywheel is normally turned by the electric motor, and stores enough kinetic energy to power the alternator for up to 15 seconds. When utility power fails the diesel motor is supposed to start in under 5 seconds, well before the flywheel's kinetic energy is exhausted, providing uninterrupted electrical power.

The advantage of a CPS over a battery-based system is that the power going to the datacenter is decoupled from the utility power. This eliminates the complex electrical switching required from most battery-based systems, making many CPS systems simpler and sometimes more reliable.

In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.

What's most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being "nearly impossible". Unfortunately, the impossible often becomes reality.

To put it another way... Failure Happens.

Next week we'll dive into building resilient websites and take a look at a few of the sites that went down. Artur and I are both excited to be writing about this, and welcome your comments, suggestions, and war stories!

tags: operationscomments: 19
submit: Reddit Digg stumbleupon   

 

Wed

Jul 25
2007

Artur Bergman

Failure happens

by Artur Bergmancomments: 8

What an exciting day, as services for hundreds of thousands of users and millions of readers disappeared from the internet. In a stunning but unsurprising event, a repeated power cycling caused by a blown power station disrupted the 365 Main datacenter, causing them to lose all power to two colocation rooms.

I jokingly refer to 365 Main as the "Web 2.0" datacenter; of course, there is nothing Web 2.0 about the datacenter itself. But it does host a remarkable number of such properties, including Craigslist, Technorati, and Red Envelope. Someone could make a lot of VCs cry by taking it out, or so the running joke goes. And ironically enough, this morning 365 Main (together with Red Envelope) put out a press release announcing 2 years of 100% uptime; one may also note that they have now removed the press release from their site, as http://www.365main.com/press_releases/pr_7_24_07_red_envelope.html returns a file not found.

We can draw the conclusion from today's event that the return to the mainframe world is precarious. Maintaining our own systems is harder than paying someone else to do it; trading on this fact, we expect reliability in exchange for cost. Having entrusted our data, services, and therefore income to these companies, we trust them to keep it safe and available. When this trust is broken by extended downtime, clearly the situation is not a sustainable solution.

But that doesn't mean that running your own operations is the solution either. A while ago I wrote about disaster recovery with Amazon Web Services, and some of you astutely pointed out that problems can occur if you run your own operations. And indeed they can! And they do! Clearly the need for disaster recovery plans are just as important if you are hosting it yourself. These plans exist on a different continuum, affecting not just operations but also your entire organisation's response to disasters.

Planning response to disasters cannot be avoided, particular in San Francisco's position between two quite dangerous fault lines. An earthquake is a question of when, not if. Are the startups ready for this? How long will we expect them to be gone? Possible answers to these questions look grim given a cursory analysis of today's events. Several of the world's largest websites went down. None of them were ready for a datacenter outage. None of them had backup datacenters or fail over that worked. None even had a coherent strategy for communicating the situation to the rest of the world.

Google, Yahoo, Amazon and EBay are companies that have invested money in redundancy. Because of this investment they can build systems that survive outages, a significant comparative advantage for these companies. This advantage presents a tradeoff I am willing to make: even if I don't trust Google with my data's privacy, I do trust that they will keep my data safe. (Or at least safer than I or most people could.) But the tradeoff doesn't pay off if the company cannot do that, and my trust in a whole swath of companies dropped completely.

I wrote earlier about the discussion about the Open Source Developer Toolkit and the need for frameworks, tools and patterns that improve scalability. But in light of recent events, perhaps reliable and fault tolerant systems are more important than scalable ones. Events like this make people suddenly realise that mysql replication is not an adequate disaster recovery strategy. Or that tight coupling between the database and the app might cause a bit of a problem when your database might be moved to another city. Or that the memcache you are accessing suddenly is several milliseconds away. There is a small group of people who know this; some may call them jaded and cynical, others may call them experienced. But the vast number of developers and operations people are completely out of their depth here.

I want to welcome Jesse Robbins to Radar. We are kicking off a series of articles exploring the depths of the dark and forgotten world of operations. Operations has too long been hiding in the shadows, treated as the poor cousin to engineering and development. It is time to share our horror stories, experiences and ideas in hopes of collectively pushing our profession to a higher level.

tags: operationscomments: 8
submit: Reddit Digg stumbleupon