Entries tagged with “outages” from O'Reilly Radar

Fri

Apr 10
2009

Jesse Robbins

AT&T Fiber cuts remind us: Location is a Basket too!

by Jesse Robbins@jesserobbinscomments: 3

The fiber cuts affecting much of the San Francisco Bay Area this week are similar to the outages in the Middle East last year (radar post), although far more limited in scope and impact.   What I said last year still holds true and is repeated below: 

From an operations perspective these kinds of outages are nothing new, and underscore why having "many eggs in few baskets" is such a problem. I believe we will see similar incidents when we have the first multi-datacenter failures where multiple providers lose significant parts of their infrastructure in a single geographic area.

Remember: Don't put all your eggs in one basket... and Location is a basket too!

To really understand the issue, I recommend Neal Stephenson's incredible (and lengthy) Wired article from 1996 entitled "Mother Earth Mother Board":

[...] It sometimes seems as though every force of nature, every flaw in the human character, and every biological organism on the planet is engaged in a competition to see which can sever the most cables. The Museum of Submarine Telegraphy in Porthcurno, England, has a display of wrecked cables bracketed to a slab of wood. Each is labeled with its cause of failure, some of which sound dramatic, some cryptic, some both: trawler maul, spewed core, intermittent disconnection, strained core, teredo worms, crab's nest, perished core, fish bite, even "spliced by Italians." The teredo worm is like a science fiction creature, a bivalve with a rasp-edged shell that it uses like a buzz saw to cut through wood - or through submarine cables. Cable companies learned the hard way, early on, that it likes to eat gutta-percha, and subsequent cables received a helical wrapping of copper tape to stop it.

[...] There is also the obvious threat of sabotage by a hostile government, but, surprisingly, this almost never happens. When cypherpunk Doug Barnes was researching his Caribbean project, he spent some time looking into this, because it was exactly the kind of threat he was worried about in the case of a data haven. Somewhat to his own surprise and relief, he concluded that it simply wasn't going to happen. "Cutting a submarine cable," Barnes says, "is like starting a nuclear war. It's easy to do, the results are devastating, and as soon as one country does it, all of the others will retaliate."

As the capacity of optical fibers climbs, so does the economic damage caused when the cable is severed. FLAG makes its money by selling capacity to long-distance carriers, who turn around and resell it to end users at rates that are increasingly determined by what the market will bear. If FLAG gets chopped, no calls get through. The carriers' phone calls get routed to FLAG's competitors (other cables or satellites), and FLAG loses the revenue represented by those calls until the cable is repaired. The amount of revenue it loses is a function of how many calls the cable is physically capable of carrying, how close to capacity the cable is running, and what prices the market will bear for calls on the broken cable segment. In other words, a break between Dubai and Bombay might cost FLAG more in revenue loss than a break between Korea and Japan if calls between Dubai and Bombay cost more.

The rule of thumb for calculating revenue loss works like this: for every penny per minute that the long distance market will bear on a particular route, the loss of revenue, should FLAG be severed on that route, is about $3,000 a minute. So if calls on that route are a dime a minute, the damage is $30,000 a minute, and if calls are a dollar a minute, the damage is almost a third of a million dollars for every minute the cable is down. Upcoming advances in fiber bandwidth may push this figure, for some cables, past the million-dollar-a-minute mark. [Link]

It's also worth mentioning the outages to multiple service providers hosted in a single colocation facility when the FBI sized all the equipment in the facility, the big outage at 365 Main from two years ago, and many others (see: Radar posts & comprehensive coverage at Data Center Knowledge).

(If Web Operations & Infrastructure is your interest or passion, you should attend Velocity 2009.  You can use the code "vel09cmb" for a 15% discount)

velocity2009.gif
(Image source: http://www.flickr.com/photos/mundane_joy/2301368102/)

tags: at&t, cloud, failure, failure happens, fiber, infrastructure, operations, outages, velocity, velocity09, web infrastructure, web operations, web2.0, webops, worriescomments: 3
submit: Reddit Digg stumbleupon   

 

Mon

Jun 23
2008

Jesse Robbins

Hyperic CloudStatus service dashboard launches at Velocity!

by Jesse Robbins@jesserobbinscomments: 6

Javier Soltero just launched CloudStatus during his Hyperic sponsor session today at Velocity. CloudStatus is a public health dashboard for web services like Amazon's EC2/S3, and Google's App Engine.

CloudStatus-Hyperic.png

Javier called to tell me about this last week after I declared that "Service Monitoring Dashboards are mandatory". This comes right after Amazon and Google had visible outages, and couldn't have happened at a better time. I'm really excited to see this idea take off, as it's something that is critical to the broad adoption of web services and cloud computing.

tags: cloudstatus, hyperic, monitoring, operations, outages, platform plays, specialized services, startups, velocity, velocity08, web 2.0, webopscomments: 6
submit: Reddit Digg stumbleupon   

 

Tue

Jun 17
2008

Jesse Robbins

Service Monitoring Dashboards are mandatory for production services!

by Jesse Robbins@jesserobbinscomments: 6

Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington's post on TechCrunch.

Google has a strong Web Operations culture, and there are numerous internal monitoring tools in use across the company, along with a smaller set available to customers. It's suprising that Google launched a developer platform without providing something beyond an email group, although they are by no means the first to do so.

google-app-engine-needs-a-dashboard.png

Service Monitoring Dashboards are mandatory for production services and platforms!

  • If you launch a platform that people pay you money for, you need to have a real time service dashboard. Ideally this should be decoupled from the rest of your infrastructure.
  • Don't rely on platforms that lack service monitoring dashboards for production.

Many companies are initially reluctant to provide this kind of monitoring to the public, and only do so in reaction to an outage. However, it seems that every company that offers such a dashboard uses it as a source of competitive advantage.

The best example of this is trust.salesforce.com which they launched after series of outages in 2006. Amazon (eventually) launched a status dashboard for AWS, and added RSS feeds for specific services which I think is pretty cool.

AWS Service Health Dashboard - Jun 17, 2008.png

Javier Soltero at Hyperic points out

1. The reports of service outages arrive long after anyone who depends on the services can possibly do anything to mitigate their effect.
2. The services themselves seem incapable of providing any visibility into the circumstances that might lead to future outages.

[...]Even TechCrunch points out that the Google Apps blog doesn’t even mention the outage. Other clouds rely on blogs such as this one, this one, or maybe even this one (from our good friends at Mosso). These are all places where outages can be discussed, but not the right means for people to find out whether it their application that crashed, or the cloud that it depends on.

(Updated:Niall Kennedy pointed out that GAE is still a preview release, and I agree that my original wording was wrong. My intent is to emphasize the importance of providing a public service dashboard and so I've edited accordingly.)

tags: failure happens, google app engine, infrastructure, internet policy, monitoring, operations, outages, platform plays, platforms, saas, velocity, web 2.0, web services, webopscomments: 6
submit: Reddit Digg stumbleupon