Entries tagged with “failure happens” from O'Reilly Radar
Four Short Links: 25 August 2009
Reverse Search, PDF Stripping, Flash Visualization, Failure
by Nat Torkington | @gnat | comments: 1
- Tineye -- reverse search engine; you upload an image and they find you similar images so you know where else it's used. Check out their cool searches.
- PDF Pirate -- upload a PDF and this web site will give it back to you minus the restrictions on copying/printing/etc.
- Flare -- an ActionScript library for creating visualizations that run in the Adobe Flash Player. BSD-licensed, modelled on Prefuse. When there's a visualisation library for every platform, will we start to get people who know how to make them?
- The Importance of Failure (Marco Tabini) -- This is a point that I don't often hear made when people talk about failure; the moral behind a failure-related story is usually about preventing it, or dealing with the aftermath, but not about the fact that sometimes things go bad despite your best efforts, and all the careful risk management and contingency planning won't keep you from going down in flames. This is important, because it forces every person to establish a risk threshold that they are willing to accept in every one of their life efforts.
tags: drm, failure, failure happens, flash, publishing, search, visualization
| comments: 1
submit:
AT&T Fiber cuts remind us: Location is a Basket too!
by Jesse Robbins | @jesserobbins | comments: 3
The fiber cuts affecting much of the San Francisco Bay Area this week are similar to the outages in the Middle East last year (radar post), although far more limited in scope and impact. What I said last year still holds true and is repeated below:
From an operations perspective these kinds of outages are nothing new, and underscore why having "many eggs in few baskets" is such a problem. I believe we will see similar incidents when we have the first multi-datacenter failures where multiple providers lose significant parts of their infrastructure in a single geographic area.
Remember: Don't put all your eggs in one basket... and Location is a basket too!
To really understand the issue, I recommend Neal Stephenson's incredible (and lengthy) Wired article from 1996 entitled "Mother Earth Mother Board":
It's also worth mentioning the outages to multiple service providers hosted in a single colocation facility when the FBI sized all the equipment in the facility, the big outage at 365 Main from two years ago, and many others (see: Radar posts & comprehensive coverage at Data Center Knowledge).[...] It sometimes seems as though every force of nature, every flaw in the human character, and every biological organism on the planet is engaged in a competition to see which can sever the most cables. The Museum of Submarine Telegraphy in Porthcurno, England, has a display of wrecked cables bracketed to a slab of wood. Each is labeled with its cause of failure, some of which sound dramatic, some cryptic, some both: trawler maul, spewed core, intermittent disconnection, strained core, teredo worms, crab's nest, perished core, fish bite, even "spliced by Italians." The teredo worm is like a science fiction creature, a bivalve with a rasp-edged shell that it uses like a buzz saw to cut through wood - or through submarine cables. Cable companies learned the hard way, early on, that it likes to eat gutta-percha, and subsequent cables received a helical wrapping of copper tape to stop it.
[...] There is also the obvious threat of sabotage by a hostile government, but, surprisingly, this almost never happens. When cypherpunk Doug Barnes was researching his Caribbean project, he spent some time looking into this, because it was exactly the kind of threat he was worried about in the case of a data haven. Somewhat to his own surprise and relief, he concluded that it simply wasn't going to happen. "Cutting a submarine cable," Barnes says, "is like starting a nuclear war. It's easy to do, the results are devastating, and as soon as one country does it, all of the others will retaliate."
As the capacity of optical fibers climbs, so does the economic damage caused when the cable is severed. FLAG makes its money by selling capacity to long-distance carriers, who turn around and resell it to end users at rates that are increasingly determined by what the market will bear. If FLAG gets chopped, no calls get through. The carriers' phone calls get routed to FLAG's competitors (other cables or satellites), and FLAG loses the revenue represented by those calls until the cable is repaired. The amount of revenue it loses is a function of how many calls the cable is physically capable of carrying, how close to capacity the cable is running, and what prices the market will bear for calls on the broken cable segment. In other words, a break between Dubai and Bombay might cost FLAG more in revenue loss than a break between Korea and Japan if calls between Dubai and Bombay cost more.
The rule of thumb for calculating revenue loss works like this: for every penny per minute that the long distance market will bear on a particular route, the loss of revenue, should FLAG be severed on that route, is about $3,000 a minute. So if calls on that route are a dime a minute, the damage is $30,000 a minute, and if calls are a dollar a minute, the damage is almost a third of a million dollars for every minute the cable is down. Upcoming advances in fiber bandwidth may push this figure, for some cables, past the million-dollar-a-minute mark. [Link]
tags: at&t, cloud, failure, failure happens, fiber, infrastructure, operations, outages, velocity, velocity09, web infrastructure, web operations, web2.0, webops, worries
| comments: 3
submit:
Service Monitoring Dashboards are mandatory for production services!
by Jesse Robbins | @jesserobbins | comments: 6
Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington's post on TechCrunch.
Google has a strong Web Operations culture, and there are numerous internal monitoring tools in use across the company, along with a smaller set available to customers. It's suprising that Google launched a developer platform without providing something beyond an email group, although they are by no means the first to do so.

Service Monitoring Dashboards are mandatory for production services and platforms!
- If you launch a platform that people pay you money for, you need to have a real time service dashboard. Ideally this should be decoupled from the rest of your infrastructure.
- Don't rely on platforms that lack service monitoring dashboards for production.
Many companies are initially reluctant to provide this kind of monitoring to the public, and only do so in reaction to an outage. However, it seems that every company that offers such a dashboard uses it as a source of competitive advantage.
The best example of this is trust.salesforce.com which they launched after series of outages in 2006. Amazon (eventually) launched a status dashboard for AWS, and added RSS feeds for specific services which I think is pretty cool.
Javier Soltero at Hyperic points out
1. The reports of service outages arrive long after anyone who depends on the services can possibly do anything to mitigate their effect.
2. The services themselves seem incapable of providing any visibility into the circumstances that might lead to future outages.[...]Even TechCrunch points out that the Google Apps blog doesn’t even mention the outage. Other clouds rely on blogs such as this one, this one, or maybe even this one (from our good friends at Mosso). These are all places where outages can be discussed, but not the right means for people to find out whether it their application that crashed, or the cloud that it depends on.
(Updated:Niall Kennedy pointed out that GAE is still a preview release, and I agree that my original wording was wrong. My intent is to emphasize the importance of providing a public service dashboard and so I've edited accordingly.)
tags: failure happens, google app engine, infrastructure, internet policy, monitoring, operations, outages, platform plays, platforms, saas, velocity, web 2.0, web services, webops
| comments: 6
submit:
You Become what You Disrupt - (part two)
by Jesse Robbins | @jesserobbins | comments: 10
Google's GrandCentral (Radar coverage) was down over the weekend resulting in missed calls and other phone problems for its users.
This is very similar to the the two day Skype outage last year where I said that "You Become what You Disrupt". I've spoken about this issue several times, most recently at the Princeton CITP "Computing in the Cloud" workshop.
The problem is that it's not particularly clear at what point a disruptive innovation becomes a utility. As innovators it's important that we recognize that this point will arrive and prepare for it. I believe that we have a responsibility to be good stewards of the technologies we create, and to take responsibility for protecting people who come to rely on those technologies to live their daily lives. When we fail to do that, we may find ourselves being cast as either fools or villains who must be regulated and controlled.
Ultimately, I think we will evolve a set of safety standards very similar to building codes. For instance, it appears that a multi-datacenter strategy would have prevented the GrandCentral outage. (As I've said many times before: Datacenters are a Single Point of Failure!)
Cofounder Craig Walker writes: "I wanted to write a quick note to all the GC users and apologize for the service interruption this morning. We had a power issue at our current colo facility and it knocked us off line for a few hours. Unfortunately I’ve been up in the mountains with the family this weekend and had no cell/internet coverage so couldn’t respond earlier. I did want to let you know that we were able to restore the service by noon today and are working extremely diligently to make sure this won’t occur in the future. We’ll do a better job keeping you informed in the future, not only about service related issues but also about upcoming features, soliciting your feedback, and generally making sure that you, the GC user, is well informed as to what’s going on with the service."
Will better industry standards, best-practices, and independent certifying authorities emerge for these new utilities without innovation-stifling regulation? I hope so.
Amazon improves EC2 (by embracing failure)
by Jesse Robbins | @jesserobbins | comments: 5
Amazon just announced two big improvements to EC2:
- Multiple Locations
Amazon EC2 now provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and Availability Zones. Regions are geographically dispersed and will be in separate geographic areas or countries. Currently, Amazon EC2 exposes only a single region. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same region. Regions consist of one or more Availability Zones. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.
- Elastic IP Addresses
Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account not a particular instance, and you control that address until you choose to explicitly release it. Unlike traditional static IP addresses, however, Elastic IP addresses allow you to mask instance or Availability Zone failures by programmatically remapping your public IP addresses to any instance in your account. Rather than waiting on a data technician to reconfigure or replace your host, or waiting for DNS to propagate to all of your customers, Amazon EC2 enables you to engineer around problems with your instance or software by quickly remapping your Elastic IP address to a replacement instance.
Datacenters and geographic regions are Single Points of Failure (SPOF) too. Failure Happens, and it's far better (and cheaper) to build services that are resilient to failure than to try to prevent them from happening. This is a big step in the right direction.
Update: RightScale posted an excellent overview of how this works.
tags: amazon, aws, ec2, failure happens, infrastructure, internet policy, mysql conference, operations, platform plays, velocity08
| comments: 5
submit:



