Entries tagged with “sla” from O'Reilly Radar

Fri

Oct 24
2008

Jesse Robbins

Amazon's new EC2 SLA

by Jesse Robbins@jesserobbinscomments: 7

Amazon announced a new SLA for EC2, similar to the one for S3. This is a notable step for Amazon and cloud computing as a whole, as it establishes a new bar for utility computing services.

Amazon is committing to 99.95% availability for the EC2 service on a yearly basis, which corresponds to approximately four hours and twenty three minutes of downtime per year. It's important to remember that an SLA is just a contract that provides a commitment to a certain level of performance and some form of compensation when a provider fails to meet it.

Here's the summary of the EC2 SLA (emphasis added):
Service Commitment AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below. [...]
  • “Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability [...]
  • “Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances. [...]
To receive a Service Credit, you must submit a request by sending an e-mail message to aws-sla-request @ amazon.com. To be eligible, the credit request must [...] include your server request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks)

This new SLA does not appear to address the reliability of server instances individually or in aggregate. For example, if half of a customer's EC2 instances lose their connections or die every 6 minutes, EC2 would still be considered "available" even if it is essentially unusable.

If the entire EC2 service is down a cumulative four hours and twenty minutes, customers must furnish proof of the outage to Amazon to be eligible for the 10% credit. This seems like an onerous process for very little compensation, and isn't in-line with Amazon's famous "Relentless Customer Obsession". Amazon takes monitoring very seriously and should take the lead by tracking, reporting, and proactively compensating customers when it lets them down.

tags: amazon, availability, cloud computing, ec2, operations, s3, sla, webopscomments: 7
submit: Reddit Digg stumbleupon   

 

Sun

Apr 13
2008

Jesse Robbins

You Become what You Disrupt - (part two)

by Jesse Robbins@jesserobbinscomments: 10

Google's GrandCentral (Radar coverage) was down over the weekend resulting in missed calls and other phone problems for its users.

This is very similar to the the two day Skype outage last year where I said that "You Become what You Disrupt". I've spoken about this issue several times, most recently at the Princeton CITP "Computing in the Cloud" workshop.

The problem is that it's not particularly clear at what point a disruptive innovation becomes a utility. As innovators it's important that we recognize that this point will arrive and prepare for it. I believe that we have a responsibility to be good stewards of the technologies we create, and to take responsibility for protecting people who come to rely on those technologies to live their daily lives. When we fail to do that, we may find ourselves being cast as either fools or villains who must be regulated and controlled.

Ultimately, I think we will evolve a set of safety standards very similar to building codes. For instance, it appears that a multi-datacenter strategy would have prevented the GrandCentral outage. (As I've said many times before: Datacenters are a Single Point of Failure!)

Cofounder Craig Walker writes: "I wanted to write a quick note to all the GC users and apologize for the service interruption this morning. We had a power issue at our current colo facility and it knocked us off line for a few hours. Unfortunately I’ve been up in the mountains with the family this weekend and had no cell/internet coverage so couldn’t respond earlier. I did want to let you know that we were able to restore the service by noon today and are working extremely diligently to make sure this won’t occur in the future. We’ll do a better job keeping you informed in the future, not only about service related issues but also about upcoming features, soliciting your feedback, and generally making sure that you, the GC user, is well informed as to what’s going on with the service."

Will better industry standards, best-practices, and independent certifying authorities emerge for these new utilities without innovation-stifling regulation? I hope so.

(continue reading)

tags: building codes, emerging telephony, failure happens, failures, google, grandcentral, internet policy, news from the past, open source, operations, operations webops, skype, sla, thought provoking, videos, voip, web 2.0comments: 10
submit: Reddit Digg stumbleupon   

 

Mon

Mar 10
2008

Jesse Robbins

Paging systems and Conference Bridges for startups & small teams

by Jesse Robbins@jesserobbinscomments: 17

velocity_logo_conf.gifEarly registration for the Velocity Web Performance & Operations Conference has opened. To help spread the word, I've written this "simplest thing that will work" hack to a common Operations need: Paging systems and Conference Bridges.

Step 1: Establish a team contact list with SMS email addresses

Create a Google Spreadsheet to create a team roster like this one. My recommendation is to let people enter and manage their own information. Most cell providers have an email to SMS gateway of some kind. In the US, these are:

  • ATT: phonenumber@txt.att.net
  • Nextel: phonenumber@messaging.nextel.com
  • Sprint: phonenumber@messaging.sprintpcs.com
  • T-Mobile: phonenumber@tmomail.net
  • Verizon: phonenumber@vtext.com

Step 2: Set up a notification email list

Set up an email alias and add people by email address and SMS gateway address. If you don't have a way of creating an alias, you can use a mailing list provider such as Google Groups.

Step 3: Set up the Conference Bridges

free-conference-call.gifI am really happy with FreeConferenceCall.com which, amazingly, provides free conference call bridges. I recommend setting up three different bridges, and naming them by color so you can refer to them as the "Red Line", "Blue line", etc.

Step 4: Test your notification & conference bridges

Test your notification system to make sure people get the pages and can dial in and use the conference bridges as expected. I've found that it's easier just to give everybody the "host code" instead of having some people using the "participant code". Your mileage may vary. Once you have verified that people are getting pages and can dial into the conference bridges you should...

(continue reading)

tags: hacks, itil, itoperations, life hacks, mtbf, mttr, operations, sla, startups, velocity, velocity08, web 2.0, webops, webopshackcomments: 17
submit: Reddit Digg stumbleupon   

 

Tue

Oct 30
2007

Jesse Robbins

WebOps Hack #1: Simple Availability Report for Busy Teams

by Jesse Robbins@jesserobbinscomments: 2

I created this spreadsheet for tracking availability and "days since last outage".Simple-Availability-Report-2 Along with the availability and uptime calculations, it asks the following questions:

  • What broke?
  • Why?
  • What fixed It?
  • What did we learn?
  • How can we prevent recurrence?
  • Who owns follow-up?

I've found this to be the "simplest thing that could possibly work" for identifying problems and tracking issues before a formal incident tracking system is in place, or with vendors or other teams who you want to keep honest. Please let me know if it's helpful for you and how it might be improved. (Feel free to improve upon it yourself too -- it's Creative Commons Attribution Share Alike.)

Link to the Google doc is here. You need to "Copy to a new spreadsheet" to be able to use it.

Technorati Tags: , , ,

tags: hacks, itil, itoperations, mtbf, mttr, operations, sla, velocity, velocity08, webops, webopshackcomments: 2
submit: Reddit Digg stumbleupon