Entries tagged with “webops” from O'Reilly Radar
John Adams on Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site
by Jesse Robbins | @jesserobbins | comments: 2
Twitter is suffering outages today as they fend off a Denial of Service attack, and so I thought it would be helpful to post John Adams’ exceptional Velocity session about Operations at Twitter.
Good luck today John & team… I know it’s going to be a long day!
Update: Apparently Facebook & Livejournal have had similar attacks today. Rich Miller from Data Center Knowledge reminds us that this is just the latest in a series of major attacks.
tags: attacks, critical infrastructure, infrastructure, operations, performance, security, twitter, velocity, velocity09, velocityconf, video, web2.0, webops
| comments: 2
submit:
Velocity and the Bottom Line
by Steve Souders | comments: 3
Velocity 2009 took place last week in San Jose, with Jesse Robbins
and I serving as co-chairs. Back in
November 2008, while we were planning Velocity, I said I wanted to highlight "best practices in performance and operations that improve the user experience as well as the company's bottom line." Much of my work focuses on the how of improving performance - tips developers use to create even faster web sites. What's been missing is the why. Why is it important for companies to focus on performance?
That question was answered at Velocity last week by speakers from AOL, Google, Microsoft, and Shopzilla.
- Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What's more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)
- Dave Artz from AOL presented several performance suggestions. He concluded with statistics that show page views drop off as page load times increase. Users in the top decile of page load times view ~7.5 pages/visit. This drops to ~6 pages/visit in the 3rd decile, and bottoms out at ~5 pages/visit for users with the slowest page load times. (slides)
- Marissa Mayer shared several performance case studies from Google. One experiment increased the number of search results per page from 10 to 30, with a corresponding increase in page load times from 400 milliseconds to 900 milliseconds. This resulted in a 25% dropoff in first result page searches. Adding the checkout icon (a shopping cart) to search results made the page 2% slower with a corresponding 2% drop in searches/user. (Watch the video to see the clever workaround they found.) Image optimizations in Google Maps made the page 2-3x faster, with significant increase in user interaction with the site. (video, slides)
- Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)
These case studies provide real world numbers that show the benefits of making your site faster. Other Velocity sessions share techniques for implementing performance improvements, including sessions from me, Doug Crockford, and the Facebook and Google frontend teams. But what about the user experience? In his session, Matt Mullenweg (of WordPress fame) makes sure we remember the importance of how the user feels while interacting with our site:
That's why [performance] is important and why we should be obsessed and not be discouraged when it doesn't change the funnel. My theory here is when an interface is faster, you feel good. And ultimately what that comes down to is you feel in control. The web app isn't controlling me, I'm controlling it. Ultimately that feeling of control translates to happiness in everyone. In order to increase the happiness in the world, we all have to keep working on this.
Thanks to the Velocity speakers & their organizations for overcoming the many challenges required to present this data for the first time. We're now equipped with the financial justification, the technical know-how, and the visceral motivation to go out and make the Web a faster place. We'll have more performance success stories next year. Your company could be one of them! Capture your performance improvements and bottom line impact. We'd love to hear from you at Velocity 2010.
tags: operations, velocity09, velocityconf, web2.0, webops
| comments: 3
submit:
Jonathan Heiliger on Web Performance, Operations, and Culture
by Jesse Robbins | @jesserobbins | comments: 0
We were honored to have Jonathan Heiliger, Facebook’s VP of Technology Operations, as our opening keynote speaker at Velocity. Jonathan is one of the most accomplished leaders in our field, and is a master of the craft.
Here is his keynote in its entirety:
Note: Other videos from Velocity are being posted to VelocityConference.blip.tv
tags: development, executive, facebook, jonathan heiliger, leadership, operations, performance, velocity, velocityconf, web2.0, webops
| comments: 0
submit:
Announcing: Spike Night at Velocity
by Scott Ruthfield | @scottru | comments: 5Guest blogger Scott Ruthfield is a Program Committee member of the O'Reilly Velocity: Web Performance & Operations Conference.
- Chris Bissell, Chief Software Architect at MySpace, and members of the MySpace team will demonstrate a massive, real increase in traffic, and will manage it on-stage. MySpace already deals with tens of thousands of hits each second - we can't throw enough traffic at them to cause any harm - so they'll cause their own harm and then show how they work through it.
- Ryan Nelson, Operations Director for MLB Advanced Media and MLB.com, will walk us through a combination of war stories and live traffic management to show what happens when millions of baseball fans all want to see what's happened after the commercial break at the exact same time. Between their very popular desktop apps and their newly-announced iPhone game streaming, the MLB is a true leader in technology innovation with a rabid fan base that goes well beyond the Web 2.0 echo chamber.
tags: cloud, infrastructure, operations, performance, scalability, scale, spikenight, velocity, velocity09, velocityconf, web2.0, webops
| comments: 5
submit:
Ignite! comes to San Jose June 22nd - Submit your talks now!
by Jesse Robbins | @jesserobbins | comments: 0
Ignite! is coming to San Jose on Monday June 22, 2009 at 8:00 pm, attached to the Velocity Conference. Admission is free, open to all, and there will be a cash bar.
The deadline for talks is May 11th, so submit your talks now!
As with all Ignites each speaker will only get 20 slides that each auto-advance every 15 seconds for a total of five minutes. We'll be looking for fun geek topics like hacks, how-to's, and insights. (Talks don't have to be Velocity-related!) If you're not sure what an Ignite talk looks like check out the Ignite Show.
tags: events, ignite, operations, san jose, velocity, velocityconf, web2.0, webops
| comments: 0
submit:
Velocity Preview - The Greatest Good for the Greatest Number at Microsoft
by James Turner | comments: 4
You may also download this file. Running time: 00:20:26
Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.
The psychology of engineering user experiences on the web can be difficult. How much rich content can you place up on a page before the load time drives away your visitors? Get the answer wrong, and you can end up with a ghost town; get it right and you're a star. Eric Schurman knows this well, since he is responsible for just those kind of trade-off decisions on some of Microsoft's highest traffic pages. He'll be speaking at O'Reilly's Velocity Conference in June, and he recently talked with us about how Microsoft tests different user experiences on small groups of visitors.
James Turner: Why don't you start by describing what your gig at Microsoft is now and what your career path has been there?
Eric Schurman: I'm a principal dev lead for Live Search, what used to be MSN Search. And I started at Microsoft back in the late 90s working in Microsoft's Press organization, where we actually were developing training software that would emulate new Microsoft products, but didn't require those products to be on a user's machine. So, for example, if you had an organization that was running Windows 95, we would have a training system for Windows 98 that would emulate a bunch of the functionality of Windows 98 so that you could deploy it to your people. They could train their people on how to use Windows 98 before they actually deployed it.
I then moved on to the Microsoft Press website, where I became the dev lead for it. I made a few other moves and ended up going to Microsoft.com, where I ran the download center, the Microsoft.com homepage, the product catalog, and a bunch of other places from a dev perspective.
I then moved to what was then MSN Search, back in about 2005, and was there through the MSN to Live transition. At the time, I wasn't working on performance; I was just working on the Live Search application. And it became very obvious that we had some major performance problems. Performance has always been one of my really strong interests, so I took on addressing a lot of those. And when we addressed them, we had very significant improvements in our business metrics. That really surfaced how important performance was to the organization, and I moved into a role where I was really focusing just on performance. I've been in that role now for about two years.
JT: You've worked on at least three very different parts of the Microsoft website. The homepage has lots of hits, fairly static. The download page is a lot of data for long periods of time. Live Search is high volume, but there's also a lot of backend on that. In what ways do you need to architect them differently? And where can you reuse the same lessons?
ES:: That's a great question. On the web, you've got different concerns on what you have for client apps. The main things that tend to impact end-user perceived performance on the web are often things about how you've designed your application from a network perspective. So how many different HTTP get requests are you making? How are those get requests structured? So, for example, are they serialized? Did you have a JavaScript file that then gets returned to the browser that requests another JavaScript file and another JavaScript file and then some content and then it finally gets rendered? So the number of assets that you request, that's going to be something that's important no matter what product your doing.
There are other things, like how much script do you have on the page, how much CSS you have on the page, how much actual content are your rendering to the page, etcetera. There are tricks that you can use like combining many different graphics into a single tiled image and sending that down to the browser. It's much faster to send one image to the browser than, say, 20 images. Even if you end up sending the same overall graphics, but combined into one, it's still must faster to send it as one request.
There are also different data volume concerns. They're also different from a business perspective. A lot of what we were sending out from the download center was extremely time critical. We would have an update go out, and we needed to make sure that update was going to be available anywhere in the world within a certain time frame, which required us to handle very high bandwidth, and a very high volume of requests coming into the site that were transferring lots of bits. So that required something totally different than something like the Microsoft.com homepage.
It's also interesting looking at the volume of traffic and how that traffic reflects real users. So, for example, one of the problems that you end up with on both the Microsoft homepage and Live Search is that we have a huge number of bots that are trying to hit the system, lots of people trying to do SEO work are trying to hit search engines to gather information about their site, about competitor sites, about all sorts of things. On the Microsoft.com homepage, it's always under distributed denial of service attacks. It's not a question of how frequently does it happen; it's just what is the rate right now? Also, the Microsoft.com homepage has historically had such a high up-time rate that it's actually hit by a lot of hardware devices simply to check for connectivity to the internet. And so you'd want to treat a request from that kind of "user" very differently from a request that's coming from a real user.
So that's kind of a long, rambling answer to your question. Do you have any areas that you want me to drill in or maybe talk about something else?
tags: interviews, microsoft, operations, velocity09, velocityconf, web2.0, webops
| comments: 4
submit:
Velocity Preview - Keeping Twitter Tweeting
by James Turner | comments: 3
You may also download this file. Running time: 00:10:46
Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.
If there's a site that exemplifies explosive growth, it has to be Twitter. It seems like everywhere you look, someone is Tweeting, or talking about Tweeting, or Tweeting about Tweeting. Keeping the site responsive under that type of increase is no easy job, but it's one that John Adams has to deal with every day, working in Twitter Operations. He'll be talking about that work at O'Reilly's Velocity Conference, in a session entitled Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site, and he spent some time with us to talk about what is involved in keeping the site alive.
James Turner: Can you start by describing the platforms and technologies that make Twitter run today?
John Adams: Twitter currently runs on Ruby on Rails. And we also use a combination of Java and Scala, and a number of homegrown scripts that run the site. We also use a lot of open-source tools like Apache, MySQL, memcached.
JT: What type of hardware are you running on?
JA: It's all Linux, so a lot of x86 hardware. I can't tell you the brands or how many.
JT: Do you make any kind of attempt to stay homogeneous in that?
JA: Yes, we do. All of our hardware is very consistent. It makes deployment of new software very easy. And we also use a number of configuration management tools like Puppet to deliver software to those machines.
JT: As anyone can see, Twitter has had a pretty explosive growth, especially recently. Were you prepared for this kind of ramp up?
JA: I don't think so. I mean we're growing week over week in enormous numbers. And we spend a lot of time calculating the growth and scalability of the site to make sure that we can handle the upcoming load.
JT: I mean obviously there are events like Oprah decides she's going to Tweet that are going to be spikes. Do you try to get warning of that stuff?
JA: Yeah. And frequently we know of major events happening. Major events are very predictable like Macworld, even any massive amount of media interaction, we have some fair warning beforehand.
tags: interviews, operations, twitter, velocity, velocity09, velocityconf, web2.0, webops
| comments: 3
submit:
AT&T Fiber cuts remind us: Location is a Basket too!
by Jesse Robbins | @jesserobbins | comments: 3
The fiber cuts affecting much of the San Francisco Bay Area this week are similar to the outages in the Middle East last year (radar post), although far more limited in scope and impact. What I said last year still holds true and is repeated below:
From an operations perspective these kinds of outages are nothing new, and underscore why having "many eggs in few baskets" is such a problem. I believe we will see similar incidents when we have the first multi-datacenter failures where multiple providers lose significant parts of their infrastructure in a single geographic area.
Remember: Don't put all your eggs in one basket... and Location is a basket too!
To really understand the issue, I recommend Neal Stephenson's incredible (and lengthy) Wired article from 1996 entitled "Mother Earth Mother Board":
It's also worth mentioning the outages to multiple service providers hosted in a single colocation facility when the FBI sized all the equipment in the facility, the big outage at 365 Main from two years ago, and many others (see: Radar posts & comprehensive coverage at Data Center Knowledge).[...] It sometimes seems as though every force of nature, every flaw in the human character, and every biological organism on the planet is engaged in a competition to see which can sever the most cables. The Museum of Submarine Telegraphy in Porthcurno, England, has a display of wrecked cables bracketed to a slab of wood. Each is labeled with its cause of failure, some of which sound dramatic, some cryptic, some both: trawler maul, spewed core, intermittent disconnection, strained core, teredo worms, crab's nest, perished core, fish bite, even "spliced by Italians." The teredo worm is like a science fiction creature, a bivalve with a rasp-edged shell that it uses like a buzz saw to cut through wood - or through submarine cables. Cable companies learned the hard way, early on, that it likes to eat gutta-percha, and subsequent cables received a helical wrapping of copper tape to stop it.
[...] There is also the obvious threat of sabotage by a hostile government, but, surprisingly, this almost never happens. When cypherpunk Doug Barnes was researching his Caribbean project, he spent some time looking into this, because it was exactly the kind of threat he was worried about in the case of a data haven. Somewhat to his own surprise and relief, he concluded that it simply wasn't going to happen. "Cutting a submarine cable," Barnes says, "is like starting a nuclear war. It's easy to do, the results are devastating, and as soon as one country does it, all of the others will retaliate."
As the capacity of optical fibers climbs, so does the economic damage caused when the cable is severed. FLAG makes its money by selling capacity to long-distance carriers, who turn around and resell it to end users at rates that are increasingly determined by what the market will bear. If FLAG gets chopped, no calls get through. The carriers' phone calls get routed to FLAG's competitors (other cables or satellites), and FLAG loses the revenue represented by those calls until the cable is repaired. The amount of revenue it loses is a function of how many calls the cable is physically capable of carrying, how close to capacity the cable is running, and what prices the market will bear for calls on the broken cable segment. In other words, a break between Dubai and Bombay might cost FLAG more in revenue loss than a break between Korea and Japan if calls between Dubai and Bombay cost more.
The rule of thumb for calculating revenue loss works like this: for every penny per minute that the long distance market will bear on a particular route, the loss of revenue, should FLAG be severed on that route, is about $3,000 a minute. So if calls on that route are a dime a minute, the damage is $30,000 a minute, and if calls are a dollar a minute, the damage is almost a third of a million dollars for every minute the cable is down. Upcoming advances in fiber bandwidth may push this figure, for some cables, past the million-dollar-a-minute mark. [Link]
tags: at&t, cloud, failure, failure happens, fiber, infrastructure, operations, outages, velocity, velocity09, web infrastructure, web operations, web2.0, webops, worries
| comments: 3
submit:
Understanding Web Operations Culture - the Graph & Data Obsession
by Jesse Robbins | @jesserobbins | comments: 8
We’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time.
-John Allspaw, Operations Engineering Manager at Flickr & author of The Art of Capacity Planning
One of the most interesting parts of running a large website is watching the effects of unrelated events affecting user traffic in aggregate. Web traffic is something that companies typically keep very secret, and often the only time engineers can talk about it is late at night, at a bar, and very much off the record.
There are many good reasons for keeping this kind of information confidential, particularly for publicly traded companies with complicated disclosure requirements. There are also downsides, the biggest being that is difficult for peers to learn from each other and compare notes.
John Allspaw recently created a WebOps Visualizations group on Flickr for sharing these kinds of graphs with the confidential information removed. Here’s an example of a traffic drop seen both by Flickr & by Last.FM that coincided with President Obama’s inauguration.
Similar traffic drop on Last.FM seen on the right
Google saw a similar drop as well
Was it because everybody went to Twitter?
Besides being an interesting story, sharing these kinds of graphs help people build better monitoring tools and processes. As just one example: How should the WebOps team respond to this dip in traffic? Is it an outage? The inaguration was a very well known event and so it’s easy to explain the drop in traffic… what happens when a similar drop in traffic occurs? Should the WebOps team be looking at CNN (or trends in twitter) along with everything else?
How do you tell when that unexpected 10% drop in traffic is really just people with something more important to do than browse your site?
(Note: Updated since original posting to add Google & Twitter graphs and annotations, and to switch the Last.FM graphic with an annotated one after I got permission.)
tags: big data, culture, enterprise 2.0, flickr, infovis, john allspaw, last.fm, metrics, monitoring, operations, velocity, velocity09, web2.0, webops
| comments: 8
submit:
Velocity 2009: Themes, ideas, and call for participation...
by Jesse Robbins | @jesserobbins | comments: 0
Last year's Velocity conference was an incredible success. We expected around 400 people and we ended up maxing out the facility with over 600. This year we're moving the conference to a bigger space and extending it to 3 days to accommodate workshops and longer sessions.
Velocity 2009 will be on June 22-24th, 2009 at the Fairmont Hotel in San Jose, CA.
This year's conference will be especially important. I've said many times that Web Performance and Operations is critical to the success of every company that depends on the web. In the current economic situation, it's becoming a matter of survival. The competitive advantage comes from the ability to do two things:
Our Velocity 2009 mantra is "Fast, Scalable, Efficient, Available", a slight change from last year. (We've replaced "Resilient" with "Efficient" to make focus clear.)
I'm excited to announce that joining Steve Souders & I on this year's program committee are John Allspaw, Artur Bergman, Scott Ruthfield, Eric Schurman, and Mandi Walls. We've already started working on the program, and have just opened the Call for Participation.
tags: artur bergman, conferences, Eric Schurman, John Allspaw, mandi walls, operations, performance, scott ruthfield, steve souders, velocity, velocity09, web2.0, webops
| comments: 0
submit:
DisasterTech: "Decisions for Heroes"
by Jesse Robbins | @jesserobbins | comments: 2
One of the most interesting DisasterTech projects I've been following is "Decisions for Heroes" led by developer and Irish Coast Guard volunteer Robin Blandford.
Decisions is like Basecamp for volunteer Search & Rescue teams. The focus is on providing "just enough" process to compliment the real-world workflow of a rescue team, without unnecessary complexity. One of Robin's design goals is that: 
User requirements are nil. Nobody likes reading manuals - if we have to write one, we've gotten too complicated.
This is the winning approach for building systems that "serve those that serve others", and is echoed by InSTEDD's design philosophy and the Sahana disaster management system.
Teams begin by entering their responses to incidents and training exercises. They then tag them with things like the weather conditions, the tools and skills required, and who from the team was deployed.
As a team's incident database grows this information can be used to show heatmaps, and provide powerful insight on the locations, weather conditions, and times of year that various incidents occur. Over time this kind of data could be analyzed in aggregate across multiple teams and regions and create an incredibly powerful resource for Emergency Managers. This is very similar to what Wesabe does for consumers with financial transaction data today (disclosure: OATV investment).
Rescue team members enter training dates and levels. The system tracks certification expiration dates and prompts team members & leaders to plan classes and remain current. This is a huge issue for volunteers who have to manage professional-level training requirements with the demands of a regular career.
As more incidents are entered into the system, it compares the skills required for each of the rescues with the team training exercises. This allows teams to identify areas to focus, train, and develop new skills.

tags: disaster tech, disastertech, emergency management, firefighting, humanitarian aid, ict, innovation, operations, rescue, social networking, web 2.0, webops
| comments: 2
submit:
Sprint blocking Cogent network traffic...
by Jesse Robbins | @jesserobbins | comments: 3
It appears that Sprint has stopped routing traffic (called "depeering") from Cogent as a result of some sort of legal dispute. Sprint customers cannot reach Cogent customers, and vice versa. The effect is similar to what would happen if Sprint were to block voice phonecalls to AT&T customers.
Here's a graph that shows the outage, courtesy of Keynote :

Rich Miller at DataCenterKnowledge has a great summary of the issues behind the incident, which has happened with Cogent before. Rich says:
At the heart of it, peering disputes are really loud business negotiations, and angry customers can be used as leverage by either side. This one will end as they always do, with one side agreeing to pay up or manage their traffic differently.
I think this is particularly Radar-worthy because it provides an example of the complex issues around Net Neutrality . In this case customers are harmed and most (especially Sprint wireless customers) will have no immediate recourse.
tags: cloud computing, cogent, disruption, innovation, internet policy, network neutrality, operations, sprint, utilities, utility computing, webops
| comments: 3
submit:
Amazon's new EC2 SLA
by Jesse Robbins | @jesserobbins | comments: 7
Amazon announced a new SLA for EC2, similar to the one for S3. This is a notable step for Amazon and cloud computing as a whole, as it establishes a new bar for utility computing services.
Amazon is committing to 99.95% availability for the EC2 service on a yearly basis, which corresponds to approximately four hours and twenty three minutes of downtime per year. It's important to remember that an SLA is just a contract that provides a commitment to a certain level of performance and some form of compensation when a provider fails to meet it.
Here's the summary of the EC2 SLA (emphasis added):Service Commitment AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below. [...]To receive a Service Credit, you must submit a request by sending an e-mail message to aws-sla-request @ amazon.com. To be eligible, the credit request must [...] include your server request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks)
- “Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability [...]
- “Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances. [...]
This new SLA does not appear to address the reliability of server instances individually or in aggregate. For example, if half of a customer's EC2 instances lose their connections or die every 6 minutes, EC2 would still be considered "available" even if it is essentially unusable.
If the entire EC2 service is down a cumulative four hours and twenty minutes, customers must furnish proof of the outage to Amazon to be eligible for the 10% credit. This seems like an onerous process for very little compensation, and isn't in-line with Amazon's famous "Relentless Customer Obsession". Amazon takes monitoring very seriously and should take the lead by tracking, reporting, and proactively compensating customers when it lets them down.
tags: amazon, availability, cloud computing, ec2, operations, s3, sla, webops
| comments: 7
submit:
Hyperic CloudStatus service dashboard launches at Velocity!
by Jesse Robbins | @jesserobbins | comments: 6
Javier Soltero just launched CloudStatus during his Hyperic sponsor session today at Velocity. CloudStatus is a public health dashboard for web services like Amazon's EC2/S3, and Google's App Engine.
Javier called to tell me about this last week after I declared that "Service Monitoring Dashboards are mandatory". This comes right after Amazon and Google had visible outages, and couldn't have happened at a better time. I'm really excited to see this idea take off, as it's something that is critical to the broad adoption of web services and cloud computing.
tags: cloudstatus, hyperic, monitoring, operations, outages, platform plays, specialized services, startups, velocity, velocity08, web 2.0, webops
| comments: 6
submit:
Service Monitoring Dashboards are mandatory for production services!
by Jesse Robbins | @jesserobbins | comments: 6
Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington's post on TechCrunch.
Google has a strong Web Operations culture, and there are numerous internal monitoring tools in use across the company, along with a smaller set available to customers. It's suprising that Google launched a developer platform without providing something beyond an email group, although they are by no means the first to do so.

Service Monitoring Dashboards are mandatory for production services and platforms!
- If you launch a platform that people pay you money for, you need to have a real time service dashboard. Ideally this should be decoupled from the rest of your infrastructure.
- Don't rely on platforms that lack service monitoring dashboards for production.
Many companies are initially reluctant to provide this kind of monitoring to the public, and only do so in reaction to an outage. However, it seems that every company that offers such a dashboard uses it as a source of competitive advantage.
The best example of this is trust.salesforce.com which they launched after series of outages in 2006. Amazon (eventually) launched a status dashboard for AWS, and added RSS feeds for specific services which I think is pretty cool.
Javier Soltero at Hyperic points out
1. The reports of service outages arrive long after anyone who depends on the services can possibly do anything to mitigate their effect.
2. The services themselves seem incapable of providing any visibility into the circumstances that might lead to future outages.[...]Even TechCrunch points out that the Google Apps blog doesn’t even mention the outage. Other clouds rely on blogs such as this one, this one, or maybe even this one (from our good friends at Mosso). These are all places where outages can be discussed, but not the right means for people to find out whether it their application that crashed, or the cloud that it depends on.
(Updated:Niall Kennedy pointed out that GAE is still a preview release, and I agree that my original wording was wrong. My intent is to emphasize the importance of providing a public service dashboard and so I've edited accordingly.)
tags: failure happens, google app engine, infrastructure, internet policy, monitoring, operations, outages, platform plays, platforms, saas, velocity, web 2.0, web services, webops
| comments: 6
submit:
Two new open source projects at Velocity
by Jesse Robbins | @jesserobbins | comments: 3
At Velocity next week there will be two significant open source projects debuting. The first is the Jiffy: Open Source Performance Measurement and Instrumentation tool created by Scott Ruthfield and his team at Whitepages.com.
Most tools for measuring web performance come in two flavors:
- Developer-installed tools (Firebug, Fiddler, etc.) that allow individuals to closely trace single sessions
- Third-party performance monitoring systems (Gomez, Keynote, etc.) that will hit your site occasionally and report back component-level metrics (for a fee)
Neither of these tools give you real-world information on what’s actually happening with your clients—how long are pages really taking to load, what’s the real cost of client-side execution, and what’s the impact of your loading or dependency chain. This is even more important when you don’t host all of your own assets, such as when you load ads or JavaScript from third parties, for example, and you need to monitor their performance.
Thus we built Jiffy—an end-to-end system for instrumenting your web pages, capturing client-side timings for any event that you determine, and storing and reporting on those timings. You run Jiffy yourself, so you aren’t dependent on the performance characteristics, inflexibility, or costs of third-party hosted services.
The second is project is EUCALYPTUS, the Elastic Utility Computing Architecture for Linking Your Programs To Useful Systems, presented by Rich Wolski from UCSB. This project has already started getting attention. (Many thanks to Surj Patel of Structure08/GigaOM for connecting us!)
Eucalyptus is an open-source software infrastructure for implementing "cloud computing" on clusters. The current interface to EUCALYPTUS is compatible with Amazon's EC2 interface, but the infrastructure is designed to support multiple client-side interfaces. EUCALYPTUS is implemented using commonly-available Linux tools and basic Web-service technologies making it easy to install and maintain.
The talk will focus on the design, the implementation tradeoffs we have identified in implementing Eucalyptus as an exploratory tool, and the ways in which we have chosen to address these tradeoffs in the first version of the software.
tags: cloud, cloud computing, ec2, gomez, jiffy, keynote, metrics, open source, operations, performance, platform plays, startups, structure08, velocity, velocity08, web 2.0, web monitoring, webops
| comments: 3
submit:
Understanding Web Operations Culture (Part 1)
by Jesse Robbins | @jesserobbins | comments: 11
“You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it does.” - Fire Chief Mike Burtch
(Note: I became a Firefighter-1 and EMT in 2000. My experiences in the fire service profoundly influence my efforts in technology. Much of my work over the past few years has been translating and distilling my knowledge from these two worlds, teaching others, and finding ways to apply it in the service of both.)
Last week I came upon a truck vs. scooter accident on my way home. I could hear a woman yelling in pain from underneath the truck (a good sign!) and could see a guy in the cab looking panicked and touching his controls. I stopped my car and “surveyed the scene” looking for things that might kill me (traffic, hazmat, downed power lines) or make the situation worse if undetected (additional victims, deflating tires, fires).
It looked like the driver was about to move his truck, which would have definitely made things worse. I used my ‘command voice’ to yell “Put it in park! Stop your engine! Set your brake! Get out and wait!” as I approached the truck.
A city crew came over, and one of them told me “We’ve called 911 and they are on their way.”
I asked them to handle traffic control as I approached my patient. I then introduced myself and asked her if I could help. (I have to obtain consent before assisting an injured person, and a response means I know they have still have their Airway, Breathing, and Circulation intact.)
Her legs were entangled in her scooter which was trapped underneath the truck. While she probably had broken her leg, it didn’t look all that bad. She was still wearing her helmet and it wasn't seriously damaged which meant her head was probably okay too. I did a quick check for bleeding and other serious injuries and did a “mental status check” by asking her name, where she was (“on my way to school”), and what had happened (“I was riding and that a**hole RAN OVER ME!”). This meant she was alert and oriented, which was good.
Now that I was sure there weren’t any other life threatening injuries, I prepared to hold her head for c-spine stabilization. (Once you start holding stabilization, you cannot move again until you are ready to put the patient on a backboard.)
As I positioned myself on the ground and took hold of her head, I explained “I’m going to hold your head now to protect your neck and back. Once the fire department gets here, they are going to get your legs unstuck and then we’ll get you on a backboard. Your job is to keep still and keep talking to us. There will be a lot of commotion and noise around you, and that’s okay. Everyone will be watching out for you and so there is no reason to be scared. We’ve got you.”
tags: culture, education, ems, executive, firefighting, leadership, mainstream acceptance, management, medicine, operations, startups, velocity, velocity08, web 2.0, webops
| comments: 11
submit:
CloudCamp gathering after Velocity
by Jesse Robbins | @jesserobbins | comments: 2
On Tuesday after Velocity closes there will be a CloudCamp gathering at Microsoft's San Francisco Office. I'll be going (unless I'm too exhausted to stand).
CloudCamp was formed in order to provide a common ground for the introduction and advancement of cloud computingThrough a series of local cloudcamp events, attendees can exchange ideas, knowledge and information in a creative and supporting environment, advancing the current state of cloud computing and related technologies. As an informal, member-supported gathering, we rely entirely on volunteers to help with meeting content, speakers, meeting locations, equipment and membership recruitment. We also have corporate sponsors that provide financial assistance with venues, software, books, discounts, and other valuable donations. To become a member, simply register for an upcoming event. Anyone may attend a meeting, there are no fees or dues.
It looks like there is now a London CloudCamp being planned for July 16th as well.
(PS: If you still haven't registered for Velocity and want to attend, you can use my 20% discount code "vel08js".)
tags: barcamp, cloud, cloud computing, cloudcamp, ec2, open source, operations, performance, startups, velocity, web 2.0, webops
| comments: 2
submit:
Bill Coleman to keynote Velocity
by Jesse Robbins | @jesserobbins | comments: 0
Bill Coleman has twice transformed our industry, and I'm excited to announce that he will keynote Velocity later this month. Bill is most famous for being the "B" in BEA and for leading the creation of Solaris while at Sun. He is now the CEO of his new startup, Cassatt, which "makes Data Centers more efficient".
Bill is awesome and I'm really looking forward to his keynote. He is changing the way we think about and manage Data Centers and the software that runs within them.
When we spoke earlier this week he explained how vacuum tubes created the fear of powering down servers, and how funny it is that that fear persists with people that have never seen them. (I've never made that connection as I'm "part of the problem" ;-)
At Velocity, Bill will likely talk about virtualization & efficiency, where he thinks we're headed, and the questions we need to be asking now to get there.
(Many thanks to Tim for suggesting this to Bill and making the introduction.)
tags: bill coleman, datacenter, energy, green datacenter, operations, platform plays, power management, velocity, velocity08, virtualization, web 2.0, web operations, webops
| comments: 0
submit:
TLS Report grades and reports on site security
by Jesse Robbins | @jesserobbins | comments: 7
My friend Ben Black just released TLS Report, a free (ad-supported) tool that evaluates SSL/TLS configurations across websites and assigns letter grades. In the example below, Facebook gets a D because it accepts several keys that are below 128-bits and relies on MD5:

Ben explains: Cryptography is arcane and complex. Cryptography is also the basis for the various protocols that secure online commerce, ensure privacy of communication, and provide for integrity of data. Transport Layer Security (TLS), formerly SSL, is the de-facto standard for secure communication on the web, and it, naturally, relies on some rather sophisticated cryptographic techniques. Properly implemented, TLS all but guarantees the security of the communication channel.
It's that properly implemented part that catches folks out. Whether from poor defaults in software, poor understanding of best practices, or a weak grasp on the various trade-offs between security and performance, TLS, as most often deployed on the web, is in a sorry state. We hope to change that.
The tls report delivers the tools, information, and visibility to reveal problems in TLS configurations and offer better alternatives so folks can improve their security posture and make sure it stays improved. Everybody wins.
Ben has received a few early complaints from sites getting low grades. This seems to be common with most new rating systems, and I think the discussion is often more important than the scores themselves. You can check out the top/bottom 20 sites, search, and add new ones to be included in the report.
tags: compliance, dss, operations, pci, pcidss, security, ssl, tls, tlsreport, velocity, velocity08, web 2.0, webops
| comments: 7
submit:










