Entries tagged with “operations” from O'Reilly Radar

Thu

Oct 1
2009

Jesse Robbins

More on how web performance impacts revenue...

by Jesse Robbins@jesserobbinscomments: 9

At Velocity this year Microsoft, Google and Shopzilla each presented data on how web performance directly impacts revenue.

Their data showed that slow sites get fewer search queries per user, less revenue per visitor, fewer clicks, fewer searches, and lower search engine rankings. They found that in some cases even after site performance was improved users continued to interact as if it was slow. Bad experiences have a lasting influence on customer behavior.

What about smaller websites that aren't yet at this scale?

Alistair Croll and Sean Power, the authors of the new book Complete Web Monitoring, have continued this research for sites at smaller scale.

They used a Strangeloop Networks web acceleration appliance to optimize half the sessions to a smaller production website, tagging optimized and unoptimized visitors so they could be analyzed in Google Analytics. The Strangeloop device applies many of Steve Souders' performance rules to an existing site automatically (a kind of "Steve-in-a-Box" ;-).

The results of their analysis show how significant a reduction in page latency can be. In addition to reducing bounce rates, and increasing pages per visit & time on site, they found a 16.07% increase in conversion rates and a 5.50% increase in average order value.

conversion-rate-and-order-value.png

Check out the full post on the Watching Websites blog.

tags: alistair croll, book related, operations, performance, velocity, velocityconf, watching websites, web monitoringcomments: 9
submit: Reddit Digg stumbleupon   

 

Fri

Sep 4
2009

Nat Torkington

Four short links: 4 September 2009

Flood Maps, Govt Permalinks, Ops, and Security

by Nat Torkington@gnatcomments: 1

  1. Flood Maps -- what the world will look like when the oceans rise. Interactive, so you can dial up your preferred level of environmental horror. (via Hans Nowak)
  2. Citability -- making government accessible, reliable, and transparent with advanced permalinks, as Government websites are ever changing and cannot be cited. Content changes without notice or accountability.
  3. Bootstrapping EC2 Images as Puppet Clients -- This is a post on how to get to the point of using Puppet in an EC2 environment, by automatically configuring EC2 instances as Puppet clients once they're launched. I've been learning that if you're using a cloud hosting service, you need an automated admin tool. (via Grig Gheorghiu). See also the APT repository for Chef.
  4. USB Snoop Stick -- Trojan in a convenient form factor, malware on a stick, back doors in your pocket ... and best of all, it's sold to consumers.

tags: climate change, environment, gov 2.0, operations, security, web, web monitoringcomments: 1
submit: Reddit Digg stumbleupon   

 

Mon

Aug 17
2009

Carl Hewitt

Is intimate personal information a toxic asset in cloud datacenters?

by Carl Hewittcomments: 15

Guest blogger Carl Hewitt, Emeritus at MIT in the Electrical Engineering and Computer Science department, is known for his research on strongly paraconsistent logic, privacy-friendly client cloud computing, norms and commitments for organizational computing, and concurrent programming languages, models, and theories.

Aggregators (Google, Yahoo, Microsoft, Facebook, etc.) tend to believe that personal information is a valuable asset for several reasons. It is valuable to advertisers because it enables greater relevance for their ads. It is valuable to users because it can be used to enrich their lives. And it is valuable to aggregators because they can use personal information to make more money by selling (anonymous?) versions and by using it to bring together advertisers and customers. Recency and intimacy can add value to information. Current and recent information tends to be more relevant than older information. Intimate psychological, physiological, sociological, geographical, medical, etc. information can be used to personalize interactions.

Intimate current personal information is also valuable for government security because it can be critical to taking security counter measures. Already in the UK, the previous two years of everyone's email, web browsing, and telephone calls are becoming available to government officials at varying levels of detail. For example, detectives will be required to consider accessing telephone and internet records during every investigation under new plans to increase police use of communications data.

But that's only the beginning. As Jim Gray noted in "Distributed Computing Economics" (MSR-TR-2003-24) there is a growing imbalance between the computation power of billions of cores in aggregator datacenters and the relatively feeble fiber optic communications coming out of aggregator datacenters. This problem has now become so severe that Amazon has been forced to introduce a commercial service that lets users of their cloud import and export data through the post--as in, put it on storage devices and ship it by land, sea, or air. Soon even this stopgap will become impractical for government security agencies because whole shipping containers would have to be transferred--the functional equivalent of shipping large pieces of an aggregator datacenter. Consequently, to be effective, future government security software will have to be tightly integrated with aggregator datacenters. The most effective security measures will require aggregator datacenters to be heavily regulated, i.e., analogous to nuclear power plants.

Semantic Integration, an emerging technological capability to bring together all kinds of information in a semantic engine, will greatly intensify all of the above issues (see "A historical perspective on developing foundations for privacy-friendly client cloud computing: The Paradigm Shift from 'Inconsistency Denial' to 'Practical Semantic IntegrationTM' " ArXiv 0901.4934). The following kinds of information can be semantically integrated: calendars and to-do lists, email, SMS and Twitter archives, presence information (including physical, psychological and social), maps (including firms, points of interest, traffic, parking, and weather), events (including alerts and status), documents (including presentations, spreadsheets, proposals, job applications, health records, photos, videos, gift lists, memos, purchasing, contracts, articles), contacts (including social graphs and reputation) and search results (including rankings and ratings).

Two critical technologies are the foundation of Practical Semantic Integration: The first is Lightly Structured Natural LanguageTM interfaces that allow information to be easily found and organized. The second is many-core semantic engines (see "ActorScriptTM: Industrial strength integration of local and nonlocal concurrency for Client-cloud Computing""; ArXiv 0907.3330) that rapidly process information in ways that are tolerant of inconsistency (see "Common sense for concurrency and strong paraconsistency using unstratified inference and reflection" ArXiv 0812.4852).

To be effective, government security Semantic Integration systems will need to be joined with those of aggregators. Thus Semantic Integration of personal information on aggregator datacenters will require additional government regulation of aggregators. Will government regulation prove toxic to the ability of aggregators to innovate?

This is a future that we expect most readers would find distasteful. There is an alternative: A client cloud is a local cloud controlled by a client, e.g., a family cloud might consist of the cell phones, computers, security cameras, home entertainment centers, Wi-Fi access points, etc. of a family. Semantic Integration could be performed in clients' clouds so that clients by default store their information in cloud datacenters in a way that it can be decrypted only by using a client';s secret key.

Semantic Integration using clients' clouds has some important advantages. Client responsiveness can be faster by not requiring communication with datacenters. Aggregator capital, operating and communication costs can be lower because Semantic Integration is performed in clients' clouds instead of aggregator datacenters.

By performing Semantic Integration in clients' clouds, aggregators can make tons of more money than now by doing an even better job of matching up customers with merchants in a way that is more pleasing to both. Aggregators can provide software that runs in the clients' clouds (although it may have to be audited by 3rd parties). The aggregator's software can volunteer high level information to the aggregator's datacenters about the kind of merchant information that might be relevant. Within clients' clouds, the merchant information can then be tailored to the specific requirements of clients.

For reasons above, an aggregator can do better by performing clients’ Semantic Integration using their clouds rather than relying entirely on the aggregator's cloud. And using clients’ clouds could lessen the degree of government regulation because the government would have to subpoena clients to obtain their most intimate personal information. If the information in an aggregator’s datacenters is sufficiently anonymous, then it would not become necessary for government security agencies to regulate them so heavily.

The question is: "What are the aggregators going to do about intimate personal information?" If one of them initiates a project to develop a Semantic Integration product that operates in clients' clouds, then the others will rapidly follow suit.

tags: emerging tech, operationscomments: 15
submit: Reddit Digg stumbleupon   

 

Thu

Aug 6
2009

Jesse Robbins

John Adams on Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site

by Jesse Robbins@jesserobbinscomments: 2

Twitter is suffering outages today as they fend off a Denial of Service attack, and so I thought it would be helpful to post John Adams’ exceptional Velocity session about Operations at Twitter.

Good luck today John & team… I know it’s going to be a long day!

Update: Apparently Facebook & Livejournal have had similar attacks today. Rich Miller from Data Center Knowledge reminds us that this is just the latest in a series of major attacks.

tags: attacks, critical infrastructure, infrastructure, operations, performance, security, twitter, velocity, velocity09, velocityconf, video, web2.0, webopscomments: 2
submit: Reddit Digg stumbleupon   

 

Wed

Jul 1
2009

Steve Souders

Velocity and the Bottom Line

by Steve Souderscomments: 3

Velocity 2009 took place last week in San Jose, with Jesse Robbins and I serving as co-chairs. Back in November 2008, while we were planning Velocity, I said I wanted to highlight "best practices in performance and operations that improve the user experience as well as the company's bottom line." Much of my work focuses on the how of improving performance - tips developers use to create even faster web sites. What's been missing is the why. Why is it important for companies to focus on performance?

That question was answered at Velocity last week by speakers from AOL, Google, Microsoft, and Shopzilla.

  • Eric Schurman (Bing) and Jake Brutlag (Google Search) co-presented results from latency experiments conducted independently on each site. Bing found that a 2 second slowdown changed queries/user by -1.8% and revenue/user by -4.3%. Google Search found that a 400 millisecond delay resulted in a -0.59% change in searches/user. What's more, even after the delay was removed, these users still had -0.21% fewer searches, indicating that a slower user experience affects long term behavior. (video, slides)
  • Dave Artz from AOL presented several performance suggestions. He concluded with statistics that show page views drop off as page load times increase. Users in the top decile of page load times view ~7.5 pages/visit. This drops to ~6 pages/visit in the 3rd decile, and bottoms out at ~5 pages/visit for users with the slowest page load times. (slides)
  • Marissa Mayer shared several performance case studies from Google. One experiment increased the number of search results per page from 10 to 30, with a corresponding increase in page load times from 400 milliseconds to 900 milliseconds. This resulted in a 25% dropoff in first result page searches. Adding the checkout icon (a shopping cart) to search results made the page 2% slower with a corresponding 2% drop in searches/user. (Watch the video to see the clever workaround they found.) Image optimizations in Google Maps made the page 2-3x faster, with significant increase in user interaction with the site. (video, slides)
  • Phil Dixon, from Shopzilla, had the most takeaway statistics about the impact of performance on the bottom line. A year-long performance redesign resulted in a 5 second speed up (from ~7 seconds to ~2 seconds). This resulted in a 25% increase in page views, a 7-12% increase in revenue, and a 50% reduction in hardware. This last point shows the win-win of performance improvements, increasing revenue while driving down operating costs. (video, slides)

These case studies provide real world numbers that show the benefits of making your site faster. Other Velocity sessions share techniques for implementing performance improvements, including sessions from me, Doug Crockford, and the Facebook and Google frontend teams. But what about the user experience? In his session, Matt Mullenweg (of WordPress fame) makes sure we remember the importance of how the user feels while interacting with our site:

That's why [performance] is important and why we should be obsessed and not be discouraged when it doesn't change the funnel. My theory here is when an interface is faster, you feel good. And ultimately what that comes down to is you feel in control. The web app isn't controlling me, I'm controlling it. Ultimately that feeling of control translates to happiness in everyone. In order to increase the happiness in the world, we all have to keep working on this.

Thanks to the Velocity speakers & their organizations for overcoming the many challenges required to present this data for the first time. We're now equipped with the financial justification, the technical know-how, and the visceral motivation to go out and make the Web a faster place. We'll have more performance success stories next year. Your company could be one of them! Capture your performance improvements and bottom line impact. We'd love to hear from you at Velocity 2010.

tags: operations, velocity09, velocityconf, web2.0, webopscomments: 3
submit: Reddit Digg stumbleupon   

 

Mon

Jun 29
2009

Nat Torkington

Four short links: 29 June 2009

Syadmin Wiki, Physics, National Archives, and Reinventing the British Government

by Nat Torkington@gnatcomments: 1

  1. Server Fault -- Wikipedia-like sysadmin guide, built by the Stack Overflow team, who are branching out to reach a more general IT Professional audience. (via Brady in email)
  2. Sixty Symbols -- 5m videos about the symbols of physics and astronomy. Great stuff! (via Glutnix on Twitter)
  3. US National Archives launches YouTube Channel -- a mixture of archives-nerd stuff (directors of Presidential Libraries talking about their favourite items) and wider-interest collections (such as Touring 1930s America).
  4. Open House in Westminster -- the ever-insightful Tom Steinberg from MySociety has an article in the Independent about British plans to reinvent government. Now the talk of Westminster is all about democratic reform. By my count there are over 50 different ideas for changing the way our democracy works being touted by different pundits at the moment. [...] What all these ideas, though, have in common is that they propose structural reforms that could have been achieved any time in the last 200 years.[...] My view is that these proposals are all interesting, and some may be quite critical for a better democracy. But I am also concerned that they do not see Parliament and the process of making laws as a native to the internet would. They don’t ask: “What reforms are possible that just weren’t conceivable ten years ago?”

tags: gov2.0, government, mysociety, operations, science, science educationcomments: 1
submit: Reddit Digg stumbleupon   

 

Wed

Jun 24
2009

Jesse Robbins

Jonathan Heiliger on Web Performance, Operations, and Culture

by Jesse Robbins@jesserobbinscomments: 0

We were honored to have Jonathan Heiliger, Facebook’s VP of Technology Operations, as our opening keynote speaker at Velocity. Jonathan is one of the most accomplished leaders in our field, and is a master of the craft.

Here is his keynote in its entirety:

Note: Other videos from Velocity are being posted to VelocityConference.blip.tv

tags: development, executive, facebook, jonathan heiliger, leadership, operations, performance, velocity, velocityconf, web2.0, webopscomments: 0
submit: Reddit Digg stumbleupon   

 

Fri

Jun 19
2009

Scott Ruthfield

Announcing: Spike Night at Velocity

by Scott Ruthfield@scottrucomments: 5

Guest blogger Scott Ruthfield is a Program Committee member of the O'Reilly Velocity: Web Performance & Operations Conference. 


Web Operations is not for the casual observer: it's for a particular kind of adrenaline junkie that's motivated by graphs and servers spinning out of control.  Jumping in, on-your-feet analysis, and experience-based-experimentation are all part of solving new problems caused by unexpected user and machine behavior, and keeping a clear head when service owners and executives are panicking is part of the job. 

A core part of operations leadership is spike management - what you do when you see a significantly larger amount of load than you've had before. Sometimes this is predictable months out (Amazon knows, for example, that the first or second Monday of December will be their biggest day each year), sometimes days out (Twitter knew Oprah was coming), and sometimes not at all (what we still call the Slashdot Effect). Every web ops professional deals with some kind of spike - even intranets manage paydays and employee review days - and if you're into it, well, spikes can be fun. Of course, maybe you use EC2 Auto-Scaling, and so (in theory) don't have to worry about it, although of course bottlenecks come in many forms.

So at Velocity this year, we're trying out something new: Spike Night.

Spike Night is a chance to see and learn about how real, high-traffic websites deal with massive increases in load, either expected or unexpected. We'll see real-world management of traffic increases - graphs, tools, the whole shebang.

Now, it turns out that when I called up lots of people on the phone and said "can we throw massive load at your website so you can stand on stage and brag about it," many web ops folks were excited, but then they start worrying about little things like "what if something goes wrong and everyone blogs about it" or "do I have to ask somebody in a PR department" and then calls went unreturned. 

Fortunately, two parties have stepped up, and I can't wait to see what they have to show:
  • Chris Bissell, Chief Software Architect at MySpace, and members of the MySpace team will demonstrate a massive, real increase in traffic, and will manage it on-stage. MySpace already deals with tens of thousands of hits each second - we can't throw enough traffic at them to cause any harm - so they'll cause their own harm and then show how they work through it.
  • Ryan NelsonOperations Director for MLB Advanced Media and MLB.com, will walk us through a combination of war stories and live traffic management to show what happens when millions of baseball fans all want to see what's happened after the commercial break at the exact same time. Between their very popular desktop apps and their newly-announced iPhone game streaming, the MLB is a true leader in technology innovation with a rabid fan base that goes well beyond the Web 2.0 echo chamber.
Spike Night is meant to be a fun event, taking place Tuesday June 23rd @ 7:30PM at Velocity, and open to the larger web community - a Velocity conference pass is not required to attend. I'm looking forward to hosting interesting demos and a fun Q&A, and hope to see all of you there!

tags: cloud, infrastructure, operations, performance, scalability, scale, spikenight, velocity, velocity09, velocityconf, web2.0, webopscomments: 5
submit: Reddit Digg stumbleupon   

 

Mon

Jun 8
2009

Jesse Robbins

Ignite! comes to San Jose June 22nd - Submit your talks now!

by Jesse Robbins@jesserobbinscomments: 0

Ignite! VelocityIgnite! is coming to San Jose on Monday June 22, 2009 at 8:00 pm, attached to the Velocity Conference. Admission is free, open to all, and there will be a cash bar.

The deadline for talks is May 11th, so submit your talks now!

As with all Ignites each speaker will only get 20 slides that each auto-advance every 15 seconds for a total of five minutes. We'll be looking for fun geek topics like hacks, how-to's, and insights. (Talks don't have to be Velocity-related!) If you're not sure what an Ignite talk looks like check out the Ignite Show.

You can RSVP for the event on Upcoming or Facebook.

tags: events, ignite, operations, san jose, velocity, velocityconf, web2.0, webopscomments: 0
submit: Reddit Digg stumbleupon   

 

Mon

May 18
2009

James Turner

Velocity Preview - The Greatest Good for the Greatest Number at Microsoft

by James Turnercomments: 4

You may also download this file. Running time: 00:20:26

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

The psychology of engineering user experiences on the web can be difficult. How much rich content can you place up on a page before the load time drives away your visitors? Get the answer wrong, and you can end up with a ghost town; get it right and you're a star. Eric Schurman knows this well, since he is responsible for just those kind of trade-off decisions on some of Microsoft's highest traffic pages. He'll be speaking at O'Reilly's Velocity Conference in June, and he recently talked with us about how Microsoft tests different user experiences on small groups of visitors.

James Turner: Why don't you start by describing what your gig at Microsoft is now and what your career path has been there?

Eric Schurman: I'm a principal dev lead for Live Search, what used to be MSN Search. And I started at Microsoft back in the late 90s working in Microsoft's Press organization, where we actually were developing training software that would emulate new Microsoft products, but didn't require those products to be on a user's machine. So, for example, if you had an organization that was running Windows 95, we would have a training system for Windows 98 that would emulate a bunch of the functionality of Windows 98 so that you could deploy it to your people. They could train their people on how to use Windows 98 before they actually deployed it.

I then moved on to the Microsoft Press website, where I became the dev lead for it. I made a few other moves and ended up going to Microsoft.com, where I ran the download center, the Microsoft.com homepage, the product catalog, and a bunch of other places from a dev perspective.

velocity2009_336x280.gifI then moved to what was then MSN Search, back in about 2005, and was there through the MSN to Live transition. At the time, I wasn't working on performance; I was just working on the Live Search application. And it became very obvious that we had some major performance problems. Performance has always been one of my really strong interests, so I took on addressing a lot of those. And when we addressed them, we had very significant improvements in our business metrics. That really surfaced how important performance was to the organization, and I moved into a role where I was really focusing just on performance. I've been in that role now for about two years.

JT: You've worked on at least three very different parts of the Microsoft website. The homepage has lots of hits, fairly static. The download page is a lot of data for long periods of time. Live Search is high volume, but there's also a lot of backend on that. In what ways do you need to architect them differently? And where can you reuse the same lessons?

ES:: That's a great question. On the web, you've got different concerns on what you have for client apps. The main things that tend to impact end-user perceived performance on the web are often things about how you've designed your application from a network perspective. So how many different HTTP get requests are you making? How are those get requests structured? So, for example, are they serialized? Did you have a JavaScript file that then gets returned to the browser that requests another JavaScript file and another JavaScript file and then some content and then it finally gets rendered? So the number of assets that you request, that's going to be something that's important no matter what product your doing.

There are other things, like how much script do you have on the page, how much CSS you have on the page, how much actual content are your rendering to the page, etcetera. There are tricks that you can use like combining many different graphics into a single tiled image and sending that down to the browser. It's much faster to send one image to the browser than, say, 20 images. Even if you end up sending the same overall graphics, but combined into one, it's still must faster to send it as one request.

There are also different data volume concerns. They're also different from a business perspective. A lot of what we were sending out from the download center was extremely time critical. We would have an update go out, and we needed to make sure that update was going to be available anywhere in the world within a certain time frame, which required us to handle very high bandwidth, and a very high volume of requests coming into the site that were transferring lots of bits. So that required something totally different than something like the Microsoft.com homepage.

It's also interesting looking at the volume of traffic and how that traffic reflects real users. So, for example, one of the problems that you end up with on both the Microsoft homepage and Live Search is that we have a huge number of bots that are trying to hit the system, lots of people trying to do SEO work are trying to hit search engines to gather information about their site, about competitor sites, about all sorts of things. On the Microsoft.com homepage, it's always under distributed denial of service attacks. It's not a question of how frequently does it happen; it's just what is the rate right now? Also, the Microsoft.com homepage has historically had such a high up-time rate that it's actually hit by a lot of hardware devices simply to check for connectivity to the internet. And so you'd want to treat a request from that kind of "user" very differently from a request that's coming from a real user.

So that's kind of a long, rambling answer to your question. Do you have any areas that you want me to drill in or maybe talk about something else?

(continue reading)

tags: interviews, microsoft, operations, velocity09, velocityconf, web2.0, webopscomments: 4
submit: Reddit Digg stumbleupon   

 

Fri

May 8
2009

Jesse Robbins

Velocity 2009 - Big Ideas (early registration deadline)

by Jesse Robbins@jesserobbinscomments: 7

what-is-velocityconf.png

(tag cloud created from Velocity session & speaker information using wordle.net)

My favorite interview question to ask candidates is: "What happens when you type www.(amazon|google|yahoo).com in your browser and press return?"

While the actual process of serving and rendering a page takes seconds to complete, describing it in real detail can take an hour. A good answer spans every part of the Internet from the client browser & operating system, DNS, through the network, to load balancers, servers, services, storage, down to the operating system & hardware, and all the way back again to the browser. It requires an understanding of TCP/IP, HTTP, & SSL deep enough to describe how connections are managed, how load-balancers work, and how certificates are exchanged and validated... and that's just the first request!

Web Performance & Operations is an emerging discipline which requires incredible breadth, focusing less on specific technologies and more on how the entire system works together. While people often specialize on particular components, great engineers always think of that component in relation to the whole. The best engineers are able to fly to the 50,000 foot view and see the entire system in motion and then zoom in to microscopic levels and examine the tiny movements of an individual part.

John Allspaw recently described this interconnectedness on his blog:

With websites, the introduction of change (for example, a bad database query) can affect (in a bad way) the entire system, not just the component(s) that saw the change. Adding handfuls of milliseconds to a query that’s made often, and you’re now holding page requests up longer. The same thing applies to optimizations as well. Break that [bad] query into two small fast ones, and watch how usage can change all over the system pretty quickly. Databases respond a bit faster, pages get built quicker, which means users click on more links, etc. This second-order effect of optimization is probably pretty familiar to those of us running sites of decent scale.

Working with these systems requires an understanding not only of the way technology interacts, but the way that people do as well. The structure, operation, and development of a website mirrors the organization that creates it, which is why so many people in WebOps focus on understanding and improving management culture & process.

Organizing a conference like Velocity is a wonderful challenge because it requires the same sort of thinking. We focus on the big concepts that everyone needs to know and then go deep into the technologies that change our understanding of the system. We find ways to share the unique experience that can only be gained by operating at scale. We make it safe to share as much of the "Secret Sauce" as we can.

Please join us at Velocity this year, we have an amazing lineup of speakers & participants. Early registration ends on Monday, May 11th at 11:59 PM Pacific. (Radar readers can use "vel09cmb" for an additional 15% discount.)

Velocity, the Web Performance and Operations Conference 2009

tags: cloud, data, infrastructure, operations, scale, velocity, velocity09, velocityconf, web, web2.0comments: 7
submit: Reddit Digg stumbleupon   

 

Thu

May 7
2009

James Turner

Velocity Preview - Keeping Twitter Tweeting

by James Turnercomments: 3

You may also download this file. Running time: 00:10:46

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

If there's a site that exemplifies explosive growth, it has to be Twitter. It seems like everywhere you look, someone is Tweeting, or talking about Tweeting, or Tweeting about Tweeting. Keeping the site responsive under that type of increase is no easy job, but it's one that John Adams has to deal with every day, working in Twitter Operations. He'll be talking about that work at O'Reilly's Velocity Conference, in a session entitled Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site, and he spent some time with us to talk about what is involved in keeping the site alive.

James Turner: Can you start by describing the platforms and technologies that make Twitter run today?

John Adams: Twitter currently runs on Ruby on Rails. And we also use a combination of Java and Scala, and a number of homegrown scripts that run the site. We also use a lot of open-source tools like Apache, MySQL, memcached.

twitter_logo_header.pngJT: What type of hardware are you running on?

JA: It's all Linux, so a lot of x86 hardware. I can't tell you the brands or how many.

JT: Do you make any kind of attempt to stay homogeneous in that?

JA: Yes, we do. All of our hardware is very consistent. It makes deployment of new software very easy. And we also use a number of configuration management tools like Puppet to deliver software to those machines.

JT: As anyone can see, Twitter has had a pretty explosive growth, especially recently. Were you prepared for this kind of ramp up?

JA: I don't think so. I mean we're growing week over week in enormous numbers. And we spend a lot of time calculating the growth and scalability of the site to make sure that we can handle the upcoming load.

JT: I mean obviously there are events like Oprah decides she's going to Tweet that are going to be spikes. Do you try to get warning of that stuff?

JA: Yeah. And frequently we know of major events happening. Major events are very predictable like Macworld, even any massive amount of media interaction, we have some fair warning beforehand.

(continue reading)

tags: interviews, operations, twitter, velocity, velocity09, velocityconf, web2.0, webopscomments: 3
submit: Reddit Digg stumbleupon   

 

Fri

Apr 10
2009

Jesse Robbins

AT&T Fiber cuts remind us: Location is a Basket too!

by Jesse Robbins@jesserobbinscomments: 3

The fiber cuts affecting much of the San Francisco Bay Area this week are similar to the outages in the Middle East last year (radar post), although far more limited in scope and impact.   What I said last year still holds true and is repeated below: 

From an operations perspective these kinds of outages are nothing new, and underscore why having "many eggs in few baskets" is such a problem. I believe we will see similar incidents when we have the first multi-datacenter failures where multiple providers lose significant parts of their infrastructure in a single geographic area.

Remember: Don't put all your eggs in one basket... and Location is a basket too!

To really understand the issue, I recommend Neal Stephenson's incredible (and lengthy) Wired article from 1996 entitled "Mother Earth Mother Board":

[...] It sometimes seems as though every force of nature, every flaw in the human character, and every biological organism on the planet is engaged in a competition to see which can sever the most cables. The Museum of Submarine Telegraphy in Porthcurno, England, has a display of wrecked cables bracketed to a slab of wood. Each is labeled with its cause of failure, some of which sound dramatic, some cryptic, some both: trawler maul, spewed core, intermittent disconnection, strained core, teredo worms, crab's nest, perished core, fish bite, even "spliced by Italians." The teredo worm is like a science fiction creature, a bivalve with a rasp-edged shell that it uses like a buzz saw to cut through wood - or through submarine cables. Cable companies learned the hard way, early on, that it likes to eat gutta-percha, and subsequent cables received a helical wrapping of copper tape to stop it.

[...] There is also the obvious threat of sabotage by a hostile government, but, surprisingly, this almost never happens. When cypherpunk Doug Barnes was researching his Caribbean project, he spent some time looking into this, because it was exactly the kind of threat he was worried about in the case of a data haven. Somewhat to his own surprise and relief, he concluded that it simply wasn't going to happen. "Cutting a submarine cable," Barnes says, "is like starting a nuclear war. It's easy to do, the results are devastating, and as soon as one country does it, all of the others will retaliate."

As the capacity of optical fibers climbs, so does the economic damage caused when the cable is severed. FLAG makes its money by selling capacity to long-distance carriers, who turn around and resell it to end users at rates that are increasingly determined by what the market will bear. If FLAG gets chopped, no calls get through. The carriers' phone calls get routed to FLAG's competitors (other cables or satellites), and FLAG loses the revenue represented by those calls until the cable is repaired. The amount of revenue it loses is a function of how many calls the cable is physically capable of carrying, how close to capacity the cable is running, and what prices the market will bear for calls on the broken cable segment. In other words, a break between Dubai and Bombay might cost FLAG more in revenue loss than a break between Korea and Japan if calls between Dubai and Bombay cost more.

The rule of thumb for calculating revenue loss works like this: for every penny per minute that the long distance market will bear on a particular route, the loss of revenue, should FLAG be severed on that route, is about $3,000 a minute. So if calls on that route are a dime a minute, the damage is $30,000 a minute, and if calls are a dollar a minute, the damage is almost a third of a million dollars for every minute the cable is down. Upcoming advances in fiber bandwidth may push this figure, for some cables, past the million-dollar-a-minute mark. [Link]

It's also worth mentioning the outages to multiple service providers hosted in a single colocation facility when the FBI sized all the equipment in the facility, the big outage at 365 Main from two years ago, and many others (see: Radar posts & comprehensive coverage at Data Center Knowledge).

(If Web Operations & Infrastructure is your interest or passion, you should attend Velocity 2009.  You can use the code "vel09cmb" for a 15% discount)

velocity2009.gif
(Image source: http://www.flickr.com/photos/mundane_joy/2301368102/)

tags: at&t, cloud, failure, failure happens, fiber, infrastructure, operations, outages, velocity, velocity09, web infrastructure, web operations, web2.0, webops, worriescomments: 3
submit: Reddit Digg stumbleupon   

 

Thu

Feb 26
2009

Simon Wardley

Karmic Koalas Love Eucalyptus

by Simon Wardleycomments: 7

Guest blogger Simon Wardley, a geneticist with a love of mathematics and a fascination for economics, is the Software Services Manager for Canonical, helping define future cloud computing strategies for Ubuntu. Simon is a passionate advocate and researcher in the fields of open source, commoditization, innovation, and cybernetics.

Mark Shuttleworth recently announced that the release of Ubuntu 9.10 will be code-named Karmic Koala. Whilst many of the developments around Ubuntu 9.10 are focused on the desktop, a significant effort is being made on the server release to bring Ubuntu into the cloud computing space. The cloud effort begins with 9.04 and the launch of a technology preview of Eucalyptus, an open sourced system for creating Amazon EC2-like clouds, on Ubuntu.

I thought I'd discuss some of the reasoning behind Ubuntu's Cloud Computing strategy. Rather than just give a definition of cloud computing, I'll start with a closer look at its underlying causes.

The computing stack is comprised of many layers, from the applications we write, to the platforms we develop in and the infrastructure we build upon. Some activities at various layers of this stack have become so ubiquitous and well defined that they are now suitable for service provision through volume operations. This has led to the growth of the 'as a Service' industries, with providers like Amazon EC2 and Force.com.

Information Technology's shift from a product to a service-based economy brings with it both advantage and disruption. On the one hand, the shift offers numerous benefits including economies of scale (through volume operations), focus on core activities (outsourcing), acceleration in innovation (componentisation), and pay per use (utility charging). On the other hand, many concerns remain, some relating to the transitional nature of this shift (management, security and trust), while others pertain to the general outsourcing of any common activity (second sourcing options, competitive pricing pressures and lock-in). These concerns create significant adoption barriers for the cloud.

At Canonical, the company that sponsors and supports Ubuntu, we intend to provide our users with the ability to build their own clouds whilst promoting standards for the cloud computing space. We want to encourage the formation of competitive marketplaces for cloud services with users having choice, freedom, and portability between providers. In a nutshell, and with all due apologies to Isaac Asimov, our aim is to enable our users with 'Three Rules Happy' cloud computing. That is to say:

  • Rule 1: I want to run the service on my own infrastructure.

  • Rule 2: I want to easily migrate the service from my infrastructure to a cloud provider and vice versa with a few clicks of a button.

  • Rule 3: I want to easily migrate the service from one cloud provider to another with a few clicks of a button.

(continue reading)

tags: cloud computing, open source, operations, ubuntucomments: 7
submit: Reddit Digg stumbleupon   

 

Thu

Feb 12
2009

Artur Bergman

Cloud Computing defined by Berkeley RAD Labs

by Artur Bergmancomments: 5

I am pleased to finally have found a paper that manages to bring together the different aspects of cloud computing in a coherent fashion, and suggests the requirements for it to develop further.

Written by the Berkeley RAD Lab (UC Berkeley Reliable Adaptive Distributed Systems Laboratory) the paper succinctly brings together Software as a Service with Utility Computing to come up with a workable definition of Cloud Computing and is a recommended read.

The services themselves have long been referred to as Software as a Service (SaaS). The datacenter hardware and software is what we will call a Cloud. When a Cloud is made available in a pay-as-you-go manner to the general public, we call it a Public Cloud; the service being sold is Utility Computing. We use the term Private Cloud to refer to internal datacenters of a business or other organization, not made available to the general public. Thus, Cloud Computing is the sum of SaaS and Utility Computing, but does not include Private Clouds.

Exploring the difference between the raw service of Amazon EC2 to the high level web centered Google App Engine, the highlights are:

  • Insight into the pay-as-you go aspect with no commits
  • Analysis of cost with regards to peak and elasticity in face of unknown demand
  • Cost of data transfers versus processing time
  • Seamless migration of user to cloud processing
  • Limits and problems with I/O on shared hardware
They raise the following obstacles and opportunities, echoing Tim's posts on Open Source and Cloud Computing and Web 2.0 and Cloud Computing.
  • Availability of Service
  • Data Lock-In
  • Data Confidentiality and Auditability
  • Data Transfer Bottlenecks
  • Performance Unpredictability
  • Scalable Storage
  • Bugs in Large-Scale Distributed Systems
  • Scaling Quickly
  • Reputation Fate Sharing
  • Software Licensing

I particularly find interesting the analysis of transportation cost versus computing cost; when is it more efficient to to use EC2 than your own individual processing? I predict speed of light and available of raw transfer capacity is going to become a even larger obstacle. (Both inside computers, between them on local LANs and on WANs.)

The paper reinforces my belief in the cloud, but that we need open source cloud environments and a larger ecosystem of providers.

Read more on the Above the Clouds blog.

tags: cloud computing, operations, web2.0comments: 5
submit: Reddit Digg stumbleupon   

 

Thu

Feb 5
2009

Jesse Robbins

Understanding Web Operations Culture - the Graph & Data Obsession

by Jesse Robbins@jesserobbinscomments: 8

We’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time.

-John Allspaw, Operations Engineering Manager at Flickr & author of The Art of Capacity Planning

One of the most interesting parts of running a large website is watching the effects of unrelated events affecting user traffic in aggregate. Web traffic is something that companies typically keep very secret, and often the only time engineers can talk about it is late at night, at a bar, and very much off the record.

There are many good reasons for keeping this kind of information confidential, particularly for publicly traded companies with complicated disclosure requirements. There are also downsides, the biggest being that is difficult for peers to learn from each other and compare notes.

John Allspaw recently created a WebOps Visualizations group on Flickr for sharing these kinds of graphs with the confidential information removed. Here’s an example of a traffic drop seen both by Flickr & by Last.FM that coincided with President Obama’s inauguration.

John Allspaw shows drop in web traffic to Flickr during Obama inauguration

Similar traffic drop on Last.FM seen on the right

Traffic Drop to Last.FM during Obama inauguration on right

Google saw a similar drop as well

Traffic Drop to Google during Obama Inauguration

Was it because everybody went to Twitter?

Traffic Spike on Twitter during Obama Inauguration

Besides being an interesting story, sharing these kinds of graphs help people build better monitoring tools and processes. As just one example: How should the WebOps team respond to this dip in traffic? Is it an outage? The inaguration was a very well known event and so it’s easy to explain the drop in traffic… what happens when a similar drop in traffic occurs? Should the WebOps team be looking at CNN (or trends in twitter) along with everything else?

How do you tell when that unexpected 10% drop in traffic is really just people with something more important to do than browse your site?

(Note: Updated since original posting to add Google & Twitter graphs and annotations, and to switch the Last.FM graphic with an annotated one after I got permission.)

tags: big data, culture, enterprise 2.0, flickr, infovis, john allspaw, last.fm, metrics, monitoring, operations, velocity, velocity09, web2.0, webopscomments: 8
submit: Reddit Digg stumbleupon   

 

Sat

Nov 29
2008

Jesse Robbins

Data Center Power Efficiency

by Jesse Robbins@jesserobbinscomments: 8

James Hamilton is one of the smartest and most accomplished engineers I know. He now leads Microsoft's Data Center Futures Team, and has been pushing the opportunities in data center efficiency and internet scale services both inside & outside Microsoft. His most recent post explores misconceptions about the Cost of Power in Large-Scale Data Centers:

jameshamilton.jpg

I’m not sure how many times I’ve read or been told that power is the number one cost in a modern mega-data center, but it has been a frequent refrain. And, like many stories that get told and retold, there is an element of truth to the it. Power is absolutely the fastest growing operational costs of a high-scale service. Except for server hardware costs, power and costs functionally related to power usually do dominate.

However, it turns out that power alone itself isn’t anywhere close to the most significant a cost. Let’s look at this more deeply. If you amortize power distribution and cooling systems infrastructure over 15 years and amortize server costs over 3 years, you can get a fair comparative picture of how server costs compare to infrastructure (power distribution and cooling). But how to compare the capital costs of server, and power and cooling infrastructure with that monthly bill for power?

The approach I took is to convert everything into a monthly charge. [...]

James Hamilton explains Datacenter Costs

[link]

tags: cloud computing, energy, james hamilton, microsoft, operations, performance, platforms, utilities, utility computing, velocity, velocity09, web2.0comments: 8
submit: Reddit Digg stumbleupon   

 

Tue

Nov 25
2008

Jim Stogdill

My Web Doesn't Like Your Enterprise, at Least While it's More Fun

by Jim Stogdill@jstogdillcomments: 20

The other day Jesse posted a call for participation for the next Velocity Web Operations Conference. My background is in the enterprise space, so, despite Velocity's web focus, I wondered if there might not be interest in a bit of enterprise participation. After all, enterprise data centers deal with the same "Fast, Scaleable, Efficient, and Available" imperatives. I figured there might be some room for the two communities to learn from each other. So, I posted to the internal Radar author's list to see what everyone else thought.

Mostly silence. Until Artur replied with this quote from one of his friends employed at a large enterprise: "What took us a weekend to do, has taken 18 months here." That concise statement seems to sum up the view of the enterprise, and I'm not surprised. For nearly six years I've been swimming in the spirit-sapping molasses that is the Department of Defense IT Enterprise so I'm quite familiar with the sentiment. I often express it myself.

We've had some of this conversation before at Radar. In his post on Enterprise Rules, Nat used contrasting frames of reference to describe the web as your loving dear old API-provisioning Dad, while the enterprise is the belt-wielding standing-in-the-front-door-when-you-come-home-after-curfew step father.

While I agree that the enterprise is about control and the web is about emergence (I've made the same argument here at Radar), I don't think this negative characterization of the enterprise is all that useful. It seems to imply that the enterprise's orientation toward control springs fully formed from the minds of an army of petty controlling middle managers. I don't think that's the case.

I suspect it's more likely the result of large scale system dynamics, where the culture of control follows from other constraints. If multiverse advocates are right and there are infinite parallel universes, I bet most of them have IT enterprises just like ours; at least in those shards that have similar corporate IT boundary conditions. Once you have GAAP, Sarbox, domain-specific regulation like HIPAA, quarterly expectations from "The Street," decades of MIS legacy, and the talent acquisition realities that mature companies in mature industries face, the strange attractors in the system will pull most of those shards to roughly the same place. In other words, the IT enterprise is about control because large businesses in mature industries are about control. On the other hand, the web is about emergence because in this time, place, and with this technology discontinuity, emergence is the low energy state.

Also, as Artur acknowledged in a follow up email to the list, no matter what business you're in, it's always more fun to be delivering the product than to be tucked away in a cost center. On the web, bits are the product. In the enterprise bits are squirreled away in a supporting cost center that always needs to be ten percent smaller next year.

(continue reading)

tags: operations, web2.0comments: 20
submit: Reddit Digg stumbleupon   

 

Thu

Nov 20
2008

Jesse Robbins

Velocity 2009: Themes, ideas, and call for participation...

by Jesse Robbins@jesserobbinscomments: 0

velocity2009_120x421.gifLast year's Velocity conference was an incredible success. We expected around 400 people and we ended up maxing out the facility with over 600. This year we're moving the conference to a bigger space and extending it to 3 days to accommodate workshops and longer sessions. Velocity 2009 will be on June 22-24th, 2009 at the Fairmont Hotel in San Jose, CA.

This year's conference will be especially important. I've said many times that Web Performance and Operations is critical to the success of every company that depends on the web. In the current economic situation, it's becoming a matter of survival. The competitive advantage comes from the ability to do two things:

  1. Generate more revenue with fewer resources
  2. Respond quickly to change
Our Velocity 2009 mantra is "Fast, Scalable, Efficient, Available", a slight change from last year. (We've replaced "Resilient" with "Efficient" to make focus clear.)

I'm excited to announce that joining Steve Souders & I on this year's program committee are John Allspaw, Artur Bergman, Scott Ruthfield, Eric Schurman, and Mandi Walls.  We've already started working on the program, and have just opened the Call for Participation.

(continue reading)

tags: artur bergman, conferences, Eric Schurman, John Allspaw, mandi walls, operations, performance, scott ruthfield, steve souders, velocity, velocity09, web2.0, webopscomments: 0
submit: Reddit Digg stumbleupon   

 

Sat

Nov 1
2008

Jesse Robbins

DisasterTech: "Decisions for Heroes"

by Jesse Robbins@jesserobbinscomments: 2

One of the most interesting DisasterTech projects I've been following is "Decisions for Heroes" led by developer and Irish Coast Guard volunteer Robin Blandford.

Decisions is like Basecamp for volunteer Search & Rescue teams. The focus is on providing "just enough" process to compliment the real-world workflow of a rescue team, without unnecessary complexity. One of Robin's design goals is that: decisions-for-heros.png

User requirements are nil. Nobody likes reading manuals - if we have to write one, we've gotten too complicated.

This is the winning approach for building systems that "serve those that serve others", and is echoed by InSTEDD's design philosophy and the Sahana disaster management system.

Teams begin by entering their responses to incidents and training exercises. They then tag them with things like the weather conditions, the tools and skills required, and who from the team was deployed.

As a team's incident database grows this information can be used to show heatmaps, and provide powerful insight on the locations, weather conditions, and times of year that various incidents occur. Over time this kind of data could be analyzed in aggregate across multiple teams and regions and create an incredibly powerful resource for Emergency Managers. This is very similar to what Wesabe does for consumers with financial transaction data today (disclosure: OATV investment).

200811011649.jpg

Rescue team members enter training dates and levels. The system tracks certification expiration dates and prompts team members & leaders to plan classes and remain current. This is a huge issue for volunteers who have to manage professional-level training requirements with the demands of a regular career.

As more incidents are entered into the system, it compares the skills required for each of the rescues with the team training exercises. This allows teams to identify areas to focus, train, and develop new skills.

200811011644.jpg

This is an innovative project with tremendous potential, and hopefully an early signal of coming changes in Emergency Management.

(Note: ''How to Serve those that Serve Others" will be the theme of my "High Order Bit" session at the Web2.0 Summit.  I'll be sure to post video/slides/notes when they are available.)

tags: disaster tech, disastertech, emergency management, firefighting, humanitarian aid, ict, innovation, operations, rescue, social networking, web 2.0, webopscomments: 2
submit: Reddit Digg stumbleupon