Entries tagged with “internet” from O'Reilly Radar
Bing's Sanaz Ahari on Query Level Categorization (1 of 2)
by Brady Forrest | @brady | comments: 0
A couple of weeks ago Bing had a small search summit for analysts, bloggers, SEO experts, entrepreneurs and advertisers. It was held in Bellevue; they put us up in the hotel and fed us. While there we received demos from Bing project teams. I was able to snag an interview with Sanaz Ahari, Lead PM on Bing. She led the team that developed the categories you see on a Bing web search. The interview was based on the slides from her presentation at the event. I have posted the significant images from her slides. The first portion of the interview focuses on how the Bing team handles Query level categorization and some of the problems they face. The second portion (up shortly) focuses on the systems used to generate the categorization.
Disclosure: I was on the MSN Search team (now the Bing team) from 2004- March, 2006. I knew Sanaz at that time.
Brady Forrest: Hi, this is Brady Forrest with O'Reilly Radar, and I'm here with Sanaz Ahari, Lead PM on Bing Search. And she's going to lead us through the categorization process that you see on every page. Hey, Sanaz.
Sanaz Ahari: Hey, Brady. So I'm going to walk you through basically kind of just the journey that we went through for coming up with our categorized experience. And so the categorized experience is basically the left rail experience that you see on Bing today. It doesn't show up for every single query today, but when it does show up, it's really about helping the users complete their task essentially. So just to take a step back, when we started on the project, we had done a lot of analysis on queries just in vacuum. And queries are always a part of users completing a task. And in a lot of the analysis we did, we noticed that a lot of the tasks are common. And it's really just common sense. When you're looking for a car, you're either researching it; you already own it; you want to buy one. When you're looking for a musician, you want to see if they're on tour; you want lyrics, songs, albums, et cetera.
And so our challenge was can we apply some of that essentially structured aspect to queries. And this is really similar to what you see on sites like Amazon, IMDB, et cetera. They do just a really kick ass job of categorizing their content. The challenge is that A, those sites are really about one domain. And then B, those sites are really operating on top of already structured data. And so the challenge that we have with search is that A, we are a general purpose search engine, and then B, the data that we have is not structured. So the goal that we started out with was we wanted to start very simple. And categorization on clustering, et cetera are nothing really new in the search space. There are a lot of people for years that have been working around the space in the research and computer science space.
So what we started out with was one of the key things that we wanted were two principles. One of them was A, can we achieve aspects and categories that were really, really user intuitive. And B, can we achieve this across a query class. One of the things that we really wanted was in order for us to build a habit for our users, we needed to deliver a predictable consistent experience across a query class. So if I went and told my dad, "Hey, Dad, try any car," I really want him to get a categorized experience for any car. So those are the two kind of constraints that we really set for ourselves. We said, "Unless we meet these two criteria, it's not really successful." And so we started out with a lot of prototyping around, "Hey, can we actually extract intent from queries?" So we started from the intent aspect. And I'll walk you through an example just to show you a simplistic view and how it gets very easily complicated.
So in the example that you see here, we started out with musicians. So with musicians as a whole, the categories and the tasks essentially that the users do generically are fairly straightforward, you know, people want lyrics, songs, tabs, tour dates, ring tones, et cetera, and the list goes on.
Brady Forrest: And are musicians judged as a category?
Sanaz Ahari: Yes, so musicians here is, for example, a category. Yes. Now this is fairly -- what I would say, it's a fairly meaty high-level category though, because as you dig in deep, there are a lot of different attributes about musicians. So the three different examples I have here are -- well, two of them are my favorite bands, but not J Lo exactly. And they kind of cover a wide range. So you've got Jennifer Lopez (Bing search) and she's a pop musician, but she's one of those people that does a whole bunch of other things as well. You've got Gotan Project (Bing search), a little bit more tail. And they're a trip pop band. And then third, you've got Rodrigo Y Gabriella (Bing search) who are more of rhythmic guitarists. And you can think about all different sorts of attributes. You've got musicians that may not be alive anymore, et cetera. So there's all sorts of different attributes that fall out of even just a single musician's class. And so in this example, ideally, you should nail the right categories that apply to these three different examples.
So in one case, you've got the guitarist's ideally for this case, you know, tabs are pretty relevant. Lyrics definitely don't make a whole lot of sense. And then you've got J Lo and she is multifaceted, and we should really try and capture most of her facets. She's a fashion designer. She's an actress, and she's a musician, et cetera. So this shows you kind of the types of problems that we have to solve. A is a query might fall under different classes. B is that even if you're under a single class, the intent from that class, it may not be the same. And then there's the problem of head queries and tail queries, ones were we have a lot of data for and ones where we don't have a lot of data for. So from here on, we go on to basically our approach for solving this problem. I should say that this is an area where we had a brilliant set of folks working on it. We collaborated pretty closely with research. We had a brilliant set of engineers working on it. And the model that we converged on is one where we basically do category level inference as well as query levels. So in this case, in the category level, we want to figure out -- I've given a class of queries that are all similar. What are the top things that users are interested in?
In this case, our algorithms basically we used a whole bunch of different features, everything from query clustering, query clicks, session analysis, document extraction, contextual analysis, et cetera. And all of these things were things that we -- the features that we added were based on -- we did a lot of quick iteration to figure out what is good; what is bad and then where do we fall short to figure out what are the extra things that we really need to add in to our algorithm. So measurements was a very key process into our system because we really, really wanted to achieve categories that the users could make a lot of sense out of.
So algorithms don't often give you things that users really understand. So we really, really wanted to deliver things that it made sense to the users. And then on the second level, we really wanted to understand everything about just a query standalone as much as possible. And this is to balance the whole, "Okay. What are the top things people care about in a whole category?" If I've got this bag of categories that users care about, now how do I pick the right ones that only apply to this query? And that is why we had an approach at a category level and also at our query level. Lastly, we did a lot of work around determining if we know that a query is in a category, is that actually the primary intent for that query. So, I don't know, like traffic may be a movie, but a lot of users when they type in traffic, they actually are just looking for how bad is the traffic right now. And that's an example of a query, even though belonging to a category, it may be an obscure intent.
Lastly, we have our ranking model. And our ranking model basically takes all of the different inputs at the category level and at the query level in order to do some modeling around what are the top intents that apply to our re-query. And, of course, we have a very tight feedback loop system from what are the things that users engage with to feed back into the ranking of the categories as well as discovering new ones.
Brady Forrest: And how fast do you have to make this calculation for each query?
Sanaz Ahari: I mean it's all pretty fast because we are scaling through millions of queries. So there's a combination of things for performance optimizations, we do some things offline and we do some things online. For things that don't change a lot and it makes sense for us to do it offline, we try to optimize it. But it's definitely a combination of the two. And our goal is with users, performance is just an expectation. So that's something that we can't compromise on. So everything happens in a matter of milliseconds basically for all of our computations.
Brady Forrest: And how much are you able to cache in case suddenly a query starts to trend up?
Sanaz Ahari: Right. For a lot of our headquarters, we definitely do a lot of caching, et cetera. And for real time spiky things, we have invested in an entire different system where we're constantly monitoring for spiky trends. So it's basically the two systems are basically kind of optimized both individually so that we always are aware of what are the things that are all of the sudden spiking a lot. And then being smart about the things that have already been -- you know, that are head queries, that people are re-querying for.
The second portion of this interview will be posted shortly.
tags: bing, internet, sanaz ahari, search
| comments: 0
submit:
Four short links: 25 May 2009
by Nat Torkington | @gnat | comments: 6
- China is Logging On -- blogging 5x more popular in China than in USA, email 1/3 again as popular in USA as China. These figures are per-capita of Internet users, and make eye-opening reading. (via Glyn Moody)
- The Economics of Google (Wired) -- the money graf is Google even uses auctions for internal operations, like allocating servers among its various business units. Since moving a product's storage and computation to a new data center is disruptive, engineers often put it off. "I suggested we run an auction similar to what the airlines do when they oversell a flight. They keep offering bigger vouchers until enough customers give up their seats," Varian says. "In our case, we offer more machines in exchange for moving to new servers. One group might do it for 50 new ones, another for 100, and another won't move unless we give them 300. So we give them to the lowest bidder—they get their extra capacity, and we get computation shifted to the new data center."
- Why Washington Doesn't Get New Media -- Things eventually improved, but despite the stunning advances in communications technology, most of federal Washington has still failed to grasp the meaning of Government 2.0. Indeed, much is mired in Government 1.5. Government 1.5? That’s a term of art for the vast virtual ecosystem taking root in Washington that has set up the trappings of 2.0 — the blogs, the Facebook pages, the Twitter accounts — but lacks any intellectual heartbeat. Too many aides in official Washington are setting up blogs and social media pages because they understand that is what they are supposed to do. All the while, many are sweating the possibility that they might actually have to say something substantive or engage the public directly. It is the nature of midlevel know-nothings to grinfuck any idea that would force them to substantially change their behaviour. We incentivize this when we talk about "you must have a blog" (ok, I'll get comms to write it), or "put up a wiki for this" (ok, but there'll be no moderation so it'll be ignorable chaos). Describe the behaviour you want and not a tool that might produce it. (via timoreilly on Twitter)
- On the Information Armageddon (Mind Hacks) -- Vaughn points out that the much-linked-to New York Magazine article on attention is a crock. I didn't like it because it was wordy and self-indulgent, Vaughn because it didn't actually cite any studies other than one which was described incorrectly. History has taught us that we worry about widespread new technology and this is usually expressed in society in terms of its negative impact on our minds and social relationships. If you're really concerned about cognitive abilities, look after your cardiovascular health (eat well and exercise), cherish your relationships, stay mentally active and experience diverse and interesting things. All of which have been shown to maintain mental function, especially as we age.
tags: attention, brain, china, democracy, economics, google, government, internet, web
| comments: 6
submit:
It's Really Just a Series of Tubes
by Jesse Robbins | @jesserobbins | comments: 12
Molly Wright Steenson hit the Ignite jackpot at Etech this year with her explanation of the steam powered network of pneumatic tubes of the 1800s. If you're someone that, like me, has a somewhat obsessive relationship with Internet Infrastructure, you must watch this talk.
tags: etech, ignite, ignite show, infrastructure, internet, steam, steampunk, tubes, velocity, velocity09, velocityconf, web2.0
| comments: 12
submit:
Technology, Politics and Democracy
by Joshua-Michéle Ross | @jmichele | comments: 3
Or, you may download the file.
Recently I spoke with Jascha Franklin-Hodge, CTO and co-founder of Blue State Digital about how technology is affecting politics and democracy in the U.S.
Blue State Digital was born out of Jascha's experience helping Howard Dean’s seminal run for the White House in ’04. and is the technology and strategic services company powering Barack Obama (and many other Democratic leaders and social justice causes like Save Darfur and We Can Solve It).
These videos (there are three total) are timely in light of the staggering September figures from the Obama campaign:
- 630,000 new donors (bringing total donors to 3.1 million)
- 150 million dollars raised
- Average contribution: $86
Here are a few observations I took away from our conversation:
Online U.S. political communities will morph from a campaign fundraising role to a governing role. Regardless of whether Obama or McCain wins in November, every 2012 political campaign, even the laggards, will be as sophisticated as Obama is today- and any campaign with that much momentum won’t be able to stop community participation at the White House door or the Capitol steps (“thanks for all the money and support, I‘ll see you in four years”). Online communities will follow politicians into their governing roles. This summer when MyBarackObama experienced the FISA revolt within his own community this became clear. This has far more transformative potential than the fundraising juggernaut we are seeing now. Powerful communities may come to dominate the agenda of incumbent politicians providing feedback, direction and policy input.
tags: future at work, internet, politics, videos
| comments: 3
submit:
The Corporation's Two Bodies
by Mike Loukides | @mikeloukides | comments: 16
The New York Times quotes Laura Martin of Soleil Securities, as saying "This is management putting its employees and its job security ahead of current Yahoo shareholders' interest." The sense of horror here--that management could actually put the interests of employees ahead of the interests of investors--is interesting, to say the least. It raises an important question that's really almost theological in nature. It is most certainly theological in, as Lawrence Ferlinghetti wrote, "the promised land where every coin is marked In God We Trust, but the dollar bills do not have it being gods unto themselves. ("Autobiography," A Coney Island of the Mind, 1958, New Directions)
tags: ferlinghetti, finance, internet, microsoft, web, web 2.0, yahoo
| comments: 16
submit:







