Entries tagged with “search” from O'Reilly Radar

Fri

Nov 20
2009

Carl Malamud

Robots.Txt and the .Gov TLD

by Carl Malamud@CarlMalamudcomments: 4

I'm on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard.

In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a "global disallow," meaning robots are excluded from indexing this content. Even more curious, on 175 of these sites, while there is a global disallow, there is a specific bypass that allows the Googlebot to index the data. You can look at the raw data on Factual.

At Public.Resource.Org, we've always felt that the use of a robots.txt file by the government should only be used for purposes of security and integrity of the site, not because some webmaster arbitrarily decides they don't want to be indexed. Indeed, on several occasions we have deliberately ignored government imposed robots.txt files because we felt this was an arbitrary and illegal attempt to keep the public out.

And, needless to say, it doesn't make any sense at all to let in some webcrawlers and not let in others. If this is a reaction to a security/integrity issue, such as limited capacity, the proper thing to do is include in the robots.txt file a comment that can be used by other bots to explain what is going on. For example, it could be perfectly reasonable for a government group faced with limited capacity to ask a robot to limit crawls to a certain number of queries per second and only whitelist crawlers that agree to that condition.

Government webmasters should use the robots.txt file sparingly, and should do so in a non-discriminatory fashion.

tags: gov2.0, open source, searchcomments: 4
submit: Reddit Digg stumbleupon   

 

Fri

Nov 20
2009

Nat Torkington

Four short links: 20 November 2009

Social Network Search for Morons, Bulking Up Bio Data, Better E-Mail, Better Standards

by Nat Torkington@gnatcomments: 1

  1. Spokeo -- abysmal indictment of society, first prize in mankind's race to the bottom. Uncover personal photos, videos, and secrets ... GUARANTEED! Spokeo deep searches within 48 major social networks to find truly mouth-watering news about friends and coworkers. PS, anybody who gives their gmail username and password to a site that specializes in dishing dirt can only be described as a fucking idiot. (via Jim Stogdill, who was equally disappointed in our species)
  2. Biologists rally to sequence 'neglected' microbes (Nature) -- The Genomic Encyclopedia of Bacteria and Archaea is project to sequence genomes from more branches of the evolutionary tree of life. Eisen's team selected and sequenced more than 100 'neglected' species that lacked close relatives among the 1,000 genomes already in GenBank. The researchers reported earlier this year at the JGI's Fourth Annual User Meeting that even mapping the first 56 of these microbes' genomes increased the rate of discovery of new gene and protein families with new biological properties. It also improved the researchers' ability to predict the role of genes with unknown functions in already sequenced organisms. (via Jonathan Eisen)
  3. Mail Learning: The What and the How (Simon Cozens) -- a few things that a really good mail analysis tool needs to do. I hope that my mail client and server does these out of the box in the next five years.
  4. Introducing the Open Web Foundation Agreement -- The Open Web Foundation Agreement itself establishes the copyright and patent rights for a specification, ensuring that downstream consumers may freely implement and reuse the licensed specification without seeking further permission. In addition to the agreement itself, we also created an easy-to-read "Deed" that provides a high level overview of the agreement. Applying the open source approach to better standards.

tags: bio, data, email, genomics, idiots, opensource, search, social graph, social software, standardscomments: 1
submit: Reddit Digg stumbleupon   

 

Tue

Oct 13
2009

Joshua-Michéle Ross

Real Time Search with Wowd: A Conversation with CEO Mark Drummond

by Joshua-Michéle Ross@jmichelecomments: 0

You may also download this file. Running time: 00:05:57

During last year's Summit I had the good fortune to interview Kevin Kelly (see Technology is the Seventh Kingdom of Life). In the interview Kevin made the case that we have only scratched the surface on how to coordinate group activities on the web: there must be hundreds of effective methods to run an auction, crowdsource products etc. We have only scratched the surface so why stop at eBay and Threadless?

MarkDrummond.jpgSo too in the area of search and discovery. As the web moves real-time, it exposes the limitations of reference based search. Wowd is among the new crop of companies looking to find ways to implement search and discovery in a real-time context.

While Google measures relevance based on PageRank and Digg measures topical relevance based on explicit user action (promoting pieces of news), Wowd is trying to measure attention across the web in real time. Attention can be an implicit indicator of interest and another form of harnessing collective intelligence.


Mark will be doing a High Order Bit on "A Conversational Approach to Search"at the upcoming Web 2.0 Summit.

tags: searchcomments: 0
submit: Reddit Digg stumbleupon   

 

Tue

Aug 25
2009

Nat Torkington

Four Short Links: 25 August 2009

Reverse Search, PDF Stripping, Flash Visualization, Failure

by Nat Torkington@gnatcomments: 1

  1. Tineye -- reverse search engine; you upload an image and they find you similar images so you know where else it's used. Check out their cool searches.
  2. PDF Pirate -- upload a PDF and this web site will give it back to you minus the restrictions on copying/printing/etc.
  3. Flare -- an ActionScript library for creating visualizations that run in the Adobe Flash Player. BSD-licensed, modelled on Prefuse. When there's a visualisation library for every platform, will we start to get people who know how to make them?
  4. The Importance of Failure (Marco Tabini) -- This is a point that I don't often hear made when people talk about failure; the moral behind a failure-related story is usually about preventing it, or dealing with the aftermath, but not about the fact that sometimes things go bad despite your best efforts, and all the careful risk management and contingency planning won't keep you from going down in flames. This is important, because it forces every person to establish a risk threshold that they are willing to accept in every one of their life efforts.

tags: drm, failure, failure happens, flash, publishing, search, visualizationcomments: 1
submit: Reddit Digg stumbleupon   

 

Fri

Aug 7
2009

Nat Torkington

Four short links: 7 August 2009

Recovery.gov, Meme tracking, RFID Scans, Open Source Search Engines

by Nat Torkington@gnatcomments: 1

  1. Defragging the Stimulus -- each [recovery] site has its own silo of data, and no site is complete. What we need is a unified point of access to all sources of information: firsthand reports from Recovery.gov and state portals, commentary from StimulusWatch and MetaCarta, and more. Suggests that Recovery.gov should be the hub for this presently-decentralised pile of recovery data.
  2. Memetracker -- site accompanying the research written up by the New York Times as Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs [...] For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours [...] a relative handful of blog sites are the quickest to pick up on things that later gain wide attention on the Web. Confirming that blogs and traditional media have a symbiotic relationship, not a parasitic one. (via Stats article in NY Times)
  3. Feds at DefCon Alarmed After RFIDs Scanned (Wired) -- RFID badges make for convenient security, and for convenient attack. Black hats can read your security cards from 2 or 3 feet away, and few in government are aware of the attack vector. To help prevent surreptitious readers from siphoning RFID data, a company named DIFRWear was doing brisk business at DefCon selling leather Faraday-shielded wallets and passport holders lined with material that prevents readers from sniffing RFID chips in proximity cards.
  4. A Comparison of Open Source Search Engines and Indexing Twitter -- Detailed write-up of the open source search options and how they stack up on a pile of Tweets. While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found: Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep … And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance. (via joshua on Delicious)

tags: big data, gov2.0, meme wars, open source, privacy, rfid, search, security, transparency, twitter, visualizationcomments: 1
submit: Reddit Digg stumbleupon   

 

Tue

Jul 14
2009

Nat Torkington

Four short links: 14 July 2009

Twenty Questions, CC Pix, INSERT INTO WEB, and Wash Your Hands!

by Nat Torkington@gnatcomments: 2

  1. Twenty Questions about GPLv3 (Jacob Kaplan-Moss) -- twenty very challenging questions about the GPLv3. foo.js is a JavaScript library released under the GPLv3. bar.js is a library with all rights reserved. For performance reasons, I would like to minimize all my site’s JavaScript into a single compressed file called foobar.js. If I distribute this file, must I also distribute bar.js under the GPL?
  2. CC Searching within Google Image Search -- what it seems. (via waxy)
  3. YQL INSERT INTO -- insert into {table} (status,username,password) values ("new tweet from YQL", "twitterusernamehere","twitterpasswordhere"). That's too cool. (via Simon Willison)
  4. CleanWell -- very low-cost recyclable enviro-friendly antimicrobials to battle third-world disease. Met the founder at Sci Foo. He said women wash hands more than men, because women enter bathrooms in pairs. Single easiest way to increase handwashing compliance is to put sinks and basins outside the room, in public view.

tags: copyright, creative commons, google, licensing, medicine, opensource, psychology, search, software, yahoo, yqlcomments: 2
submit: Reddit Digg stumbleupon   

 

Fri

Jul 10
2009

Nat Torkington

Four short links: 10 July 2009

Network File System, Internet Use, Lovelace Comic, Search User Interfaces

by Nat Torkington@gnatcomments: 0

  1. Ceph -- open source distributed filesystem from UCSC. Ceph is built from the ground up to seamlessly and gracefully scale from gigabytes to petabytes and beyond. Scalability is considered in terms of workload as well as total storage. Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file, or write to the same directory-usage scenarios that bring typical enterprise storage systems to their knees. (via joshua on delicious)
  2. Daily Internet Activities, 2000-2009 -- Pew Charitable Trust's Internet usage survey. We've finally broken 50% of Americans using the Internet daily. Twitter is almost a rounding error. (via dhowell on Twitter)
  3. The Thrilling Adventures of Lovelace and Babbage -- fantastic comic, with end-notes that explain how Babbage and Lovelace's lives and works are reflected in the action of the comic. (via suw on Twitter)
  4. Search User Interfaces -- full text of this book about the different (successful and un-) interfaces to search. (via sebchan on Twitter)

tags: history, scale, search, twitter, uicomments: 0
submit: Reddit Digg stumbleupon   

 

Mon

Jun 29
2009

Brady Forrest

Bing's Sanaz Ahari on Query Level Categorization (1 of 2)

by Brady Forrest@bradycomments: 0

A couple of weeks ago Bing had a small search summit for analysts, bloggers, SEO experts, entrepreneurs and advertisers. It was held in Bellevue; they put us up in the hotel and fed us. While there we received demos from Bing project teams. I was able to snag an interview with Sanaz Ahari, Lead PM on Bing. She led the team that developed the categories you see on a Bing web search. The interview was based on the slides from her presentation at the event. I have posted the significant images from her slides. The first portion of the interview focuses on how the Bing team handles Query level categorization and some of the problems they face. The second portion (up shortly) focuses on the systems used to generate the categorization.

Disclosure: I was on the MSN Search team (now the Bing team) from 2004- March, 2006. I knew Sanaz at that time.

bing j-lo

bing musicians

Brady Forrest: Hi, this is Brady Forrest with O'Reilly Radar, and I'm here with Sanaz Ahari, Lead PM on Bing Search. And she's going to lead us through the categorization process that you see on every page. Hey, Sanaz.

Sanaz Ahari: Hey, Brady. So I'm going to walk you through basically kind of just the journey that we went through for coming up with our categorized experience. And so the categorized experience is basically the left rail experience that you see on Bing today. It doesn't show up for every single query today, but when it does show up, it's really about helping the users complete their task essentially. So just to take a step back, when we started on the project, we had done a lot of analysis on queries just in vacuum. And queries are always a part of users completing a task. And in a lot of the analysis we did, we noticed that a lot of the tasks are common. And it's really just common sense. When you're looking for a car, you're either researching it; you already own it; you want to buy one. When you're looking for a musician, you want to see if they're on tour; you want lyrics, songs, albums, et cetera.

And so our challenge was can we apply some of that essentially structured aspect to queries. And this is really similar to what you see on sites like Amazon, IMDB, et cetera. They do just a really kick ass job of categorizing their content. The challenge is that A, those sites are really about one domain. And then B, those sites are really operating on top of already structured data. And so the challenge that we have with search is that A, we are a general purpose search engine, and then B, the data that we have is not structured. So the goal that we started out with was we wanted to start very simple. And categorization on clustering, et cetera are nothing really new in the search space. There are a lot of people for years that have been working around the space in the research and computer science space.

So what we started out with was one of the key things that we wanted were two principles. One of them was A, can we achieve aspects and categories that were really, really user intuitive. And B, can we achieve this across a query class. One of the things that we really wanted was in order for us to build a habit for our users, we needed to deliver a predictable consistent experience across a query class. So if I went and told my dad, "Hey, Dad, try any car," I really want him to get a categorized experience for any car. So those are the two kind of constraints that we really set for ourselves. We said, "Unless we meet these two criteria, it's not really successful." And so we started out with a lot of prototyping around, "Hey, can we actually extract intent from queries?" So we started from the intent aspect. And I'll walk you through an example just to show you a simplistic view and how it gets very easily complicated.

So in the example that you see here, we started out with musicians. So with musicians as a whole, the categories and the tasks essentially that the users do generically are fairly straightforward, you know, people want lyrics, songs, tabs, tour dates, ring tones, et cetera, and the list goes on.

200906290906

Brady Forrest: And are musicians judged as a category?

Sanaz Ahari: Yes, so musicians here is, for example, a category. Yes. Now this is fairly -- what I would say, it's a fairly meaty high-level category though, because as you dig in deep, there are a lot of different attributes about musicians. So the three different examples I have here are -- well, two of them are my favorite bands, but not J Lo exactly. And they kind of cover a wide range. So you've got Jennifer Lopez (Bing search) and she's a pop musician, but she's one of those people that does a whole bunch of other things as well. You've got Gotan Project (Bing search), a little bit more tail. And they're a trip pop band. And then third, you've got Rodrigo Y Gabriella (Bing search) who are more of rhythmic guitarists. And you can think about all different sorts of attributes. You've got musicians that may not be alive anymore, et cetera. So there's all sorts of different attributes that fall out of even just a single musician's class. And so in this example, ideally, you should nail the right categories that apply to these three different examples.

So in one case, you've got the guitarist's ideally for this case, you know, tabs are pretty relevant. Lyrics definitely don't make a whole lot of sense. And then you've got J Lo and she is multifaceted, and we should really try and capture most of her facets. She's a fashion designer. She's an actress, and she's a musician, et cetera. So this shows you kind of the types of problems that we have to solve. A is a query might fall under different classes. B is that even if you're under a single class, the intent from that class, it may not be the same. And then there's the problem of head queries and tail queries, ones were we have a lot of data for and ones where we don't have a lot of data for. So from here on, we go on to basically our approach for solving this problem. I should say that this is an area where we had a brilliant set of folks working on it. We collaborated pretty closely with research. We had a brilliant set of engineers working on it. And the model that we converged on is one where we basically do category level inference as well as query levels. So in this case, in the category level, we want to figure out -- I've given a class of queries that are all similar. What are the top things that users are interested in?

bing gotan project
In this case, our algorithms basically we used a whole bunch of different features, everything from query clustering, query clicks, session analysis, document extraction, contextual analysis, et cetera. And all of these things were things that we -- the features that we added were based on -- we did a lot of quick iteration to figure out what is good; what is bad and then where do we fall short to figure out what are the extra things that we really need to add in to our algorithm. So measurements was a very key process into our system because we really, really wanted to achieve categories that the users could make a lot of sense out of.

So algorithms don't often give you things that users really understand. So we really, really wanted to deliver things that it made sense to the users. And then on the second level, we really wanted to understand everything about just a query standalone as much as possible. And this is to balance the whole, "Okay. What are the top things people care about in a whole category?" If I've got this bag of categories that users care about, now how do I pick the right ones that only apply to this query? And that is why we had an approach at a category level and also at our query level. Lastly, we did a lot of work around determining if we know that a query is in a category, is that actually the primary intent for that query. So, I don't know, like traffic may be a movie, but a lot of users when they type in traffic, they actually are just looking for how bad is the traffic right now. And that's an example of a query, even though belonging to a category, it may be an obscure intent.

bing rodrigo
Lastly, we have our ranking model. And our ranking model basically takes all of the different inputs at the category level and at the query level in order to do some modeling around what are the top intents that apply to our re-query. And, of course, we have a very tight feedback loop system from what are the things that users engage with to feed back into the ranking of the categories as well as discovering new ones.

Brady Forrest: And how fast do you have to make this calculation for each query?

Sanaz Ahari: I mean it's all pretty fast because we are scaling through millions of queries. So there's a combination of things for performance optimizations, we do some things offline and we do some things online. For things that don't change a lot and it makes sense for us to do it offline, we try to optimize it. But it's definitely a combination of the two. And our goal is with users, performance is just an expectation. So that's something that we can't compromise on. So everything happens in a matter of milliseconds basically for all of our computations.

Brady Forrest: And how much are you able to cache in case suddenly a query starts to trend up?

Sanaz Ahari: Right. For a lot of our headquarters, we definitely do a lot of caching, et cetera. And for real time spiky things, we have invested in an entire different system where we're constantly monitoring for spiky trends. So it's basically the two systems are basically kind of optimized both individually so that we always are aware of what are the things that are all of the sudden spiking a lot. And then being smart about the things that have already been -- you know, that are head queries, that people are re-querying for.

The second portion of this interview will be posted shortly.

tags: bing, internet, sanaz ahari, searchcomments: 0
submit: Reddit Digg stumbleupon   

 

Thu

Jun 18
2009

Nat Torkington

Four short links: 18 June 2009

Weaker Copyright Good, YQL.gov, GeoSPARQL, Happiness

by Nat Torkington@gnatcomments: 3

  1. Harvard Study Finds Weaker Copyright Protection Has Benefited Society (Michael Geist) -- Given the increase in artistic production along with the greater public access conclude that "weaker copyright protection, it seems, has benefited society." This is consistent with the authors' view that weaker copyright is "uambiguously desirable if it does not lessen the incentives of artists and entertainment companies to produce new works." (read the original paper)
  2. Using Public Data for Good With the Power of YQL -- The first part is a new batch of YQL tables providing data on the U.S. government, earthquake data, and the non-profit micro-lender Kiva. The second part is an incredibly easy way to render YQL queries on websites. After all, what good is data that no one can see?
  3. GeoSPARQL -- RDF meets geo goodness. SELECT ?s ?p ?o WHERE { ?s gn:name "Dallas" . ?s ?p ?o } (via the geowanking mailing list)
  4. How To Be Happy in Business -- this Venn diagram makes me happy. (via Ned Batchedler)
happyinbiz.jpg

tags: copyright, geodata, gov2.0, lifehacks, location, open data, search, semantic web, yahoocomments: 3
submit: Reddit Digg stumbleupon   

 

Thu

Jun 11
2009

Brady Forrest

Search for Developers

by Brady Forrest@bradycomments: 0

Vanessa Fox just posted her slides from her talk Diagnosing Technical Issues With Search Engine Optimization. They are packed with handy SEO/SEM suggestions, checklists and resources. It's worth going through at least once.

tags: search, seocomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Jun 3
2009

James Turner

Google Squared is an Exponential Improvement in Search

by James Turnercomments: 19

One of the things I've learned about Google is that the most amazing things will come out of them with barely a whisper of fanfare. Such is the case with Google Squared, a new Google Labs tool that was released today. What does Google Squared do? It organizes and tables information from searches for you in a way that makes it much more useful.

For example, the first thing I put into Google Squared was [science fiction conventions], and I got back:
Picture 5.png

Not too bad right off the bat, and by clicking on the X boxes, you can remove columns or rows that don't fit. It works even better for things that are very well defined, like [atomic weights of elements]:
Picture 6.png

(continue reading)

tags: google, google squared, searchcomments: 19
submit: Reddit Digg stumbleupon   

 

Tue

May 12
2009

James Turner

Google Engineering Explains Microformat Support in Searches

by James Turnercomments: 8

You may also download this file. Running time: 18:24

Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.

Today, Google is releasing support for parsing and display of microformat data in their search results. While the initial launch will be limited to a specific set of partners (including LinkedIn, Yelp and CNet reviews), the intent is that very quickly, anyone who marks their pages up with the appropriate microformat data will be able to make their information understandable by Google. This technology would allow you to explicitly search, for example, for only printers that had an average customer review of 3 stars or higher. Initial support will include things such as:

  • Review Ratings
  • Product Prices
  • Personal Details

We talked this morning with Othar Hansson and RV Guha, two of the Google engineers responsible for the new functionality, and you can listen to them discuss it in this exclusive O'Reilly interview.


JAMES TURNER: Why don't you guys start by introducing yourselves?

OTHAR HANSSON: Sure. I'm Othar Hansson: and I'm a tech lead on this project. And I'm in Google's Search UI Group.

RV GUHA: My name is Guha. I'm an engineer at Google and I do stuff across the board.

JT: So can you describe briefly, to start off, exactly what it is you're releasing today?

RVG: Okay. We are asking webmasters who have pieces of data like reviews or people profiles, and in an experimental form, things like information about organizations and products, to put the structure data representing the content on the webpage in a machine-understandable form on the webpage. Typically, what happens is that if you take a website and having created opinions, I can talk about the context of opinions. You would typically have a database in the back-end which has lots of information about products. People write reviews about them. And you get information such as the number of reviews, the average rating of the reviews, the price of the product, who sells it, et cetera, et cetera, et cetera. It's stored in a structured database in your back-end. You then use some scripts to format it into HTML as per the site's design. Now going from the structured data to the HTML is quite straight-forward. But going from the HTML back to the structured data in a fashion which works across sites is very, very, very hard. Now our search engine doesn't -- it's very difficult for a search engine to understand -- to sort of get back the structured data for all of the sites. Now if it were to understand that, if it were to understand that this is a review site where the product being reviewed is such and such and it has 30 reviews with an average rating of 3.2 and so on and so forth, we could do a better job of the search. In particular, we could do a better job of presenting the two or three lines of text that appeared as part of the search result so that the user has a better idea of what to expect on that page. And from our experiments, it seemed that giving the user a better idea of what to expect on the page increases the click-through rate on the search results. So if the webmasters do` this, it's really good for them. They get more traffic. It's good for users because they have a better idea of what to expect on the page. And, overall, it's good for the web.

JT: So in some ways, that's in the same way that right now for certain sites, you'll give the internal structure of the site as part of the search result or for shopping results, you'll give price ranges and things like this. This is just, again, enriching and providing more structured -- more than just a snippet, giving more of a structured display of the information on that page?

RVG: Yes. If we have a structured data, we can do lots of things. We're starting off by improving the snippets. It's an absolute no-brainer. It seems to be helping everybody. And, as you know us, we keep playing it on with different ideas and different things. As structured data becomes more prevalent, there's a ton of ideas, both inside Google and outside Google, on how you might improve search.

(continue reading)

tags: google, interviews, microformats, search, seocomments: 8
submit: Reddit Digg stumbleupon   

 

Wed

Apr 29
2009

Nat Torkington

Four short links: 29 Apr 2009

4chan, urban redesign, 3d printing, python

by Nat Torkington@gnatcomments: 4

  1. Moot Wins, Time Inc. Loses -- summary of how the 4chan group Anonymous rigged the voting in Time's 100 Most Influential poll to not just put their man at the top, but also spell an in-joke with the initial letters of the first 21 people. Time tried weakly to prevent the vote-rigging, and ReCAPTCHA gave the Internet scalliwags their biggest setback, but check out how they automated as much as possible so that human effort was targeted most effectively. It's the same mindset that build Google's project management, ops, and dev systems. Notice how they tried to game ReCAPTCHA, a collective intelligence app whose users train the system to read OCRed words, by essentially outvoting genuine users so that every word was read as "penis". Collective intelligence should never be the only security/discovery/etc. feature because such apps are often vulnerable to coordinated action.
  2. The old mint in downtown SF painted by 7 perfectly mapped HD projectors -- looks absolutely spectacular. I love the combination of permanent and fleeting, architecture and infotexture. (via BoingBoing)
  3. 3-D Printing Hits Rock-bottom Prices With Homemade Ceramics Mix (Science Daily) -- University of Washington researchers invent, and give away, a new 3D printer supply mix that costs under a dollar a pound (versus current commercial mixes of $30-50/pound).
  4. Haystack and Whoosh Notes (Richard Crowley) -- notes on installing the search framework Haystack and the search back-end Whoosh, both pure Python. It's a quick get-up-and-go so you can add quite sophisticated search to your Django apps. (via Simon Willison)

tags: 3d printing, architecture, collective intelligence, programming, python, search, securitycomments: 4
submit: Reddit Digg stumbleupon   

 

Wed

Apr 15
2009

Vanessa Fox

Practical Tips for Government Web Sites (And Everyone Else!) To Improve Their Findability in Search

by Vanessa Foxcomments: 18

In an earlier post, I said that key to government opening its data to citizens, being more transparent, and improving the relationship between citizens and government in light of our web 2.0 world was ensuring content on government sites could be easily found in search engines. Architecting sites to be search engine friendly, particularly sites with as much content and legacy code as those the government manages, can be a resource-intensive process that takes careful long-term planning. But

two keys are:

  • Assessing who the audience is and what they're searching for
  • Ensuring the site architecture is easily crawlable


Crawlability Quick Wins
This post is about quick wins in crawlability. In many cases, ensuring crawlability also ensures accessibility (particularly access via screen readers). From this standpoint, many government web sites have an advantage over other sites since they already build in many accessibility features. Creating search-friendly sites also improves usability and user access from mobile devices and slow connections. So forget everything you may have heard about how you have to sacrifice user experience for SEO. SEO done right facilitates deeper audience engagement, makes it easier for visitors to navigate and find information on the site, and provides access to a wider variety of users.

Use XML Sitemaps
Create XML Sitemaps that list all the pages on the site and submit them to the major search engines.

Why is this important? Many government sites have poor information architecture. Ideally each page of the site should have at least one link to it. This helps users navigate the site and helps search engines find all of the pages. Long term, these sites should revamp their navigational structure so that at least one link exists to every page. Since that may take some time to implement, an XML Sitemap can function in the meantime to provide a list of all pages for search engines to crawl.

Government sites have already made great progress in search by using XML Sitemaps.

The Energy Department's Office of Science and Technology (OSTI) implemented XML Sitemaps protocol with great success. "The first day that Yahoo offered up our material for search, our traffic increased so much that we could not keep up with it,' said Walt Warnick, OSTI's director.

If possible, provide an HTML sitemap as well, which provides a browsable navigation to site visitors. Below is a good example of a browsable HTML sitemap on nih.gov:

nih_sitemap.jpg

Don't block access to content
Make all content available outside of a login, registration form, or other input mechanism. Search engine crawlers can't access content behind a login or registration. If the content requires the visitor to enter an email address or otherwise provide input before accessing it, it won't show up in search results.

(continue reading)

tags: google, search, xmlcomments: 18
submit: Reddit Digg stumbleupon   

 

Thu

Apr 2
2009

Nat Torkington

Four short links: 2 Apr 2009

by Nat Torkington@gnatcomments: 1

Predictions, PDF, source code control, and recommendation engines:

  1. Wrong Tomorrow -- track pundits predictions and see how accurate they really are. From the ever-awesome Maciej Ceglowski.
  2. PDFMiner -- Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document. It also infers text running within a page by using clustering technique. Entirely written in Python.
  3. Migrating from svn to a Distributed VCS -- to decide which distributed VCS to use, Brett Cannon gathered Python use cases and then showed how they'd be done with different dvcses. The result is a very useful comparison document for svn, bzr, git, and hg.
  4. Online Monoculture and the End of the Niche -- interesting post summarising and explaining research into recommendation engines, drawing the conclusion that although Internet World recommendation engines show everybody lots of new stuff, we're all seeing the same new stuff and the end result is less the "riches of niches" Long Tail fantasy and more a monoculture.

tags: collective intelligence, future, open source, programming, python, searchcomments: 1
submit: Reddit Digg stumbleupon   

 

Mon

Feb 16
2009

Nat Torkington

Four short links: 16 Feb 2009

by Nat Torkington@gnatcomments: 2

A lot of Python and databases today, with some hardware and Twitter pranking/security worries to taste:

  1. Free Telephony Project, Open Telephony Hardware -- professionally-designed mass-manufactured hardware for telephony projects. E.g., IP04 runs Asterisk and has four phone jacks and removable Flash storage. Software, schematics, and PCB files released under GPL v2 or later.
  2. Don't Click Prank Explained -- inside the Javascript prank going around Twitter. Transparent overlays would appear to be dangerous.
  3. Tokyo Cabinet: A Modern Implementation of DBM -- ok, so there's definitely something going on with these alternative databases. Here's the 1979 BTree library reinvented for the modern age, then extended with PyTyrant, a database server for Tokyo Cabinet that offers HTTP REST, memcached, and a simple binary protocol. Cabinet is staggeringly fast, as this article makes clear. And if that wasn't enough wow for one day, Tokyo Dystopia is the full-text search engine. The Tyrant tutorial shows you how to get the server up and running. And what would technology be without a Slideshare presentation? (via Stinky)
  4. Whoosh -- a pure Python fulltext search library.

tags: big data, hardware, javascript, opensource, python, search, security, voipcomments: 2
submit: Reddit Digg stumbleupon   

 

Thu

Jan 22
2009

Vanessa Fox

Making Site Architecture Search-Friendly: Lessons From whitehouse.gov

by Vanessa Foxcomments: 11

Guest blogger Vanessa Fox is co-chair of the new O'Reilly conference Found: Search Acquisition and Architecture. Find more from Vanessa at ninebyblue.com and janeandrobot.com. Vanessa is also entrepreneur in residence at Ignition Partners, and Features Editor at Search Engine Land.

Yesterday, as President-elect Obama became president Obama, we geeky types filled the web with chatter about change. That change of change.gov becoming whitehouse.gov, that is. The new whitehouse.gov robots.txt file opens everything up to search engines while the previous one had 2400 lines! The site has a blog! The fonts are Mac-friendly! That Obama administration sure is online savvy.

Or is it?

An amazing amount of customer acquisition can come from search (a 2007 Jupiter research study found that 92% of online Americans search monthly and over half search daily). Whitehouse.gov likely doesn't need the kind of visibility that most sites need in search, but when people search for information about today's issues, such as the economy, the Obama administration surely wants the whitehouse.gov pages that explain their position to show up.

The site has a blog, which is awesome, but the title tag, the most important tag on the page, has only the text "blog". Nothing else. Which might help the page rank well for people doing a search for blog, but that's probably not what they're going for. This doesn't just hurt them in search of course. It's also what shows up in the browser tab and bookmarks.

The site runs on IIS 6.0. Does the site developer know about tricky configuration that makes the redirects search engine-friendly?

Search engines are text-based, so they can't read text hidden in images. Some whitehouse.gov pages get around this issue well, by making the text look image-like, but leaving it as text, such as below.

whitehouse.gov text example

However, other pages have text in images and don't use ALT text to describe them. (This, of course, is an accessibility issue as well, as it keeps screen readers from being able to access the text in the images.) An example of this is the home page, which may be part of why whitehouse.gov doesn't show up on the first page in a search for President Obama.

whitehouse.gov image example

There are all kinds of technical issues, big and small, that impact whether your site can be found in search results for what you want to be found for. (whitehouse.gov using underscores rather than dashes in URLs, the meta descriptions are the same on every page...) Probably the biggest issue in this case is the lack of 301 redirects between the old site and the new site. When you change domains and move content to the new domain, you don't want to have to rebuild the audience and links all over again. (Not that Obama or whitehouse.gov will have a problem with attracting and audience, but we all can't be president!) When you use a 301 redirect, both visitors and search engines know to replace the old page with the new one.

In the case of change.gov, it's unclear if they intend to maintain the old site. The home page asks people to join them at whitehouse.gov, but all the old pages still exist (even the old home page at http://change.gov/content/home).

change.gov example

And in many cases, the same content exists at both change.gov and whitehouse.org (see, for instance, http://change.gov/agenda/iraq_agenda/ and http://www.whitehouse.gov/agenda/iraq/).

As Matt Cutts, Googler extraordinaire pointed out, give them a few days to relax before worrying so much about SEO. And I certainly think the site is an excellent step towards better communication between the president and the American people. But not everyone has the luxury of having one of the most well-known names and sites in the world, so the technical details are more important for the rest of us.

If you want to know more about technical issues that can keep your site from being found in search and tips for making sure that you don't lose visibility in a site move, join us for the O'Reilly Found conference June 9-11 in Burlingame. And if you're in Mountain View tomorrow night (Thursday, January 22nd), stop by Ooyala from 6pm to 9pm for our webdev/seo meetup, and get all your search questions answered. Hope to see you there! (Macon Phillips and the whitehouse.gov webmasters are welcome, but my guess is that they're a little busy.)

tags: publishing, search, seo, web 2.0, whitehouse.govcomments: 11
submit: Reddit Digg stumbleupon   

 

Sat

Jun 14
2008

Tim O'Reilly

Why Arrington is Wrong about Yahoo!-Google Deal

by Tim O'Reilly@timoreillycomments: 56

I was inspired by Fred Wilson's excellent piece on the subject to add my own two cents to Mike Arrington's rant about how Yahoo!'s deal with Google is bad for the industry. I wrote the following in Arrington's comment stream, and will reproduce it here:

Let me weigh in as well on why I don't think Google's dominance in search is going to cause the problems you imagine.

1. Search is only one way to find things. It's the most easily monetizable, so it gets the lion's share of the attention. But take a look at (and report on) what percentage of techcrunch's traffic comes from search. For the O'Reilly Radar blog, it's about 35%. Significant, sure, but hardly a sign of lack of competition. If Google absorbed both Yahoo! and Microsoft, the share of our visits coming from search would still be below 40%. (That tells you what a small share of our search traffic comes from the other guys today.) And that's just the web traffic. Count in RSS (which is much bigger than web for most blogs, including ours) and the Search share of traffic goes down to a much smaller amount. So there's not much worry about people not being able to find information.

2. You specifically raise the specter of Google taking a bigger share of the search dollar absent competition. I'd be interested to know if you have concrete case studies of better deals because of competitive pressure. Seems to me that if Google does this, they will undermine the virtuous circle that drives their success. Maybe they will do this, but if they do, attention and value will migrate elsewhere, as eventually happened with Microsoft.

At O'Reilly, we always say "Create more value than you capture." All successful companies do this. Once they start capturing more value than they create, their market position erodes, and someone displaces them. It may take a while but it happens eventually. If Google takes too much of the pie, it will be a great opening for a new competitor. Right now, because Google is creating the most value for the ecosystem, competitors continue to lose share. If they started taking a lot more of the revenue, Microsoft's share would go up, plus new startups would have an opening that they don't have now.

3. The real source of my argument for this position, which you linked to in your piece, but I'll point to again here, is that Web 2.0, the internet operating system we're building, is much bigger than search. Search is an incredibly powerful subsystem of that OS, but it is just a subsystem. There is lots of competition across the system as a whole, and we're a LONG way from the concentration of power that represents monopoly when we take that into consideration.

4. The landscape is changing so fast. To take only one axis, consider mobile. Google doesn't dominate mobile/local search. That's a whole new game.... Again, there's lots of competition.

tags: competition, google, search, web 2.0, yahoocomments: 56
submit: Reddit Digg stumbleupon   

 

Sun

May 25
2008

Tim O'Reilly

Why search competition isn't the point

by Tim O'Reilly@timoreillycomments: 39

This morning, in response to my Microhoo: Corporate Penis Envy? piece, Michael Arrington wrote The importance of a competitive search market.

First, let's be clear. I agree with Michael that competition is a good thing, and that there's a real risk that, absent competition, Google will become "evil," as "absolute power corrupts absolutely." Nonetheless, I thought I'd take a few moments to explore why Michael got it wrong, despite the fundamental appeal of his assertion, especially to people who grew up learning the lessons of the Microsoft desktop monopoly.

  1. To focus on search is to miss the big picture. Web 2.0 (or whatever the fullness of the Internet Operating System ends up being called) is far bigger than search. Yes, search is currently the most valuable and monetizable Web 2.0 application--or perhaps better-named, subsystem. But look back at 1984: Lotus was bigger and more valuable than Microsoft ($153 million in revenues to Microsoft's $100 million, and growing faster -- Lotus had tripled in size, while Microsoft had only doubled.) But we now know that Microsoft had the stronger position. As I've said in my Web 2.0 talks from the very beginning, a platform beats an application every time.
The key question is what kind of platform we're collectively building.

There is strong evidence that the platform that's emerging is more like Linux than it is like Windows. That is, no one player is going to own all the pieces. But that could change if someone owned enough of the pieces that everyone else became dependent on them. So I'd be much more concerned about a single player rolling up unrelated and complementary pieces of the larger internet OS till they owned critical mass in multiple areas than I would be about a single player owning a best of breed application in one area or another.

The sooner we start getting serious about interoperability between best-of-breed services (the next step up from first generation mashups), the safer we'll be against a single dominant player turning their subsystem into the "one ring that rules them all."

  1. I think Google understands the need for interoperability better than Microsoft. When Eric Schmidt says "don't fight the internet," I believe he means it. Google seems to be doing their best to balance competitive advantage with giving back and the overall health of the internet ecosystem.

  2. Even if Google does achieve true monopoly status, that monopoly will be short-lived. Just as Microsoft stumbled at what appeared to be the peak of its power, so too will Google. The pace of technology is increasing, and it's rare for a company that led with one generation of technology to also win at the next. Take mobile, as hopmojo notes, or as I wrote myself in Static on the Dream Phone, mobile is going to be a make-or-break transition for Google.

  3. Many Web 2.0 applications tend naturally to monopoly, precisely because they harness network effects. In fact, one of my short definitions of Web 2.0 is the design of systems that get better the more people use them. Network effects apply to the Web 2.0 system as a whole as well as to any individual subsystem. In What is Web 2.0?, I wrote:

    The race is on to own certain classes of core data: location, identity, calendaring of public events, product identifiers and namespaces. In many cases, where there is significant cost to create the data, there may be an opportunity for an Intel Inside style play, with a single source for the data. In others, the winner will be the company that first reaches critical mass via user aggregation, and turns that aggregated data into a system service.

    The critical point is whether or not, having achieved critical mass, you take the next step and turn that aggregated data into a system service. If Google doesn't do that, and the rest of us have done their homework, then someone else will beat them in search because the network effect of the entire system will be greater than the network effect of the search ecosystem alone. If Microsoft understood this, they'd be competing with Google by making search services that are more open, re-usable and re-deployable than Google's search services. Since they aren't operating this way, they ought to throw in the towel.

  4. We're still so early! There's so much yet to invent. Take what Amazon is doing with S3 and EC2. They broke new ground and took a leadership position in an emerging category, while A9, their attempt at incremental innovation in search, got them nowhere. If Microsoft and Yahoo! want to compete with Google, go where they aren't!

True search innovation will come from something that doesn't look like search. Google's video search efforts foundered, while YouTube took off. (Google was smart enough to buy YouTube quickly.) Facebook took off in an area that could be characterized as "people search." Tweetspace is becoming a hidden transmission channel for information, one that Google doesn't yet search. Everything Microsoft (and other explicit search competitors, including most specialized search startups) is incremental innovation. Google's search dominance will be toppled by a disruptive innovation that changes the game, not by playing catch-up at the same game. The challenges that keep Google on their toes, innovating in search, will come from outside the current system.

tags: google, microhoo, microsoft, open source, search, the long view, web 2.0comments: 39
submit: Reddit Digg stumbleupon   

 

Sat

May 24
2008

Tim O'Reilly

MicroHoo: corporate penis envy?

by Tim O'Reilly@timoreillycomments: 35

After reading endless pieces about Microsoft's obsession with search, I am forced to offer the following theory: penis envy (from Wikipedia):

I worked with Freud in Vienna. We broke over the concept of penis envy. He thought it should be limited to women - Woody Allen in Zelig

While not the same kind of penis envy as that typically referred to in psychoanalysis, the phrase "penis envy" or "small penis syndrome" is also sometimes used to describe the envy of a male over another male's penis. Although this subconscious or conscious envy may solely be based on the idea that a larger penis is universally more satisfying and appealing to a sexual partner, other implications arise from the fact that a large penis has been seen in many cultures as a symbol of high masculinity, dominance and power.

Microsoft was once motivated by its own Big Hairy Audacious Goal: "a computer on every desk and in every home." They achieved that goal, and ever since, they've drifted. Now their only goal seems to be to stay on top of the heap. They need to stop focusing on eating other people's lunch and start thinking deeply about what kind of goals might stretch the company once again. "Organize all the world's information" is already taken, but there are a lot of other things that need doing, that Microsoft is uniquely capable of, and that would energize the creativity and passion of Microsoft's employees. What's more (as I'll get to later) there is a much bigger game afoot, and one that Microsoft would be far wiser to focus on.

Meanwhile, Yahoo! has let itself be defined by the same kind of penis envy. Here is a business that has beaten Google in area after area, that is unquestionably the #1 media company on the net, and yet has let itself be defined by the one area in which it is #2 -- and where it could be much more profitable and successful by partnering with #1 than by competing with them.

So, my advice to Yahoo!: continue with your plan to outsource search to Google, just like you did before 2002, and plow those increased profits and reduced costs into your own innovation, strengthening the areas where you are #1, exploring new ideas that will make YOUR users insanely happy, and generally focusing on what makes Yahoo! great, rather than on what doesn't. That is, unless Microsoft makes you so good a deal for your search assets that you just can't say no. But either way, let yourself be quit of the destructive competition and focus on adding real value for your users.

My advice to Microsoft: outsource your search to Google too!

Of course, I don't really expect you to do that. As Todd Bishop wrote in a piece called The Rest of the Motto back in 2004:

...people who have been following the company since the early days point out that the stated goal was actually, "A computer on every desk and in every home, running Microsoft software." In an October 1995 interview with Fortune magazine, for example, Bill Gates said: " ... I still believe in our vision -- a computer on every desk and in every home, all running Microsoft software."

Microsoft has long operated on the model of platform as lever for lock-in and competitive advantage, or as Tolkien put it, "One ring to rule them all." But as I've been saying in my advocacy about Web 2.0 from the beginning, there is another model, represented by both Linux and the Internet, that might be called small pieces loosely joined. (I don't use the term in quite the same way as in David Weinberger's eponymous book (linked just previously), but the phrase is just too right to ignore.) The Unix philosophy, laid out so brilliantly in The Unix Programming Environment, is of a network of cooperating tools, each doing one thing well. This philosophy took root on the internet as well, and has proven to be an enormous engine of innovation.

I believe that we're collectively working on an Internet Operating System, and that it will ultimately look more like Unix than it looks like Windows. That is, it will be an aggregate of best of breed tools produced by an army of independent actors, all playing by the same rules so that those tools work together to produce a whole greater than the sum of the parts.

Fighting over search is a bit like the Free Software Foundation re-implementing cat, ls, sort, and all the other Unix utilities that were already available in the Berkeley distributions of Unix. The real problem was solved by someone outside the FSF, when Linus Torvalds wrote a kernel, a missing piece that became the gravitational center of Linux, the center around which all of the other projects could coalesce, which made them more valuable not by competing with them but by completing them.

Don't get me wrong: I'm not saying that there isn't enormous room for competition, and that competition isn't good for the market. Compete where you have ideas that can really change the game, but don't play me-too.

So let's assume that Google has won at search, or close enough to make no difference. Is Microsoft better off trying to reimplement cat and ls, or trying to figure out what's still missing from the Internet Operating System? While they are locked in penis envy, all the really cute girls are going out with startups :-)

So think hard about the future internet OS: ubiquitous computing, with a computer not just on every desktop and in every home, or even every phone and every camera, but in everyday devices, clothing, shopping carts, cars, pens, toys, buildings, roads, the power grid, even human bodies--and yes, lots of server farms. An infrastructure of real-time data services across that "network of networks," with search (and search-based-advertising) only one of many such services. As David Stutz once wrote: Useful software written above the level of the single device will command high margins for a long time to come.

This is the real Web 2.0, the web as platform. Search and its advertising economy is only one subsystem of that platform.

I know there are lots of people at Microsoft Research already working on key parts of this vision. Get behind them. Pour resources into the future, not the past. Meanwhile, there will be a lot of client devices in our connected future. And the connected devices that are the most successful will be the ones that are most open, the ones that are best for using ALL the services that are popping up out there, not just the ones that are part of a single vendor stack. The internet operating system is still in its infancy, perhaps at the level of Windows 3.1 (a simple, flawed GUI skin over DOS) or the GNU system without the Linux kernel. Microsoft has an enormous opportunity to build client software and devices that are ideal citizens of the software cooperative that the internet is becoming.

(Aside: Apple's apparent success with an "own the stack, from the device to cloud" strategy is misleading. With both the iPod and the iPhone, a key element of success is precisely the device's openness to what Apple does not own. Imagine an iPod where you could only buy music from the Apple music store instead of ripping your own CDs (this is Amazon's mistake with the Kindle). Imagine an iphone without the Safari browser (opening a world of web apps to the phone) or the Google Maps application. Apple owns key elements of the stack, but it's a permeable stack, and getting more so.)

Learn from the best, partner with the best, fill in the gaps, and build for the future. Above all, remember that great companies have "big, hairy audacious goals." Energize Microsoft by pursuing a seemingly impossible goal that can change the world for the better.

tags: google, microhoo, microsoft, open source, penis envy, search, web 2.0, yahoocomments: 35
submit: Reddit Digg stumbleupon