ETech Day 4: One of these is not like the other
Related link: http://conferences.oreillynet.com/cs/et2006/view/e_sess/7676
Today is the last day of ETech and things are starting to wind down. The hotel broke the network and moved us to the third floor, so the smooth flow of geeks milling about is feeling a little disjointed today. Regardless, the sessions continue and maybe people are paying a little more attention since they are not being distracted by their laptops (as much).
The most interesting session I attended this morning was Meredith Patterson's "One of these is not like the others" talk on her Query by Example extension to Postgres. Since I am a MySQL refugee who now uses Postgres every day, I was intrigued by this extension. Meredith also had an interesting observation on MySQL: "The Postgres back-end is really nice, unlike MySQL. I'm sorry all you MySQL users, but your code f*cking sucks." Of course, that elicited a chuckle from the crowd.
Meredith started out by talking about some basic ideas and challenges of data mining. She outlined how in large data sets patterns and trends emerge and how clustering techniques attempt to find data points with similar traits by identifying locations where data points form clusters. Classification is another data mining tool where all the data points that lie on one side of a line (for 2D datasets) belong to one class and the other points belong to the other.
Her Postgres extension uses the classification technique as part of a support vector machine (SVM) -- fortunately she didn't go into much detail on how these work. They use far too much math to think about in the morning -- if you want to delve deeper into SVMs, you might start with this Wikipedia entry.
Meredith did describe the characteristics of SVM machines a little more:
- can be used to query by example
- can be used to order results by example
- can learn complicated functions quickly
- fast execution
- lightweight and portable
- can be used for both classification and ranking
So far, all of this has been a little bit abstract -- let's put this all into perspective by showing you what cool things you can do with this nifty extension:
SELECT title
FROM songs
WHERE EXAMPLE KEY title
LIKE ("Canon in D", "Moonlight Sonata", "Air on the G String")
NOT LIKE ("Closer", "Take On Me", "Sell Out")
This example query selects titles from the songs table that are similar to the three mentioned classical pieces of music, but not like the three pop music pieces. I must admit -- this example is a little confusing since she mentioned that text isn't supported yet in the postgres extension -- so far the extension only supports real numbers. Regardless, the example query shows the power of her extension. Postgres' LIKE operator is pretty inflexible, quite slow and overall not very useful -- Meredith's extension improves on this by making the LIKE operator much more flexible and fuzzy. But the real power of her extension comes from the ability to specify items that should not be like the given examples. SQL can't do that without this extension.
Now consider this ranking example:
SELECT title
FROM songs
ORDER BY EXAMPLE KEY int_id (((1, 2, 3) > (4, 5)), ((6,7) > (8, 9, 10)))
This standard SQL query with a funky ORDER BY clause uses the order by example concept. The two tuples of data ((1, 2, 3) > (4, 5)) and ((6,7) > (8, 9, 10)), indicate which items of data are more important than other items of data. In this case (1,2,3) is more important than (4,5). Then, (6,7) are more important than (8,9,10). Again, the actual data values may be a little insignificant, but you can see how ordering by example can be a really powerful concept.
Meridith says that she will continue to work on this project and once she rounds out some rough edges she will release the extension under the GPL. I think this extension is going to be a really useful extension to postgres -- the postgres team has already said they are interested in including it in Postgres' contrib section. I will keep an eye on this extension and once she posts the source code, I will post a link here.
Do you think query by example is going to be useful in Postgres?
Categories
WebRead More Entries by Robert Kaye.

Text values as keys
I must admit -- this example is a little confusing since she mentioned that text isn't supported yet in the postgres extension -- so far the extension only supports real numbers.
Hey Robert,
Just a quick clarification -- the KEY field you specify can be of any data type, because it's just an index into the table. In that example, I'm (crassly) assuming that all the other fields in that "songs" table are numeric, and that the "title" field uniquely identifies the examples you're using in a human-readable way. The KEY field isn't actually used in training.
Sorry for the confusion!
The code is available
http://pgfoundry.org/projects/qbe/