Entries tagged with “machine learning” from O'Reilly Radar
Four short links: 15 October 2009
Open Access, Right to Broadband, Machine Learning Textbook, Javascript Performance Art
by Nat Torkington | @gnat | comments: 1
- Open Access Week -- world-wide, dedicated to raising awareness of open access to research. (via Creative Commons Aotearoa).
- 1Mb Broadband Access Becomes Legal Right -- Starting next July, every person in Finland will have the right to a one-megabit broadband connection.
- The Elements of Statistical Learning 2ed -- classic book (I have the 1st edition) that is now available as a free PDF download. (via Hacker News)
- vi in Javascript -- yup, someone's written a vi clone in Javascript. (via monkchips on Twitter)
tags: book related, broadband, finland, javascript, machine learning, science, science commons
| comments: 1
submit:
Four short links: 12 October 2009
DSL for NLP Task, Insider Tradespotting, Outsource Fail, Cloud Fail
by Nat Torkington | @gnat | comments: 3
- Snowball -- a small string processing language designed for creating stemming algorithms for use in Information Retrieval. (via straup on delicious)
- Insider Trades -- a Yahoo! Hack Day app that turned out to be worth continuing. Scans SEC systems every 30 seconds and alerts you if the stock you track has been traded by an insider. (via straup on delicious)
- Air New Zealand Slams IBM -- central point of failure in the outsourced IT. "In my 30-year working career, I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologise to its client and its client's customers is not the glowing endorsement you want.
- Danger/Microsoft Loses Sidekick Customers' Data -- Regrettably, based on Microsoft/Danger's latest recovery assessment of their systems, we must now inform you that personal information stored on your device - such as contacts, calendar entries, to-do lists or photos - that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger. This cloud had a brown lining.
tags: cloud, failures, finance, hacks, machine learning, microsoft, programming, yahoo
| comments: 3
submit:
Big Data and Real-time Structured Data Analytics
by Ben Lorica | @dliman | comments: 9The emergence of sensors as sources of Big Data highlights the need for real-time analytic tools. Popular web apps like Twitter, Facebook, and blogs are also faced with having to analyze (mostly unstructured) data in near real-time. But as Truviso founder and UC Berkeley CS Professor Michael Franklin recently noted, there are mountains of structured data generated by web apps that lend themselves to real-time analysis:
The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data.In our report on Big Data, we listed some tools that can turn SQL data warehouses into real-time intelligence systems. The typical data warehouse usually reports on data that are a day, week, or even a month old. Not every company requires real-time reports, alerts, or exception tracking, but some domains may benefit from dramatically reducing latency. To supplement the typical post-campaign reports generated by traditional (static) data warehouses, advertisers and content providers could track and make adjustments to their campaigns in real-time. Web applications that rely on data generated by sensors (e.g. smart grids, location-aware mobile apps, logistics & supply-chain tracking, environmental sensors) would be able to display reports that are continuously updated in real-time. Web site performance and security reports are also natural candidates for real-time analytics.
If you desire (near) real-time analysis, traditional SQL databases and MapReduce systems are batch-oriented (load all the data, then analyze), and might not be able to deliver the low latency you're seeking. Fortunately, there are tools that allow structured data sets (such as data warehouses) to be easily analyzed in real-time.
Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data. Truviso separates the processing and analysis of data, and performs both in real-time. End-users and business analysts can access/query real-time data and historical data using SQL: in Truviso's case the underlying Postgres engine and optimizer have been extended to include an embedded stream processor to handle "live data" in any SQL statement's FROM clause. To specify how "live data" is to be processed by a database engine, most real-time analytic vendors provide SQL extensions that allow users to specify the time windows to be analyzed. As data flows continuously into the system, the results of queries involving "live data" are continuously updated in real-time. Leveraging a popular database such as Postgres means structured data warehouses can be ported and made real-time with Truviso.
A major challenge facing stream databases is what do with out-of-order data. Streams are timestamped data sets, and most systems expect data to arrive in the correct time sequence. Unfortunately, things happen when data flows in from multiple sources and it is not uncommon for timestamped data to arrive out-of-order. While some real-time analytic systems simply drop out-of-order data (potentially leading to misleading query results), Truviso has developed algorithms that look for contiguous data and produce query results that correctly handle out-of-order data.
What about real-time analysis of unstructured data? Truviso hasn't focused on unstructured data, preferring instead to target companies with existing data warehouses. After all, the general notion is that unstructured data doesn't quite fit into SQL databases like Truviso. But the perception that unstructured data isn't for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured text, passed it through a Conditional Random Fields algorithm (coded in SQL), and turned it into structured data.
() We recently had the chance to meet with the founders of Truviso. There are many other real-time analytic solutions including streambase and SQLstream.
() In Truviso's system, "live data" or streams can be created (CREATE stream) and accessed in SQL much like static database tables.
tags: analytics, big data, machine learning, real-time, sensors, streams
| comments: 9
submit:
Mechanical Turk Best Practices
by Ben Lorica | @dliman | comments: 8Last night, Dolores Labs hosted what was billed as the first-ever Mechanical Turk meetup, and I was fortunate enough to have been able to squeeze into what turned out to be a great series of presentations. While Amazon was the pioneer and remains the largest provider in the space, other services like Dolores Labs and Nathan Eagle's txteagle have emerged to expand the pool of users and turks.
In the past, we've turned to Dolores Labs when we needed (machine-learning) training sets and were unable to quickly find reliable ones. To increase the quality of the output we receive from turks, we try to get multiple turks to perform an individual task and aggregate their work into a single answer. (We jokingly refer to this as the wisdom of micro-crowds.) Working on problems quite different from the ones we tackle, the first set of speakers presented research results confirming that this form of aggregation actually works. Rion Snow of Stanford's AI Lab presented results that suggest that for a large set of tasks, the aggregate work of 4-6 turks compare favorably to the work of a single (domain) expert. Working primarily in the area of NLP and computational linguistics, Bob Carpenter of alias-i presented similar results when evaluating turk-generated against gold standard training sets. (It's hard enough when turks disagree, but as Bob Carpenter highlighted, disagreements among experts makes it difficult to arrive at a gold standard.) Bob has found that in certain situations an iterative approach works best ("code-a-little", "learn-a-little") and tools that allow you to start suggesting "answers" to a new set of turks would help immensely. Coincidentally, one of the speakers presented a toolkit that allows users to do just that: Greg Little's TurKit is a JavaScript API for running iterative tasks in mechanical turk.
Another set of speakers talked about the emergence of mechanical turks as a research tool. Social scientists Aaron Shaw and John Horton spoke of favorably of their experience using turks for research experiments in economics and paired surveys. Among other things, they've conducted studies on the turk labor market by testing demand for tasks of varying difficulty (something Bob Carpenter also talked about), and by evaluating demand for follow-on tasks at lower wages. Alexander Sorokin of UIUC, presented work on using turks to annotate training sets for computer vision and robotics. For those interested in using turks to annotate images, Alex has a toolkit ready to go.
For most users of mechanical turk (us included), it has become an API call that fits smoothly within their workflow. (Or as someone at the meetup wryly suggested, turk is a Remote Person Call.) The last pair of speakers, Lilly Irani and Six Silberman, reminded us that behind mechanical turk lies thousands of workers ("the crowd in the cloud") working without (health care) benefits, oftentimes at extremely low hourly wages. Irani and Silberman suggested that rather than abstracting mechanical turk services as mere API calls, users should start thinking of the plight of the turks ("Mechanical Turk Bill of Rights") behind the service. As a first step they have a released a Firefox plugin that aims to narrow the information assymetry between turks (those performing tasks) and requesters (those posting tasks). While requesters can see ratings for turks, requesters aren't rated: Turkopticon lets turks rate requesters. They need more turks to download and start using Turkopticon, so if you know any mechanical turks please enourage them do so.
() According to Amazon representatives in the audience, a majority of turks are in the U.S. That may change in the future, once Amazon is able to get approval for other payment systems. Because of the possibility of money-laundering, services like AMT are subject to strict KYC controls.
tags: big data, machine learning, mechanical turk, meetup
| comments: 8
submit:


