Entries tagged with “analytics” from O'Reilly Radar

Wed

Nov 11
2009

Ben Lorica

Counting Unique Users in Real-time with Streaming Databases

by Ben Lorica@dlimancomments: 6

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.
Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.

pathint

Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for "on-the-fly predictive modeling".

The other main enhancement in this release is Truviso's move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso's parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

tags: a/b testing, analytics, big data, real-time, sensors, sql, streamscomments: 6
submit: Reddit Digg stumbleupon   

 

Tue

Oct 20
2009

Ben Lorica

Pipelining and Real-time Analytics with MapReduce Online

by Ben Lorica@dlimancomments: 2

Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.

Some organizations create their own real-time analysis tools, while others turn to specialized solutions††. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.

In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.

At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:

A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.

Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.

Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.

Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.

In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.

For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.

So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".

UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.

(†) Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
(††) For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.

tags: analytics, big data, cep, hadoop, hive, mapreduce, mpp, real-time, streamscomments: 2
submit: Reddit Digg stumbleupon   

 

Tue

Sep 29
2009

Nat Torkington

Four short links: 29 September 2009

Bletchley Park No Longer Blech, Contest Mania, Palm Process Fails For Free Software, Open Source Web Analytics

by Nat Torkington@gnatcomments: 0

  1. Bletchley Park May Have a Future -- the UK birthplace of modern computing, where Alan Turing worked during WW II breaking German codes, is dilapidated and in need of major repair. They appear to have a supporter in the UK National Lottery, who have given them a grant to begin work and prepare for further grants. It should be secured for the future as a place of significant historical merit in the development of computing. (See also The Geek Atlas)
  2. Google Opens Voting on Ideas to Change the World -- there are a lot of contests at the moment: Project 10^100, Apps for Democracy, Apps for America, a plethora of X Prizes, the Netflix prize, and more. I wonder whether contests are like communities: you need a manager to cultivate and boost interest, or else your contest withers on the vine.
  3. My ongoing Kafka-esque nightmare of dealing with Palm and their App Catalog submission process (jwz) -- This is my story about attempting to simply distribute this free software that I have written, and how Palm has so far completely prevented me from doing so. Epic Palm fail. (via Hacker News)
  4. Piwik -- Piwik aims to be an open source alternative to Google Analytics. GPL-licensed.

tags: analytics, collective intelligence, history, open source, palm, uk, webcomments: 0
submit: Reddit Digg stumbleupon   

 

Thu

Aug 13
2009

Ben Lorica

Big Data and Real-time Structured Data Analytics

by Ben Lorica@dlimancomments: 9

The emergence of sensors as sources of Big Data highlights the need for real-time analytic tools. Popular web apps like Twitter, Facebook, and blogs are also faced with having to analyze (mostly unstructured) data in near real-time. But as Truviso founder and UC Berkeley CS Professor Michael Franklin recently noted, there are mountains of structured data generated by web apps that lend themselves to real-time analysis:

The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data.
In our report on Big Data, we listed some tools that can turn SQL data warehouses into real-time intelligence systems. The typical data warehouse usually reports on data that are a day, week, or even a month old. Not every company requires real-time reports, alerts, or exception tracking, but some domains may benefit from dramatically reducing latency. To supplement the typical post-campaign reports generated by traditional (static) data warehouses, advertisers and content providers could track and make adjustments to their campaigns in real-time. Web applications that rely on data generated by sensors (e.g. smart grids, location-aware mobile apps, logistics & supply-chain tracking, environmental sensors) would be able to display reports that are continuously updated in real-time. Web site performance and security reports are also natural candidates for real-time analytics.

If you desire (near) real-time analysis, traditional SQL databases and MapReduce systems are batch-oriented (load all the data, then analyze), and might not be able to deliver the low latency you're seeking. Fortunately, there are tools that allow structured data sets (such as data warehouses) to be easily analyzed in real-time.

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data. Truviso separates the processing and analysis of data, and performs both in real-time. End-users and business analysts can access/query real-time data and historical data using SQL: in Truviso's case the underlying Postgres engine and optimizer have been extended to include an embedded stream processor to handle "live data" in any SQL statement's FROM clause††. To specify how "live data" is to be processed by a database engine, most real-time analytic vendors provide SQL extensions that allow users to specify the time windows to be analyzed. As data flows continuously into the system, the results of queries involving "live data" are continuously updated in real-time. Leveraging a popular database such as Postgres means structured data warehouses can be ported and made real-time with Truviso.

A major challenge facing stream databases is what do with out-of-order data. Streams are timestamped data sets, and most systems expect data to arrive in the correct time sequence. Unfortunately, things happen when data flows in from multiple sources and it is not uncommon for timestamped data to arrive out-of-order. While some real-time analytic systems simply drop out-of-order data (potentially leading to misleading query results), Truviso has developed algorithms that look for contiguous data and produce query results that correctly handle out-of-order data.

What about real-time analysis of unstructured data? Truviso hasn't focused on unstructured data, preferring instead to target companies with existing data warehouses. After all, the general notion is that unstructured data doesn't quite fit into SQL databases like Truviso. But the perception that unstructured data isn't for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured text, passed it through a Conditional Random Fields algorithm (coded in SQL), and turned it into structured data.

(†) We recently had the chance to meet with the founders of Truviso. There are many other real-time analytic solutions including streambase and SQLstream.
(††) In Truviso's system, "live data" or streams can be created (CREATE stream) and accessed in SQL much like static database tables.

tags: analytics, big data, machine learning, real-time, sensors, streamscomments: 9
submit: Reddit Digg stumbleupon   

 

Thu

Jul 23
2009

Nat Torkington

Four short links: 23 July 2009

Wave Fed, Fake Steve, Vanish and Reconnoiter

by Nat Torkington@gnatcomments: 0

  1. Google Wave Federation Protocol -- the interesting part of Wave for me is the system for keeping databases coherent. There's a reference implementationl.
  2. I shouldn't have yelled at that Chinese guy so much -- the post that redeemed Fake Steve Jobs in my eyes. We all know that there's no fucking way in the world we should have microwave ovens and refrigerators and TV sets and everything else at the prices we're paying for them. There's no way we get all this stuff and everything is done fair and square and everyone gets treated right. No way. And don't be confused -- what we're talking about here is our way of life. Our standard of living. You want to "fix things in China," well, it's gonna cost you. Because everything you own, it's all done on the backs of millions of poor people whose lives are so awful you can't even begin to imagine them, people who will do anything to get a life that is a tiny bit better than the shitty one they were born into, people who get exploited and treated like shit and, in the worst of all cases, pay with their lives.
  3. Vanish -- time-limited encryption in a Firefox plugin.
  4. Reconnoiter -- holy cow web console and analytics for data centers, from the magic Theo Schlossnagle. He built the screenshots for his OSCON presentation, graphing streams of live performance data from dozens of data centers, while on a Virgin America flight.

tags: analytics, china, data center, encryption, google wave, opensource, privacycomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Jun 17
2009

Nat Torkington

Four short links: 17 June 2009

Word Mining, Open Ideas, Power Meter BotNet, and Realtime Web Traffic Tracking

by Nat Torkington@gnatcomments: 0

  1. NY Times Mines Its Data To Identify Words That Readers Find Abstruse -- the feature that lets you highlight a word on a NY Times web page and get more information about it is something that irritates me. I'm fascinated by the analysis of their data: boggling that sumptuary is less perplexing than solipsistic. Louche (#3 on the list) has been my favourite word for two years, by the way, since I heard Dylan Moran toss it out in that uniquely facile way the Irish have with words. I think Irish citizens get this incredible competence with the English language for free, along with staggering house prices and beer you can walk on.
  2. Open Ideas -- Alex Payne's blog of Concepts in the public domain, awaiting collaboration and appropriation.
  3. Buggy 'smart meters' open door to power-grid botnet (The Register) -- Paul Graham said that we've found what we get when we cross a television with a computer: a computer. Similarly, intelligent power meters are computers, computers that apparently haven't been well-secured. To prove his point, Davis and his IOActive colleagues designed a worm that self-propagates across a large number of one manufacturer's smart meter. Once infected, the device is under the control of the malware developers in much the way infected PCs are under the spell of bot herders. Attackers can then send instructions that cause its software to turn power on or off and reveal power usage or sensitive system configuration settings.
  4. Chartbeat -- the sexiest web analytics ever. It gives realtime count of users, whether they're reading or writing (based on whether focus is in a form element), where they're from, mentions on Twitter, and more and more and more. This is a different form of analytics than Google Analytics, which tells you trends and historical access. Love this for the pure sex appeal of a heads-up dashboard that can tell you exactly how many people are on your site and exactly what they're doing. (via Artur)

tags: analytics, crowdsourcing, data, energy, innovation, lazyweb, mining, securitycomments: 0
submit: Reddit Digg stumbleupon   

 

Mon

May 4
2009

Ben Lorica

Big Data: SSD's, R, and Linked Data Streams

by Ben Lorica@dlimancomments: 4

The Solid State Storage Revolution: If you haven't seen it, I recommend you watch Andy Bechtolsheim's keynote at the recent Mysqlconf. We covered SSD's in our just published report on Big Data management technologies. Since then, we've gotten additional signals from our network of alpha geeks and our interest in them remains high.

R and Linked Data Streams: I had a chance to visit with Dataspora founder and blogger Mike Driscoll, an enthusiastic advocate for the use of the open source statistical computing language, R. After founding and leading online retailer CustomInk.com, Mike went back to grad school and earned a doctorate in Bioinformatics. He has applied data analysis and programming in a variety of domains including retail, biotech, academia, and government projects.

Having been an avid user of S/S-Plus in the 1990's, I seamlessly switched over to R in the early 2000's. To this day, I consider the S/S-Plus user manuals to be the best reference and introductory books on the R programming language. (Mike wholeheartedly agrees.) R has been popular in the statistics community for many years, but I've been noticing that its visualization and analytic capabilities are attracting interest from developers. Moreover, recent efforts by the R community to improve its ability to scale large data sets (see brief update from Jay Emerson), will strengthen R's place in the Big Data stack.

(continue reading)

tags: analytics, big data, r, ssd, statistics, videocomments: 4
submit: Reddit Digg stumbleupon   

 

Tue

Apr 28
2009

Ben Lorica

How Big Data Impacts Analytics

by Ben Lorica@dlimancomments: 9

Research for our just published report on Big Data management technologies, included conversations with teams who are at the forefront of analyzing massive data sets. We were particularly impressed with the work being produced by Linkedin's analytics team. [We have more details on Linkedin's analytics team, in an article in the upcoming issue of Release 2.0.]

At the second Social Web Foo camp, I had a chance to visit with Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. Years later, he ended up in an analytics role at Ebay where his prior experience with massive data sets came in handy. In the short video below, DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. Companies are casting a wider net, and are hiring scientists from fields not traditionally known as fertile recruiting grounds for data intelligence teams.

DJ also talks about his personal journey from mathematics to e-commerce and social networks. Among his previous stints, DJ worked with the DOD and used "... social network analysis to identify terrorists."

Other short videos from Social Web Foo camp:

  • Ty Ahmad-Taylor on the Challenges Facing Television
  • Steve Ganz' observations midway through Social Web Foo Camp Year 2
  • tags: analytics, big data, foo camp, hadoop, social networking, social web, swfoo, videocomments: 9
    submit: Reddit Digg stumbleupon