Entries tagged with “mapreduce” from O'Reilly Radar

Tue

Oct 20
2009

Ben Lorica

Pipelining and Real-time Analytics with MapReduce Online

by Ben Lorica@dlimancomments: 2

Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.

Some organizations create their own real-time analysis tools, while others turn to specialized solutions††. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.

In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.

At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:

A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.

Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.

Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.

Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.

In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.

For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.

So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".

UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.

(†) Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
(††) For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.

tags: analytics, big data, cep, hadoop, hive, mapreduce, mpp, real-time, streamscomments: 2
submit: Reddit Digg stumbleupon   

 

Tue

Jul 28
2009

Ben Lorica

HadoopDB: An Open Source Parallel Database

by Ben Lorica@dlimancomments: 0

The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill (Teradata, Netezza, Vertica, DATAllegro, Greenplum, Aster Data, Infobright, Kognitio, Kickfire, Dataupia, ParAccel, Exasol, ...). As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.

HadoopDB is comprised of Postgres on each node (database layer), Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running Postgres, and Hive as the translation layer. The result is a shared-nothing parallel database, that business analysts can interact with using a SQL-like language. [Technical details can be found in the following paper.]

We recently spent an hour discussing Big Data and HadoopDB with Yale CS Professor (and HadoopDB co-creator) Daniel Abadi. One of the main motivations for building HadoopDB was the desire to make available an open source parallel database. While some analytic database vendors have built parallel systems using open source databases (e.g. Aster Data and Greenplum use Postgres), the resulting products aren't open source.

By taking advantage of Hadoop (particularly HDFS, scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters. Especially in cloud computing environments, where there might be wild fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. Given that the performance of current parallel databases scale (near linearly) as more nodes are added, vendors strive to develop systems that can be easily deployed on large clusters. Current parallel databases have been deployed mostly on systems with less than a hundred nodes. OTOH, the use of Hadoop technology allows HadoopDB to easily scale to hundreds (if not thousands) of nodes.

Generally speaking, Professor Abadi places HadoopDB somewhere between Hadoop and parallel databases when it comes to the trade-off between load (data loads are slower than Hadoop, but faster than parallel databases) and runtime (on structured data, HadoopDB is faster than Hadoop but slower than parallel databases). Below are some graphs from a series of tests conducted by the HadoopDB team:

Performance on Data Loads

pathint

Performance on Analytic Tasks

pathint

In our report on Big Data Management Technologies, we highlighted that (given the lack of upfront relational data modeling) Hadoop and other simple key-value databases encouraged experimentation that could lead to quick insights. But as query patterns emerge, " ... more refined data structures, data transformation, and data access processes can be built (including interfaces to relational RDBMSs) that make subsequent inquiries easy to repeat." In practice this means throwing data into Hadoop, observing how users interact with the data, then building relational data marts accordingly. The vision of the HadoopDB development team fits perfectly into this workflow. Over time, the HadoopDB team envisions their system to initially load all the data into HDFS, then take advantage of query patterns to dynamically load the right data slices into relational data structures.

Admittedly, the HadoopDB team needs to release tools to make their system easier to use/deploy. The HadoopDB development team is comprised entirely of Yale CS Department members, although Professor Abadi is hoping that open source developers will start contributing to the project. But if a paid gig is what you're after, the good news is that they're in search of a Chief Hacker.

[For more on Big Data, check out our report and follow @bigdata.]

(†) We were among the first users of Greenplum. In partnership with SimplyHired and Greenplum, we actively maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005.

tags: big data, hadoop, hive, mapreduce, mpp, postgrescomments: 0
submit: Reddit Digg stumbleupon   

 

Mon

Mar 23
2009

Ben Lorica

Big Data: Technologies and Techniques for Large-Scale Data

by Ben Lorica@dlimancomments: 3

Our belief that proficiency in managing and analyzing large amounts of data distinguishes market leading companies, led to a recent report designed to help users understand the different large-scale data management techniques. Our report on Big Data Technologies was the result of interviews with over thirty experts, including research scientists, (open-source) hackers, vendors, data analysts, and entrepreneurs. Rather than endorse specific vendors and technologies, we provide a framework to help readers navigate the wide variety of options available. (NOTE: If you're interested in purchasing the report as a single-issue of Release 2.0, we can provide you with a DISCOUNT CODE. Contact information is at the end of the video clip below.)

I recently sat down with my co-author, Roger Magoulas (Director of Research at O'Reilly), who agreed talk about our report and Big Data in general. Roger begins by speaking passionately of the importance of data management and analysis. He proceeds to highlight what we believe to be the key technology dimensions for evaluating data management solutions. The video ends with a glimpse into future technologies and general advice to organizations interested in improving their proficiency in handling data.

The full program is available in four extended clips:

  • What is Big Data and why is it important? (3:33 minutes)
  • Big Data Technologies (1:35 minutes)
  • Key Technology Dimensions (4:52 minutes)
  • A Look Into The Future and Closing Summary (3:42 minutes)
  • [ Head over to O'Reilly Media's Youtube channel for other interesting videos. ]

    tags: big data, hadoop, key-value, mapreduce, mpp, report, sql, videocomments: 3
    submit: Reddit Digg stumbleupon