Entries tagged with “hadoop” from O'Reilly Radar

Thu

Nov 12
2009

Nat Torkington

Four short links: 12 November 2009

CRM on Rails, Data Mining on Hadoop, Disappointing Keynotes, The Teapot Effect

by Nat Torkington@gnatcomments: 1

  1. Fat Free CRM -- open source (Affero GPL) Ruby on Rails CRM system.
  2. Bixo -- open source data mining toolkit that runs as a series of pipes on top of Hadoop. Built on Cascading workflow system for Hadoop that hides MapReduce. (via kdnuggets)
  3. Andy Kessler's Keynote at Defrag Stank (Pete Warden) -- I'm sorry to hear it, because I loved Andy's book How We Got Here about the intersecting histories of economics, finance, and technology. Read the book instead of reading about the disappointing keynote.
  4. The Teapot Effect -- the thing I love about geeks is how their passion causes them to explore, ruthlessly and quantitatively, the everyday phenomena that the rest of us take for granted. Such as dribbling teapots: “Previous studies have shown that dribbling is the result of flow separation where the layer of fluid closest to the boundary becomes detached from it. When that happens, the fluid flows smoothly over the lip. But as the flow rate decreases, the boundary layer re-attaches to the surface causing dribbling.” Read the post and the research it talks about to learn how to prevent Dribbling Teapot Syndrome ....

tags: CRM, data mining, economics, finance, hadoop, history, open source, rails, research, sciencecomments: 1
submit: Reddit Digg stumbleupon   

 

Tue

Oct 20
2009

Ben Lorica

Pipelining and Real-time Analytics with MapReduce Online

by Ben Lorica@dlimancomments: 2

Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.

Some organizations create their own real-time analysis tools, while others turn to specialized solutions††. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.

In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.

At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:

A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.

Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.

Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.

Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.

In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.

For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.

So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".

UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.

(†) Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
(††) For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.

tags: analytics, big data, cep, hadoop, hive, mapreduce, mpp, real-time, streamscomments: 2
submit: Reddit Digg stumbleupon   

 

Fri

Oct 16
2009

Nat Torkington

Four short links: 16 October 2009

Audio Geotagging, SF Open Data Stories, Wave Use Cases, Hadooped Genomes

by Nat Torkington@gnatcomments: 0

  1. Wiimote Audio Geotagging -- match audio with the map movement and annotations made with an IR pen and a Wiimote. Very cool! (and from New Zealand)
  2. San Francisco: Open For Data -- Two months after it launched, the project is already reaping rewards from San Francisco's huge community of programmers. Applications using the data include Routesy, which offers directions based on real-time city transport feeds; and EcoFinder, which points you to the nearest recycling site for a given item.
  3. Google Wave's Best Use Cases (Lifehacker) -- not cases where people are using Wave, but where they want to. Read this as "the Web has not provided all the tools to solve these problems". Something will solve them, and Wave is trying to. (via Jim Stogdill)
  4. Analyzing Human Genomes with Hadoop -- case study from the Cloudera blog. Performs alignment and genotyping on the 100GB of data you get when you sequence a human's genome in about three hours for less than $100 using a 40-node, 320-core cluster rented from Amazon’s EC2. (via mndoci on Twitter)

tags: bio, ec2, geo, google wave, gov2.0, hacks, hadoop, hardwarecomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Oct 14
2009

Nat Torkington

Four short links: 14 October 2009

Multitouch Demo, Secrets Site Secrets, Hadoop Futures, Becoming Lucky

by Nat Torkington@gnatcomments: 0

  1. 10Gui Video -- demo of a new take on multitouch, a tablet and new GUI conventions. (via titine on Twitter)
  2. Behind the Scenes at WhatDoTheyKnow -- numbers and stories from the MySociety project, which provides a public place for Official Information Act requests and responses. The fact information is subject to copyright and restrictions on re-use does not exempt it from disclosure under the Freedom of Information Act (though there is a closely related exemption relating to “commercial interest”). Occasionally public bodies will offer to reply to a request, but in order to deter wider dissemination of the material they will refuse to reply via WhatDoTheyKnow.com. Southampton University have released information in protected PDF documents and the House of Commons has refused to release information via WhatDoTheyKnow.com which it has said it would be prepared to send to an individual directly.
  3. The View from HadoopWorld (RedMonk) -- fascinating glimpse into the Hadoop user and developer world. Hadoop can be used with a variety of languages, from Perl to Python to Ruby, but as Doug Cutting admitted today, they’re all second class citizens relative to Java. The plan, however, is for that to change. Which can’t happen soon enough, in my view. It’s not that there’s anything intrinsically wrong with Java, or its audience. The point, rather, is that there are lots and lots of dynamic language developers out there that would be far more productive working in their native tongue versus translating into Java.
  4. Be Lucky, It's an Easy Skill to Learn (Telegraph) -- this one resonated with me, as it ties into some life hacking I've been doing lately. And so it is with luck - unlucky people miss chance opportunities because they are too focused on looking for something else. They go to parties intent on finding their perfect partner and so miss opportunities to make good friends. They look through newspapers determined to find certain types of job advertisements and as a result miss other types of jobs. Lucky people are more relaxed and open, and therefore see what is there rather than just what they are looking for. (via Hacker News)

tags: gov2.0, hadoop, lifehacks, multicore, multitouch, mysociety, politics, ui, webcomments: 0
submit: Reddit Digg stumbleupon   

 

Tue

Jul 28
2009

Ben Lorica

HadoopDB: An Open Source Parallel Database

by Ben Lorica@dlimancomments: 0

The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill (Teradata, Netezza, Vertica, DATAllegro, Greenplum, Aster Data, Infobright, Kognitio, Kickfire, Dataupia, ParAccel, Exasol, ...). As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.

HadoopDB is comprised of Postgres on each node (database layer), Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running Postgres, and Hive as the translation layer. The result is a shared-nothing parallel database, that business analysts can interact with using a SQL-like language. [Technical details can be found in the following paper.]

We recently spent an hour discussing Big Data and HadoopDB with Yale CS Professor (and HadoopDB co-creator) Daniel Abadi. One of the main motivations for building HadoopDB was the desire to make available an open source parallel database. While some analytic database vendors have built parallel systems using open source databases (e.g. Aster Data and Greenplum use Postgres), the resulting products aren't open source.

By taking advantage of Hadoop (particularly HDFS, scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters. Especially in cloud computing environments, where there might be wild fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. Given that the performance of current parallel databases scale (near linearly) as more nodes are added, vendors strive to develop systems that can be easily deployed on large clusters. Current parallel databases have been deployed mostly on systems with less than a hundred nodes. OTOH, the use of Hadoop technology allows HadoopDB to easily scale to hundreds (if not thousands) of nodes.

Generally speaking, Professor Abadi places HadoopDB somewhere between Hadoop and parallel databases when it comes to the trade-off between load (data loads are slower than Hadoop, but faster than parallel databases) and runtime (on structured data, HadoopDB is faster than Hadoop but slower than parallel databases). Below are some graphs from a series of tests conducted by the HadoopDB team:

Performance on Data Loads

pathint

Performance on Analytic Tasks

pathint

In our report on Big Data Management Technologies, we highlighted that (given the lack of upfront relational data modeling) Hadoop and other simple key-value databases encouraged experimentation that could lead to quick insights. But as query patterns emerge, " ... more refined data structures, data transformation, and data access processes can be built (including interfaces to relational RDBMSs) that make subsequent inquiries easy to repeat." In practice this means throwing data into Hadoop, observing how users interact with the data, then building relational data marts accordingly. The vision of the HadoopDB development team fits perfectly into this workflow. Over time, the HadoopDB team envisions their system to initially load all the data into HDFS, then take advantage of query patterns to dynamically load the right data slices into relational data structures.

Admittedly, the HadoopDB team needs to release tools to make their system easier to use/deploy. The HadoopDB development team is comprised entirely of Yale CS Department members, although Professor Abadi is hoping that open source developers will start contributing to the project. But if a paid gig is what you're after, the good news is that they're in search of a Chief Hacker.

[For more on Big Data, check out our report and follow @bigdata.]

(†) We were among the first users of Greenplum. In partnership with SimplyHired and Greenplum, we actively maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005.

tags: big data, hadoop, hive, mapreduce, mpp, postgrescomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Jun 24
2009

Nat Torkington

Four short links: 24 June 2009

Open Source Kids, Crowdsourcing Lessons, Flickr Secrets, Hadoop Spatial Joins

by Nat Torkington@gnatcomments: 0

  1. The Digital Open -- The Digital Open is an online technology community and competition for youth around the world, age 17 and under. Building a community of young open source hackers.
  2. Four Crowdsoucing Lessons from the Guardian's Spectacular Expenses Scandal Experiment -- Your workers are unpaid, so make it fun. How to lure them? By making it feel like a game. "Any time that you’re trying to get people to give you stuff, to do stuff for you, the most important thing is that people know that what they’re doing is having an effect," Willison said. "It’s kind of a fundamental tenet of social software. … If you’re not giving people the ‘I rock’ vibe, you’re not getting people to stick around." (via migurski on delicious)
  3. 10+ Deploys/Day: Dev & Ops Cooperation at Flickr -- John Allspaw and Paul Hammond's talk from Velocity. You tell any mainstream company in the world "10 deploys/day" and you'll be met with disbelief.
  4. Reproducing Spatial Joins using Hadoop and EC2 -- bit by bit the techniques for emulating important operations from trad databases are being discovered and shared in the new database scene. (via straup on delicious)

tags: crowdsourcing, django, ec2, flickr, geo, geodata, hadoop, journalism, opensource, velocitycomments: 0
submit: Reddit Digg stumbleupon   

 

Thu

Jun 11
2009

Nat Torkington

Four short links: 11 June 2009

Trends, Graffiti, Games, and Streaming Video

by Nat Torkington@gnatcomments: 1

  1. Trending Topics -- full source code for trendingtopics.org, Wikipedia trend analysis. Rails app running on the Cloudera Hadoop Distribution on EC2. (via mattb on Delicious)
  2. Graffiti from Pompeii -- I can't help but read these as Tweets. Herculaneum (on the exterior wall of a house); 10619: Apollinaris, the doctor of the emperor Titus, defecated well here (see also olde style Twitter) (via OvidPerl on Twitter)
  3. Online Games Dominate Beijing Startonomics -- presentations from sessions on Chinese game business at Startonomics conference. Though there are many differences between the US and China games market, the one that stands out most is China’s ability to massively monetize games. Tencent, a leading Chinese web portal, social network and game developer, famously announced revenue of over $1 billion earlier this year, much of it coming from their avatar service. (via TinaTranT on Twitter)
  4. Ustream's Audience for Apple iPhone Announcement Greater Than Cable News -- Ustream is amazing, you can take a consumer handycam and video broadcast live to a greater audience than many TV shows get.

tags: china, ec2, games, hadoop, media, programming, trends, video, web 2.0comments: 1
submit: Reddit Digg stumbleupon   

 

Mon

Jun 1
2009

Ben Lorica

Most Hadoop Jobs Are In California

by Ben Lorica@dlimancomments: 4

Given the recent buzz surrounding Hadoop and MapReduce, I was curious if employers were beginning to mention either term in their job postings. Fortunately I have access to a massive job data warehouse dating back to mid-2005. In partnership with SimplyHired and Greenplum, we maintain a data warehouse that contains most of the online job postings in the U.S.

While the percentage of job postings that mention either Hadoop or MapReduce remains miniscule, the number of such postings is growing steadily:

pathint

The number of Hadoop/MapReduce job postings (during the Feb/Apr 2009 period) grew 49% compared to 2008. In contrast, the tough economic environment has translated to significantly fewer job postings: the total number of online job postings declined 40% during the same period.

How mainstream is Hadoop? While researching our report on Big Data, we talked to a (database) vendor who jokingly claimed that nobody outside of the West & East coast cared about Hadoop. Analysis of recent job postings seems to support that perspective. During the three most recent months, employers in 18 states posted Hadoop/MapReduce jobs online, but 60% of those were in California. The top 5 states (CA, MD, NY, MA, WA) accounted for 87% of the Hadoop/MapReduce job postings:

pathint

Looking at the same period last year, 72% of the job postings were in California, and the top 5 states (CA, WA, TX, PA, VA) accounted for 79%.

Given the presence of large (Google, Yahoo!, Facebook) and small companies (Cloudera, Greenplum, Aster, ...) who are leaders in the use of Hadoop/MapReduce, it's no surprise that at this early stage, a large share of jobs are in California. While the share of California job postings remains high (60%), it's down from 72% last year. As mentioned above, the percentage of job postings that mention either Hadoop or MapReduce remains miniscule, so I caution against reading too much into the geographic distributions. Nevertheless, it's clear that California employers are expressing interest in Hadoop skills ahead of their peers in other states.

tags: big data, hadoop, jobscomments: 4
submit: Reddit Digg stumbleupon   

 

Tue

Apr 28
2009

Ben Lorica

How Big Data Impacts Analytics

by Ben Lorica@dlimancomments: 9

Research for our just published report on Big Data management technologies, included conversations with teams who are at the forefront of analyzing massive data sets. We were particularly impressed with the work being produced by Linkedin's analytics team. [We have more details on Linkedin's analytics team, in an article in the upcoming issue of Release 2.0.]

At the second Social Web Foo camp, I had a chance to visit with Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. Years later, he ended up in an analytics role at Ebay where his prior experience with massive data sets came in handy. In the short video below, DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. Companies are casting a wider net, and are hiring scientists from fields not traditionally known as fertile recruiting grounds for data intelligence teams.

DJ also talks about his personal journey from mathematics to e-commerce and social networks. Among his previous stints, DJ worked with the DOD and used "... social network analysis to identify terrorists."

Other short videos from Social Web Foo camp:

  • Ty Ahmad-Taylor on the Challenges Facing Television
  • Steve Ganz' observations midway through Social Web Foo Camp Year 2
  • tags: analytics, big data, foo camp, hadoop, social networking, social web, swfoo, videocomments: 9
    submit: Reddit Digg stumbleupon   

     

    Mon

    Mar 23
    2009

    Ben Lorica

    Big Data: Technologies and Techniques for Large-Scale Data

    by Ben Lorica@dlimancomments: 3

    Our belief that proficiency in managing and analyzing large amounts of data distinguishes market leading companies, led to a recent report designed to help users understand the different large-scale data management techniques. Our report on Big Data Technologies was the result of interviews with over thirty experts, including research scientists, (open-source) hackers, vendors, data analysts, and entrepreneurs. Rather than endorse specific vendors and technologies, we provide a framework to help readers navigate the wide variety of options available. (NOTE: If you're interested in purchasing the report as a single-issue of Release 2.0, we can provide you with a DISCOUNT CODE. Contact information is at the end of the video clip below.)

    I recently sat down with my co-author, Roger Magoulas (Director of Research at O'Reilly), who agreed talk about our report and Big Data in general. Roger begins by speaking passionately of the importance of data management and analysis. He proceeds to highlight what we believe to be the key technology dimensions for evaluating data management solutions. The video ends with a glimpse into future technologies and general advice to organizations interested in improving their proficiency in handling data.

    The full program is available in four extended clips:

  • What is Big Data and why is it important? (3:33 minutes)
  • Big Data Technologies (1:35 minutes)
  • Key Technology Dimensions (4:52 minutes)
  • A Look Into The Future and Closing Summary (3:42 minutes)
  • [ Head over to O'Reilly Media's Youtube channel for other interesting videos. ]

    tags: big data, hadoop, key-value, mapreduce, mpp, report, sql, videocomments: 3
    submit: Reddit Digg stumbleupon