Entries tagged with “big data” from O'Reilly Radar

Wed

Nov 11
2009

Ben Lorica

Counting Unique Users in Real-time with Streaming Databases

by Ben Lorica@dlimancomments: 6

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.
Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.

pathint

Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for "on-the-fly predictive modeling".

The other main enhancement in this release is Truviso's move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso's parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

tags: a/b testing, analytics, big data, real-time, sensors, sql, streamscomments: 6
submit: Reddit Digg stumbleupon   

 

Wed

Nov 4
2009

Nat Torkington

Four short links: 4 November 2009

Electronics Hacking FAQs, Speech-To-Text Democracy, Open Source Column Database, Massive Online Analysis

by Nat Torkington@gnatcomments: 1

  1. ChipHacker -- collaborative FAQ site for electronics hacking. Based on the same StackExchange software as RedMonk's FOSS FAQ for open source software.
  2. Democracy Live -- BBC launch searchable coverage of parliamentary discussion, using speech-to-text. One aspect we're particularly proud of is that we've managed to deliver good results for speech-to-text in Welsh, which, we're told, is unique. I think of this as the start of a They Work For You for video coverage. I'd love to be able to scale this to local government coverage, which is disappearing as local newspapers turn into delivery mechanisms for real estate advertisements.
  3. InfiniDB: Open Source Column Database -- hooks into MySQL, uses MySQL for SQL parsing, security, etc. The commercial enterprise version has multi-server support (parallel scale-out). (via Brian Aker)
  4. Massive Online Analysis -- MOA is a framework for data stream mining. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems. . (via joshua on Delicious)

tags: big data, collective intelligence, databases, democracy, gov2.0, hardware, maker, open sourcecomments: 1
submit: Reddit Digg stumbleupon   

 

Tue

Oct 20
2009

Ben Lorica

Pipelining and Real-time Analytics with MapReduce Online

by Ben Lorica@dlimancomments: 2

Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.

Some organizations create their own real-time analysis tools, while others turn to specialized solutions††. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.

In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.

At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:

A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.

Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.

Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.

Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.

In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.

For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.

So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".

UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.

(†) Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
(††) For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.

tags: analytics, big data, cep, hadoop, hive, mapreduce, mpp, real-time, streamscomments: 2
submit: Reddit Digg stumbleupon   

 

Tue

Aug 18
2009

Nat Torkington

Four short links: 18 August 2009

iPhone App Backstory, Cookie Resurrection, The Entrepreneuralism Lickmus test, and An Interesting Database

by Nat Torkington@gnatcomments: 2

  1. The Making of the NPR News iPhone App -- interesting behind-the-scenes look, with sketches and all. Station streams, however, presented a larger challenge. To begin with, NPR didn't have direct stream links for any of its stations, so we built a Web spider that identified and captured more than 300 iPhone-compatible station streams. After that first pass, we worked with our station representatives to manually test each stream. In the process they found enough new streams to double our database. All of these streams are delivered to the app from NPR's Station Finder API. (via mattb on Twitter)
  2. You Deleted Your Cookies? Think Again (Wired) -- Flash keeps its own cookies, which are harder to delete. Several services even use the surreptitious data storage to reinstate traditional cookies that a user deleted, which is called ‘re-spawning’ in homage to video games where zombies come back to life even after being “killed,” the report found. So even if a user gets rid of a website’s tracking cookie, that cookie’s unique ID will be assigned back to a new cookie again using the Flash data as the “backup.” (via Simon Willison)
  3. Would You Lick It? (Rowan Simpson) -- clever example of what it takes to be an entrepreneur.
  4. FluidDB -- a shared "in the cloud" database built around tags: an object is a container for a set of tags which are name:value pairs, tag names have simple namespaces (e.g., "gnat/review" is the "review" tag in my namespace), all objects are world readable and writable but there are ACLs for tags, values can be any type (string, number, URL, Excel spreadsheet), and there's a simple query language. I'm curious to see what applications spring up around shared data. They're in limited alpha, controlling the # of users, so register now to play before everyone else.

tags: big data, databases, design, flash, iphone app, news, npr, privacy, security, startupscomments: 2
submit: Reddit Digg stumbleupon   

 

Thu

Aug 13
2009

Ben Lorica

Big Data and Real-time Structured Data Analytics

by Ben Lorica@dlimancomments: 9

The emergence of sensors as sources of Big Data highlights the need for real-time analytic tools. Popular web apps like Twitter, Facebook, and blogs are also faced with having to analyze (mostly unstructured) data in near real-time. But as Truviso founder and UC Berkeley CS Professor Michael Franklin recently noted, there are mountains of structured data generated by web apps that lend themselves to real-time analysis:

The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data.
In our report on Big Data, we listed some tools that can turn SQL data warehouses into real-time intelligence systems. The typical data warehouse usually reports on data that are a day, week, or even a month old. Not every company requires real-time reports, alerts, or exception tracking, but some domains may benefit from dramatically reducing latency. To supplement the typical post-campaign reports generated by traditional (static) data warehouses, advertisers and content providers could track and make adjustments to their campaigns in real-time. Web applications that rely on data generated by sensors (e.g. smart grids, location-aware mobile apps, logistics & supply-chain tracking, environmental sensors) would be able to display reports that are continuously updated in real-time. Web site performance and security reports are also natural candidates for real-time analytics.

If you desire (near) real-time analysis, traditional SQL databases and MapReduce systems are batch-oriented (load all the data, then analyze), and might not be able to deliver the low latency you're seeking. Fortunately, there are tools that allow structured data sets (such as data warehouses) to be easily analyzed in real-time.

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data. Truviso separates the processing and analysis of data, and performs both in real-time. End-users and business analysts can access/query real-time data and historical data using SQL: in Truviso's case the underlying Postgres engine and optimizer have been extended to include an embedded stream processor to handle "live data" in any SQL statement's FROM clause††. To specify how "live data" is to be processed by a database engine, most real-time analytic vendors provide SQL extensions that allow users to specify the time windows to be analyzed. As data flows continuously into the system, the results of queries involving "live data" are continuously updated in real-time. Leveraging a popular database such as Postgres means structured data warehouses can be ported and made real-time with Truviso.

A major challenge facing stream databases is what do with out-of-order data. Streams are timestamped data sets, and most systems expect data to arrive in the correct time sequence. Unfortunately, things happen when data flows in from multiple sources and it is not uncommon for timestamped data to arrive out-of-order. While some real-time analytic systems simply drop out-of-order data (potentially leading to misleading query results), Truviso has developed algorithms that look for contiguous data and produce query results that correctly handle out-of-order data.

What about real-time analysis of unstructured data? Truviso hasn't focused on unstructured data, preferring instead to target companies with existing data warehouses. After all, the general notion is that unstructured data doesn't quite fit into SQL databases like Truviso. But the perception that unstructured data isn't for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured text, passed it through a Conditional Random Fields algorithm (coded in SQL), and turned it into structured data.

(†) We recently had the chance to meet with the founders of Truviso. There are many other real-time analytic solutions including streambase and SQLstream.
(††) In Truviso's system, "live data" or streams can be created (CREATE stream) and accessed in SQL much like static database tables.

tags: analytics, big data, machine learning, real-time, sensors, streamscomments: 9
submit: Reddit Digg stumbleupon   

 

Fri

Aug 7
2009

Nat Torkington

Four short links: 7 August 2009

Recovery.gov, Meme tracking, RFID Scans, Open Source Search Engines

by Nat Torkington@gnatcomments: 1

  1. Defragging the Stimulus -- each [recovery] site has its own silo of data, and no site is complete. What we need is a unified point of access to all sources of information: firsthand reports from Recovery.gov and state portals, commentary from StimulusWatch and MetaCarta, and more. Suggests that Recovery.gov should be the hub for this presently-decentralised pile of recovery data.
  2. Memetracker -- site accompanying the research written up by the New York Times as Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs [...] For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours [...] a relative handful of blog sites are the quickest to pick up on things that later gain wide attention on the Web. Confirming that blogs and traditional media have a symbiotic relationship, not a parasitic one. (via Stats article in NY Times)
  3. Feds at DefCon Alarmed After RFIDs Scanned (Wired) -- RFID badges make for convenient security, and for convenient attack. Black hats can read your security cards from 2 or 3 feet away, and few in government are aware of the attack vector. To help prevent surreptitious readers from siphoning RFID data, a company named DIFRWear was doing brisk business at DefCon selling leather Faraday-shielded wallets and passport holders lined with material that prevents readers from sniffing RFID chips in proximity cards.
  4. A Comparison of Open Source Search Engines and Indexing Twitter -- Detailed write-up of the open source search options and how they stack up on a pile of Tweets. While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found: Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep … And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance. (via joshua on Delicious)

tags: big data, gov2.0, meme wars, open source, privacy, rfid, search, security, transparency, twitter, visualizationcomments: 1
submit: Reddit Digg stumbleupon   

 

Wed

Aug 5
2009

Ben Lorica

The US Online Job Market Improved Slightly in July

by Ben Lorica@dlimancomments: 12

Measured in terms of online job postings, the U.S. job market improved slightly in July. Here are two views of the number of job postings per day: note the slight uptick in July 2009 in both graphs.

pathint

pathint

The worst year-over-year decline occurred in April, the online job market subsequently shed less postings in May and June. Given that July was an improvement over May/June, one would hope that the stage is set for a sustained upward trend. But with 45% fewer job postings in Jul-09 compared to Jul-08, the U.S. online job market remains far from the levels seen in previous years.

pathint

Alternatively, it may take a long period before job postings return to 2008 levels. Instead of looking for green shoots, we may have to brace ourselves and adjust to the New Normal: a stretch of time when job postings remain significantly less than the 2006-2008 period.

There were fewer online job postings in every state, with losses ranging from 36-37% in VA, MD, OK, AK, to 55-58% in DE, WY, MN, WI. The two largest states (CA, TX) had 50% and 43% fewer job postings in 2009 compared to the same period in 2008:

pathint

(†) In partnership with SimplyHired and Greenplum, we maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005. Data for this post was through 7/31/2009.

tags: big data, economy, jobscomments: 12
submit: Reddit Digg stumbleupon   

 

Fri

Jul 31
2009

Nat Torkington

Four short links: 31 July 2009

NoSQL, Goldman Sachs, Yahoo! Developer Products and Bing, and Alternate Reality

by Nat Torkington@gnatcomments: 3

On this day in history, Mt Fuji exploded (781), Daniel Defoe was put in the stocks for seditious libel but was pelted with flowers (1703), the first U.S. patent was issued (1790), and the radio show The Shadow aired for the first time (1930).

  1. Tokyo Cabinet: Beyond Key-Value Store -- description of Tokyo Cabinet and code examples in Ruby. More on the nosql move to leave relational databases behind for certain modern problems (such as scaling).
  2. The Great American Bubble Machine (Rolling Stone) -- I know it's old hat, but read it for the poetry if for nothing else. The first thing you need to know about Goldman Sachs is that it's everywhere. The world's most powerful investment bank is a great vampire squid wrapped around the face of humanity, relentlessly jamming its blood funnel into anything that smells like money.
  3. Yahoo!'s Developer Program and Bing -- note from Yahoo! to developers, saying that YQL, YUI, and Pipes are safe. For SearchMonkey and BOSS they currently do not have anything concrete to tell you. I assume (and hope) that Delicious is a top-level product, not something under "search". (via Simon Willison)
  4. Preparing Us for AR -- (Schulze & Webb) round up of some apps and toys that show what AR might be, unfettered by current day technological constraints.

tags: alternate reality, big data, bing, finance, financial crisis, nosql, yahoo, yahoo pipescomments: 3
submit: Reddit Digg stumbleupon   

 

Tue

Jul 28
2009

Ben Lorica

HadoopDB: An Open Source Parallel Database

by Ben Lorica@dlimancomments: 0

The growing need to manage and make sense of Big Data, has led to a surge in demand for analytic databases, which many companies are attempting to fill (Teradata, Netezza, Vertica, DATAllegro, Greenplum, Aster Data, Infobright, Kognitio, Kickfire, Dataupia, ParAccel, Exasol, ...). As an alternative to current shared-nothing analytic databases, HadoopDB is a hybrid that combines parallel databases with scalable and fault-tolerant Hadoop/MapReduce systems.

HadoopDB is comprised of Postgres on each node (database layer), Hadoop/MapReduce as a communication layer that coordinates the multiple nodes each running Postgres, and Hive as the translation layer. The result is a shared-nothing parallel database, that business analysts can interact with using a SQL-like language. [Technical details can be found in the following paper.]

We recently spent an hour discussing Big Data and HadoopDB with Yale CS Professor (and HadoopDB co-creator) Daniel Abadi. One of the main motivations for building HadoopDB was the desire to make available an open source parallel database. While some analytic database vendors have built parallel systems using open source databases (e.g. Aster Data and Greenplum use Postgres), the resulting products aren't open source.

By taking advantage of Hadoop (particularly HDFS, scheduling, and job-tracking), HadoopDB distinguishes itself from many of the current parallel databases by dynamically monitoring and adjusting for slow nodes and node failures to optimize performance in heterogenous clusters. Especially in cloud computing environments, where there might be wild fluctuations in the performance and availability of individual nodes, fault-tolerance and the ability to perform in heterogeneous environments are critical. Given that the performance of current parallel databases scale (near linearly) as more nodes are added, vendors strive to develop systems that can be easily deployed on large clusters. Current parallel databases have been deployed mostly on systems with less than a hundred nodes. OTOH, the use of Hadoop technology allows HadoopDB to easily scale to hundreds (if not thousands) of nodes.

Generally speaking, Professor Abadi places HadoopDB somewhere between Hadoop and parallel databases when it comes to the trade-off between load (data loads are slower than Hadoop, but faster than parallel databases) and runtime (on structured data, HadoopDB is faster than Hadoop but slower than parallel databases). Below are some graphs from a series of tests conducted by the HadoopDB team:

Performance on Data Loads

pathint

Performance on Analytic Tasks

pathint

In our report on Big Data Management Technologies, we highlighted that (given the lack of upfront relational data modeling) Hadoop and other simple key-value databases encouraged experimentation that could lead to quick insights. But as query patterns emerge, " ... more refined data structures, data transformation, and data access processes can be built (including interfaces to relational RDBMSs) that make subsequent inquiries easy to repeat." In practice this means throwing data into Hadoop, observing how users interact with the data, then building relational data marts accordingly. The vision of the HadoopDB development team fits perfectly into this workflow. Over time, the HadoopDB team envisions their system to initially load all the data into HDFS, then take advantage of query patterns to dynamically load the right data slices into relational data structures.

Admittedly, the HadoopDB team needs to release tools to make their system easier to use/deploy. The HadoopDB development team is comprised entirely of Yale CS Department members, although Professor Abadi is hoping that open source developers will start contributing to the project. But if a paid gig is what you're after, the good news is that they're in search of a Chief Hacker.

[For more on Big Data, check out our report and follow @bigdata.]

(†) We were among the first users of Greenplum. In partnership with SimplyHired and Greenplum, we actively maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005.

tags: big data, hadoop, hive, mapreduce, mpp, postgrescomments: 0
submit: Reddit Digg stumbleupon   

 

Tue

Jul 21
2009

Nat Torkington

Four short links: 21 July 2009

Semweb, Comedy Java, Mobile Spyware, Crypto

by Nat Torkington@gnatcomments: 0

  1. On Data Reconciliation Strategies and Their Impact on the Web of Data -- For years, I’ve been a fairly vocal advocate for the elegance and scalability of a-posteriori reconciliation via equivalence mappings as a superior mechanism (scale-wise) to a-priori reconciliation efforts… but this started to change very rapidly once I started working for Metaweb and saw first hand how much more effective a-priori reconciliation can be, even if drastically more expensive and limiting in the data acquisition front. (via straup on Delicious)
  2. Java Spring's Biggus Dickus Effect -- Nonstop administrative debris as dadaist poetry. Écriture automatique of the programming office manager or his parrot. (via mattb on Delicious)
  3. Arabic Blackberry Spyware -- update pushed out to Arabic Blackberries CC:ed all email to the authorities. A powerful case for multi-distro platforms, which reduces the size of the market captured with one distro is pwned like this.
  4. NaCl - Networking and Cryptography Library -- open source high-level crypto library. NaCl (pronounced "salt") is a new easy-to-use high-speed software library for network communication, encryption, decryption, signatures, etc. NaCl's goal is to provide all of the core operations needed to build higher-level cryptographic tools. Of course, other libraries already exist for these core operations. NaCl advances the state of the art by improving security, by improving usability, and by improving speed. Creator of qmail is one of the developers. (via Simon Willison)

tags: big data, cryptography, mobile, opensource, security, semantic webcomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Jul 1
2009

Ben Lorica

The US Online Job Market Was (still) Down Big In June 2009

by Ben Lorica@dlimancomments: 5

Updating my post from early June, the U.S. online job market still hasn't shown signs of recovering from steady declines that began in September of last year. Compared to the same period last year, there were 50% fewer job postings in June 2009.

pathint

An alternate view highlights the start of the downward trend, as well as the smaller than expected seasonal bounce from Dec-08 to Jan/Feb 2009. In a normal year, the number of postings decline in December (as employers table job searches for after the holidays) and recovers sharply the following Jan/Feb. While job postings did bounce back in Jan/Feb 2009, the seasonal bump was less than half of what occurred in previous years.

pathint

No geographic region has been exempt from the downturn in online job postings. There have been sharp declines in all states, ranging from -59% in DE, WY, and MN, to -38% in MD, OK, VA.

pathint

In closing, we still haven't detected the green shoots that some forecasters have been crowing about over the last few months. If one were to take an optimistic perspective, the worse year-over-year decline occurred in April. OTOH, we are still staring at a 50% decline in June 2009. So while we may have hit the bottom in April, we need a few strong(er) months before we can comfortably announce the arrival of green shoots.

(†) In partnership with SimplyHired and Greenplum, we maintain a data warehouse that contains most U.S. online job postings dating back to mid-2005. Data for this post was through 6/28/2009.

tags: big data, economy, jobscomments: 5
submit: Reddit Digg stumbleupon   

 

Thu

Jun 11
2009

Ben Lorica

Mechanical Turk Best Practices

by Ben Lorica@dlimancomments: 8

Last night, Dolores Labs hosted what was billed as the first-ever Mechanical Turk meetup, and I was fortunate enough to have been able to squeeze into what turned out to be a great series of presentations. While Amazon was the pioneer and remains the largest provider in the space, other services like Dolores Labs and Nathan Eagle's txteagle have emerged to expand the pool of users and turks.

In the past, we've turned to Dolores Labs when we needed (machine-learning) training sets and were unable to quickly find reliable ones. To increase the quality of the output we receive from turks, we try to get multiple turks to perform an individual task and aggregate their work into a single answer. (We jokingly refer to this as the wisdom of micro-crowds.) Working on problems quite different from the ones we tackle, the first set of speakers presented research results confirming that this form of aggregation actually works. Rion Snow of Stanford's AI Lab presented results that suggest that for a large set of tasks, the aggregate work of 4-6 turks compare favorably to the work of a single (domain) expert. Working primarily in the area of NLP and computational linguistics, Bob Carpenter of alias-i presented similar results when evaluating turk-generated against gold standard training sets. (It's hard enough when turks disagree, but as Bob Carpenter highlighted, disagreements among experts makes it difficult to arrive at a gold standard.) Bob has found that in certain situations an iterative approach works best ("code-a-little", "learn-a-little") and tools that allow you to start suggesting "answers" to a new set of turks would help immensely. Coincidentally, one of the speakers presented a toolkit that allows users to do just that: Greg Little's TurKit is a JavaScript API for running iterative tasks in mechanical turk.

Another set of speakers talked about the emergence of mechanical turks as a research tool. Social scientists Aaron Shaw and John Horton spoke of favorably of their experience using turks for research experiments in economics and paired surveys. Among other things, they've conducted studies on the turk labor market by testing demand for tasks of varying difficulty (something Bob Carpenter also talked about), and by evaluating demand for follow-on tasks at lower wages. Alexander Sorokin of UIUC, presented work on using turks to annotate training sets for computer vision and robotics. For those interested in using turks to annotate images, Alex has a toolkit ready to go.

For most users of mechanical turk (us included), it has become an API call that fits smoothly within their workflow. (Or as someone at the meetup wryly suggested, turk is a Remote Person Call.) The last pair of speakers, Lilly Irani and Six Silberman, reminded us that behind mechanical turk lies thousands of workers ("the crowd in the cloud") working without (health care) benefits, oftentimes at extremely low hourly wages. Irani and Silberman suggested that rather than abstracting mechanical turk services as mere API calls, users should start thinking of the plight of the turks ("Mechanical Turk Bill of Rights") behind the service. As a first step they have a released a Firefox plugin that aims to narrow the information assymetry between turks (those performing tasks) and requesters (those posting tasks). While requesters can see ratings for turks, requesters aren't rated: Turkopticon lets turks rate requesters. They need more turks to download and start using Turkopticon, so if you know any mechanical turks please enourage them do so.

(†) According to Amazon representatives in the audience, a majority of turks are in the U.S. That may change in the future, once Amazon is able to get approval for other payment systems. Because of the possibility of money-laundering, services like AMT are subject to strict KYC controls.

tags: big data, machine learning, mechanical turk, meetupcomments: 8
submit: Reddit Digg stumbleupon   

 

Wed

Jun 3
2009

Ben Lorica

The Economic Crisis and the US Online Job Market

by Ben Lorica@dlimancomments: 11

In my previous post, I noted that despite the large decline in total number of job postings, the number Hadoop/MapReduce job postings increased by 49%. What is the current state of the online job market? The financial crisis that began in the Fall of 2008 has had a lasting negative effect on the U.S. online job market. Since late 2008, there have been significantly less jobs posted online.

Using data from SimplyHired and a few charts, I'll quickly highlight the impact of the global economic crisis on the U.S. online job market. To quantify the sudden drop in U.S. online job postings, I calculated the average number of job posts per day:

pathint

The number of posts declined 49% from Jan/May 2008 to Jan/May 2009. While there has been a downward trend since April 2008, the financial crisis in September 2008 marked the start of even larger reductions. In particular, the relatively small number of job postings in Nov/Dec 2008 has carried over into the first five months of 2009. The sharp seasonal rebound that occurs in Jan/Feb of each year, was practically non-existent in 2009. While some forecasters are seeing signs of a recovery, at least through the first five months of 2009, we haven't detected "green shoots" in the U.S. online job market.

(continue reading)

tags: big data, jobscomments: 11
submit: Reddit Digg stumbleupon   

 

Mon

Jun 1
2009

Ben Lorica

Most Hadoop Jobs Are In California

by Ben Lorica@dlimancomments: 4

Given the recent buzz surrounding Hadoop and MapReduce, I was curious if employers were beginning to mention either term in their job postings. Fortunately I have access to a massive job data warehouse dating back to mid-2005. In partnership with SimplyHired and Greenplum, we maintain a data warehouse that contains most of the online job postings in the U.S.

While the percentage of job postings that mention either Hadoop or MapReduce remains miniscule, the number of such postings is growing steadily:

pathint

The number of Hadoop/MapReduce job postings (during the Feb/Apr 2009 period) grew 49% compared to 2008. In contrast, the tough economic environment has translated to significantly fewer job postings: the total number of online job postings declined 40% during the same period.

How mainstream is Hadoop? While researching our report on Big Data, we talked to a (database) vendor who jokingly claimed that nobody outside of the West & East coast cared about Hadoop. Analysis of recent job postings seems to support that perspective. During the three most recent months, employers in 18 states posted Hadoop/MapReduce jobs online, but 60% of those were in California. The top 5 states (CA, MD, NY, MA, WA) accounted for 87% of the Hadoop/MapReduce job postings:

pathint

Looking at the same period last year, 72% of the job postings were in California, and the top 5 states (CA, WA, TX, PA, VA) accounted for 79%.

Given the presence of large (Google, Yahoo!, Facebook) and small companies (Cloudera, Greenplum, Aster, ...) who are leaders in the use of Hadoop/MapReduce, it's no surprise that at this early stage, a large share of jobs are in California. While the share of California job postings remains high (60%), it's down from 72% last year. As mentioned above, the percentage of job postings that mention either Hadoop or MapReduce remains miniscule, so I caution against reading too much into the geographic distributions. Nevertheless, it's clear that California employers are expressing interest in Hadoop skills ahead of their peers in other states.

tags: big data, hadoop, jobscomments: 4
submit: Reddit Digg stumbleupon   

 

Tue

May 26
2009

Nat Torkington

Four short links: 26 May 2009

Databases, Sensors, Visualization, and Patents

by Nat Torkington@gnatcomments: 0

  1. Flare -- dynamically partitioning and reconstructing key-value server. Currently built on Tokyo Cabinet, but backend is theoretically pluggable. (via joshua on delicious)
  2. Implantable Device Offers Continuous Cancer Monitoring -- the sensor network begins to extend into our bodies. The cylindrical, 5-millimeter implant contains magnetic nanoparticles coated with antibodies specific to the target molecules. Target molecules enter the implant through a semipermeable membrane, bind to the particles and cause them to clump together. That clumping can be detected by MRI (magnetic resonance imaging). The device is made of a polymer called polyethylene, which is commonly used in orthopedic implants. The semipermeable membrane, which allows target molecules to enter but keeps the magnetic nanoparticles trapped inside, is made of polycarbonate, a compound used in many plastics. (via FreakLabs)
  3. Visualizing Data source -- the source code to examples in Visualizing Data.
  4. The First Software Patent (Wired) -- was issued on this day in 1981, for a complex full-text storage and retrieval system. Tellingly, business strategy of the owner of the first software patent was ... to become a patent lawyer. A day that will linger in irritation, if not live in infamy. (via glynmoody on Twitter)

tags: big data, book related, databases, history, law, medicine, patent, sensors, visualizationcomments: 0
submit: Reddit Digg stumbleupon   

 

Tue

May 19
2009

Joshua-Michéle Ross

Captivity of the Commons

by Joshua-Michéle Ross@jmichelecomments: 25

This post is part two of the series, “The Question Concerning Social Technology”. Part one is here. These posts will be opened to live discussion in an upcoming webcast on May 27.

In January 2002 DARPA launched the Information Awareness Office. The mission was to, “ imagine, develop, apply, integrate, demonstrate and transition information technologies, components and prototype, closed-loop, information systems that will counter asymmetric threats by achieving total information awareness (emphasis added)” The notion of a government agency achieving total information awareness was too Orwellian to ignore. Under criticism that this “awareness” could quickly migrate to a mass surveillance system the program was defunded.

Fast-forward to last week and my near-purchase of Libbey Duratuff Gibralter Glasses (the perfect bourbon glass one might speculate). Over the course of the next few days I was peppered with exact-match ads for Libbey Duratuff glassware on several other websites; A small example of information awareness at work.

Personal data is the currency of Web 2.0. Knowing what we watch, buy, click, own, what we think, intend and ultimately do confers competitive advantage. Facebook possesses your social graph, your personal interests and your full profile (age, location, relationship status etc.) not to mention your daily (or hourly) answer to their persistent question, “what’s on your mind?”. Reviewing the “25 Surprising Things Google Knows About You” should give anyone pause. And it’s not just the Web 2.0 set. Credit Card Companies, Telcos, Insurance , Pharma… all are collecting vast stores of personal data. If you watch the trendline it is moving toward more data and more analytic capability - not less.

So why is it that we seem to have more comfort when the capacity for total information awareness lies with corporations as opposed to government? Experience shows that there is a very thin barrier between the two. To wit, the release of thousands of phone records to the U.S. government - and, conveniently, government immunity for those same corporations after the breach. Google and Yahoo! and Microsoft have all been accused of cooperating with the Chinese government to aid censorship and repression of free speech. What happens if/when we encounter the next version of the Bush administration that sees no problem abrogating civil rights in pursuit of “evildoers”?

What's more, when we deliver our personal information over to corporations we are giving this data over to an institution that is amoral. Companies are not yet structured to deliver moral or ethical results - they are encouraged to grow and deliver “shareholder value” (read money) which is a numb and narrow measure of value. Do I want my data to be managed by an amoral institution?

To be clear - I want the convenience and miracles that modern technology brings. I love the Internet and I am willing to give over lots of data in the trade. But I want two fundamental protections:

First, change the corporation. The structure of the corporation continues to be driven by 20th century hard goals of efficiency and scale - not by more complex measures of environmental sustainability, value creation and the commonweal. These are simply not adequately factored into any structural, organizational, incentive or taxation systems of business today. Profit and profit motive are fine - but hiding social and environmental costs is no longer acceptable. I want to deal with institutions capable of morality. This is no small task - but if we can build the Internet….

Second. We need a right to privacy that matches the 21st century reality. As a friend of mine likes to say, “privacy is now a responsibility - not a right.” While it is pithy (and perhaps true), the reason we grant rights - and laws to enforce those rights in society is the simple fact that people do not generally have the wherewithal to protect themselves from large, institutional interests. In the same way that regulatory structures are needed to keep a financial system in balance (alas even the Ayn Rand acolyte Greenspan finally agrees with this truism), we need new rights and regulations governing the use of our personal data - and simple sets of controls over who has access to it.

The true work of the 21st century lies not in refining our technology - this we will achieve without any political will. The work lies in re-imagining our institutions.

tags: big data, social graph, social media, social networking, social webcomments: 25
submit: Reddit Digg stumbleupon   

 

Mon

May 4
2009

Ben Lorica

Big Data: SSD's, R, and Linked Data Streams

by Ben Lorica@dlimancomments: 4

The Solid State Storage Revolution: If you haven't seen it, I recommend you watch Andy Bechtolsheim's keynote at the recent Mysqlconf. We covered SSD's in our just published report on Big Data management technologies. Since then, we've gotten additional signals from our network of alpha geeks and our interest in them remains high.

R and Linked Data Streams: I had a chance to visit with Dataspora founder and blogger Mike Driscoll, an enthusiastic advocate for the use of the open source statistical computing language, R. After founding and leading online retailer CustomInk.com, Mike went back to grad school and earned a doctorate in Bioinformatics. He has applied data analysis and programming in a variety of domains including retail, biotech, academia, and government projects.

Having been an avid user of S/S-Plus in the 1990's, I seamlessly switched over to R in the early 2000's. To this day, I consider the S/S-Plus user manuals to be the best reference and introductory books on the R programming language. (Mike wholeheartedly agrees.) R has been popular in the statistics community for many years, but I've been noticing that its visualization and analytic capabilities are attracting interest from developers. Moreover, recent efforts by the R community to improve its ability to scale large data sets (see brief update from Jay Emerson), will strengthen R's place in the Big Data stack.

(continue reading)

tags: analytics, big data, r, ssd, statistics, videocomments: 4
submit: Reddit Digg stumbleupon   

 

Tue

Apr 28
2009

Ben Lorica

How Big Data Impacts Analytics

by Ben Lorica@dlimancomments: 9

Research for our just published report on Big Data management technologies, included conversations with teams who are at the forefront of analyzing massive data sets. We were particularly impressed with the work being produced by Linkedin's analytics team. [We have more details on Linkedin's analytics team, in an article in the upcoming issue of Release 2.0.]

At the second Social Web Foo camp, I had a chance to visit with Linkedin's Chief Scientist DJ Patil. As a mathematician specializing in dynamical systems and chaos theory, DJ began his career as a weather forecaster working for the Federal government. Years later, he ended up in an analytics role at Ebay where his prior experience with massive data sets came in handy. In the short video below, DJ shares his observations on how analytics has changed in recent years, especially as Big Data increasingly becomes common. Companies are casting a wider net, and are hiring scientists from fields not traditionally known as fertile recruiting grounds for data intelligence teams.

DJ also talks about his personal journey from mathematics to e-commerce and social networks. Among his previous stints, DJ worked with the DOD and used "... social network analysis to identify terrorists."

Other short videos from Social Web Foo camp:

  • Ty Ahmad-Taylor on the Challenges Facing Television
  • Steve Ganz' observations midway through Social Web Foo Camp Year 2
  • tags: analytics, big data, foo camp, hadoop, social networking, social web, swfoo, videocomments: 9
    submit: Reddit Digg stumbleupon   

     

    Thu

    Apr 16
    2009

    Nat Torkington

    Four short links: 16 Apr 2009

    by Nat Torkington@gnatcomments: 1

    China, databases, storage, and git:

    1. China's Complicated Internet Culture (Ethan Zuckerman) -- summary of Rebecca McKinnon's talk at the Berkman Internet Center. Democracy is complex and hard to transition to, online democracy doubly so. Rebecca questions the widespread but unjustified belief that the Great Firewall of China is all that separates Chinese citizens from the empowered liberty of the West, and lays out the tangled state of affairs in China's political Internet. Despite the rise of web video, “no one has managed to organized an opposition party on the web,” Rebecca points out. “There’s no Lech Walenza, no religious movement - Falun Gong has been squished pretty thoroughly.” (via cshirky's delicious stream)
    2. Drop ACID and Think About Data -- Bob Ippolito's talk from PyCon about the things you can do easily when you foresake the promises of ACID. More in the ongoing reinvention of databases for the needs of modern web systems. (via cesther's Twitter stream)
    3. The Pogoplug -- The Pogoplug connects your external hard drive to the Internet so you can easily share and access your files from anywhere. We're accumulating terabytes of storage at home, where it's very useful to all the computers in the home. This offers an easy way for non-technical civilians to make these drives useful outside the home as well. There are many possibilities for Interesting Things in the massive storage we're accumulating. (via joshua's delicious stream)
    4. Gitorious -- open source (AGPLv3) clone of github. (via edd's delicious stream)

    tags: big data, china, databases, democracy, hardware, open source, politics, programmingcomments: 1
    submit: Reddit Digg stumbleupon   

     

    Mon

    Mar 23
    2009

    Ben Lorica

    Big Data: Technologies and Techniques for Large-Scale Data

    by Ben Lorica@dlimancomments: 3

    Our belief that proficiency in managing and analyzing large amounts of data distinguishes market leading companies, led to a recent report designed to help users understand the different large-scale data management techniques. Our report on Big Data Technologies was the result of interviews with over thirty experts, including research scientists, (open-source) hackers, vendors, data analysts, and entrepreneurs. Rather than endorse specific vendors and technologies, we provide a framework to help readers navigate the wide variety of options available. (NOTE: If you're interested in purchasing the report as a single-issue of Release 2.0, we can provide you with a DISCOUNT CODE. Contact information is at the end of the video clip below.)

    I recently sat down with my co-author, Roger Magoulas (Director of Research at O'Reilly), who agreed talk about our report and Big Data in general. Roger begins by speaking passionately of the importance of data management and analysis. He proceeds to highlight what we believe to be the key technology dimensions for evaluating data management solutions. The video ends with a glimpse into future technologies and general advice to organizations interested in improving their proficiency in handling data.

    The full program is available in four extended clips:

  • What is Big Data and why is it important? (3:33 minutes)
  • Big Data Technologies (1:35 minutes)
  • Key Technology Dimensions (4:52 minutes)
  • A Look Into The Future and Closing Summary (3:42 minutes)
  • [ Head over to O'Reilly Media's Youtube channel for other interesting videos. ]

    tags: big data, hadoop, key-value, mapreduce, mpp, report, sql, videocomments: 3
    submit: Reddit Digg stumbleupon