Archive | mapreduce RSS feed for this section

Mahout, There It Is! Open Source Algorithms Remake | Wired Enterprise |

The tale highlights the benefit of a blue-sky R&D operation. Overstock was founded in 1999 and went public in 2002, and Byrne — the company’s swashbuckling chief exec — created O Labs about a year ago to feed a bit more of the entrepreneurial ethos back into the company. “We’re saving $2 million a year with Mahout, and that never would have happened if not for the sort of experimental stuff we’re doing in the labs,” says Bagley. “We’re discovering things that can then have benefit across the company.” But it also shows how Hadoop and related open source tools continue to evolve and push even further across the web and into businesses. Mahout — which was specifically built for use with Hadoop — is little more than 3 years old, and it has already attracted the attention of several big-name web operations, including not only Overstock, but AOL, Foursquare, Yahoo, Twitter, and even Amazon.

(Full Story: )


Why Hadoop MapReduce needs Scala

A look at Scoobi and Scalding Scala DSLs for Hadoop

(Full Story: Why Hadoop MapReduce needs Scala)

MapReduce translations, from skyscrapers to Hadoop clusters

In our skyscraper analogy, the work the fire wardens did in getting the per-suite, per-platform handset count would be the Map step in our job, and the work the fire wardens on the 10th, 20th and 30th floors did, in calculating their final platform tallies, would be the Reduce step. A Map step and a Reduce step constitute a MapReduce job. Got that?

Let’s keep going. For the building, the collection of suite numbers and smartphone types in each would represent the keys and values in our input file. We split/partitioned that file into a smaller one for each floor which, just like the original input file, would have suite number as the key and smartphone platform data as the value. For each suite, our mappers (the fire wardens on each floor) created output data with the smartphone platform name as key and the count of handsets for that suite and platform as the value. So the mappers produce output which eventually becomes the Reduce step’s input.

(Full Story: MapReduce translations, from skyscrapers to Hadoop clusters)

Crunch: Easy MapReduce Pipelines for Hadoop

Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.

(Full Story: Crunch: Easy MapReduce Pipelines for Hadoop)

MapReduce Patterns, Algorithms, and Use Cases

Basic MapReduce Patterns, Not-So-Basic MapReduce Patterns
, Relational MapReduce Patterns, Machine Learning and Math MapReduce Algorithms

(Full Story: MapReduce Patterns, Algorithms, and Use Cases)

Tenzing A SQL Implementation On The MapReduce Framework

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.

(Full Story: Tenzing A SQL Implementation On The MapReduce Framework)

Building data science teams – O’Reilly Radar

It’s hard to understate the sophistication of the tools needed to instrument, track, move, and process data at scale. The development and implementation of these technologies is the responsibility of the data engineering and infrastructure team. The technologies have evolved tremendously over the past decade, with an incredible amount of collaboration taking place through open source projects.hive

(Full Story: Building data science teams – O’Reilly Radar)

LexisNexis open sources Hadoop challenger

HPCC uses LexisNexis’ own data-centric declarative programming language, known as ECL. Developed 10-years ago, it compiles to C++. HPCC includes two data-crunching platforms: the Thor Data Refinery Cluster and the Roxie Rapid Data Delivery Cluster.
LexixNexis senior vice president and chief technology officer Armando Escalante says Thor is analogous to Hadoop, while Roxie is the component that Hadoop is currently missing. Since it’s written in C++, he says, the system is also faster than Hadoop, which is written in Java.
“We been 10 years perfecting it,” Escalante said, “and we tweaked it up the wazoo to get all the performance we can. We can add more use cases and make it better.”
“We are four faster than Hadoop on the Thor side. If Hadoop needs 1,000 nodes we can do it with 250 – that means less cooling and data center space.”

(Full Story: LexisNexis open sources Hadoop challenger)

Twitter open sources Storm a mapreduce framework

I’ve only scratched the surface on Storm. The “stream” concept at the core of Storm can be taken so much further than what I’ve shown here — I didn’t talk about things like multi-streams, implicit streams, or direct groupings. I showed two of Storm’s main abstractions, spouts and bolts, but I didn’t talk about Storm’s third, and possibly most powerful abstraction, the “state spout”. I didn’t show how you do distributed RPC over Storm, and I didn’t discuss Storm’s awesome automated deploy that lets you create a Storm cluster on EC2 with just the click of a button.

(Full Story: Twitter open sources Storm a mapreduce framework)

Map / Reduce – A visual explanation

Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel. A common use case for Map/Reduce is in document database, which is why I found myself thinking deeply about this.

(Full Story: Map / Reduce – A visual explanation)