Tag Archives: hadoop

Why Hadoop MapReduce needs Scala

A look at Scoobi and Scalding Scala DSLs for Hadoop

(Full Story: Why Hadoop MapReduce needs Scala)

MapReduce translations, from skyscrapers to Hadoop clusters

In our skyscraper analogy, the work the fire wardens did in getting the per-suite, per-platform handset count would be the Map step in our job, and the work the fire wardens on the 10th, 20th and 30th floors did, in calculating their final platform tallies, would be the Reduce step. A Map step and a Reduce step constitute a MapReduce job. Got that?

Let’s keep going. For the building, the collection of suite numbers and smartphone types in each would represent the keys and values in our input file. We split/partitioned that file into a smaller one for each floor which, just like the original input file, would have suite number as the key and smartphone platform data as the value. For each suite, our mappers (the fire wardens on each floor) created output data with the smartphone platform name as key and the count of handsets for that suite and platform as the value. So the mappers produce output which eventually becomes the Reduce step’s input.

(Full Story: MapReduce translations, from skyscrapers to Hadoop clusters)

MapReduce Patterns, Algorithms, and Use Cases

Basic MapReduce Patterns, Not-So-Basic MapReduce Patterns
, Relational MapReduce Patterns, Machine Learning and Math MapReduce Algorithms

(Full Story: MapReduce Patterns, Algorithms, and Use Cases)

Understanding Microsoft’s big-picture plans for Hadoop and Project Isotope | ZDNet

With codename “Project Isotope,” Microsoft is packaging up analytics tools and services for its coming Hadoop on Windows Azure and Windows Server distributions and making them available to users of all kinds.

(Full Story: Understanding Microsoft’s big-picture plans for Hadoop and Project Isotope | ZDNet)

How Yahoo Spawned Hadoop, the Future of Big Data

Last year, eBay erected a Hadoop cluster spanning 530 servers. Now it’s five times that large, and it helps with everything analyzing inventory data to building customer profiles using real live online behavior. “We got tremendous value — tremendous value — out of it, so we’ve expanded to 2,500 nodes,” says Bob Page, vice president of analytics at eBay. “Hadoop is an amazing technology stack. We now depend on it to run eBay.”

(Full Story: How Yahoo Spawned Hadoop, the Future of Big Data)

Building data science teams – O’Reilly Radar

It’s hard to understate the sophistication of the tools needed to instrument, track, move, and process data at scale. The development and implementation of these technologies is the responsibility of the data engineering and infrastructure team. The technologies have evolved tremendously over the past decade, with an incredible amount of collaboration taking place through open source projects.hive

(Full Story: Building data science teams – O’Reilly Radar)

LexisNexis open sources Hadoop challenger

HPCC uses LexisNexis’ own data-centric declarative programming language, known as ECL. Developed 10-years ago, it compiles to C++. HPCC includes two data-crunching platforms: the Thor Data Refinery Cluster and the Roxie Rapid Data Delivery Cluster.
LexixNexis senior vice president and chief technology officer Armando Escalante says Thor is analogous to Hadoop, while Roxie is the component that Hadoop is currently missing. Since it’s written in C++, he says, the system is also faster than Hadoop, which is written in Java.
“We been 10 years perfecting it,” Escalante said, “and we tweaked it up the wazoo to get all the performance we can. We can add more use cases and make it better.”
“We are four faster than Hadoop on the Thor side. If Hadoop needs 1,000 nodes we can do it with 250 – that means less cooling and data center space.”

(Full Story: LexisNexis open sources Hadoop challenger)

Twitter open sources Storm a mapreduce framework

I’ve only scratched the surface on Storm. The “stream” concept at the core of Storm can be taken so much further than what I’ve shown here — I didn’t talk about things like multi-streams, implicit streams, or direct groupings. I showed two of Storm’s main abstractions, spouts and bolts, but I didn’t talk about Storm’s third, and possibly most powerful abstraction, the “state spout”. I didn’t show how you do distributed RPC over Storm, and I didn’t discuss Storm’s awesome automated deploy that lets you create a Storm cluster on EC2 with just the click of a button.

(Full Story: Twitter open sources Storm a mapreduce framework)


Follow

Get every new post delivered to your Inbox.