You can’t fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them. I’d like to change that. In this post, I’ll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.
Walmart started using big data even before the term big data became known in the industry and in 2012 they moved from an experiential 10-node Hadoop cluster to a 250-node Hadoop cluster. At the same time they developed new tools to migrate their existing data on Oracle, Netezza and Greenplum hardware to their own systems. The objective was to consolidate 10 different websites into one website and store all incoming data in the new Hadoop cluster. Since then they have made big steps in integrating big data into the DNA of Walmart.
In a nutshell, Corona represents a new system for scheduling Hadoop jobs that makes better use of a cluster’s resources and also makes it more amenable to multitenant environments like the one Facebook operates. Ching et al explain the problems and the solution in some detail, but the short explanation is that Hadoop’s JobTracker node is responsible for both cluster management and job-scheduling, but has a hard time keeping up with both tasks as clusters grow and the number of jobs sent to them increase.
Further, job-scheduling in Hadoop involves an inherent delay, which is problematic for small jobs that need fast results. And a fixed configuration of “map” and “reduce” slots means Hadoop clusters run inefficiently when jobs don’t fit into the remaining slots or when they’re not MapReduce jobs at all.
Corona resolves some of these problems by creating individual job trackers for each job and a cluster manager focused solely on tracking nodes and the amount of available resources. Th
Corona, and it lets you run myriad tasks across a vast collection of Hadoop servers without running the risk of crashing the entire cluster. But the second is more intriguing. It’s called Prism, and it’s a way of running a Hadoop cluster so large that it spans multiple data centers across the globe.
Facebook eliminated the single point of failure in the HDFS platform using a creation it calls AvatarNode, and the open source Hadoop project has followed with a similar solution known as HA NameNode, for high-availability. But that still left the single point of failure on MapReduce. Now, with Corona, Facebook has solved this as well.
But Prism will change that. In short, it automatically replicates and moves data wherever it’s needed across a vast network of computing facilities.
AvatarNode isn’t a panacea for Hadoop availability, however. Ryan notes that only 10 percent of Facebook’s unplanned downtime would have been preventable with AvatarNode in place, but the architecture will allow Facebook to eliminate an estimated 50 percent of future planned downtime.
(Full Story: How Facebook keeps 100 petabytes of Hadoop data online)
In our skyscraper analogy, the work the fire wardens did in getting the per-suite, per-platform handset count would be the Map step in our job, and the work the fire wardens on the 10th, 20th and 30th floors did, in calculating their final platform tallies, would be the Reduce step. A Map step and a Reduce step constitute a MapReduce job. Got that?
Let’s keep going. For the building, the collection of suite numbers and smartphone types in each would represent the keys and values in our input file. We split/partitioned that file into a smaller one for each floor which, just like the original input file, would have suite number as the key and smartphone platform data as the value. For each suite, our mappers (the fire wardens on each floor) created output data with the smartphone platform name as key and the count of handsets for that suite and platform as the value. So the mappers produce output which eventually becomes the Reduce step’s input.
With codename “Project Isotope,” Microsoft is packaging up analytics tools and services for its coming Hadoop on Windows Azure and Windows Server distributions and making them available to users of all kinds.
Last year, eBay erected a Hadoop cluster spanning 530 servers. Now it’s five times that large, and it helps with everything analyzing inventory data to building customer profiles using real live online behavior. “We got tremendous value — tremendous value — out of it, so we’ve expanded to 2,500 nodes,” says Bob Page, vice president of analytics at eBay. “Hadoop is an amazing technology stack. We now depend on it to run eBay.”
(Full Story: How Yahoo Spawned Hadoop, the Future of Big Data)