How To Analyse Youtube Data Using Mapreduce? | A Hadoop Mapreduce Use Case

Equally the corporeality of captured data increases over the years, so do our storage needs. Companies are realizing that "data is king," but how do we analyze it? Through Hadoop. In this article, the first of a three-office series, Steven Haines presents an overview of Hadoop's compages and demonstrates, from a loftier-level, how to build a MapReduce application.

Like this article? We recommend 

In the development of data processing, we moved from flat files to relational databases and from relational databases to NoSQL databases. Essentially, every bit the corporeality of captured data increased, so did our needs, and traditional patterns no longer sufficed. The databases of quondam worked well with data that measured in megabytes and gigabytes, but now that companies realize "data is king," the amount of captured data is measured in terabytes and petabytes. Even with NoSQL data stores, the question remains: How do we analyze that corporeality of data?

The well-nigh popular answer to this is: Hadoop. Hadoop is an open-source framework for developing and executing distributed applications that procedure very large amounts of data. Hadoop is meant to run on large clusters of commodity machines, which can be machines in your data center that you're non using or even Amazon EC2 images. The danger, of course, in running on commodity machines is how to handle failure. Hadoop is architected with the assumption that hardware will fail and as such, it can gracefully handle most failures. Furthermore, its architecture allows information technology to scale most linearly, so as processing capacity demands increase, the only constraint is the amount of budget you have to add more machines to your cluster.

This article presents an overview of Hadoop'southward compages to describe how information technology can achieve these bold claims, and it demonstrates, from a high-level, how to build a MapReduce application.

Hadoop Architecture

At a high-level, Hadoop operates on the philosophy of pushing analysis code shut to the data it is intended to analyze rather than requiring lawmaking to read data across a network. Equally such, Hadoop provides its own file system, aptly named Hadoop File Organisation or HDFS. When you upload your data to the HDFS, Hadoop volition partition your data across the cluster (keeping multiple copies of it in case your hardware fails), and and then it can deploy your lawmaking to the car that contains the data upon which it is intended to operate.

Like many NoSQL databases, HDFS organizes information by keys and values rather than relationally. In other words, each piece of data has a unique central and a value associated with that key. Relationships between keys, if they exist, are defined in the application, non by HDFS. And in practise, you're going to have to think about your problem domain a bit differently in gild realize the full power of Hadoop (see the adjacent section on MapReduce).

The components that contain Hadoop are:

HDFS: The Hadoop file organisation is a distributed file system designed to agree huge amounts of data across multiple nodes in a cluster (where huge can exist defined as files that are 100+ terabytes in size!) Hadoop provides both an API and a command-line interface to interacting with HDFS.
MapReduce Application: The next section reviews the details of MapReduce, but in short, MapReduce is a functional programming epitome for analyzing a single record in your HDFS. Information technology then assembles the results into a consumable solution. The Mapper is responsible for the data processing step, while the Reducer receives the output from the Mappers and sorts the information that applies to the same key.
Partitioner: The partitioner is responsible for dividing a detail assay trouble into workable chunks of data for utilise past the various Mappers. The HashPartioner is a partitioner that divides work upwards by "rows" of data in the HDFS, but you are free to create your own custom partitioner if you need to divide your data up differently.
Combiner: If, for some reason, you want to perform a local reduce that combines information before sending it dorsum to Hadoop, then you'll need to create a combiner. A combiner performs the reduce step, which groups values together with their keys, but on a single node before returning the cardinal/value pairs to Hadoop for proper reduction.
InputFormat: Most of the time the default readers will work fine, but if your data is non formatted in a standard style, such equally "cardinal, value" or "key [tab] value", and then you volition need to create a custom InputFormat implementation.
OutputFormat: Your MapReduce applications will read data in some InputFormat and so write data out through an OutputFormat. Standard formats, such equally "key [tab] value", are supported out of the box, merely if you lot want to do something else, then you demand to create your own OutputFormat implementation.

Additionally, Hadoop applications are deployed to an infrastructure that supports its high level of scalability and resilience. These components include:

NameNode: The NameNode is the main of the HDFS that controls slave DataNode daemons; it understands where all of your information is stored, how the data is broken into blocks, what nodes those blocks are deployed to, and the overall wellness of the distributed filesystem. In brusk, information technology is the most important node in the entire Hadoop cluster. Each cluster has one NameNode, and the NameNode is a single-point of failure in a Hadoop cluster.
Secondary NameNode: The Secondary NameNode monitors the country of the HDFS cluster and takes "snapshots" of the information contained in the NameNode. If the NameNode fails, and then the Secondary NameNode tin can be used in place of the NameNode. This does require human intervention, nevertheless, so in that location is no automatic failover from the NameNode to the Secondary NameNode, simply having the Secondary NameNode will aid ensure that data loss is minimal. Like the NameNode, each cluster has a single Secondary NameNode.
DataNode: Each slave node in your Hadoop cluster will host a DataNode. The DataNode is responsible for performing data management: Information technology reads its data blocks from the HDFS, manages the data on each physical node, and reports back to the NameNode with data direction status.
JobTracker: The JobTracker daemon is your liaison betwixt your application and Hadoop itself. There is one JobTracker configured per Hadoop cluster and, when you submit your code to be executed on the Hadoop cluster, information technology is the JobTracker's responsibility to build an execution plan. This execution programme includes determining the nodes that comprise data to operate on, arranging nodes to correspond with data, monitoring running tasks, and relaunching tasks if they fail.
TaskTracker: Similar to how information storage follows the chief/slave compages, code execution also follows the primary/slave architecture. Each slave node will take a TaskTracker daemon that is responsible for executing the tasks sent to it by the JobTracker and communicating the status of the job (and a heartbeat) with the JobTracker.

Effigy ane tries to put all of these components together in i pretty crazy diagram.

Effigy ane Hadoop application and infrastructure interactions

Figure ane shows the relationships between the master node and the slave nodes. The principal node contains two of import components: the NameNode, which manages the cluster and is in charge of all data, and the JobTracker, which manages the code to exist executed and all of the TaskTracker daemons. Each slave node has both a TaskTracker daemon as well as a DataNode: the TaskTracker receives its instructions from the JobTracker and executes map and reduce processes, while the DataNode receives its data from the NameNode and manages the data contained on the slave node. And of course there is a Secondary NameNode listening to updates from the NameNode.

MapReduce

MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed beyond a big number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its proper noun implies, works in ii steps:

Map: The map step essentially solves a small problem: Hadoop's partitioner divides the problem into pocket-sized workable subsets and assigns those to map processes to solve.
Reduce: The reducer combines the results of the mapping processes and forms the output of the MapReduce functioning.

My Map definition purposely used the work "essentially" because one of the things that requite the Map step its proper name is its implementation. While it does solve pocket-size workable problems, the manner that it does it is that it maps specific keys to specific values. For instance, if we were to count the number of times each word appears in a book, our MapReduce application would output each give-and-take as a primal and the value equally the number of times it is seen. Or more specifically, the book would probably exist broken upwards into sentences or paragraphs, and the Map step would return each word mapped either to the number of times it appears in the sentence (or to "one" for each occurrence of every word) and then the reducer would combine the keys past adding their values together.

Listing 1 shows a Java/Pseudo-code example about how the map and reduce functions might work to solve this trouble.

Listing 1 - Java/Pseudocode for MapReduce

public void map( String proper name, String sentence, OutputCollector output ) {   for( Cord word : sentence ) {     output.collect( word, 1 );   } }  public void reduce( String give-and-take, Iterator values, OutputCollector output ) {   int sum = 0;   while( values.hasNext() ) {     sum += values.next().become();   }   output.collect( word, sum ); }

List ane does not incorporate lawmaking that actually works, but it does illustrate from a high-level how such a task would exist implemented in a handful of lines of code. Prior to submitting your job to Hadoop, you would showtime load your data into Hadoop. It would so distribute your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it would distribute your code to the slave nodes and have each map and reduce job process data on that slave node. Your map task would iterate over every word in the information block passed to it (assuming a sentence in this example), and output the word every bit the primal and the value every bit "1". The reduce job would then receive all instances of values mapped to a particular fundamental; for example, it may accept 1,000 values of "1" mapped to the work "apple tree", which would mean that at that place are 1,000 apples in the text. The reduce task sums upwardly all of the values and outputs that as its consequence. So your Hadoop chore would exist set up to handle all of the output from the various reduce tasks.

This style of thinking is quite a bit different from how you might have approached the trouble without using MapReduce, but it will become clearer in the next article on writing MapReduce applications, in which nosotros build several working examples.

Summary

This article described what Hadoop is and presented an overview of its compages. Hadoop is an open up-source framework for developing and executing distributed applications that procedure very big amounts of data. It provides the infrastructure that distributes data across a multitude of machines in a cluster and that pushes analysis code to nodes closest to the data being analyzed. Your job is to write MapReduce applications that leverage this infrastructure to analyze your data.

The next commodity in this series, Building a MapReduce Application with Hadoop, volition demonstrate how to set up a development environment and build MapReduce applications, which should requite you a good feel for how this new paradigm works. So the final installment in this series volition walk you through setting up and managing a Hadoop product surroundings.

Source: https://www.informit.com/articles/article.aspx?p=2008905

Posted by: henkeboxistaken.blogspot.com