Overview of Hadoop - Understanding of MapReduce

As mentioned above, MapReduce and HDFS are the main contents of Hadoop. This article is briefly translated <MapReduce:Simplified Data Processing on LargeClusters >, and analyze the api of the org.apache.hadoop.mapreduce package in hadoop to outline the idea of ​​MapReduce.

1. Translate part of MapReduce

Concept: MapReduce is a programming model for processing and producing large datasets. Programs written using this model are automatically executed in parallel on a large cluster. The runtime system is responsible for splitting input data/distributing programs to run on different machines/handling machine errors/managing communication between machines. mapreduce allows programmers without any distributed and parallel experience to easily utilize the resources of large distributed systems.

**Use:** The user has to specify two functions: the map function and the reduce function. map processes a key/value pair to produce a set of intermediate key-value pairs; reduce merges intermediate values ​​with the same intermediate key. Map receives a key/value and generates a key/value set. The MapReduce library groups values ​​with the same key into sets. Reduce takes an intermediate key and its corresponding set of values, and combines these values ​​to form a possibly smaller set of values. Usually each call to Reduce produces only 0 or 1 output value. These intermediate values ​​are passed to the reduce function through an iterator.

**In short:** The input of the entire MapReduce calculation is a key/value set, and the output is also a key/value set. Among them, the system (hadoop) is responsible for the segmentation of data, the grouping, sorting, and data transmission of key corresponding values. The user is only responsible for writing map/reduce functions.

The abstract formula is simply expressed: map (k1,v1)->list(k2,v2) reduce(k2,list(v2))->list(v2).

A specific example to explain: count the number of times the url is visited. map receives web request logs and outputs a series of <url,1>, such as <google.com,1><baidu.com,1> <google.com,1>.

reduce adds up values ​​with the same url, yielding <url,total count>. Such as <google.com,2> <baidu.com 1>.

2.API explanation

Commonly used classes are: **org.apache.hadoop.mapreduce.**Job, Mapper, Reducer.

The Job class is a job submitted to the cluster for execution. Through the Job, you can configure, submit, and control the execution of the job, and you can also query the status of the job. Note: Job can only be set before submitting!

Common method: void setJarByClass(Class) - specifies the Jar package containing the execution class;

void setMapperClass(class) - specifies the Mapper class for the job; void setReducerClass(Class) - specifies the Reducer class.

void setOutputKeyClass(Class)-specifies the data type of the last output key (key); void setOutputValueClass(Class)-specifies the output value (value) type.

void submit() - submits the job, and then returns immediately; boolean waitForCompletion(boolean verbose) - submits the job and waits for the end of the execution, returning true indicates that the job was executed successfully. verbose indicates whether to print the execution process, true-print.

The Mapper class is responsible for mapping the input key/value to the intermediate key/value. Maps are the individualtasks which transform input records into an intermediate records. The transformed intermediate records need not be of the same type as the inputrecords. A given input pair may map to zero or many output pairs.

Main method: protected void map(KEYIN key, VALUEIN value,Mapper.Context context) - will be called once for each key/value pair. Often write new classes to inherit Mapper to customize this function.

The Reducer class is used to simplify intermediate key-value sets. There are three main stages: shuffle (temporarily translated as aggregation) - the output of Mapper is aggregated through the HTTP network; sorting - a group with the same key is merged, sorting and aggregation are performed at the same time; reduce is simple.

Main method: protected void reduce(KEYIN key,Iterable values,Reducer.Context context), this method is called once for each key, and this method is rewritten by inheriting Reducer. **Where (key, collection of values) is the result of sort. The output values ​​of the Reducer are not reordered.

Mapper.Context, Reducer.Context, and Job all inherit from JobContext. Literally understood is the context of the Job, the data structure that saves various information of the Job.

3. Flowchart

4. Examples

Mapper class

import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class CountWordMaper extends Mapper<LongWritable, Text, Text, IntWritable> {
 
        @Override
        protected void map(LongWritable key,Text value,
                       Context context)
                       throws IOException,InterruptedException {
                       String text=value.toString();
                       StringTokenizer tokens=new StringTokenizer(text);
                       while(tokens.hasMoreTokens()){
                               String str=tokens.nextToken();
                               context.write(newText(str), new IntWritable(1));
                       }             
        }      
}

Reducer class

import java.io.IOException;
import java.util.Iterator;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class CountWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 
        @Override
        protected void reduce(Text key,Iterable<IntWritable> values, Context context)
                       throws IOException,InterruptedException {
               int total=0;
               Iterator<IntWritable>it=values.iterator();
               while(it.hasNext()){
                       it.next();
                       ++total;
               }
               context.write(key, newIntWritable(total));
        }      
}

Main class: define Job

import java.io.IOException;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class CountWord {
        public static void main(String[] args)throws IOException, InterruptedException, ClassNotFoundException {
               Job job=new Job();
               //Set the Jar by finding where agiven class came from.
               job.setJarByClass(CountWord.class);
               //Add a Path to the list ofinputs for the map-reduce job.
               FileInputFormat.addInputPath(job,new Path(args[0]));
               FileOutputFormat.setOutputPath(job,new Path(args[1]));             
              
               job.setMapperClass(CountWordMaper.class);
               job.setReducerClass(CountWordReducer.class);
              
               job.setOutputKeyClass(Text.class);
               job.setOutputValueClass(IntWritable.class);
               System.exit(job.waitForCompletion(true)?0:1);
        }
}

For hadoop programming, just introduce hadoop-core.jar into the project. However, when using it, it needs to be packaged into a jar package and bundled their Map Reduce code into a jar file. I typed it with eclipse. The jar package type should select "seal the jar" and specify the class that contains the main function. You can specify the input file and output directory at runtime.

Usage: hadoop jar [mainClass] args. …

 hadoop jar CountWord.jar test.hadoop.CountWord input output1

Result: No picture, no truth! ! !

input file

[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-mCuk2Gky-1662108576431)(http://hi.csdn.net/attachment/201107/31/0_1312114591r4ej.gif )]

result

[External link image transfer failed, the source site may have anti-leech mechanism, it is recommended to save the image and upload it directly (img-fS9cgvUZ-1662108576432)(http://hi.csdn.net/attachment/201107/31/0_1312114628ZQ1S.gif )]

Added: Before use, start hadoop:start-all.sh. To use the hadoop command directly, add the bin directory of hadoop to the environment variable path.

Also notice that the key s are in ascending alphabetical order, thanks to the shuffle/sort before the reduce operation.

Reprinted in: https://www.cnblogs.com/whuqin/archive/2011/07/29/4982068.html

l

Tags: Java data structure Hadoop hdfs mapreduce programming language

Posted by Win32 on Sat, 03 Sep 2022 06:07:12 +0930