Catalogue of series articles
Hadoop Chapter 1: Environment Construction
Hadoop Chapter 2: cluster construction (Part 1)
Hadoop Chapter 2: cluster construction (middle)
Hadoop Chapter 2: cluster construction (Part 2)
Hadoop Chapter 3: Shell commands
Hadoop Chapter 4: Client
Hadoop Chapter 4: Client 2.0
Hadoop Chapter 5: word frequency statistics
preface
Due to the pressure of school study, it has been cut off for two months, and it will recover from now on, with a minimum of one article per week.
Before, we used the hadoop word frequency statistics jar package that comes with hadoop to make word frequency statistics. Now we write one by hand to have a deeper understanding of hadoop.
1, Create project
Bloggers use the latest version of idea. There are some before the project was created, so let's re demonstrate it.
2, Basic environment construction
1. Add dependency
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.3.2</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.13.2</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.36</version> </dependency> </dependencies>
2. Create a log
Create log4j Properties file
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
3. Create packages and classes
3, Write function
1.WordCountMapper
package com.atguigu.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class WordCountMapper extends Mapper<LongWritable,Text,Text, IntWritable> { //To save space, set k-v outside the function private Text outK=new Text(); private IntWritable outV=new IntWritable(1); @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { //Get a row of input data String line = value.toString(); //Segment data String[] words = line.split(" "); //Loop each word for k-v output for (String word : words) { outK.set(word); //Pass parameters to reduce context.write(outK,outV); } } }
2.WordCountReducer
package com.atguigu.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> { //Global variable output type private IntWritable outV = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { //Set up a counter int sum=0; //Count the number of words for (IntWritable value : values) { sum+=value.get(); } //Conversion result type outV.set(sum); //Output results context.write(key,outV); } }
3.WordCountDriver
package com.atguigu.mapreduce.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordCountDriver { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { //1. Get a job Configuration conf = new Configuration(); Job job = Job.getInstance(conf); //2. Set the jar package path job.setJarByClass(WordCountDriver.class); //3. Associate mapper and reducer job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); //4. Set the map output kv type job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //5. Set the final output kv type job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //6. Set input path and output path FileInputFormat.setInputPaths(job,new Path("D:\\learn\\hadoop\\wordcount\\input")); FileOutputFormat.setOutputPath(job,new Path("D:\\learn\\hadoop\\wordcount\\output")); //7. Submit a job boolean result = job.waitForCompletion(true); System.exit(result?0:1); } }
4, Local operation
The local hadoop environment has been configured before, so it will not be repeated.
This means the operation is successful
Go to the output directory you set to view.
5, Cluster operation
1. Add dependencies and package
In order to upload a file to the cluster conveniently, we need to package it, so we need to add the packaging dependency first.
In pom XML for dependency addition
<build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.6.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
We need to manually set the input and output paths of the code just now. It is obviously inconvenient in the cluster, so let's write another package to copy the previous three codes at one time and make simple modifications.
Now, let's pack
After completion, you can view it on the right
You can see that there are two jar packages, which vary greatly in size. Let's talk about the difference. The first jar package only contains code, and the required hadoop running environment needs to be configured in advance. The second jar contains the required dependencies, which can be called directly.
2. Start the cluster and upload the jar package
You can simply modify the file name.
Upload test data in the cluster.
Cluster operation.
hadoop jar package command package class output path output path
Package classes can be obtained from idea
com.atguigu.mapreduce.wordcount2.WordCountDriver
hadoop jar wc.jar com.atguigu.mapreduce.wordcount2.WordCountDriver /wcinput /wcoutput
View results
summary
So far, the project of word frequency statistics is basically over, and we will start to make up for the blog we owe before.