Hadoop Chapter 5: word frequency statistics

Catalogue of series articles

Hadoop Chapter 1: Environment Construction
Hadoop Chapter 2: cluster construction (Part 1)
Hadoop Chapter 2: cluster construction (middle)
Hadoop Chapter 2: cluster construction (Part 2)
Hadoop Chapter 3: Shell commands
Hadoop Chapter 4: Client
Hadoop Chapter 4: Client 2.0
Hadoop Chapter 5: word frequency statistics

preface

Due to the pressure of school study, it has been cut off for two months, and it will recover from now on, with a minimum of one article per week.
Before, we used the hadoop word frequency statistics jar package that comes with hadoop to make word frequency statistics. Now we write one by hand to have a deeper understanding of hadoop.

1, Create project

Bloggers use the latest version of idea. There are some before the project was created, so let's re demonstrate it.

2, Basic environment construction

1. Add dependency

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.3.2</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.13.2</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.36</version>
        </dependency>
    </dependencies>

2. Create a log

Create log4j Properties file

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

3. Create packages and classes



3, Write function

1.WordCountMapper

package com.atguigu.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordCountMapper extends Mapper<LongWritable,Text,Text, IntWritable> {
	//To save space, set k-v outside the function
    private Text outK=new Text();
    private IntWritable outV=new IntWritable(1);


    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
    	//Get a row of input data
        String line = value.toString();
		//Segment data
        String[] words = line.split(" ");
		//Loop each word for k-v output
        for (String word : words) {
            outK.set(word);
			//Pass parameters to reduce
            context.write(outK,outV);
        }
    }
}

2.WordCountReducer

package com.atguigu.mapreduce.wordcount;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
	//Global variable output type
    private IntWritable outV = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {		//Set up a counter
        int sum=0;
        //Count the number of words
        for (IntWritable value : values) {
            sum+=value.get();
        }
		//Conversion result type
        outV.set(sum);
		//Output results
        context.write(key,outV);
    }
}

3.WordCountDriver

package com.atguigu.mapreduce.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
    	//1. Get a job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

		//2. Set the jar package path
        job.setJarByClass(WordCountDriver.class);

		//3. Associate mapper and reducer
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

		//4. Set the map output kv type
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
		//5. Set the final output kv type
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
		//6. Set input path and output path
        FileInputFormat.setInputPaths(job,new Path("D:\\learn\\hadoop\\wordcount\\input"));
        FileOutputFormat.setOutputPath(job,new Path("D:\\learn\\hadoop\\wordcount\\output"));
		//7. Submit a job
        boolean result = job.waitForCompletion(true);

        System.exit(result?0:1);
    }
}

4, Local operation

The local hadoop environment has been configured before, so it will not be repeated.

This means the operation is successful

Go to the output directory you set to view.

5, Cluster operation

1. Add dependencies and package

In order to upload a file to the cluster conveniently, we need to package it, so we need to add the packaging dependency first.
In pom XML for dependency addition

<build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

We need to manually set the input and output paths of the code just now. It is obviously inconvenient in the cluster, so let's write another package to copy the previous three codes at one time and make simple modifications.


Now, let's pack

After completion, you can view it on the right

You can see that there are two jar packages, which vary greatly in size. Let's talk about the difference. The first jar package only contains code, and the required hadoop running environment needs to be configured in advance. The second jar contains the required dependencies, which can be called directly.

2. Start the cluster and upload the jar package


You can simply modify the file name.
Upload test data in the cluster.



Cluster operation.
hadoop jar package command package class output path output path
Package classes can be obtained from idea
com.atguigu.mapreduce.wordcount2.WordCountDriver

hadoop jar wc.jar com.atguigu.mapreduce.wordcount2.WordCountDriver /wcinput /wcoutput


View results

summary

So far, the project of word frequency statistics is basically over, and we will start to make up for the blog we owe before.

Tags: Big Data Distribution Hadoop

Posted by grilldan on Sun, 03 Jul 2022 04:29:16 +0930