1, Requirement description
Hadoop comprehensive operation requirements:
1. Upload the files to be analyzed (no less than 10000 English words) to HDFS.
2. Call MapReduce to count the occurrence times of each word in the file.
3. Download the statistical results locally.
4. Write a blog to describe your analysis process and results.
For this big job, we need to call MapReduce to count the number of occurrences of each word in the file. The above operations are required to be implemented in Linux system. First install the Ubuntu system, then configure the Java environment and install the JDK. Ubuntu provides a robust and feature rich computing environment.
2, Environment introduction
First install VMware Workstation Pro under Windows, and then install ubuntu 18.04 on VMware Workstation pro. First, in ubuntu 18 04 install JDK on virtual machine and configure JAVA environment. Then at ubuntu 18 04. Install Hadoop in and configure Hadoop pseudo distributed. Download the installation package of Eclipse Linux version on the official website of Eclipse and upload it to ubuntu for installation. After installation, upload the files to be analyzed to HDFS, and then create the MapReduce project in Eclipse to add the required JAR package for the project. Run the word frequency statistics code, package the word frequency statistics code into binary files, and make word frequency statistics for the files to be counted
3, Data source and data upload
Prepare a 10000 word English word file to be analyzed, named SYJ txt. The contents of the file are the first two chapters in English (as shown in Figure 1), and syj.txt is uploaded to the virtual machine by means of shared folder.
Figure 1 document content
The file has been uploaded to the / home/yaco/tools/hadoop/standby directory
Figure 2 uploading to ubuntu virtual machine
4, View data upload results
Start hadoop and set SYJ Txt upload to HDFS:
Figure 3 starting HDFS
Figure 4 jps check the process and confirm that it has run
Check the HDFS input file to see if there is SYJ txt:
Figure 5 view SYJ Txt uploaded to HDFS
5, Description of data processing process
1. Open eclipse:
Figure 6 opening eclipse
2. Create a MapReduce project in Eclipse
Figure 7
3. Select the "File – > New – > Java project" menu to start creating a Java project, as shown in the following figure
Display interface.
Figure 8 creating a java project
Use default location to modify the saving path of the project. With this option, you can customize your own saving path, and then click next.
4. Add the dependent JAR package for the project, as shown in Figure 9, and select add terminal JARS
- Hadoop-common-3.1 in the directory "/ usr/local/hadoop/share/hadoop/common" 3. Jar and haoop-nfs-3.1 3.jar;
- All JAR packages under the directory "/ usr/local/hadoop/share/hadoop/common/lib";
- All JAR packages under "/ usr/local/hadoop/share/hadoop/mapreduce" directory, but excluding jdiff, lib, lib examples and sources directories.
- All JAR packages in the "/ usr/local/hadoop/share/hadoop/mapreduce/lib" directory.
Figure 9 select add external JAR package
Figure 10 adding jar package
5. Create a class named "WordCount"
Figure 11 create a WordCount class
6. Clear the code inside and copy the following code, as shown in Figure 12
import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public WordCount() { } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs(); if(otherArgs.length < 2) { System.err.println("Usage: wordcount <in> [<in>...] <out>"); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(WordCount.TokenizerMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for(int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true)?0:1); } public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public TokenizerMapper() { } public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { this.word.set(itr.nextToken()); context.write(this.word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public IntSumReducer() { } public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int sum = 0; IntWritable val; for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) { val = (IntWritable)i$.next(); } this.result.set(sum); context.write(key, this.result); } } }
Figure 12 copy word frequency statistics code
7. Click the | > button to run the code, as shown in Figure 13
Figure 13 running code
8. Package into binary executable
File - > export enters the interface shown in Figure 14
Figure 14 Export
9. Specify the path of the binary executable file, and click finish to generate the executable file under the specified path
Figure 15 specifies the path of the executable binary
6, Download and command line display of processing results
1. Enter the path under the folder of the executable file just exported to check whether the executable file just exported exists
cd /home/yaco/tools/hadoop/myapp
ls
Figure 16 confirms that the executable has been successfully exported
2. Enter the hadoop directory and use the executable file for word frequency statistics
./bin/hadoop jar ./myapp/WordCount.jar input output
Figure 17 word frequency statistics using executable file
3. Download processing results to local
bin/hdfs dfs -get output/* ./myapp/
Fig. 18 downloads the processing results to the local
4. The command line displays the statistical results
cd myapp/
Cat part-r-00000
Figure 19 display of statistical results
7, Experience summary
Through this big job, I learned a lot. When installing hadoop pseudo distribution, there are two small problems that may not be mentioned in the book. One is that the startup script needs to add the command of the specified user, and the other is that a script of environment variables in hadoop needs to declare Java again_ Home global variable. If the hadoop pseudo Distributed installation is successful, but the specified root user is installed during the pseudo Distributed installation, the following problems will also occur during the mapreduce experiment:
There was no output when running binary WordCount. After careful investigation, I finally found the problem. When I first installed hadoop, I did not specify the hadoop user, but the specified root user, so I should specify the / user/root/input path when inputting the file. In order to successfully output the word frequency statistics results, if the user creates hadoop, it is necessary to specify the corresponding user to run successfully.
reference
[1]http://dblab.xmu.edu.cn/blog/2481-2/
[2]http://dblab.xmu.edu.cn/blog/3043-2/#more-3043
[3]http://dblab.xmu.edu.cn/blog/337-2/#more-337
[4]http://dblab.xmu.edu.cn/blog/tag/linux/
[5]http://dblab.xmu.edu.cn/blog/778-2/#more-778