Chapter 1 configuring Hadoop
preface
I choose to attach one to our python + big data assignment this time
Using hadoop+python implementation, I have time to complete the recent test.
This time we use Hadoop. We use python to operate. First, we need to configure our virtual machine
Introduction: MapReduce is a computing model, framework and platform for big data parallel processing. It implies the following three meanings:
(1) MapReduce is a cluster based high-performance parallel computing platform (Cluster Infrastructure). It allows the common commercial servers in the market to form a distributed and parallel computing cluster with dozens, hundreds to thousands of nodes.
(2) MapReduce is a Software Framework for parallel computing and running. It provides a huge but well-designed parallel computing Software Framework, which can automatically complete the parallel processing of computing tasks, automatically divide computing data and computing tasks, automatically allocate and execute tasks on cluster nodes and collect computing results, and hand over many complex details at the bottom of the system involved in parallel computing such as data distributed storage, data communication and fault-tolerant processing to the system, It greatly reduces the burden of software developers.
(3) MapReduce is a parallel programming model & methodology. With the help of the design idea of functional programming language Lisp, it provides a simple parallel programming method, realizes the basic parallel computing tasks with Map and Reduce function programming, and provides abstract operation and parallel programming interface, so as to complete the programming and computing processing of large-scale data simply and conveniently
In short: mapreduce is a computing engine that can abstract our calculation of large quantities of data into two subtasks: map and reduce, so as to get the desired results faster.
Now let's prepare our hadoop virtual machine, which is based on CentOS 7 five
Cluster construction
2.1 template virtual machine environment preparation
0) install the template virtual machine, with IP address 192.168.10.100, host name Hadoop 100, memory 4G and hard disk 50G
1) The configuration requirements of Hadoop 100 virtual machine are as follows (the Linux system in this paper takes CentOS-7.5-x86-1804 as an example)
(1) Using Yum to install requires that the virtual machine can access the Internet normally. You can test the virtual machine networking before installing yum
[root@hadoop100 ~]# ping www.baidu.com PING www.baidu.com (14.215.177.39) 56(84) bytes of data. 64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=1 ttl=128 time=8.60 ms 64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=2 ttl=128 time=7.72 ms
(2) Install EPEL release
Note: Extra Packages for Enterprise Linux is an additional software package for the "red hat" operating system, which is applicable to RHEL, CentOS and Scientific Linux. As a software warehouse, most rpm packages cannot be found in the official repository)
[root@hadoop100 ~]# yum install -y epel-release
(3) Note: if the minimum system version is installed on Linux, the following tools need to be installed; If you are installing Linux Desktop Standard Edition, you do not need to perform the following operations
Net tool: toolkit collection, including ifconfig and other commands
[root@hadoop100 ~]# yum install -y net-tools
vim: Editor
[root@hadoop100 ~]# yum install -y vim
2) Turn off the firewall. Turn off the firewall and start it automatically
[root@hadoop100 ~]# systemctl stop firewalld [root@hadoop100 ~]# systemctl disable firewalld.service
Note: during enterprise development, the firewall of a single server is usually turned off. The company as a whole will set up a very secure firewall
3) Create a folder in the / opt directory and modify the owner and group
(1) Create the module and software folders in the / opt directory
[root@hadoop100 ~]# mkdir /opt/module [root@hadoop100 ~]# mkdir /opt/software
(2) Modify that the owner and group of the module and software folders are atguigu users
[root@hadoop100 ~]# chown atguigu:atguigu /opt/module [root@hadoop100 ~]# chown atguigu:atguigu /opt/software
(3) View the owner and group of the module and software folders
[root@hadoop100 ~]# cd /opt/ [root@hadoop100 opt]# ll Total consumption 12 drwxr-xr-x. 2 atguigu atguigu 4096 5 July 28 17:18 module drwxr-xr-x. 2 root root 4096 9 July 2017 rh drwxr-xr-x. 2 atguigu atguigu 4096 5 August 28-17:18 software
4) Uninstall the JDK that comes with the virtual machine
Note: if your virtual machine is minimized, you do not need to perform this step.
[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps
rpm -qa: query all installed rpm packages
grep -i: ignore case
xargs -n1: indicates that only one parameter is passed at a time
rpm -e – nodeps: force software uninstallation
5) Restart the virtual machine
[root@hadoop100 ~]# reboot
In this way, our basic virtual machine is ready. On this basis, we copy two, namely Haoop101 and Hadoop 102
2.2 cloning virtual machines
1) Using the template machine Hadoop 100, clone two virtual machines: Hadoop 101 and Hadoop 102
Note: when cloning, close Hadoop 100 first
2) Modify the clone machine IP. Take Hadoop 100 as an example
(1) Modify the static IP of the cloned virtual machine
[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
Change to
DEVICE=ens33 TYPE=Ethernet ONBOOT=yes BOOTPROTO=static NAME="ens33" IPADDR=192.168.10.100 PREFIX=24 GATEWAY=192.168.10.2 DNS1=192.168.10.2
(2) View the virtual network editor of Linux virtual machine, edit - > virtual network editor - > VMnet8
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-EUNZenmL-1639641787977)(C:Users86157AppDataLocalTemps9634674668.png)]
[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture
Direct upload (IMG iujpkmvm-1639641787978) (C: users86157appdatalocaltemps9634685117. PNG)]
(3) View the IP address of Windows system adapter VMware Network Adapter VMnet8
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-MElwdgU8-1639641787979)(C:Users86157AppDataLocalTemps9634750321.png)]
(4) Ensure that the IP address and virtual network editor address in the ifcfg-ens33 file of Linux system are the same as the VM8 network IP address of Windows system.
3 * *) modify the host name of the clone machine. The following is an example of Hadoop 100 * * * ***
(1) Modify host name
[root@hadoop100 ~]# vim /etc/hostname hadoop100
(2) Configure the Linux clone host name mapping hosts file and open / etc/hosts
[root@hadoop100 ~]# vim /etc/hosts
Add the following
192.168.10.100 hadoop100 192.168.10.101 hadoop101 192.168.10.102 hadoop102 192.168.10.103 hadoop103 192.168.10.104 hadoop104
4 * *) restart the clone machine Hadoop 102**
[root@hadoop100 ~]# reboot
5 * *) modify the host mapping file (hosts file) of windows**
(1) If the operating system is Windows 7, you can modify it directly
(a) Enter C:WindowsSystem32driversetc path
(b) Open the hosts file, add the following contents, and then save
192.168.10.100 hadoop100 192.168.10.101 hadoop101 192.168.10.102 hadoop102 192.168.10.103 hadoop103 192.168.10.104 hadoop104 192.168.10.105 hadoop105 192.168.10.106 hadoop106 192.168.10.107 hadoop107 192.168.10.108 hadoop108
(2) If the operating system is windows10, copy it first, modify and save it, and then overwrite it
(a) Enter C:WindowsSystem32driversetc path
(b) Copy hosts file to desktop
(c) Open the desktop hosts file and add the following
192.168.10.100 hadoop100 192.168.10.101 hadoop101 192.168.10.102 hadoop102 192.168.10.103 hadoop103 192.168.10.104 hadoop104 192.168.10.105 hadoop105 192.168.10.106 hadoop106 192.168.10.107 hadoop107 192.168.10.108 hadoop108
(d) Overwrite the desktop hosts file with the C:WindowsSystem32driversetc path hosts file
2.3 install JDK in Hadoop 100
1) Uninstall existing JDK
Note: before installing the JDK, be sure to delete the JDK of the virtual machine in advance. For detailed steps, see the steps of uninstalling JDK in Section 3.1 of the query document.
2) Use XShel transfer tool to import JDK into the software folder under opt directory
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-Pgs6NP2R-1639641787980)(C:Users86157AppDataLocalTemps9634950345.png)]
3) Check whether the software package is imported successfully in opt directory under Linux system
[atguigu@hadoop100 ~]$ ls /opt/software/
See the following results:
jdk-8u212-linux-x64.tar.gz
4) Unzip the JDK to the / opt/module directory
[atguigu@hadoop100 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
5) Configure JDK environment variables
(1) Create a new / etc / profile d/my_ env. SH file
[atguigu@hadoop100 ~]$ sudo vim /etc/profile.d/my_env.sh
Add the following
#JAVA_HOME export JAVA_HOME=/opt/module/jdk1.8.0_212 export PATH=$PATH:$JAVA_HOME/bin
(2) Exit after saving
:wq
(3) source click the / etc/profile file to make the new environment variable PATH effective
[atguigu@hadoop100 ~]$ source /etc/profile
6) Test whether the JDK is installed successfully
[atguigu@hadoop100 ~]$ java -version
If you can see the following results, the Java installation is successful.
java version "1.8.0_212"
Note: restart (if java version can be used, there is no need to restart)
[atguigu@hadoop100 ~]$ sudo reboot
2.4 installing Hadoop in Hadoop 100
Hadoop download address: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/
1) Use XShell file transfer tool to transfer hadoop-3.1.3 tar. GZ is imported into the software folder under the opt directory
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-D624hHTP-1639641787980)(C:Users86157AppDataLocalTemps9635182977.png)]
2) Enter the Hadoop installation package path
[atguigu@hadoop100 ~]$ cd /opt/software/
3) Unzip the installation file under / opt/module
[atguigu@hadoop100 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/
4) Check whether the decompression is successful**
[atguigu@hadoop100 software]$ ls /opt/module/ hadoop-3.1.3
5) Add Hadoop to environment variable
(1) Get Hadoop installation path
[atguigu@hadoop100 hadoop-3.1.3]$ pwd /opt/module/hadoop-3.1.3
(2) Open / etc / profile d/my_ env. SH file
[atguigu@hadoop100 hadoop-3.1.3]$ sudo vim /etc/profile.d/my_env.sh
In my_ env. Add the following at the end of the SH file: (shift+g)
#HADOOP_HOME export HADOOP_HOME=/opt/module/hadoop-3.1.3 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
Save and exit:: wq
(3) Make the modified document effective
[atguigu@hadoop100 hadoop-3.1.3]$ source /etc/profile
6) Test whether the installation is successful
[atguigu@hadoop100 hadoop-3.1.3]$ hadoop version
Hadoop 3.1.3
7) Restart (restart the virtual machine if the Hadoop command cannot be used)
[atguigu@hadoop100 hadoop-3.1.3]$ sudo reboot
2.5 Hadoop directory structure
1) View Hadoop directory structure
[atguigu@hadoop100 hadoop-3.1.3]$ ll
Total consumption 52
[root@hadoop100 hadoop-3.1.3]# ll Total consumption 184 drwxr-xr-x. 2 1 1 183 9 December 2019 bin drwxr-xr-x. 4 root root 37 11 Month 20 20:34 data drwxr-xr-x. 3 1 1 20 9 December 2019 etc drwxr-xr-x. 2 1 1 106 9 December 2019 include drwxr-xr-x. 3 1 1 20 9 December 2019 lib drwxr-xr-x. 4 1 1 288 9 December 2019 libexec -rw-rw-r--. 1 1 1 147145 9 April 2019 LICENSE.txt drwxr-xr-x. 3 root root 4096 11 June 28-16:12 logs -rw-rw-r--. 1 1 1 21867 9 April 2019 NOTICE.txt -rw-rw-r--. 1 1 1 1366 9 April 2019 README.txt drwxr-xr-x. 3 1 1 4096 11 June 18-23:11 sbin drwxr-xr-x. 4 1 1 31 9 December 2019 share drwxr-xr-x. 2 root root 22 11 June 18-23:00 wcinput drwxr-xr-x. 2 root root 88 11 June 18-23:01 wcoutput -rw-r--r--. 1 root root 46 11 June 18-23:24 word.txt
2) Important catalogue
(1) bin directory: stores scripts that operate Hadoop related services (hdfs, yarn, mapred)
(2) etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
(3) lib Directory: the local library where Hadoop is stored (the function of compressing and decompressing data)
(4) sbin Directory: stores scripts for starting or stopping Hadoop related services
(5) share Directory: stores the dependent jar packages, documents, and official cases of Hadoop
Hadoop operation mode
1) Hadoop official website: http://hadoop.apache.org/
2) Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.
Local mode: stand-alone operation, just to demonstrate the official case. The production environment is not used.
**Pseudo distributed mode: * * it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Individual companies that are short of money are used for testing, and the production environment is not used.
**Fully distributed mode: * * multiple servers form a distributed environment. Use in production environment.
3.1 local operation mode (official WordCount)
1) Create a wcinput folder under the hadoop-3.1.3 file
[atguigu@hadoop102 hadoop-3.1.3]$ mkdir wcinput
2) Create a word under the wcinput file Txt file
[atguigu@hadoop102 hadoop-3.1.3]$ cd wcinput
3) Edit word Txt file
[atguigu@hadoop102 wcinput]$ vim word.txt
Enter the following in the file
bingbing bingbing lanlan lanlan lanlan xiaozhao
Save exit:: wq
4) Go back to Hadoop directory / opt/module/hadoop-3.1.3
5) Execution procedure
[atguigu@hadoop100 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput wcoutput
6) View results
[atguigu@hadoop100 hadoop-3.1.3]$ cat wcoutput/part-r-00000
See the following results:
bingbing 2 lanlan 3 xiaozhao 1
3.2 fully distributed operation mode (development focus)
analysis:
1) Prepare 3 clients (turn off firewall, static IP, host name)
2) Install JDK
3) Configure environment variables
4) Install Hadoop
5) Configure environment variables
6) Configure cluster
7) Single point start
8) Configure ssh
9) Get together and test the cluster
3.2.1 virtual machine preparation
See sections 2.1 and 2.2 for details.
3.2.2 writing cluster distribution script xsync
1) scp (secure copy)
(1) scp definition
scp can copy data between servers. (from server1 to server2)
(2) Basic grammar
scp -r p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname
Command recursion file path / name to copy destination user @ host: destination path / name
(3) Case practice
Premise: the / opt/module and / opt/software directories have been created in Hadoop 102, Hadoop 103 and Hadoop 104, and the two directories have been modified to atguigu:atguigu
[atguigu@hadoop102 ~]$ sudo chown atguigu:atguigu -R /opt/module
(a) On Hadoop 102, add / opt / module / jdk1 8.0_ 212 directory to Hadoop 103.
[atguigu@hadoop102 ~]$ scp -r /opt/module/jdk1.8.0_212 atguigu@hadoop103:/opt/module
(b) On Hadoop 103, copy the / opt/module/hadoop-3.1.3 directory in Hadoop 102 to Hadoop 103.
[atguigu@hadoop103 ~]$ scp -r atguigu@hadoop102:/opt/module/hadoop-3.1.3 /opt/module/
(c) Operate on Hadoop 103 and copy all directories under / opt/module directory in Hadoop 102 to Hadoop 104.
[atguigu@hadoop103 opt]$ scp -r atguigu@hadoop102:/opt/module/* atguigu@hadoop104:/opt/module
2 * *) rsync * * * * remote synchronization tool**
rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links.
Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates the difference files. scp is to copy all the files.
(1) Basic grammar
rsync -av p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname
Command option parameter file path / name to copy destination user @ host: destination path / name
Option parameter description
option
function
-a
Archive copy
-v
Show copy process
(2) Case practice
(a) Delete / opt/module/hadoop-3.1.3/wcinput in Hadoop 103
[atguigu@hadoop103 hadoop-3.1.3]$ rm -rf wcinput/
(b) Synchronize / opt/module/hadoop-3.1.3 in Hadoop 102 to Hadoop 103
[atguigu@hadoop102 module]$ rsync -av hadoop-3.1.3/ atguigu@hadoop103:/opt/module/hadoop-3.1.3/
3 * *) xsync * * * * cluster distribution script**
(1) Requirement: copy files to the same directory of all nodes in a circular way
(2) Demand analysis:
(a) Original copy of rsync command:
rsync -av /opt/module atguigu@hadoop103:/opt/
(b) Expected script:
xsync name of the file to synchronize
(c) It is expected that the script can be used in any path (the script is placed in the path where the global environment variable is declared)
[atguigu@hadoop102 ~]$ echo $PATH
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/atguigu/.local/bin:/home/atguigu/bin:/opt/module/jdk1.8.0_212/bin
(3) Script implementation
(a) Create an xsync file in the / home/atguigu/bin directory
[atguigu@hadoop102 opt]$ cd /home/atguigu
[atguigu@hadoop102 ~]$ mkdir bin
[atguigu@hadoop102 ~]$ cd bin
[atguigu@hadoop102 bin]$ vim xsync
Write the following code in this file
#!/bin/bash #1. Number of judgment parameters if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. Traverse all machines in the cluster for host in hadoop102 hadoop103 hadoop104 do echo ==================== $host ==================== #3. Traverse all directories and send them one by one for file in $@ do #4. Judge whether the document exists if [ -e $file ] then #5. Get parent directory pdir=$(cd -P $(dirname $file); pwd) #6. Get the name of the current file fname=$(basename $file) ssh $host "mkdir -p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
(b) The modified script xsync has execution permission
[atguigu@hadoop102 bin]$ chmod +x xsync
(c) Test script
[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin
(d) Copy the script to / bin for global invocation
[atguigu@hadoop102 bin]$ sudo cp xsync /bin/
(e) Synchronize environment variable configuration (root owner)
[atguigu@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh
Note: if sudo is used, xsync must complete its path.
Make environment variables effective
[atguigu@hadoop103 bin]$ source /etc/profile
[atguigu@hadoop104 opt]$ source /etc/profile
3.2.3 SSH non secret login configuration
1 * *) configure ssh**
(1) Basic grammar
ssh IP address of another computer
(2) Solution to Host key verification failed during ssh connection
[atguigu@hadoop102 ~]$ ssh hadoop103
If the following appears
Are you sure you want to continue connecting (yes/no)
Enter yes and enter
(3) Return to Hadoop 102
[atguigu@hadoop103 ~]$ exit
2 * *) no key configuration**
(1) Secret free login principle
(2) Generate public and private keys
[atguigu@hadoop102 .ssh]$ pwd
/home/atguigu/.ssh
[atguigu@hadoop102 .ssh]$ ssh-keygen -t rsa
Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)
(3) Copy the public key to the target machine for password free login
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104
be careful:
You also need to configure atguigu account on Hadoop 103 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
You also need to configure atguigu account on Hadoop 104 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.
You also need to use the root account on Hadoop 102 to configure non secret login to Hadoop 102, Hadoop 103 and Hadoop 104;
3**). Explanation of file functions under the ssh folder (~ /. ssh)**
known_hosts
Record the public key of the computer accessed by ssh
id_rsa
Generated private key
id_rsa.pub
Generated public key
authorized_keys
Store the authorized secret free login server public key
3.3 cluster configuration
1 * *) cluster deployment planning**
be careful:
NameNode and SecondaryNameNode should not be installed on the same server
Resource manager also consumes a lot of memory and should not be configured on the same machine as NameNode and SecondaryNameNode.
hadoop100
hadoop101
hadoop102
HDFS
NameNode DataNode
DataNode
SecondaryNameNode DataNode
YARN
NodeManager
ResourceManager NodeManager
NodeManager
2 * *) profile description**
Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.
(1) Default profile:
Default file to get
The file is stored in the jar package of Hadoop
[core-default.xml]
hadoop-common-3.1.3.jar/core-default.xml
[hdfs-default.xml]
hadoop-hdfs-3.1.3.jar/hdfs-default.xml
[yarn-default.xml]
hadoop-yarn-common-3.1.3.jar/yarn-default.xml
[mapred-default.xml]
hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml
(2) Custom profile:
core-site.xml**,hdfs-site.xml****,yarn-site.xml****,mapred-site.xml * * four configuration files are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.
3) Configure cluster
(1) Core profile
Configure core site xml
[root@hadoop100 ~]$ cd $HADOOP_HOME/etc/hadoop [root@hadoop100 ~]$ vim core-site.xml so-ascii-f
The contents of the document are as follows:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- appoint NameNode Address of --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop100:8020</value> </property> <!-- appoint hadoop Storage directory of data --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop-3.1.3/data</value> </property> <!-- to configure HDFS The static user used for web page login is atguigu --> <property> <name>hadoop.http.staticuser.user</name> <value>root</value> </property> </configuration>
(2) HDFS profile
Configure HDFS site xml
[root@hadoop100 ~]$ vim hdfs-site.xml
The contents of the document are as follows:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at ``` http://www.apache.org/licenses/LICENSE-2.0 ``` Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- nn web End access address--> <property> <name>dfs.namenode.http-address</name> <value>hadoop100:9870</value> </property> <!-- 2nn web End access address--> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop102:9868</value> </property> </configuration>
(3) YARN profile
Configure yarn site xml
[root@hadoop100 ~]$ vim yarn-site.xml
The contents of the document are as follows:
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at ``` http://www.apache.org/licenses/LICENSE-2.0 ``` Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <property> <name>yarn.application.classpath</name> <value> /opt/module/hadoop-3.1.3/etc/hadoop, /opt/module/hadoop-3.1.3/share/hadoop/common/lib/*, /opt/module/hadoop-3.1.3/share/hadoop/common/*, /opt/module/hadoop-3.1.3/share/hadoop/hdfs, /opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*, /opt/module/hadoop-3.1.3/share/hadoop/hdfs/*, /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*, /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/*, /opt/module/hadoop-3.1.3/share/hadoop/yarn, /opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/*, /opt/module/hadoop-3.1.3/share/hadoop/yarn/*, </value> </property> <!-- appoint MR go shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- appoint ResourceManager Address of--> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop101</value> </property> <!-- Inheritance of environment variables --> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CO NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP RED_HOME</value> </property> </configuration>
(4) MapReduce profile
Configure mapred site xml
[root@hadoop100 ~]$ vim mapred-site.xml
The contents of the document are as follows:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- appoint MapReduce The program runs on Yarn upper --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
4) Distribute the configured Hadoop configuration file on the cluster
[root@hadoop100 ~]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/
5) Go to 101 and 102 to check the distribution of documents
[atguigu@hadoop101 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml [atguigu@hadoop102 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
3.4 group configuration
1) Configure workers
[root@hadoop100 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers
Add the following contents to the document:
hadoop100 hadoop101 hadoop102
Note: no space is allowed at the end of the content added in the file, and no blank line is allowed in the file.
Synchronize all node profiles
[root@hadoop100 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc
2) Start cluster
(1) If the cluster is started for the first time, The namenode needs to be formatted in Hadoop 102 node (Note: formatting the namenode will generate a new cluster id, resulting in inconsistent cluster IDS between namenode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the namenode, be sure to stop the namenode and datanode processes, delete the data and logs directories of all machines, and then format it.)
[root@hadoop100 hadoop-3.1.3]$ hdfs namenode -format
(2) Start HDFS
[root@hadoop100 hadoop-3.1.3]$ sbin/start-dfs.sh
(3) The node where the resource manager is configured is started
[root@hadoop100 hadoop-3.1.3]$ sbin/start-yarn.sh
(4) View the NameNode of HDFS on the Web side
(a) Enter in the browser: http://hadoop100:9870
(b) View data information stored on HDFS
(5) View YARN's ResourceManager on the Web
(a) Enter in the browser: http://hadoop101:8088
(b) View Job information running on YARN
3 * *) basic cluster test**
(1) Upload files to cluster
Upload small files
[root@hadoop100 ~]$ hadoop fs -mkdir /input [root@hadoop100 ~]$ hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
Upload large files
[root@hadoop100 ~]$ hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /
(2) After uploading the file, check where the file is stored
View HDFS file storage path
[root@hadoop100 subdir0]$ pwd /opt/module/hadoop-3.1.3/data/dfs/data/current/BP-1436128598-192.168.10.102-1610603650062/current/finalized/subdir0/subdir0
View the contents of files stored on disk by HDFS
[root@hadoop100 subdir0]$ cat blk_1073741825 hadoop yarn hadoop mapreduce atguigu atguigu
(3) Splicing
-rw-rw-r--. 1 atguigu atguigu 134217728 5 June 23-16:01 **blk_1073741836** -rw-rw-r--. 1 atguigu atguigu 1048583 5 June 23-16:01 blk_1073741836_1012.meta -rw-rw-r--. 1 atguigu atguigu 63439959 5 June 23-16:01 **blk_1073741837** -rw-rw-r--. 1 atguigu atguigu 495635 5 June 23-16:01 blk_1073741837_1013.meta [root@hadoop100 subdir0]$ cat blk_1073741836>>tmp.tar.gz [root@hadoop100 subdir0]$ cat blk_1073741837>>tmp.tar.gz [root@hadoop100 subdir0]$ tar -zxvf tmp.tar.gz
(4) Download
[root@hadoop102 software]$ hadoop fs -get /jdk-8u212-linux-x64.tar.gz ./
(5) Execute the wordcount program
[root@hadoop100 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output
3.5 configuring the history server
In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:
1) Configure mapred site xml
[atguigu@hadoop102 hadoop]$ vim mapred-site.xml
Add the following configuration to this file.
<!-- Historical server address --> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop102:10020</value> </property> <!-- History server web End address --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop100:19888</value> </property>
2) Distribution configuration
[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml
3) Start the history server in Hadoop 102
[atguigu@hadoop102 hadoop]$ mapred --daemon start historyserver
4) Check whether the history server is started
[atguigu@hadoop102 hadoop]$ jps
5) View JobHistory
http://hadoop100:19888/jobhistory
last
In this way, our Hadoop will be configured
[root@hadoop100 hadoop-3.1.3]# jps
3488 DataNode
3858 NodeManager
32405 Jps
3341 NameNode
[root@hadoop101 hadoop-3.1.3]# jps
3364 NodeManager
8132 Jps
3210 ResourceManager
2990 DataNode
[root@hadoop102 hadoop-3.1.3]# jps
3056 DataNode
3154 SecondaryNameNode
3336 NodeManager
7230 Jps
Common errors and Solutions
1) Firewall is not closed or YARN is not started
INFO client.RMProxy: Connecting to ResourceManager at hadoop108/192.168.10.108:8032
2) Host name configuration error
3) IP address configuration error
4) ssh is not configured properly
5) The cluster started by root and atguigu users are not unified
6) Careless modification of configuration file
7) Unrecognized host name
java.net.UnknownHostException: hadoop102: hadoop102
at java.net.InetAddress.getLocalHost(InetAddress.java:1475)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
terms of settlement:
(1) Add 192.168.10.102 Hadoop 102 to the / etc/hosts file
(2) Do not use Hadoop, Hadoop 000 and other special names for the host name
8) DataNode and NameNode processes can only work one at a time.
9) Executing the command does not take effect. When pasting the command in Word, it encounters - and long - and is not distinguished. Cause command failure
Solution: try not to paste the code in Word.
10) jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.
The reason is that there is a temporary file of the started process in the / tmp directory under the root directory of Linux. Delete the processes related to the cluster and restart the cluster.
11) jps does not take effect
Reason: the global variable hadoop java does not take effect. Solution: the source /etc/profile file is required.
12) 8088 port cannot be connected
[ atguigu@hadoop102 Desktop] $cat /etc/hosts
Comment out the following code
#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 hadoop102
Chapter 2 Linux Installation Python 3
1. First check the location of python in the system
whereis python
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG akmxpdmx-1639641787981) (C: users86157appdatalocaltemps8082686986. PNG)]
python2.7. The default installation is in the / usr/bin directory. Switch to / usr/bin/
cd /usr/bin/ ll python*
From the figure below, we can see that Python points to python2 and python2 points to python2 7, so we can install a python3, then point Python to python3, and then point python2 to python2 7, then two versions of Python can coexist.
2. Before downloading the package of python3, install the relevant dependent package to download and compile python3:
yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make
After running the above command, the related dependencies used to compile Python 3 are installed
3. The default centos7 is that pip is not installed. Add epel extension source first
yum -y install epel-release
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ioPbqZWm-1639641787981)(C:Users86157AppDataLocalTemps8082943629.png)]
4. Install pip
yum install python-pip
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-nxktrcte-163941787982) (C: users86157appdatalocaltemps8082913977. PNG)]
5. Install wget with pip
pip install wget
6. Download the source package of python3 with wget, or download it yourself, upload it to the server and then install it. If the network is fast, you can install it directly
wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tar.xz
7. Compile the python3 source code package and decompress it
xz -d Python-3.6.8.tar.xz tar -xf Python-3.6.8.tar
8. Enter the decompressed directory and execute the following commands successively for manual compilation
cd Python-3.6.8 ./configure prefix=/usr/local/python3 make && make install
9. Installation depends on zlib and zlib deve
yum install zlib zlib yum install zlib zlib-devel
10. Finally, if there is no error prompt, it means that the installation is correct. There will be a python 3 directory under / usr/local /
11. Add a soft link and back up the original link
mv /usr/bin/python /usr/bin/python.bak
12. Add soft links to Python 3
ln -s /usr/local/python3/bin/python3.6 /usr/bin/python
13. Test whether the installation is successful
python -V
14. Change the yum configuration because it requires python2 to execute. Otherwise, Yum will not work properly
vi /usr/bin/yum
15. Put the first line of #/ usr/bin/python is modified as follows
#! /usr/bin/python2
16. There is another place that needs to be revised
vi /usr/libexec/urlgrabber-ext-down
17. Put the first line of \/ usr/bin/python is modified as follows
#! /usr/bin/python2
18. Start python2
python2
19. Start Python 3
python
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-woCFJ3QO-1639641787983)(C:Users86157AppDataLocalTemps8086692414.png)]
Reference documents CentOS installation Python 3 detailed tutorial_ Unity of knowledge and practice - CSDN blog_ centos python
Chapter 3 python integration
Task purpose: count the number of words
preparation:
# Create word txt vi word.txt # In word Add a few lines of words to TXT hadoop is good spark is fast spark is better python is basics java also good hbase is nosql mysql is relational database mongdb is nosql relational database or nosql is good # Add word Txt is uploaded to the data folder of hdfs hadoop dfs -put word.txt /data
map code
# !/usr/bin/env python import sys words=[] for i in sys.stdin: i=i.strip() word=i.split(" ") words.append(word) for i in words: for j in i: print(j,1)
reduce code
# !/usr/bin/env python from operator import itemgetter import sys words=[] num=[] index=-1 for i in sys.stdin: word=i.strip() word=word.split(" ") word[1]=int(word[1]) for i in range(len(words)): if(words[i][0]==word[0]): index=i if(index==-1): words.append(word) if(index!=-1): words[index][1]+=1 index=-1 for i in words: print(i) cat word.txt |python map.py|python reduce.py ['hadoop', 1] ['is', 8] ['good', 3] ['spark', 2] ['fast', 1] ['better', 1] ['python', 1] ['basics', 1] ['java', 1] ['also', 1] ['hbase', 1] ['nosql', 3] ['mysql', 1] ['relational', 2] ['database', 2] ['mongdb', 1] ['or', 1]
hadoop running mapreduce
Note: mapreduce supports development in languages other than java, which needs to be used_ hadoop-streaming-2.7.3.jar_ This computing framework runs the mapreduce task.
# Run mapreduce task hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar -files /root/map.py,/root/reduce.py -mapper "python /root/map.py" -reducer "python /root/reduce.py" -input /input -output /out_python //Parameter description Hadoop jar + Hadoop streaming-2.7.3 Jar path - file+ Written map And reduce Code file path -mapper +implement map File command -reducer +implement reduce File command -input +Input file in hdfs Path of -output +Output file location hdfs Path of
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG ezyapxtx-1639641787983) (C: users86157desktoppytest concurrent programming python calls hadoop successfully. png)]
[the transfer of external chain pictures fails. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-5UGVxQNl-1639641787984)(C:Users86157AppDataLocalTemps9640868477.png)]
After finishing my homework, I uploaded the document to my personal blog
Using python to integrate hadoop to realize parallel computing md - Xiaolan
Reference documents:
(6 messages) MapReduce (Python Development)_ Xiaoshuang 123 blog - CSDN blog_ mapreduce python
index=i if(index==-1): words.append(word) if(index!=-1): words[index][1]+=1 index=-1
for i in words:
print(i)
cat word.txt |python map.py|python reduce.py
['hadoop', 1]
['is', 8]
['good', 3]
['spark', 2]
['fast', 1]
['better', 1]
['python', 1]
['basics', 1]
['java', 1]
['also', 1]
['hbase', 1]
['nosql', 3]
['mysql', 1]
['relational', 2]
['database', 2]
['mongdb', 1]
['or', 1]
**hadoop function mapreduce** explain: mapreduce Support addition java For development in other languages, you need to use*hadoop-streaming-2.7.3.jar* This computing framework runs mapreuce Mission.
Run mapreduce task
hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar -files /root/map.py,/root/reduce.py -mapper "python /root/map.py" -reducer "python /root/reduce.py" -input /input -output /out_python
//Parameter description Hadoop jar + Hadoop streaming-2.7.3 Jar path - file+
Written map and reduce code file path - mapper + execute map file command - reducer + execute reduce file command - input + path of input file in hdfs - output + path of output file in hdfs
[External chain picture transfer...(img-eZyaPxTX-1639641787983)] [External chain picture transfer...(img-5UGVxQNl-1639641787984)] After finishing my homework, I uploaded the document to my personal blog [utilize python integrate hadoop Implement parallel computing.md - Xiao Lan](http://8.142.109.15:8090/archives/%E5%88%A9%E7%94%A8python%E9%9B%86%E6%88%90hadoop%E5%AE%9E%E7%8E%B0%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97md) Reference documents: [(6 Message) mapreduce(python development)_Xiaoshuang 123's blog-CSDN Blog_mapreduce python](https://blog.csdn.net/qq_45014844/article/details/117438600) [(6 Message) Linux System installation Python3 Environment (ultra detailed)_L-CSDN Blog_linux install python](https://blog.csdn.net/L_15156024189/article/details/84831045) [(6 Message) Hadoop Installation and use of_Defeated blog-CSDN Blog_hadoop install](https://blog.csdn.net/qq_45021180/article/details/104640540?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522163964121216780274173287%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=163964121216780274173287&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_click~default-1-104640540.pc_search_insert_es_download&utm_term=%E5%AE%89%E8%A3%85hadoop&spm=1018.2226.3001.4187)