(super detailed) 0 basic uses python to call Hadoop, cloud computing

Chapter 1 configuring Hadoop

preface

I choose to attach one to our python + big data assignment this time

Using hadoop+python implementation, I have time to complete the recent test.

This time we use Hadoop. We use python to operate. First, we need to configure our virtual machine

Introduction: MapReduce is a computing model, framework and platform for big data parallel processing. It implies the following three meanings:

(1) MapReduce is a cluster based high-performance parallel computing platform (Cluster Infrastructure). It allows the common commercial servers in the market to form a distributed and parallel computing cluster with dozens, hundreds to thousands of nodes.
(2) MapReduce is a Software Framework for parallel computing and running. It provides a huge but well-designed parallel computing Software Framework, which can automatically complete the parallel processing of computing tasks, automatically divide computing data and computing tasks, automatically allocate and execute tasks on cluster nodes and collect computing results, and hand over many complex details at the bottom of the system involved in parallel computing such as data distributed storage, data communication and fault-tolerant processing to the system, It greatly reduces the burden of software developers.
(3) MapReduce is a parallel programming model & methodology. With the help of the design idea of functional programming language Lisp, it provides a simple parallel programming method, realizes the basic parallel computing tasks with Map and Reduce function programming, and provides abstract operation and parallel programming interface, so as to complete the programming and computing processing of large-scale data simply and conveniently

In short: mapreduce is a computing engine that can abstract our calculation of large quantities of data into two subtasks: map and reduce, so as to get the desired results faster.

Now let's prepare our hadoop virtual machine, which is based on CentOS 7 five

Cluster construction

2.1 template virtual machine environment preparation

0) install the template virtual machine, with IP address 192.168.10.100, host name Hadoop 100, memory 4G and hard disk 50G

1) The configuration requirements of Hadoop 100 virtual machine are as follows (the Linux system in this paper takes CentOS-7.5-x86-1804 as an example)

(1) Using Yum to install requires that the virtual machine can access the Internet normally. You can test the virtual machine networking before installing yum

[root@hadoop100 ~]# ping www.baidu.com

PING www.baidu.com (14.215.177.39) 56(84) bytes of data.

64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=1 ttl=128 time=8.60 ms

64 bytes from 14.215.177.39 (14.215.177.39): icmp_seq=2 ttl=128 time=7.72 ms

(2) Install EPEL release

Note: Extra Packages for Enterprise Linux is an additional software package for the "red hat" operating system, which is applicable to RHEL, CentOS and Scientific Linux. As a software warehouse, most rpm packages cannot be found in the official repository)

[root@hadoop100 ~]# yum install -y epel-release

(3) Note: if the minimum system version is installed on Linux, the following tools need to be installed; If you are installing Linux Desktop Standard Edition, you do not need to perform the following operations

Net tool: toolkit collection, including ifconfig and other commands

[root@hadoop100 ~]# yum install -y net-tools 

vim: Editor

[root@hadoop100 ~]# yum install -y vim

2) Turn off the firewall. Turn off the firewall and start it automatically

[root@hadoop100 ~]# systemctl stop firewalld

[root@hadoop100 ~]# systemctl disable firewalld.service

Note: during enterprise development, the firewall of a single server is usually turned off. The company as a whole will set up a very secure firewall

3) Create a folder in the / opt directory and modify the owner and group

(1) Create the module and software folders in the / opt directory

[root@hadoop100 ~]# mkdir /opt/module

[root@hadoop100 ~]# mkdir /opt/software

(2) Modify that the owner and group of the module and software folders are atguigu users

[root@hadoop100 ~]# chown atguigu:atguigu /opt/module 

[root@hadoop100 ~]# chown atguigu:atguigu /opt/software

(3) View the owner and group of the module and software folders

[root@hadoop100 ~]# cd /opt/

[root@hadoop100 opt]# ll
 Total consumption 12

drwxr-xr-x. 2 atguigu atguigu 4096 5 July 28 17:18 module

drwxr-xr-x. 2 root    root    4096 9 July 2017 rh

drwxr-xr-x. 2 atguigu atguigu 4096 5 August 28-17:18 software

4) Uninstall the JDK that comes with the virtual machine

Note: if your virtual machine is minimized, you do not need to perform this step.

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps 

rpm -qa: query all installed rpm packages

grep -i: ignore case

xargs -n1: indicates that only one parameter is passed at a time

rpm -e – nodeps: force software uninstallation

5) Restart the virtual machine

[root@hadoop100 ~]# reboot

In this way, our basic virtual machine is ready. On this basis, we copy two, namely Haoop101 and Hadoop 102

2.2 cloning virtual machines

1) Using the template machine Hadoop 100, clone two virtual machines: Hadoop 101 and Hadoop 102

Note: when cloning, close Hadoop 100 first

2) Modify the clone machine IP. Take Hadoop 100 as an example

(1) Modify the static IP of the cloned virtual machine

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

Change to

DEVICE=ens33

TYPE=Ethernet

ONBOOT=yes

BOOTPROTO=static

NAME="ens33"

IPADDR=192.168.10.100

PREFIX=24

GATEWAY=192.168.10.2

DNS1=192.168.10.2

(2) View the virtual network editor of Linux virtual machine, edit - > virtual network editor - > VMnet8

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-EUNZenmL-1639641787977)(C:Users86157AppDataLocalTemps9634674668.png)]

[external chain picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture
Direct upload (IMG iujpkmvm-1639641787978) (C: users86157appdatalocaltemps9634685117. PNG)]

(3) View the IP address of Windows system adapter VMware Network Adapter VMnet8

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-MElwdgU8-1639641787979)(C:Users86157AppDataLocalTemps9634750321.png)]

(4) Ensure that the IP address and virtual network editor address in the ifcfg-ens33 file of Linux system are the same as the VM8 network IP address of Windows system.

3 * *) modify the host name of the clone machine. The following is an example of Hadoop 100 * * * ***

(1) Modify host name

[root@hadoop100 ~]# vim /etc/hostname

hadoop100

(2) Configure the Linux clone host name mapping hosts file and open / etc/hosts

[root@hadoop100 ~]# vim /etc/hosts

Add the following

192.168.10.100 hadoop100

192.168.10.101 hadoop101

192.168.10.102 hadoop102

192.168.10.103 hadoop103

192.168.10.104 hadoop104

4 * *) restart the clone machine Hadoop 102**

[root@hadoop100 ~]# reboot

5 * *) modify the host mapping file (hosts file) of windows**

(1) If the operating system is Windows 7, you can modify it directly

(a) Enter C:WindowsSystem32driversetc path

(b) Open the hosts file, add the following contents, and then save

192.168.10.100 hadoop100

192.168.10.101 hadoop101

192.168.10.102 hadoop102

192.168.10.103 hadoop103

192.168.10.104 hadoop104

192.168.10.105 hadoop105

192.168.10.106 hadoop106

192.168.10.107 hadoop107

192.168.10.108 hadoop108

(2) If the operating system is windows10, copy it first, modify and save it, and then overwrite it

(a) Enter C:WindowsSystem32driversetc path

(b) Copy hosts file to desktop

(c) Open the desktop hosts file and add the following

192.168.10.100 hadoop100

192.168.10.101 hadoop101

192.168.10.102 hadoop102

192.168.10.103 hadoop103

192.168.10.104 hadoop104

192.168.10.105 hadoop105

192.168.10.106 hadoop106

192.168.10.107 hadoop107

192.168.10.108 hadoop108

(d) Overwrite the desktop hosts file with the C:WindowsSystem32driversetc path hosts file

2.3 install JDK in Hadoop 100

1) Uninstall existing JDK

Note: before installing the JDK, be sure to delete the JDK of the virtual machine in advance. For detailed steps, see the steps of uninstalling JDK in Section 3.1 of the query document.

2) Use XShel transfer tool to import JDK into the software folder under opt directory

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-Pgs6NP2R-1639641787980)(C:Users86157AppDataLocalTemps9634950345.png)]

3) Check whether the software package is imported successfully in opt directory under Linux system

[atguigu@hadoop100 ~]$ ls /opt/software/

See the following results:

jdk-8u212-linux-x64.tar.gz

4) Unzip the JDK to the / opt/module directory

[atguigu@hadoop100 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

5) Configure JDK environment variables

(1) Create a new / etc / profile d/my_ env. SH file

[atguigu@hadoop100 ~]$ sudo vim /etc/profile.d/my_env.sh

Add the following

#JAVA_HOME

export JAVA_HOME=/opt/module/jdk1.8.0_212

export PATH=$PATH:$JAVA_HOME/bin

(2) Exit after saving

:wq

(3) source click the / etc/profile file to make the new environment variable PATH effective

[atguigu@hadoop100 ~]$ source /etc/profile

6) Test whether the JDK is installed successfully

[atguigu@hadoop100 ~]$ java -version

If you can see the following results, the Java installation is successful.

java version "1.8.0_212"

Note: restart (if java version can be used, there is no need to restart)

[atguigu@hadoop100 ~]$ sudo reboot

2.4 installing Hadoop in Hadoop 100

Hadoop download address: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/

1) Use XShell file transfer tool to transfer hadoop-3.1.3 tar. GZ is imported into the software folder under the opt directory

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-D624hHTP-1639641787980)(C:Users86157AppDataLocalTemps9635182977.png)]

2) Enter the Hadoop installation package path

[atguigu@hadoop100 ~]$ cd /opt/software/

3) Unzip the installation file under / opt/module

[atguigu@hadoop100 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

4) Check whether the decompression is successful**

[atguigu@hadoop100 software]$ ls /opt/module/

hadoop-3.1.3

5) Add Hadoop to environment variable

(1) Get Hadoop installation path

[atguigu@hadoop100 hadoop-3.1.3]$ pwd

/opt/module/hadoop-3.1.3

(2) Open / etc / profile d/my_ env. SH file

[atguigu@hadoop100 hadoop-3.1.3]$ sudo vim /etc/profile.d/my_env.sh

In my_ env. Add the following at the end of the SH file: (shift+g)

#HADOOP_HOME

export HADOOP_HOME=/opt/module/hadoop-3.1.3

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

Save and exit:: wq

(3) Make the modified document effective

[atguigu@hadoop100 hadoop-3.1.3]$ source /etc/profile

6) Test whether the installation is successful

[atguigu@hadoop100 hadoop-3.1.3]$ hadoop version

Hadoop 3.1.3

7) Restart (restart the virtual machine if the Hadoop command cannot be used)

[atguigu@hadoop100 hadoop-3.1.3]$ sudo reboot

2.5 Hadoop directory structure

1) View Hadoop directory structure

[atguigu@hadoop100 hadoop-3.1.3]$ ll

Total consumption 52

[root@hadoop100 hadoop-3.1.3]# ll
 Total consumption 184
drwxr-xr-x. 2 1    1       183 9 December 2019 bin
drwxr-xr-x. 4 root root     37 11 Month 20 20:34 data
drwxr-xr-x. 3 1    1        20 9 December 2019 etc
drwxr-xr-x. 2 1    1       106 9 December 2019 include
drwxr-xr-x. 3 1    1        20 9 December 2019 lib
drwxr-xr-x. 4 1    1       288 9 December 2019 libexec
-rw-rw-r--. 1 1    1    147145 9 April 2019 LICENSE.txt
drwxr-xr-x. 3 root root   4096 11 June 28-16:12 logs
-rw-rw-r--. 1 1    1     21867 9 April 2019 NOTICE.txt
-rw-rw-r--. 1 1    1      1366 9 April 2019 README.txt
drwxr-xr-x. 3 1    1      4096 11 June 18-23:11 sbin
drwxr-xr-x. 4 1    1        31 9 December 2019 share
drwxr-xr-x. 2 root root     22 11 June 18-23:00 wcinput
drwxr-xr-x. 2 root root     88 11 June 18-23:01 wcoutput
-rw-r--r--. 1 root root     46 11 June 18-23:24 word.txt

2) Important catalogue

(1) bin directory: stores scripts that operate Hadoop related services (hdfs, yarn, mapred)

(2) etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files

(3) lib Directory: the local library where Hadoop is stored (the function of compressing and decompressing data)

(4) sbin Directory: stores scripts for starting or stopping Hadoop related services

(5) share Directory: stores the dependent jar packages, documents, and official cases of Hadoop

Hadoop operation mode

1) Hadoop official website: http://hadoop.apache.org/

2) Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.

Local mode: stand-alone operation, just to demonstrate the official case. The production environment is not used.

**Pseudo distributed mode: * * it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Individual companies that are short of money are used for testing, and the production environment is not used.

**Fully distributed mode: * * multiple servers form a distributed environment. Use in production environment.

3.1 local operation mode (official WordCount)

1) Create a wcinput folder under the hadoop-3.1.3 file

[atguigu@hadoop102 hadoop-3.1.3]$ mkdir wcinput

2) Create a word under the wcinput file Txt file

[atguigu@hadoop102 hadoop-3.1.3]$ cd wcinput

3) Edit word Txt file

[atguigu@hadoop102 wcinput]$ vim word.txt

Enter the following in the file

bingbing bingbing

lanlan lanlan lanlan

xiaozhao

Save exit:: wq

4) Go back to Hadoop directory / opt/module/hadoop-3.1.3

5) Execution procedure

[atguigu@hadoop100 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput wcoutput

6) View results

[atguigu@hadoop100 hadoop-3.1.3]$ cat wcoutput/part-r-00000

See the following results:

bingbing 2

lanlan 3

xiaozhao 1

3.2 fully distributed operation mode (development focus)

analysis:

1) Prepare 3 clients (turn off firewall, static IP, host name)

2) Install JDK

3) Configure environment variables

4) Install Hadoop

5) Configure environment variables

6) Configure cluster

7) Single point start

8) Configure ssh

9) Get together and test the cluster

3.2.1 virtual machine preparation

See sections 2.1 and 2.2 for details.

3.2.2 writing cluster distribution script xsync

1) scp (secure copy)

(1) scp definition

scp can copy data between servers. (from server1 to server2)

(2) Basic grammar

scp -r p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname

Command recursion file path / name to copy destination user @ host: destination path / name

(3) Case practice

Premise: the / opt/module and / opt/software directories have been created in Hadoop 102, Hadoop 103 and Hadoop 104, and the two directories have been modified to atguigu:atguigu

[atguigu@hadoop102 ~]$ sudo chown atguigu:atguigu -R /opt/module

(a) On Hadoop 102, add / opt / module / jdk1 8.0_ 212 directory to Hadoop 103.

[atguigu@hadoop102 ~]$ scp -r /opt/module/jdk1.8.0_212 atguigu@hadoop103:/opt/module

(b) On Hadoop 103, copy the / opt/module/hadoop-3.1.3 directory in Hadoop 102 to Hadoop 103.

[atguigu@hadoop103 ~]$ scp -r atguigu@hadoop102:/opt/module/hadoop-3.1.3 /opt/module/

(c) Operate on Hadoop 103 and copy all directories under / opt/module directory in Hadoop 102 to Hadoop 104.

[atguigu@hadoop103 opt]$ scp -r atguigu@hadoop102:/opt/module/* atguigu@hadoop104:/opt/module

2 * *) rsync * * * * remote synchronization tool**

rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links.

Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates the difference files. scp is to copy all the files.

(1) Basic grammar

rsync -av p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname

Command option parameter file path / name to copy destination user @ host: destination path / name

Option parameter description

option

function

-a

Archive copy

-v

Show copy process

(2) Case practice

(a) Delete / opt/module/hadoop-3.1.3/wcinput in Hadoop 103

[atguigu@hadoop103 hadoop-3.1.3]$ rm -rf wcinput/

(b) Synchronize / opt/module/hadoop-3.1.3 in Hadoop 102 to Hadoop 103

[atguigu@hadoop102 module]$ rsync -av hadoop-3.1.3/ atguigu@hadoop103:/opt/module/hadoop-3.1.3/

3 * *) xsync * * * * cluster distribution script**

(1) Requirement: copy files to the same directory of all nodes in a circular way

(2) Demand analysis:

(a) Original copy of rsync command:

rsync -av /opt/module atguigu@hadoop103:/opt/

(b) Expected script:

xsync name of the file to synchronize

(c) It is expected that the script can be used in any path (the script is placed in the path where the global environment variable is declared)

[atguigu@hadoop102 ~]$ echo $PATH

/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/atguigu/.local/bin:/home/atguigu/bin:/opt/module/jdk1.8.0_212/bin

(3) Script implementation

(a) Create an xsync file in the / home/atguigu/bin directory

[atguigu@hadoop102 opt]$ cd /home/atguigu

[atguigu@hadoop102 ~]$ mkdir bin

[atguigu@hadoop102 ~]$ cd bin

[atguigu@hadoop102 bin]$ vim xsync

Write the following code in this file

#!/bin/bash

 

#1. Number of judgment parameters

if [ $# -lt 1 ]

then

    echo Not Enough Arguement!

    exit;

fi

 

#2. Traverse all machines in the cluster

for host in hadoop102 hadoop103 hadoop104

do

    echo ====================  $host  ====================

    #3. Traverse all directories and send them one by one

 

    for file in $@

    do

        #4. Judge whether the document exists

        if [ -e $file ]

            then

                #5. Get parent directory

                pdir=$(cd -P $(dirname $file); pwd)

 

                #6. Get the name of the current file

                fname=$(basename $file)

                ssh $host "mkdir -p $pdir"

                rsync -av $pdir/$fname $host:$pdir

            else

                echo $file does not exists!

        fi

    done

done

(b) The modified script xsync has execution permission

[atguigu@hadoop102 bin]$ chmod +x xsync

(c) Test script

[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin

(d) Copy the script to / bin for global invocation

[atguigu@hadoop102 bin]$ sudo cp xsync /bin/

(e) Synchronize environment variable configuration (root owner)

[atguigu@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh

Note: if sudo is used, xsync must complete its path.

Make environment variables effective

[atguigu@hadoop103 bin]$ source /etc/profile

[atguigu@hadoop104 opt]$ source /etc/profile

3.2.3 SSH non secret login configuration

1 * *) configure ssh**

(1) Basic grammar

ssh IP address of another computer

(2) Solution to Host key verification failed during ssh connection

[atguigu@hadoop102 ~]$ ssh hadoop103

If the following appears

Are you sure you want to continue connecting (yes/no)

Enter yes and enter

(3) Return to Hadoop 102

[atguigu@hadoop103 ~]$ exit

2 * *) no key configuration**

(1) Secret free login principle

(2) Generate public and private keys

[atguigu@hadoop102 .ssh]$ pwd

/home/atguigu/.ssh

[atguigu@hadoop102 .ssh]$ ssh-keygen -t rsa

Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)

(3) Copy the public key to the target machine for password free login

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104

be careful:

You also need to configure atguigu account on Hadoop 103 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.

You also need to configure atguigu account on Hadoop 104 to log in to Hadoop 102, Hadoop 103 and Hadoop 104 servers without secret.

You also need to use the root account on Hadoop 102 to configure non secret login to Hadoop 102, Hadoop 103 and Hadoop 104;

3**). Explanation of file functions under the ssh folder (~ /. ssh)**

known_hosts

Record the public key of the computer accessed by ssh

id_rsa

Generated private key

id_rsa.pub

Generated public key

authorized_keys

Store the authorized secret free login server public key

3.3 cluster configuration

1 * *) cluster deployment planning**

be careful:

NameNode and SecondaryNameNode should not be installed on the same server

Resource manager also consumes a lot of memory and should not be configured on the same machine as NameNode and SecondaryNameNode.

hadoop100

hadoop101

hadoop102

HDFS

NameNode DataNode

DataNode

SecondaryNameNode DataNode

YARN

NodeManager

ResourceManager NodeManager

NodeManager

2 * *) profile description**

Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.

(1) Default profile:

Default file to get

The file is stored in the jar package of Hadoop

[core-default.xml]

hadoop-common-3.1.3.jar/core-default.xml

[hdfs-default.xml]

hadoop-hdfs-3.1.3.jar/hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-3.1.3.jar/yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

(2) Custom profile:

core-site.xml**,hdfs-site.xml****,yarn-site.xml****,mapred-site.xml * * four configuration files are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.

3) Configure cluster

(1) Core profile

Configure core site xml

[root@hadoop100 ~]$ cd $HADOOP_HOME/etc/hadoop


[root@hadoop100 ~]$ vim core-site.xml so-ascii-f

The contents of the document are as follows:

  <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- appoint NameNode Address of -->
 <property>
 <name>fs.defaultFS</name>
 <value>hdfs://hadoop100:8020</value>
 </property>
 <!-- appoint hadoop Storage directory of data -->
 <property>
 <name>hadoop.tmp.dir</name>
 <value>/opt/module/hadoop-3.1.3/data</value>
 </property>
 <!-- to configure HDFS The static user used for web page login is atguigu -->
 <property>
 <name>hadoop.http.staticuser.user</name>
 <value>root</value>
 </property>
</configuration>

(2) HDFS profile

Configure HDFS site xml

[root@hadoop100 ~]$ vim hdfs-site.xml

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

```
http://www.apache.org/licenses/LICENSE-2.0

```

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- nn web End access address-->
<property>
 <name>dfs.namenode.http-address</name>
 <value>hadoop100:9870</value>
 </property>
<!-- 2nn web End access address-->
 <property>
 <name>dfs.namenode.secondary.http-address</name>
 <value>hadoop102:9868</value>
 </property>
</configuration>

(3) YARN profile

Configure yarn site xml

[root@hadoop100 ~]$ vim yarn-site.xml

The contents of the document are as follows:

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

```
http://www.apache.org/licenses/LICENSE-2.0

```

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
   <property>
 <name>yarn.application.classpath</name>   

<value>      
        /opt/module/hadoop-3.1.3/etc/hadoop,
        /opt/module/hadoop-3.1.3/share/hadoop/common/lib/*,
        /opt/module/hadoop-3.1.3/share/hadoop/common/*,
        /opt/module/hadoop-3.1.3/share/hadoop/hdfs,
        /opt/module/hadoop-3.1.3/share/hadoop/hdfs/lib/*,
        /opt/module/hadoop-3.1.3/share/hadoop/hdfs/*,
        /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/lib/*,
        /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/*,
        /opt/module/hadoop-3.1.3/share/hadoop/yarn,
        /opt/module/hadoop-3.1.3/share/hadoop/yarn/lib/*,
        /opt/module/hadoop-3.1.3/share/hadoop/yarn/*,
</value>
</property>
<!-- appoint MR go shuffle -->
 <property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <!-- appoint ResourceManager Address of-->
 <property>
 <name>yarn.resourcemanager.hostname</name>
 <value>hadoop101</value>
 </property>
 <!-- Inheritance of environment variables -->
 <property>
 <name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CO
NF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAP
RED_HOME</value>
 </property>
</configuration>              

(4) MapReduce profile

Configure mapred site xml

[root@hadoop100 ~]$ vim mapred-site.xml

The contents of the document are as follows:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- appoint MapReduce The program runs on Yarn upper -->
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>

</configuration>

4) Distribute the configured Hadoop configuration file on the cluster

[root@hadoop100 ~]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) Go to 101 and 102 to check the distribution of documents

[atguigu@hadoop101 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

[atguigu@hadoop102 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

3.4 group configuration

1) Configure workers

[root@hadoop100 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following contents to the document:

hadoop100

hadoop101

hadoop102

Note: no space is allowed at the end of the content added in the file, and no blank line is allowed in the file.

Synchronize all node profiles

[root@hadoop100 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

2) Start cluster

(1) If the cluster is started for the first time, The namenode needs to be formatted in Hadoop 102 node (Note: formatting the namenode will generate a new cluster id, resulting in inconsistent cluster IDS between namenode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat the namenode, be sure to stop the namenode and datanode processes, delete the data and logs directories of all machines, and then format it.)

[root@hadoop100 hadoop-3.1.3]$ hdfs namenode -format

(2) Start HDFS

[root@hadoop100 hadoop-3.1.3]$ sbin/start-dfs.sh

(3) The node where the resource manager is configured is started

[root@hadoop100 hadoop-3.1.3]$ sbin/start-yarn.sh

(4) View the NameNode of HDFS on the Web side

(a) Enter in the browser: http://hadoop100:9870

(b) View data information stored on HDFS

(5) View YARN's ResourceManager on the Web

(a) Enter in the browser: http://hadoop101:8088

(b) View Job information running on YARN

3 * *) basic cluster test**

(1) Upload files to cluster

Upload small files

[root@hadoop100 ~]$ hadoop fs -mkdir /input

[root@hadoop100 ~]$ hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input

Upload large files

[root@hadoop100 ~]$ hadoop fs -put  /opt/software/jdk-8u212-linux-x64.tar.gz  /

(2) After uploading the file, check where the file is stored

View HDFS file storage path

[root@hadoop100 subdir0]$ pwd

/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-1436128598-192.168.10.102-1610603650062/current/finalized/subdir0/subdir0

View the contents of files stored on disk by HDFS

[root@hadoop100 subdir0]$ cat blk_1073741825

hadoop yarn

hadoop mapreduce 

atguigu

atguigu

(3) Splicing

-rw-rw-r--. 1 atguigu atguigu 134217728 5 June 23-16:01 **blk_1073741836**

-rw-rw-r--. 1 atguigu atguigu   1048583 5 June 23-16:01 blk_1073741836_1012.meta

-rw-rw-r--. 1 atguigu atguigu  63439959 5 June 23-16:01 **blk_1073741837**

-rw-rw-r--. 1 atguigu atguigu    495635 5 June 23-16:01 blk_1073741837_1013.meta

[root@hadoop100 subdir0]$ cat blk_1073741836>>tmp.tar.gz

[root@hadoop100 subdir0]$ cat blk_1073741837>>tmp.tar.gz

[root@hadoop100 subdir0]$ tar -zxvf tmp.tar.gz

(4) Download

[root@hadoop102 software]$ hadoop fs -get /jdk-8u212-linux-x64.tar.gz ./

(5) Execute the wordcount program

[root@hadoop100 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

3.5 configuring the history server

In order to view the historical operation of the program, you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred site xml

[atguigu@hadoop102 hadoop]$ vim mapred-site.xml

Add the following configuration to this file.

<!-- Historical server address -->

<property>

   <name>mapreduce.jobhistory.address</name>

   <value>hadoop102:10020</value>

</property>

 

<!-- History server web End address -->

<property>

   <name>mapreduce.jobhistory.webapp.address</name>

   <value>hadoop100:19888</value>

</property>

2) Distribution configuration

[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server in Hadoop 102

[atguigu@hadoop102 hadoop]$ mapred --daemon start historyserver

4) Check whether the history server is started

[atguigu@hadoop102 hadoop]$ jps

5) View JobHistory

http://hadoop100:19888/jobhistory

last

In this way, our Hadoop will be configured

[root@hadoop100 hadoop-3.1.3]# jps
3488 DataNode
3858 NodeManager
32405 Jps
3341 NameNode

[root@hadoop101 hadoop-3.1.3]# jps
3364 NodeManager
8132 Jps
3210 ResourceManager
2990 DataNode

[root@hadoop102 hadoop-3.1.3]# jps
3056 DataNode
3154 SecondaryNameNode
3336 NodeManager
7230 Jps

Common errors and Solutions

1) Firewall is not closed or YARN is not started

INFO client.RMProxy: Connecting to ResourceManager at hadoop108/192.168.10.108:8032

2) Host name configuration error

3) IP address configuration error

4) ssh is not configured properly

5) The cluster started by root and atguigu users are not unified

6) Careless modification of configuration file

7) Unrecognized host name

java.net.UnknownHostException: hadoop102: hadoop102

at java.net.InetAddress.getLocalHost(InetAddress.java:1475)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:146)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

terms of settlement:

(1) Add 192.168.10.102 Hadoop 102 to the / etc/hosts file

(2) Do not use Hadoop, Hadoop 000 and other special names for the host name

8) DataNode and NameNode processes can only work one at a time.

9) Executing the command does not take effect. When pasting the command in Word, it encounters - and long - and is not distinguished. Cause command failure

Solution: try not to paste the code in Word.

10) jps finds that the process has not been started, but restarts the cluster, indicating that the process has been started.

The reason is that there is a temporary file of the started process in the / tmp directory under the root directory of Linux. Delete the processes related to the cluster and restart the cluster.

11) jps does not take effect

Reason: the global variable hadoop java does not take effect. Solution: the source /etc/profile file is required.

12) 8088 port cannot be connected

[ atguigu@hadoop102 Desktop] $cat /etc/hosts

Comment out the following code

#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

#::1 hadoop102

Chapter 2 Linux Installation Python 3

1. First check the location of python in the system

whereis python

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (IMG akmxpdmx-1639641787981) (C: users86157appdatalocaltemps8082686986. PNG)]

python2.7. The default installation is in the / usr/bin directory. Switch to / usr/bin/

cd /usr/bin/

ll python*

From the figure below, we can see that Python points to python2 and python2 points to python2 7, so we can install a python3, then point Python to python3, and then point python2 to python2 7, then two versions of Python can coexist.

2. Before downloading the package of python3, install the relevant dependent package to download and compile python3:

yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make

After running the above command, the related dependencies used to compile Python 3 are installed

3. The default centos7 is that pip is not installed. Add epel extension source first

yum -y install epel-release

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ioPbqZWm-1639641787981)(C:Users86157AppDataLocalTemps8082943629.png)]

4. Install pip

yum install python-pip

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-nxktrcte-163941787982) (C: users86157appdatalocaltemps8082913977. PNG)]

5. Install wget with pip

pip install wget

6. Download the source package of python3 with wget, or download it yourself, upload it to the server and then install it. If the network is fast, you can install it directly

wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tar.xz

7. Compile the python3 source code package and decompress it

xz -d Python-3.6.8.tar.xz
tar -xf Python-3.6.8.tar

8. Enter the decompressed directory and execute the following commands successively for manual compilation

cd Python-3.6.8

./configure prefix=/usr/local/python3

make && make install

9. Installation depends on zlib and zlib deve

yum install zlib zlib
yum install zlib zlib-devel

10. Finally, if there is no error prompt, it means that the installation is correct. There will be a python 3 directory under / usr/local /

11. Add a soft link and back up the original link

mv /usr/bin/python /usr/bin/python.bak

12. Add soft links to Python 3

ln -s /usr/local/python3/bin/python3.6 /usr/bin/python

13. Test whether the installation is successful

python -V

14. Change the yum configuration because it requires python2 to execute. Otherwise, Yum will not work properly

vi /usr/bin/yum

15. Put the first line of #/ usr/bin/python is modified as follows

#! /usr/bin/python2

16. There is another place that needs to be revised

vi /usr/libexec/urlgrabber-ext-down

17. Put the first line of \/ usr/bin/python is modified as follows

#! /usr/bin/python2

18. Start python2

python2

19. Start Python 3

python

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-woCFJ3QO-1639641787983)(C:Users86157AppDataLocalTemps8086692414.png)]

Reference documents CentOS installation Python 3 detailed tutorial_ Unity of knowledge and practice - CSDN blog_ centos python

Chapter 3 python integration

Task purpose: count the number of words

preparation:

# Create word txt

vi word.txt

# In word Add a few lines of words to TXT

hadoop is good
spark is fast
spark is better
python is basics
java also good
hbase is nosql
mysql is relational database
mongdb is nosql
relational database or nosql is good

# Add word Txt is uploaded to the data folder of hdfs

hadoop dfs -put word.txt /data

map code

# !/usr/bin/env python

import sys
words=[]
for i in sys.stdin:
        i=i.strip()
        word=i.split(" ")
        words.append(word)
for i in words:
        for j in i:
                print(j,1)

reduce code

# !/usr/bin/env python

from operator import itemgetter
import sys

words=[]
num=[]
index=-1
for i in sys.stdin:
        word=i.strip()
        word=word.split(" ")
        word[1]=int(word[1])
        for i in range(len(words)):
                if(words[i][0]==word[0]):
                        index=i
        if(index==-1):
                words.append(word)
        if(index!=-1):
                words[index][1]+=1
                index=-1
for i in words:
        print(i)


cat word.txt |python map.py|python reduce.py 


['hadoop', 1]
['is', 8]
['good', 3]
['spark', 2]
['fast', 1]
['better', 1]
['python', 1]
['basics', 1]
['java', 1]
['also', 1]
['hbase', 1]
['nosql', 3]
['mysql', 1]
['relational', 2]
['database', 2]
['mongdb', 1]
['or', 1]

hadoop running mapreduce

Note: mapreduce supports development in languages other than java, which needs to be used_ hadoop-streaming-2.7.3.jar_ This computing framework runs the mapreduce task.

# Run mapreduce task

hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar -files /root/map.py,/root/reduce.py -mapper "python /root/map.py" -reducer "python /root/reduce.py" -input /input -output /out_python
//Parameter description Hadoop jar + Hadoop streaming-2.7.3 Jar path - file+
Written map And reduce Code file path -mapper +implement map File command -reducer +implement reduce File command -input +Input file in hdfs Path of -output +Output file location hdfs Path of

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG ezyapxtx-1639641787983) (C: users86157desktoppytest concurrent programming python calls hadoop successfully. png)]

[the transfer of external chain pictures fails. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-5UGVxQNl-1639641787984)(C:Users86157AppDataLocalTemps9640868477.png)]

After finishing my homework, I uploaded the document to my personal blog

Using python to integrate hadoop to realize parallel computing md - Xiaolan

Reference documents:

(6 messages) MapReduce (Python Development)_ Xiaoshuang 123 blog - CSDN blog_ mapreduce python

(6 messages) python 3 environment for linux system installation (super detailed)_ L-CSDN blog_ Installing python for linux

                index=i
    if(index==-1):
            words.append(word)
    if(index!=-1):
            words[index][1]+=1
            index=-1

for i in words:
print(i)

cat word.txt |python map.py|python reduce.py

['hadoop', 1]
['is', 8]
['good', 3]
['spark', 2]
['fast', 1]
['better', 1]
['python', 1]
['basics', 1]
['java', 1]
['also', 1]
['hbase', 1]
['nosql', 3]
['mysql', 1]
['relational', 2]
['database', 2]
['mongdb', 1]
['or', 1]

**hadoop function mapreduce**



explain: mapreduce Support addition java For development in other languages, you need to use*hadoop-streaming-2.7.3.jar* This computing framework runs mapreuce Mission.

Run mapreduce task

hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar -files /root/map.py,/root/reduce.py -mapper "python /root/map.py" -reducer "python /root/reduce.py" -input /input -output /out_python
//Parameter description Hadoop jar + Hadoop streaming-2.7.3 Jar path - file+
Written map and reduce code file path - mapper + execute map file command - reducer + execute reduce file command - input + path of input file in hdfs - output + path of output file in hdfs

[External chain picture transfer...(img-eZyaPxTX-1639641787983)]

[External chain picture transfer...(img-5UGVxQNl-1639641787984)]

After finishing my homework, I uploaded the document to my personal blog

[utilize python integrate hadoop Implement parallel computing.md - Xiao Lan](http://8.142.109.15:8090/archives/%E5%88%A9%E7%94%A8python%E9%9B%86%E6%88%90hadoop%E5%AE%9E%E7%8E%B0%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97md) 

Reference documents:

[(6 Message) mapreduce(python development)_Xiaoshuang 123's blog-CSDN Blog_mapreduce python](https://blog.csdn.net/qq_45014844/article/details/117438600) 

[(6 Message) Linux System installation Python3 Environment (ultra detailed)_L-CSDN Blog_linux install python](https://blog.csdn.net/L_15156024189/article/details/84831045) 

[(6 Message) Hadoop Installation and use of_Defeated blog-CSDN Blog_hadoop install](https://blog.csdn.net/qq_45021180/article/details/104640540?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522163964121216780274173287%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=163964121216780274173287&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_click~default-1-104640540.pc_search_insert_es_download&utm_term=%E5%AE%89%E8%A3%85hadoop&spm=1018.2226.3001.4187)

Tags: Java Back-end

Posted by DevilsAdvocate on Thu, 14 Apr 2022 08:26:42 +0930