Hadoop cluster configuration

Pseudo-distributed cluster installation

Configuration Environment
linux system: Centos7
Virtual Machine: VMware Workstation 16 Pro

A Linux machine, also known as a node, with a JDK environment installed on it

The top one is the process that the Hadoop cluster will start. NameNode, SecondaryNameNode, and DataNode are the processes of the HDFS service, and the ResourceManager and NodeManager are the processes of the YARN service. MapRecue has no process here because it is a computing framework. Wait for the Hadoop cluster to be installed. After that, the MapReduce program can be executed on it.

Before installing the cluster, you need to download the Hadoop installation package, here we use the hadoop3.2.0 version

Then let's take a look. There is a download button on the Hadoop official website. After entering it, you will find the Apache release archive link. Click on it to find the installation packages of various versions.

Note: If you find that the download from this foreign address is slow, you can use the domestic mirror address to download, but the version of the installation package provided in these domestic mirror addresses may not be complete. If you don't find the version we need, you must be honest Download from the official website.
These domestic mirror addresses not only contain Hadoop installation packages, but also software installation packages in most Apache organizations.

Address 1
address 2

After the installation package is downloaded, we start to install the pseudo-distributed cluster.
Use the bigdata01 machine here
First configure the basic environment

ip,hostname,firewalld,ssh password-free login, JDK

ip : set static ip

[root@bigdata01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="de575261-534b-4049-bad0-6a6d55a5f4f0"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.10.130
GATEWAY=192.168.10.2
DNS1=8.8.8.8
[root@bigdata01 ~]# service network restart
Restarting network (via systemct1):    [OK]
[root@bigdata01 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:76:da:a0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.130/24 brd 192.168.10.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::567e:a2a:64b8:ccab/64 scope link 
       valid_lft forever preferred_lft forever

hostname: Set temporary hostname and permanent hostname

[root@bigdata01 ~]# hostname bigdata01
[root@bigdata01 ~]# vi /etc/hostname
bigdata01


Notice:recommended in/etchosts configuration in file ip and hostname(hostname)the mapping relationship, append the following to Vetc/hosts in, cannot be deleted/etc/hosts Existing content in the file!


[root@bigdata01 ~]# vi /etc/hosts
192.168.10.130 bigdata01

●firewalld: Temporarily turn off the firewall + permanently turn off the firewall

[root@bigdata01 ~]# systemctl stop firewalld
[root@bigdata01 ~]# systemctl disable firewalld

ssh password-free login

Here we need to briefly explain the meaning of ssh. ssh is a secure shell, a secure shell, through which you can remotely log in to a remote linux machine.

The hadoop cluster will use ssh. When we start the cluster, we only need to start it on one machine, and then hadoop will connect to other machines through ssh, and start the corresponding programs on other machines.

But now there is a problem, that is, when we use ssh to connect to other machines, we will find that we need to enter a password, so now we need to implement ssh password-free login.
Then we may have doubts. The multiple machines mentioned here need to be configured with password-free login, but we are now a pseudo-distributed cluster, and there is only one machine.

Note that no matter it is a cluster of several machines, the steps to start the program in the cluster are the same. They are all operated through ssh remote connection. Even if it is a machine, it will use ssh to connect itself. We now use ssh You even need a password yourself.

ssh password-free login

ssh, a secure/encrypted shell, uses asymmetric encryption. There are two types of encryption, symmetric encryption and asymmetric encryption. The decryption process of asymmetric encryption is irreversible, so this encryption method is relatively secure.

Asymmetric encryption will generate a secret key. The secret key is divided into a public key and a private key. Here, the public key is disclosed to the outside world, and the private key is held by itself.

Then the process of ssh communication is that the first machine will give its public key to the second machine. When the first machine wants to communicate with the second machine, the first machine will give it to the second machine. Send a random string, the second machine will encrypt the string with the public key, and the first machine will also encrypt the string with its own private key, and then pass it to the second machine as well

At this time, the second machine has two encrypted contents, one is encrypted with its own public key, and the other is encrypted with the private key of the first machine. The public key and private key are encrypted through a certain Calculated by the algorithm, at this time, the second machine will compare whether the two encrypted contents match. If it matches, the second machine will consider the first machine to be trusted and allow the login. If they are not equal it is considered an illegal machine.

Let's start to formally configure ssh password-free login. Since we need to configure our own password-free login here, the first machine and the second machine are the same

First execute ssh-keygen -t rsa on bigdata01

rsa represents an encryption algorithm

Note: After executing this command, you need to press the Enter key 4 times to return to the linux command line to indicate the end of the operation, and you do not need to enter anything when pressing the Enter key.

[root@bigdata01 ~]# ssh-keygen -t rsa

After execution, the corresponding public and private key files will be produced in the -/.ssh directory

[root@bigdata01 ~]# ll ~/.ssh/
total 12
--------. 1 root root 1679 Apr 7 16:39 id_ _rsa
-rw-r--r--.1 root root 396 Apr 7 16:39 id_ rsa. pub

The next step is to copy the public key to the machine that requires password-free login

[root@bigdata01 ~]# cat ~/.ssh/id_ rsa.pub >> ~/.ssh/authorized_ keys

Then you can log in to the bigdata01 machine without password through ssh

[root@bigdata01 ~]# ssh bigdata01
Last login: Tue Apr 7 15:05:55 2020 from 192.168.182.1
[root@bigdata01 ~]#

JDK
Let's start installing the JDK

According to the development process in normal work, it is recommended to put all the software installation packages in the /data/soft directory.

We don't have a new disk here, so manually create the /data/soft directory

[root@bigdata01 ~]# mkdir -p /data/soft

Upload the JDK installation package to the data/soft/ directory

[root@bigdata01 soft]# tar -zxvf jdk-8u202-1inux-x64.tar.gz

rename jdk

[root@bigdata01 soft]# mv jdk1.8.0_202 jdk1.8

Configure environment variable JAVA_HOME

[root@bigdata01 soft]# vi /etc/profile
.....
export JAVA_HOME=/data/soft/jdk1.8
export PATH=.:$JAVA_HOME/bin:$PATH

verify

[root@bigdata01 soft]# source /etc/profile
[root@bigdata1 soft]# java -version
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25 .202-b08,mixed mode)

The basic environment is done, let's start installing Hadoop

1: First upload the hadoop installation package to the /data/soft directory

[root@bigdata01 soft]# ll
total 527024
-rw-r--r--. 1 root root 345625475 Jul 19 2019 hadoop-3.2.0. tar.gz
drwxr-xr-x. 7	10  143	  245 Dec 16 2018 jdk1.8
-rw-r--r--. 1 root root 194042837 Apr 6 23:14 jdk-8u202-1inux-x64. tar.gz

2. Unzip the hadoop installation package

[root@bigdata01 soft]# tar -zxvf hadoop-3.2.0.tar.gz

There are two important directories under the hadoop directory, one is the bin directory and the other is the sbin directory

[root@bigdata01 soft]# cd hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# ll
total 184
drwxr-xr-x.2 1001 1002		203 Jan 8 2019 bin
drwxr-xr-x.3 1001 1002	 	 20 Jan 8 2019 etc
drwxr-xr-x.2 1001 1002		106 Jan 8 2019 include
drwxr-xr-x.3 1001 1002	 	 20 Jan 8 2019 1ib
drwxr-xr-x.4 1001 1002 	   4096 Jan 8 2019 libexec
-rW-rW-r--.1 1001 1002   150569 Oct 19 2018 LICENSE. txt
-rw-rw-r--.1 1001 1002    22125 Oct 19 2018 NOTICE . txt
-rw-rw-r--.1 1001 1002     1361 Oct 19 2018 README.txt
drwxr-xr-x.3 1001 1002     4096 Jan 8 2019 sbin
drwxr-xr-x.4 1001 1002       31 Jan 8 2019 share 

Let's take a look at the bin directory. There are scripts such as hdfs and yarn. These scripts are mainly used to operate the hdfs and yarn components in the hadoop cluster.

Let's take a look at the sbin directory. There are many scripts starting with start and stop. These scripts are responsible for starting or stopping components in the cluster.

In fact, another important directory is the etc/hadoop directory. The files in this directory are mainly some configuration files of hadoop, which are more important. After a while we install hadoop, mainly to modify the files under this directory.

Because we will use some scripts under the bin directory and the sbin directory, for convenience, we need to configure environment variables.

[root@bigdata01 hadoop-3.2.0]# vi /etc/profile
.......
export JAVA_HOME=/data/soft/jdk1.8
export HADOOP_HOME=/data/soft/hadoop-3.2.0
export PATH=.:$JAVA_HOME/bin:$HADOOP_HOME/sbin: $HADOOP_HOME/bin: $PATH
[root@bigdata01 hadoop-3.2.0]# source /etc/profile

3: Modify Hadoop related configuration files
Go to the directory where the configuration file is located

[root@bigdata1 hadoop-3.2.0]# cd etc/hadoop/
[root@bigdata01 hadoop]#

Mainly modify the following files:

hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
workers

First, modify the hadoop-env.sh file, add the environment variable information, and add it to the end of the hadoop-env.sh file.
JAVA_HOME: Specify the installation location of java
HADOOP_LOG_DIR: Storage directory of hadoop logs

[root@bigdata01 hadoop]# vi hadoop-env.sh
.......
export JAVA_HOME=/data/soft/jdk1.8
export HADOOP LOG_DIR=/data/hadoop_repo/1ogs/hadoop

Modify the core-site.xml file

Notice fs.defaultFS The hostname in the properties needs to be the same as the hostname you configured


[root@bigdata01 hadoop]# vi core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://bigdata01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop_repo</value>
   </property>
</configuration>

Modify the hdfs-site.xml file and set the number of file copies in hdfs to 1, because now the pseudo-distributed cluster has only one node

[root@bigdata01 hadoop]# vi hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Modify mapred-site.xml to set the resource scheduling framework used by mapreduce

[root@bigdata01 hadoop]# vi mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Modify yarn-site.xml to set the whitelist of services and environment variables that support running on yarn

[root@bigdata01 hadoop]# vi yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Modify the workers and set the hostname information of the slave nodes in the cluster. There is only one cluster here, so just fill in bigdata01.

[root@bigdata01 hadoop]# vi workers
bigdata01

The configuration file has been modified at this point, but it cannot be started directly, because HDFS in Hadoop is a distributed file system, and the file system needs to be formatted before using it, just like when we buy a new disk, in Before installing the system, it needs to be formatted before it can be used.

4: Format HDFS

[root@bigdata01 hadoop]# cd /data/soft/hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# bin/hdfs namenode -format

If you can see the message successfully formatted, the formatting is successful.
If an error is prompted, it is generally due to a configuration file problem. Of course, it is necessary to analyze the problem according to the specific error message.
Note: The formatting operation can only be performed once. If the formatting fails, you can modify the configuration file and then execute the formatting. If the formatting is successful, you cannot repeat the execution, otherwise the cluster will have problems.
If you really need to repeat the execution, you need to delete all the content in the /data/hadoop_repo directory, and then execute the formatting
It can be understood in this way, we buy a new disk and come back to install the operating system. We will format it before using it for the first time. Will you format it after you have nothing to do? Certainly not, the operating system has to be reinstalled after formatting.
5: Start a pseudo-distributed cluster
Use the start-all.sh script in the sbin directory

[root@bigdata01 hadoop-3.2.0]# sbin/start-all.sh

When executing, I found a lot of ERROR information, indicating that some user information of HDFS and YARN is missing.
The solution is as follows:
Modify the two script files start-dfs.sh and stop-dfs.sh in the sbin directory, and add the following content in front of the file

[root@bigdata01 hadoop-3.2.0]# cd sbin/
[root@bigdata01 sbin]# vi start-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
[root@bigdata01 sbin]# vi stop-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Modify the two script files start-yarn.sh and stop-yarn.sh in the sbin directory, and add the following content in front of the file

[root@bigdata01 sbin]# vi start-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
[root@bigdata01 sbin]# vi stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

restart the cluster

[root@bigdata01 sbin]# cd /data/soft/hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# sbin/start-all.sh

6: Verify cluster process information
Execute the jps command to view the process information of the cluster. In addition to the Jps process, 5 processes are required to indicate that the cluster is started normally.

[root@bigdata01 hadoop-3.2.0]# jps

You can also verify whether the cluster service is normal through the webui interface

?	HDFS webui interface: http://192.168.10.130:9870
?	YARN webui interface: http://192.168.10.130:8088

If you want to access by hostname, you need to modify the hosts file in the windows machine
The location of the file is: C:WindowsSystem32driversetcHOSTS
Add the following content to the file, which is actually the ip and hostname of the Linux virtual machine. After doing a mapping here, you can access the Linux virtual machine through the hostname on the Windows machine.

192.168.10.130 bigdata01

Note: If this file cannot be modified, it is usually due to permission problems. You can choose to open it in administrator mode when opening it.
7: Stop the cluster
If you modify the cluster configuration file or stop the cluster for other reasons, you can use the following command

[root@bigdata01 hadoop-3.2.0] # sbin/stop-all.sh

Distributed cluster installation

Environment Preparation: Three Nodes
bigdata01 192.168.10.130
bigdata02 192.168.10.131
bigdata03 192.168.10.132

Note: The basic environment of each node must be configured first. First, configure the basic environments such as ip, hostname, firewalld, ssh password-free login, and JDK. The current number of nodes is not enough. According to the content of the first week, To create multiple nodes by cloning, first delete the previously installed hadoop in bigdata01, delete the decompressed directory, and modify the environment variables.

Note: We need to delete the hadoop_repo directory under the /data directory and the hadoop-3.2.0 directory under /data/soft in the bigdata01 node to restore the environment of this node, which records some information about the previous pseudo-distributed cluster.

[root@bigdata01 ~]# rm -rf /data/soft/hadoop-3.2.0
[root@bigdata01 ~]# rm -rf /data/hadoop_repo

Suppose we now have three linux machines, all with brand new environments.
Let's get started.
Note: The configuration steps for basic environments such as ip, hostname, firewalld, and JDK for these three machines are no longer recorded here.
bigdata01
bigdata02
bigdata03

The basic environments of ip, hostname, firewalld, ssh password-free login, and JDK of these three machines have been configured ok.
After these basic environments are configured, it is not finished, and there are still some configurations that need to be improved.

configure /etc/hosts

Because two slave nodes need to be remotely connected to the master node, it is necessary to enable the master node to recognize the host name of the slave node and use the host name for remote access. By default, only ip remote access can be used. If you want to use the host name for remote access, you need to Configure the ip and hostname information of the corresponding machine in the /etc/hosts file of the node.

So here we need to configure the following information in the /etc/hosts file of bigdata01. It is best to configure the current node information into it, so that the content in this file is universal and can be directly copied to the other two slave nodes.

[root@bigdata01 ~]# vi /etc/hosts
192.168.10.130 bigdata01
192.168.10.131 bigdata02
192.168.10.132 bigdata03

Modify the /etc/hosts file of bigdata02

[root@bigdata02 ~]# vi /etc/hosts
192.168.10.130 bigdata01
192.168.10.131 bigdata02
192.168.10.132 bigdata03

Modify the /etc/hosts file of bigdata03

[root@bigdata03 ~] # vi /etc/hosts
192.168.10.130 bigdata01
192.168.10.131 bigdata02
192.168.10.132 bigdata03

Time synchronization between cluster nodes

As long as the cluster involves multiple nodes, it is necessary to synchronize the time of these nodes. If the time difference between the nodes is too different, the stability of the cluster will be affected, or even cause problems in the cluster.

First operate on the bigdata01 node
Use ntpdate -u ntp.sjtu.edu.cn to achieve time synchronization, but when executing it, it prompts that the ntpdata command cannot be found

[root@bigdata01 ~]# ntpdate -u ntp.sjtu.edu.cn
-bash: ntpdate: command not found

There is no ntpdate command by default, you need to use yum to install online, execute the command yum install -y ntpdate

[root@bigdata01 ~]# yum install -y ntpdate

Then manually execute ntpdate -u ntp.sjtu.edu.cn to confirm whether it can be executed normally

[root@bigdata01 ~]# ntpdate -u ntp.sjtu.edu.cn

It is recommended to add this synchronization time operation to the linux crontab timer and execute it every minute

[root@bigdata01 ~]# vi /etc/crontab

Then configure time synchronization on bigdata02 and bigdata03 nodes
Operate on the bigdata02 node

[root@bigdata02 ~]# yum install -y ntpdate
[root@bigdata02 ~]# vi /etc/crontab

Operate on the bigdata03 node

[root@bigdata03 ~]# yum install -y ntpdate
[root@bigdata03 ~]# vi /etc/crontab

###SSH password-free login perfect

Note: For password-free login, only oneself password-free login is currently implemented. Finally, it is necessary to realize that the host point can log in to all nodes without password, so it is necessary to improve the password-free login operation.
First, execute the following command on the bigdata01 machine to copy the public key information to the two slave nodes

[root@bigdata01 ~]# scp ~/.ssh/authorized_keys bigdata02:~/
[root@bigdata01 ~]# scp ~/.ssh/authorized_keys bigdata03:~/

Then execute on bigdata02 and bigdata03
bigdata02:

[root@bigdata02 ~]# cat ~/authorized_keys  >> ~/.ssh/authorized_keys

bigdata03:

[root@bigdata03 ~]# cat ~/authorized_keys  >> ~/.ssh/authorized_keys

To verify the effect, use ssh to remotely connect two slave nodes on the bigdata01 node. If you do not need to enter a password, it means that it is successful. At this time, the host can log in to all nodes without a password.

[root@bigdata01 ~]# ssh bigdata02
[root@bigdata02 ~]# exit
[root@bigdata01 ~]# ssh bigdata03
[root@bigdata03 ~]# exit

install hadoop

First install it on the bigdata01 node.
1: Upload the hadoop-3.2.0.tar.gz installation package to the /data/soft directory of the linux machine

[root@bigdata01 soft]# ll

2. Unzip the hadoop installation package

[root@bigdata01 soft]# tar -zxvf hadoop-3.2.0.tar.gz 

3. Modify hadoop related configuration files
Go to the directory where the configuration file is located

[root@bigdata01 soft]# cd hadoop-3.2.0/etc/hadoop/
[root@bigdata01 hadoop]# 

First modify the hadoop-env.sh file and add environment variable information at the end of the file

[root@bigdata01 hadoop]# vi hadoop-env.sh
export JAVA_HOME=/data/soft/jdk1.8
export HADOOP_LOG_DIR=/data/hadoop_repo/logs/hadoop

Modify the core-site.xml file, note that the hostname in the fs.defaultFS property needs to be consistent with the hostname of the master node

[root@bigdata01 hadoop]# vi core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://bigdata01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop_repo</value>
   </property>
</configuration>

Modify the hdfs-site.xml file and set the number of file copies in hdfs to 2, up to 2, because now there are two slave nodes in the cluster, and the node information where the secondaryNamenode process is located

[root@bigdata01 hadoop]# vi hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>bigdata01:50090</value>
    </property>
</configuration>

Modify mapred-site.xml to set the resource scheduling framework used by mapreduce

[root@bigdata01 hadoop]# vi mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Modify yarn-site.xml to set the whitelist of services and environment variables that support running on yarn
Note that for distributed clusters, the hostname of the resourcemanager needs to be set in this configuration file, otherwise the nodemanager cannot find the resourcemanager node.

[root@bigdata01 hadoop]# vi yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>bigdata01</value>
	</property>
</configuration>

Modify the workers file and add the hostnames of all slave nodes, one line at a time

[root@bigdata01 hadoop]# vi workers
bigdata02
bigdata03

Modify the startup script
Modify the two script files start-dfs.sh and stop-dfs.sh, and add the following content in front of the file

[root@bigdata01 hadoop]# cd /data/soft/hadoop-3.2.0/sbin
[root@bigdata01 sbin]# vi start-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
[root@bigdata01 sbin]# vi stop-dfs.sh
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Modify the two script files, start-yarn.sh and stop-yarn.sh, and add the following content in front of the file

[root@bigdata01 sbin]# vi start-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
[root@bigdata01 sbin]# vi stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

4: Copy the installation package with the modified configuration on the bigdata01 node to the other two slave nodes

[root@bigdata01 sbin]# cd /data/soft/
[root@bigdata01 soft]# scp -rq hadoop-3.2.0 bigdata02:/data/soft/
[root@bigdata01 soft]# scp -rq hadoop-3.2.0 bigdata03:/data/soft/

5. Format HDFS on the bigdata01 node

[root@bigdata01 soft]# cd /data/soft/hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# bin/hdfs namenode -format

6. Start the cluster and execute the following command on the bigdata01 node

[root@bigdata01 hadoop-3.2.0]# sbin/start-all.sh

7. Verify the cluster
Execute the jps command on the three machines respectively, and the process information is as follows:
Execute on the bigdata01 node

[root@bigdata01 hadoop-3.2.0]# jps

Execute on the bigdata02 node

[root@bigdata02]# jps

Execute on the bigdata03 node

[root@bigdata03]# jps

8. Stop the cluster
Execute the stop command on the bigdata01 node

[root@bigdata01 hadoop-3.2.0]# sbin/stop-all.sh

So far, the hadoop distributed cluster has been successfully installed

When writing notes for the first time, if there are any mistakes in the article, you are welcome to point out in the comment area

Tags: Java Big Data Linux Operation & Maintenance Hadoop

Posted by MikeFairbrother on Sat, 03 Sep 2022 05:53:56 +0930