Completely distributed configuration of Hadoop cluster configuration under Linux

This case package: link: https://pan.baidu.com/s/1zABhjj2umontXe2CYBW_DQ
Extraction code: 1123 (if the link fails, comment below and I will update it in time)

Table of contents

(1) Configure the master node of the Hadoop cluster

1. Enter the configuration file:

2. Modify the hadoop-env.sh file

3. Configure the core-site.xml file

4. Configure the hdfs-site.xml file

5. Configure the mapred-site.xml file

6. Configure the yarn-site.xml file

7. Set the slave node, that is, modify the workers file

(2) Distribute the configuration file of the cluster master node to other nodes

(3) Format the file system

(4) Start the Hadoop cluster

(5) View the Web interface

1. Execute the following commands on the three virtual machines to turn off the firewall and firewall self-starting

​Edit 2. Add the IP mapping of the cluster service on the Windows host

(1) Configure the master node of the Hadoop cluster

1. Enter the configuration file:

cd /usr/local/hadoop/etc/hadoop

2. Modify the hadoop-env.sh file

Find the JAVA_HOME parameter location and add the JDK path.

sudo vi hadoop-env.sh

3. Configure the core-site.xml file

sudo vi core-site.sml

Edit core-site.xml file content

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
</configuration>

The main process NameNode running the host (the main node of the Hadoop cluster) configured with HDFS, and the temporary directory of the data generated when the Hadoop cluster is running.

4. Configure the hdfs-site.xml file

This file is used to set up the two processes of HDFS NameNode and DataNode

Open the hdfs-site.xml file

sudo vi hdfs-site.xml

Edit hdfs-site.xml file content

<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>slave01:50090</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>master:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name> dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

In the configuration file above, the number of copies of HDFS data blocks (the default value is 3) is configured with the Secondarynamenode, the IP and port of the host where the namenode is located, the number of copies of HDFS blocks, and the directory where temporary files are stored.

5. Configure the mapred-site.xml file

This file is used to specify the MapReduce running framework and is the core configuration file of MapReduce.
Open the mapred-site.xml file. The command is as follows.

sudo vi mapred-site.xml

Edit mapred-site.xml file content

<configuration>
        <property>
                <name> mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
        </property>
        <property>
                <name> yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
        </property>
</configuration>

6. Configure the yarn-site.xml file

This file is used to specify the manager of the YARN cluster
Open yarn-site.xml folder

sudo vi yarn-site.xml

Edit yarn-site.xml file content

<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>


The ResourceManager running host is specified as the master, and the MapReduce default program can run normally only when the auxiliary service of NodeManager is configured as mapreduce_shuffle.

7. Set the slave node, that is, modify the workers file

This file records all the slave nodes of the Hadoop cluster, and starts the slave nodes with one click with the script
Open the workers file

sudo vi workers

Delete the content in the file, add new content, each host name has a line

master
slave01
slave02

(2) Distribute the configuration file of the cluster master node to other nodes

The master executes the following commands:

   sudo scp /etc/profile slave01:/etc/profile
   sudo scp /etc/profile slave02:/etc/profile
   sudo scp -r /usr/local/hadoop slave01:/usr/local
   sudo scp -r /usr/local/hadoop slave02:/usr/local
   sudo scp -r /usr/local/jdk slave01:/usr/local
   sudo scp -r /usr/local/jdk slave02:/usr/local
   scp ~/.bashrc slave01:~/
   scp ~/.bashrc slave02:~/

After the above command is executed, it needs to be executed on slave01 and slave02 respectively.

source /etc/profile
source ~/.bashrc

Execute the following commands on slave01 and slave02 to modify the folder permissions

cd /usr/local
sudo chown -R hadoop ./hadoop

(3) Format the file system

The master master node executes the following command:

hdfs namenode -format

The following figure appears to indicate that the formatting is successful:

After the formatting command is executed, the message has been successfully formatted appears, indicating that the HDFS file system is successfully formatted, and the cluster can be officially started; otherwise, you need to check whether the command is correct, or whether the previous Hadoop cluster installation and configuration are correct. In addition, it should be noted that the above format command only needs to be executed once before the initial startup of the Hadoop cluster, and does not need to be formatted during subsequent repeated startups.

(4) Start the Hadoop cluster

start-dfs.sh
start-yarn.sh

As shown in the figure above, the Hadoop cluster has been fully opened

(5) View the Web interface

1. Execute the following commands on the three virtual machines to turn off the firewall and firewall self-starting

sudo service iptables stop
sudo chkconfig iptables off

2. Add the IP mapping of the cluster service on the Windows host

The path for Windows10 and Windows7 operating systems is C:\Windows\System32\drivers\etc\hosts
Add the following content:

After performing the above operations, you can visit http://master:50070 and http://master:8088 to view the HDFS cluster and YARN cluster status through the browsers of the host or the three virtual machines, respectively. The effect is as follows:

Friends who are interested in big data can discuss with the penguin group: 249300637

Tags: Linux Distribution Hadoop

Posted by ejames13 on Sun, 16 Oct 2022 15:51:03 +1030