Hive - common tuning methods & & two methods

Hive, as a common data warehouse component in the big data field, needs to pay attention to efficiency in the design and development stage. Hive efficiency is affected not only by the large amount of data; Data skew, data redundancy, too many job s or I/O, unreasonable MapReduce allocation and other factors have an impact on hive's efficiency. Hive tuning includes not only the optimization of HiveQL statement itself, but also the adjustment of hive configuration items and MR.

From the following three aspects: architecture optimization, parameter optimization, SQL optimization

1. Architecture

In terms of execution engine, select more appropriate and faster engines for the resources of the company's platform, such as MR, TEZ, Spark, etc,

If TEZ engine is selected, the vectorized optimizer can be started during the optimizer. In addition, the cost optimizer CBO can be selected. The configurations are as follows:

set hive.vectorized.execution.enabled = true; -
- default false
set hive.vectorized.execution.reduce.enabled = true; -
- default false
SET hive.cbo.enable=true; --from v0.14.0 default
true
SET hive.compute.query.using.stats=true; -- default false
SET hive.stats.fetch.column.stats=true; -- default false
SET hive.stats.fetch.partition.stats=true; -- default true 

Optimize the table design, such as selecting partition table, bucket table and table storage format. In order to reduce data transmission, compression can be used. Here are some parameters (for more parameters, please check the official website)

-- Intermediate result compression
SET
hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ;
-- Output result compression
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodc

2. Parameter optimization

The second part is parameter optimization. In fact, some of the above architecture parts are also controlled by parameters. The parameter control of this part mainly includes the following aspects

Local mode, strict mode, JVM reuse, parallel execution, speculative execution, merging small files, Fetch mode

2.1 local mode

When the amount of data is small, the startup of distributed data processing will be slow and take a long time, which is not as fast as the local mode. Use the following parameters to adjust

SET hive.exec.mode.local.auto=true; -- default false 
Small
SET hive.exec.mode.local.auto.inputbytes.max=50000000; --The size of the input file is less than hive.exec.mode.local.auto.inputbytes.max Configured large
SET hive.exec.mode.local.auto.input.files.max=5; -- Default 4  map The number of tasks is less than hive.exec.mode.local.auto.input.files.max Configured
 size

2.2 strict mode

This is actually a switch. When the following three statements are met, it will fail. If it is not turned on, it will execute normally. After it is turned on, these statements will fail automatically

hive.mapred.mode=nostrict
 -- Statements that do not restrict partition columns when querying partition tables;
 -- Two tables join Cartesian sentence produces Cartesian product;
 -- use order by To sort, but not specified limit Statement of 

2.3 Jvm reuse

In mr, it is based on processes. A process is a Jvm. In fact, it is like a short job. These processes can be reused quickly, but its disadvantage is that they will wait for the task slot after the task is executed. This is more obvious when the data is tilted. Turn on this and use the following parameters

SET mapreduce.job.jvm.numtasks=5;

2.4 parallel execution

The query of Hive will be transformed into stages. These stages are not interdependent. These stages can be executed in parallel, using the following parameters

SET hive.exec.parallel=true; -- default false
SET hive.exec.parallel.thread.number=16; -- Default 8

2.5 speculative execution

The function of this parameter is to use space resources to exchange the time to obtain the final result. For example, due to the inequality of network and resources, some tasks run very slowly, which will start the backup process to process the same data, and finally select the first successful calculation result as the final result.

set mapreduce.map.speculative=true
set mapreduce.reduce.speculative=true
set hive.mapred.reduce.tasks.speculative.execution=true

2.6 merge small files

Before map execution, merge small files to reduce the number of maps

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Merge small files at the end of the task

# Merge small files at the end of the map only task. The default is true
SET hive.merge.mapfiles = true;
# Merge small files at the end of the map reduce task. The default is false
SET hive.merge.mapredfiles = true;
# The size of the merged file is 256M by default
SET hive.merge.size.per.task = 268435456;
# When the average size of the output file is less than this value, start an independent map reduce task to merge the file
SET hive.merge.smallfiles.avgsize = 16777216;

2.7 Fetch mode

The last fetch mode is to try not to run mr in some cases, such as querying several fields, global search, field search, limit search, etc

hive.fetch.task.conversion=more

3.sql optimization

This part is complex and may involve the problem of data skew. As for the problem of data skew, it has always been an unavoidable problem in big data processing, and there are many processing methods

3.1 sql optimization

sql optimization is the easiest part for developers to control. It is often made by experience. Let's summarize the following ways

Column, partition disassembly, sort by instead of order by, group by instead of count(distinct), pre aggregation of group by (controlled by parameters), tilt configuration items, map join, separate filter null values, and properly adjust the number of maps and reductions. These will almost be encountered in work. What should you do to optimize them as much as possible

3.2 tilt equalization configuration items

This configuration is similar to the tilt balance configuration item of group by, which is configured through Hive optimize. The default value is false. If enabled, Hive will exceed the threshold Hive during the join process skewjoin. The line corresponding to the tilted key of the key (100000 by default) is temporarily written into the file, and then another job is started to generate the result by map join. Via Hive skewjoin. mapjoin. map. The tasks parameter can also control the number of mapper s for the second job, which is 1000 by default

3.3 separate treatment

If the tilted keys have practical significance, generally speaking, there are few tilted keys. At this time, they can be extracted separately, the corresponding rows can be stored in the temporary table separately, and then prefixed with a smaller random number (such as 0 ~ 9), and finally aggregated. Don't write too many joins in a Select statement. Be sure to understand the business and data. (A0-A9) divide into multiple statements and execute them step by step; (A0-A4; A5-A9); First perform the association between large table and small table;

4. Two SQL

4.1 find out all the teams that have won 3 consecutive titles

team,year piston, 1990 bull, 1991 bull, 1992

--
 -- 1 ranking
select team, year, 
row_number() over (partition by team order by year) as rank
  from t1;

-- 2 Get group id
select team, year, 
row_number() over (partition by team order by year) as rank,
(year -row_number() over (partition by team order by year)) as groupid
  from t1;

-- 3 Group solution
select team, count(1) years
  from (select team, 
        (year -row_number() over (partition by team order by year)) as groupid
          from t1
       ) tmp
group by team, groupid
having count(1) >= 3;

4.2 find out all peaks and troughs of each id in one day

Crest: value of this time > value of the previous time > value of this time > value of the next time trough: value of this time < value of the previous time value of this time < value of the next time id time price value of the previous time (lag) value of the next time (lead) sh66688, 9:35, 29.48 null 28.72 sh66688, 9:40, 28.72, 29.48, 27.74 sh66688, 9:45, 27.74
sh66688, 9:50, 26.75
sh66688, 9:55, 27.13 sh66688, 10:00, 26.30 sh66688, 10:05, 27.09 sh66688, 10:10, 26.46 sh66688, 10:15, 26.11 sh66688, 10:20, 26.88 sh66688, 10:25, 27.49 sh66688, 10:30, 26.70 sh66688, 10:35, 27.57 sh66688, 10:40, 28.26 sh66688, 10:45, 28.03

-- Idea: the key is to find the characteristics of peaks and troughs
-- Characteristics of wave crest: Greater than the value of the previous time period and the later time period
-- Characteristics of trough: Less than the value of the previous time period and the next time period
-- Find this feature SQL Just write it

select id, time, price,
       case when price > beforeprice and price > afterprice then "peak"
            when price < beforeprice and price < afterprice then "trough" end as feature
  from (select id, time, price,
               lag(price) over (partition by id order by time) beforeprice,
               lead(price) over (partition by id order by time) afterprice
          from t2
        )tmp
 where (price > beforeprice and price > afterprice) or
       (price < beforeprice and price < afterprice);

Big data development, pay more attention to viewing personal data

Tags: Java Big Data Interview Hadoop Apache hive mapreduce compress

Posted by shank888 on Sat, 16 Apr 2022 18:23:14 +0930