Hive, as a common data warehouse component in the big data field, needs to pay attention to efficiency in the design and development stage. Hive efficiency is affected not only by the large amount of data; Data skew, data redundancy, too many job s or I/O, unreasonable MapReduce allocation and other factors have an impact on hive's efficiency. Hive tuning includes not only the optimization of HiveQL statement itself, but also the adjustment of hive configuration items and MR.
From the following three aspects: architecture optimization, parameter optimization, SQL optimization
1. Architecture
In terms of execution engine, select more appropriate and faster engines for the resources of the company's platform, such as MR, TEZ, Spark, etc,
If TEZ engine is selected, the vectorized optimizer can be started during the optimizer. In addition, the cost optimizer CBO can be selected. The configurations are as follows:
set hive.vectorized.execution.enabled = true; - - default false set hive.vectorized.execution.reduce.enabled = true; - - default false SET hive.cbo.enable=true; --from v0.14.0 default true SET hive.compute.query.using.stats=true; -- default false SET hive.stats.fetch.column.stats=true; -- default false SET hive.stats.fetch.partition.stats=true; -- default true
Optimize the table design, such as selecting partition table, bucket table and table storage format. In order to reduce data transmission, compression can be used. Here are some parameters (for more parameters, please check the official website)
-- Intermediate result compression SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ; -- Output result compression SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodc
2. Parameter optimization
The second part is parameter optimization. In fact, some of the above architecture parts are also controlled by parameters. The parameter control of this part mainly includes the following aspects
Local mode, strict mode, JVM reuse, parallel execution, speculative execution, merging small files, Fetch mode
2.1 local mode
When the amount of data is small, the startup of distributed data processing will be slow and take a long time, which is not as fast as the local mode. Use the following parameters to adjust
SET hive.exec.mode.local.auto=true; -- default false Small SET hive.exec.mode.local.auto.inputbytes.max=50000000; --The size of the input file is less than hive.exec.mode.local.auto.inputbytes.max Configured large SET hive.exec.mode.local.auto.input.files.max=5; -- Default 4 map The number of tasks is less than hive.exec.mode.local.auto.input.files.max Configured size
2.2 strict mode
This is actually a switch. When the following three statements are met, it will fail. If it is not turned on, it will execute normally. After it is turned on, these statements will fail automatically
hive.mapred.mode=nostrict -- Statements that do not restrict partition columns when querying partition tables; -- Two tables join Cartesian sentence produces Cartesian product; -- use order by To sort, but not specified limit Statement of
2.3 Jvm reuse
In mr, it is based on processes. A process is a Jvm. In fact, it is like a short job. These processes can be reused quickly, but its disadvantage is that they will wait for the task slot after the task is executed. This is more obvious when the data is tilted. Turn on this and use the following parameters
SET mapreduce.job.jvm.numtasks=5;
2.4 parallel execution
The query of Hive will be transformed into stages. These stages are not interdependent. These stages can be executed in parallel, using the following parameters
SET hive.exec.parallel=true; -- default false SET hive.exec.parallel.thread.number=16; -- Default 8
2.5 speculative execution
The function of this parameter is to use space resources to exchange the time to obtain the final result. For example, due to the inequality of network and resources, some tasks run very slowly, which will start the backup process to process the same data, and finally select the first successful calculation result as the final result.
set mapreduce.map.speculative=true set mapreduce.reduce.speculative=true set hive.mapred.reduce.tasks.speculative.execution=true
2.6 merge small files
Before map execution, merge small files to reduce the number of maps
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Merge small files at the end of the task
# Merge small files at the end of the map only task. The default is true SET hive.merge.mapfiles = true; # Merge small files at the end of the map reduce task. The default is false SET hive.merge.mapredfiles = true; # The size of the merged file is 256M by default SET hive.merge.size.per.task = 268435456; # When the average size of the output file is less than this value, start an independent map reduce task to merge the file SET hive.merge.smallfiles.avgsize = 16777216;
2.7 Fetch mode
The last fetch mode is to try not to run mr in some cases, such as querying several fields, global search, field search, limit search, etc
hive.fetch.task.conversion=more
3.sql optimization
This part is complex and may involve the problem of data skew. As for the problem of data skew, it has always been an unavoidable problem in big data processing, and there are many processing methods
3.1 sql optimization
sql optimization is the easiest part for developers to control. It is often made by experience. Let's summarize the following ways
Column, partition disassembly, sort by instead of order by, group by instead of count(distinct), pre aggregation of group by (controlled by parameters), tilt configuration items, map join, separate filter null values, and properly adjust the number of maps and reductions. These will almost be encountered in work. What should you do to optimize them as much as possible
3.2 tilt equalization configuration items
This configuration is similar to the tilt balance configuration item of group by, which is configured through Hive optimize. The default value is false. If enabled, Hive will exceed the threshold Hive during the join process skewjoin. The line corresponding to the tilted key of the key (100000 by default) is temporarily written into the file, and then another job is started to generate the result by map join. Via Hive skewjoin. mapjoin. map. The tasks parameter can also control the number of mapper s for the second job, which is 1000 by default
3.3 separate treatment
If the tilted keys have practical significance, generally speaking, there are few tilted keys. At this time, they can be extracted separately, the corresponding rows can be stored in the temporary table separately, and then prefixed with a smaller random number (such as 0 ~ 9), and finally aggregated. Don't write too many joins in a Select statement. Be sure to understand the business and data. (A0-A9) divide into multiple statements and execute them step by step; (A0-A4; A5-A9); First perform the association between large table and small table;
4. Two SQL
4.1 find out all the teams that have won 3 consecutive titles
team,year piston, 1990 bull, 1991 bull, 1992
-- -- 1 ranking select team, year, row_number() over (partition by team order by year) as rank from t1; -- 2 Get group id select team, year, row_number() over (partition by team order by year) as rank, (year -row_number() over (partition by team order by year)) as groupid from t1; -- 3 Group solution select team, count(1) years from (select team, (year -row_number() over (partition by team order by year)) as groupid from t1 ) tmp group by team, groupid having count(1) >= 3;
4.2 find out all peaks and troughs of each id in one day
Crest: value of this time > value of the previous time > value of this time > value of the next time trough: value of this time < value of the previous time value of this time < value of the next time id time price value of the previous time (lag) value of the next time (lead) sh66688, 9:35, 29.48 null 28.72 sh66688, 9:40, 28.72, 29.48, 27.74 sh66688, 9:45, 27.74
sh66688, 9:50, 26.75
sh66688, 9:55, 27.13 sh66688, 10:00, 26.30 sh66688, 10:05, 27.09 sh66688, 10:10, 26.46 sh66688, 10:15, 26.11 sh66688, 10:20, 26.88 sh66688, 10:25, 27.49 sh66688, 10:30, 26.70 sh66688, 10:35, 27.57 sh66688, 10:40, 28.26 sh66688, 10:45, 28.03
-- Idea: the key is to find the characteristics of peaks and troughs -- Characteristics of wave crest: Greater than the value of the previous time period and the later time period -- Characteristics of trough: Less than the value of the previous time period and the next time period -- Find this feature SQL Just write it select id, time, price, case when price > beforeprice and price > afterprice then "peak" when price < beforeprice and price < afterprice then "trough" end as feature from (select id, time, price, lag(price) over (partition by id order by time) beforeprice, lead(price) over (partition by id order by time) afterprice from t2 )tmp where (price > beforeprice and price > afterprice) or (price < beforeprice and price < afterprice);
Big data development, pay more attention to viewing personal data