How does MySQL's histogram affect execution plan generation?
What are the considerations for creating a histogram?
What is the difference between the histogram and the index in the selection of the optimizer, and how to choose?
How to judge the impact of the histogram on the execution plan?
This article on the official MySQL blog answers this series of questions with very specific examples, let's go.
The original address is https://dev.mysql.com/blog-archive/histogram-statistics-in-mysql/ , the following is the translation:
what is a histogram
- How many rows are there in each table?
- How many distinct values are there in each column?
- How is the data distributed in each column?
CREATE TABLE bedtime ( person_id INT, time_of_day TIME);
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00" 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00"
How to create and delete histogram statistics
To manage histogram statistics, we extend ANALYZE TABLE with two new clauses:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | +----------------+-----------+----------+---------------------------------------------------+ 1 row in set (0.27 sec) mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | | sakila.payment | histogram | status | Histogram statistics created for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+
However, after a certain number of buckets, (continuing to increase the amount of data in the buckets) the effect on improving accuracy is quite low. So we recommend starting with a lower number, such as 32, and increasing it if you find it doesn't meet your needs.
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics removed for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_month'. | | tpcds.customer | histogram | Error | The column 'c_foobar' does not exist. | +----------------+-----------+----------+----------------------------------------------------------+ 3 rows in set (0.15 sec)
How is the histogram created inside the database?
If you have read the MySQL manual, you may have seen the new system variable histogram_generation_max_mem_size . This variable will control the amount of memory (in bytes) the server is allowed to use when generating histogram statistics. So why would you want to control it?
When you specify that you want to build a histogram, the server will read all the data into memory and perform all work (including sorting) in memory. If you want to generate a histogram on a very large table, you run the risk of reading hundreds of megabytes of data into memory, which may not be desirable. So, to handle this, MySQL will calculate how many rows of data it can fit into memory given the amount of memory specified by the system variable histogram_generation_max_mem_size. If it realizes that it can only fit a subset of rows within a given memory limit, it will resort to sampling. This can be observed by looking at the property "sampling rate":
mysql> SET histogram_generation_max_mem_size = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS; +----------------+-----------+----------+------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+------------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_country'. | +----------------+-----------+----------+------------------------------------------------------------+ 1 row in set (0.22 sec) mysql> SELECT histogram->>'$."sampling-rate"' -> FROM information_schema.column_statistics -> WHERE table_name = "customer" -> AND column_name = "c_birth_country"; +---------------------------------+ | histogram->>'$."sampling-rate"' | +---------------------------------+ | 0.048743243211626014 | +---------------------------------+
Query examples
So, what can be gained using histogram statistics? Let's look at a few queries from the TPC-DS Benchmark Suite where adding a histogram can make a big difference in query execution time. Below we will use TPC-DS with a scale factor of 1, which means that the database size is about 1GB. The machine is an Intel Core i7-4770 running Debian Stretch and MySQL 8.0 RC1. This configuration is pretty standard, except that innodb_buffer_pool_size is increased to 2G so that we can fit the entire database into the buffer pool.
In order for the optimizer to actually use the statistics provided by the histogram, you just need to ensure that the optimizer switch " condition_fanout_filter ” is turned on. Note that this is turned on by default.
Query 90
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio -> FROM (SELECT COUNT(*) amc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at, -> (SELECT COUNT(*) pmc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt -> ORDER BY am_pm_ratio -> LIMIT 100; +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (1.48 sec)
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.web_page | histogram | status | Histogram statistics created for column 'wp_char_count'. | +----------------+-----------+----------+----------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (0.50 sec)
mysql> SELECT -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200) -> / -> (SELECT COUNT(*) FROM web_page) AS ratio; +--------+ | ratio | +--------+ | 0.0167 | +--------+ 1 row in set (0.00 sec)
Query 61
mysql> SELECT promotions, -> total, -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 -> FROM (SELECT SUM(ss_ext_sales_price) promotions -> FROM store_sales, -> store, -> promotion, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_promo_sk = p_promo_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND ( p_channel_dmail = 'Y' -> OR p_channel_email = 'Y' -> OR p_channel_tv = 'Y' ) -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) promotional_sales, -> (SELECT SUM(ss_ext_sales_price) total -> FROM store_sales, -> store, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) all_sales -> ORDER BY promotions, -> total -> LIMIT 100; +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (2.78 sec)
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS; +-------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+-----------+----------+---------------------------------------------------------+ | tpcds.store | histogram | status | Histogram statistics created for column 's_gmt_offset'. | +-------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.37 sec)
But, why not an index?
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset); Query OK, 0 rows affected (0.53 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.41 sec)
However, there are two reasons why you might consider histograms instead of indices:
1. There is a cost to maintaining the index. If you have an index, every INSERT/UPDATE/DELETE will cause the index to be updated. It's not free and will hit your performance. A histogram, on the other hand, is created only once and is never updated unless you explicitly ask for it. So it won't hurt your insert/update/delete performance.
2. If you have an index, the optimizer will perform what we call "index dives" to estimate the number of records in a given range. This also has a cost, which may be too high if you have very long in-lists in your query. In this case, histogram statistics are much cheaper and thus probably more appropriate.
Translator's Note: Simply understand, index dives means that when MySQL generates an execution plan for a statement such as where id in (***, ***, ...), it scans index pages to estimate eligible data The potential problem of this method is that if there are a lot of data pages that meet the conditions, it is only necessary to scan a large number of data pages when the execution plan is evaluated and truncated for a long time, which may cause a certain performance loss. If you change the evaluation method , that is, to make an evaluation based on statistical information, which can avoid the potential situation of scanning a large number of index pages (but the estimation based on statistical information is not perfect, and the biggest problem is that it is not accurate enough). The parameter of index dives is eq_range_index_dive_limit, the default is 200.
Check histogram statistics
Histogram statistics are stored in the data dictionary as JSON objects, which makes them both flexible and readable. For example, you can use the built-in JSON functions to extract information from a histogram. Say you want to know when your histogram was created/updated for the "amount" column in the "payment" table. You can easily use the JSON backreference extraction operator to find this information:mysql> SELECT -> HISTOGRAM->>'$."last-updated"' AS last_updated -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila" -> AND TABLE_NAME = "payment" -> AND COLUMN_NAME = "amount"; +----------------------------+ | last_updated | +----------------------------+ | 2017-09-15 11:54:25.000000 | +----------------------------+
mysql> SELECT -> TABLE_NAME, -> COLUMN_NAME, -> HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified, -> JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila"; +------------+--------------+-----------------------+---------------------+ | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created | +------------+--------------+-----------------------+---------------------+ | payment | amount | 32 | 19 | | payment | payment_date | 32 | 32 | +------------+--------------+-----------------------+---------------------+
We refer to the manual for more information on what information can be extracted from the histogram.
optimizer trace
If you want to know the estimates made by the histogram, the easiest way is to look at the EXPLAIN output:mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS; +----------------+-----------+----------+--------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+--------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | +----------------+-----------+----------+--------------------------------------------------------+ 1 row in set (0.10 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
mysql> SET OPTIMIZER_TRACE = "enabled=on"; Query OK, 0 rows affected (0.00 sec) mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE; +----------------------------------------------------------------------------------------+ | JSON_EXTRACT(TRACE, "$**.filtering_effect") | +----------------------------------------------------------------------------------------+ | [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] | +----------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
mysql> SELECT -> (SELECT count(*) FROM customer WHERE c_birth_day <= 20) -> / -> (SELECT COUNT(*) FROM customer) AS ratio; +--------+ | ratio | +--------+ | 0.6376 | +--------+ 1 row in set (0.03 sec)