(Translation) Histogram statistics in MySQL

What is a histogram and how to create a histogram in MySQL 8.0.3?
How does MySQL's histogram affect execution plan generation?
What are the considerations for creating a histogram?
What is the difference between the histogram and the index in the selection of the optimizer, and how to choose?
How to judge the impact of the histogram on the execution plan?
This article on the official MySQL blog answers this series of questions with very specific examples, let's go.
The original address is https://dev.mysql.com/blog-archive/histogram-statistics-in-mysql/ , the following is the translation:
 
 
Beginning with MySQL 8.0.3, you can create histogram statistics to provide more statistics to the optimizer. In this blog post, we'll look at how to create histogram statistics and explain when it might be useful to use histogram statistics.

what is a histogram

The query optimizer is the part of the database responsible for converting SQL queries into execution plans that are as efficient as possible. Sometimes the query optimizer is unable to find the most efficient plan and ends up spending more time executing queries than necessary. The main reason for this is usually that the optimizer does not have sufficient knowledge of the data distribution it is querying:
  • How many rows are there in each table?
  • How many distinct values ​​are there in each column?
  • How is the data distributed in each column?
An example where the query optimizer is missing important information, a table of people's daily sleep times:
CREATE TABLE bedtime (
  person_id INT,
  time_of_day TIME);
For the "time_of_day" field, most values ​​are likely to be around 11:00PM, because most people go to bed at this time. So the first query below returns more data rows than the second query.
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00"
2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00"
In the absence of any statistics available, the optimizer will assume that the values ​​in "time_of_day" are evenly distributed (i.e., a person is as likely to go to bed at around 3pm as they are at around 11pm). How can the query optimizer be made aware of this skewness in the data? One solution to this is to create a histogram statistic for the column.
A histogram is an approximation of the distribution of a column of data. It can tell you with considerable accuracy if your data is skewed, which in turn will help the database server understand the nature of the data it contains. There are many different flavors of histograms, and in MySQL we have chosen to support two different types: "singleton (equal width)" histograms and "equal height" histograms. What all histogram types have in common is that they divide the data set into a set of "buckets". MySQL automatically divides the values ​​into buckets and automatically decides what type of histogram to create.

How to create and delete histogram statistics

To manage histogram statistics, we extend ANALYZE TABLE with two new clauses:

ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS;
ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
The first syntax allows you to create histogram statistics for one or more columns at once:
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS;
+----------------+-----------+----------+---------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                          |
+----------------+-----------+----------+---------------------------------------------------+
| sakila.payment | histogram | status   | Histogram statistics created for column 'amount'. |
+----------------+-----------+----------+---------------------------------------------------+
1 row in set (0.27 sec)
 
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS;
+----------------+-----------+----------+---------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                                |
+----------------+-----------+----------+---------------------------------------------------------+
| sakila.payment | histogram | status   | Histogram statistics created for column 'amount'.       |
| sakila.payment | histogram | status   | Histogram statistics created for column 'payment_date'. |
+----------------+-----------+----------+---------------------------------------------------------+
Note that the number of buckets must be specified and can be in the range 1 to 1024. How many buckets you should choose for your dataset depends on several factors; how many distinct values ​​you have, how skewed your dataset is, how high accuracy you need, etc.
However, after a certain number of buckets, (continuing to increase the amount of data in the buckets) the effect on improving accuracy is quite low. So we recommend starting with a lower number, such as 32, and increasing it if you find it doesn't meet your needs.
 
In the example above, we can see that we have built the histogram twice for the column "amount". In the first query, a new histogram is created. In the second query, the histogram for "amount" is automatically overwritten.
If you want to delete any histogram statistics you created, you can simply use the DROP histogram syntax:
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date;
+----------------+-----------+----------+---------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                                |
+----------------+-----------+----------+---------------------------------------------------------+
| sakila.payment | histogram | status   | Histogram statistics removed for column 'payment_date'. |
+----------------+-----------+----------+---------------------------------------------------------+
As with UPDATE HISTOGRAM, you can specify multiple columns in the same command. One feature worth noting is that the ANALYZE TABLE command will attempt to perform as much work as possible, even if errors occur during command execution. Suppose you specify three columns, but the second column does not exist. The server will still create and store histograms for the first and third columns:
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS;
+----------------+-----------+----------+----------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                                 |
+----------------+-----------+----------+----------------------------------------------------------+
| tpcds.customer | histogram | status   | Histogram statistics created for column 'c_birth_day'.   |
| tpcds.customer | histogram | status   | Histogram statistics created for column 'c_birth_month'. |
| tpcds.customer | histogram | Error    | The column 'c_foobar' does not exist.                    |
+----------------+-----------+----------+----------------------------------------------------------+
3 rows in set (0.15 sec)

How is the histogram created inside the database?

If you have read the MySQL manual, you may have seen the new system variable histogram_generation_max_mem_size . This variable will control the amount of memory (in bytes) the server is allowed to use when generating histogram statistics. So why would you want to control it?

When you specify that you want to build a histogram, the server will read all the data into memory and perform all work (including sorting) in memory. If you want to generate a histogram on a very large table, you run the risk of reading hundreds of megabytes of data into memory, which may not be desirable. So, to handle this, MySQL will calculate how many rows of data it can fit into memory given the amount of memory specified by the system variable histogram_generation_max_mem_size. If it realizes that it can only fit a subset of rows within a given memory limit, it will resort to sampling. This can be observed by looking at the property "sampling rate":

mysql> SET histogram_generation_max_mem_size = 1000000;
Query OK, 0 rows affected (0.00 sec)
 
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS;
+----------------+-----------+----------+------------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                                   |
+----------------+-----------+----------+------------------------------------------------------------+
| tpcds.customer | histogram | status   | Histogram statistics created for column 'c_birth_country'. |
+----------------+-----------+----------+------------------------------------------------------------+
1 row in set (0.22 sec)
 
mysql> SELECT histogram->>'$."sampling-rate"'
    -> FROM   information_schema.column_statistics
    -> WHERE  table_name = "customer"
    ->        AND column_name = "c_birth_country";
+---------------------------------+
| histogram->>'$."sampling-rate"' |
+---------------------------------+
| 0.048743243211626014            |
+---------------------------------+
Here we can see that the optimizer created a histogram by reading about 4.8% of the data in the "c_birth_country" column. It's worth noting that sampling is non-deterministic, so two subsequent calls to "ANALYZE TABLE tbl UPDATE HISTOGRAM..." on the same dataset may give you two different histograms if you use sampling.

Query examples

So, what can be gained using histogram statistics? Let's look at a few queries from the TPC-DS Benchmark Suite where adding a histogram can make a big difference in query execution time. Below we will use TPC-DS with a scale factor of 1, which means that the database size is about 1GB. The machine is an Intel Core i7-4770 running Debian Stretch and MySQL 8.0 RC1. This configuration is pretty standard, except that innodb_buffer_pool_size is increased to 2G so that we can fit the entire database into the buffer pool.

In order for the optimizer to actually use the statistics provided by the histogram, you just need to ensure that the optimizer switch " condition_fanout_filter ” is turned on. Note that this is turned on by default.

Query 90

Benchmark Suite describes this query as "What is the ratio of the number of items sold on the Internet in the morning to the number of items sold in the evening by a customer with a specified number of dependents. Only sites with a large amount of content are considered.
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio
    -> FROM   (SELECT COUNT(*) amc
    ->         FROM   web_sales,
    ->                household_demographics,
    ->                time_dim,
    ->                web_page
    ->         WHERE  ws_sold_time_sk = time_dim.t_time_sk
    ->                AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk
    ->                AND ws_web_page_sk = web_page.wp_web_page_sk
    ->                AND time_dim.t_hour BETWEEN 9 AND 9 + 1
    ->                AND household_demographics.hd_dep_count = 2
    ->                AND web_page.wp_char_count BETWEEN 5000 AND 5200) at,
    ->        (SELECT COUNT(*) pmc
    ->         FROM   web_sales,
    ->                household_demographics,
    ->                time_dim,
    ->                web_page
    ->         WHERE  ws_sold_time_sk = time_dim.t_time_sk
    ->                AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk
    ->                AND ws_web_page_sk = web_page.wp_web_page_sk
    ->                AND time_dim.t_hour BETWEEN 15 AND 15 + 1
    ->                AND household_demographics.hd_dep_count = 2
    ->                AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt
    -> ORDER  BY am_pm_ratio
    -> LIMIT  100;
+-------------+
| am_pm_ratio |
+-------------+
|  1.27619048 |
+-------------+
1 row in set (1.48 sec)
As we can see, it takes about 1.5 seconds to execute the query. This doesn't seem like much, but by adding a histogram on a single column, we can make this query run three times faster (query truncated for readability);
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS;
+----------------+-----------+----------+----------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                                 |
+----------------+-----------+----------+----------------------------------------------------------+
| tpcds.web_page | histogram | status   | Histogram statistics created for column 'wp_char_count'. |
+----------------+-----------+----------+----------------------------------------------------------+
1 row in set (0.06 sec)
 
mysql> SELECT ...
+-------------+
| am_pm_ratio |
+-------------+
|  1.27619048 |
+-------------+
1 row in set (0.50 sec)
For this histogram, the query now takes about 0.5 seconds. Why? The main reason can be found by looking at the predicate "web_page". wp_char_count BETWEEN 5000 AND 5200 ". In the absence of any statistics available, the optimizer assumes that 11.11% of the rows in the table "web_page" match the given predicate. However, this is wrong. By inspecting the table, we can see to only 1.6% matching this predicate (one row out of 60):
mysql> SELECT
    ->   (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200)
    ->   /
    ->   (SELECT COUNT(*) FROM web_page) AS ratio;
+--------+
| ratio  |
+--------+
| 0.0167 |
+--------+
1 row in set (0.00 sec)
With the histogram statistics, the optimizer now knows this and pushes the table earlier in the join order. The principle of small tables driving large tables), so as to generate execution plans, and the execution speed is increased by three times.

Query 61

The query is described as "Find the proportion of sales of an item with and without a sale for a given month and year". Only products of a certain category sold to customers living in a certain time zone will be considered. "This is a large complex query with multiple joins:
mysql> SELECT promotions,                                                                                                                                                                                                ->        total,
    ->        CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100
    -> FROM   (SELECT SUM(ss_ext_sales_price) promotions
    ->         FROM   store_sales,
    ->                store,
    ->                promotion,
    ->                date_dim,
    ->                customer,
    ->                customer_address,
    ->                item
    ->         WHERE  ss_sold_date_sk = d_date_sk
    ->                AND ss_store_sk = s_store_sk
    ->                AND ss_promo_sk = p_promo_sk
    ->                AND ss_customer_sk = c_customer_sk
    ->                AND ca_address_sk = c_current_addr_sk
    ->                AND ss_item_sk = i_item_sk
    ->                AND ca_gmt_offset = -5
    ->                AND i_category = 'Home'
    ->                AND ( p_channel_dmail = 'Y'
    ->                       OR p_channel_email = 'Y'
    ->                       OR p_channel_tv = 'Y' )
    ->                AND s_gmt_offset = -5
    ->                AND d_year = 2000
    ->                AND d_moy = 12) promotional_sales,
    ->        (SELECT SUM(ss_ext_sales_price) total
    ->         FROM   store_sales,
    ->                store,
    ->                date_dim,
    ->                customer,
    ->                customer_address,
    ->                item
    ->         WHERE  ss_sold_date_sk = d_date_sk
    ->                AND ss_store_sk = s_store_sk
    ->                AND ss_customer_sk = c_customer_sk
    ->                AND ca_address_sk = c_current_addr_sk
    ->                AND ss_item_sk = i_item_sk
    ->                AND ca_gmt_offset = -5
    ->                AND i_category = 'Home'
    ->                AND s_gmt_offset = -5
    ->                AND d_year = 2000
    ->                AND d_moy = 12) all_sales
    -> ORDER  BY promotions,
    ->           total
    -> LIMIT  100;
+------------+------------+--------------------------------------------------------------------------+
| promotions | total      | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
+------------+------------+--------------------------------------------------------------------------+
| 3213210.07 | 5966836.78 |                                                              53.85114741 |
+------------+------------+--------------------------------------------------------------------------+
1 row in set (2.78 sec)
As you can see from the output, it takes about 2.8 seconds to execute the query. However, the query optimizer does not realize that there is only one distinct value in the column "s_gmt_offset". In the absence of any statistics available, the optimizer uses some Hardcoded estimate, this assumes 10% of rows will match the predicate "ca_gmt_offset = -5" . If we add a histogram for this column, the optimizer now knows that all rows in the table will satisfy the condition, giving us a better execution plan (query truncated for better readability):
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS;
+-------------+-----------+----------+---------------------------------------------------------+
| Table       | Op        | Msg_type | Msg_text                                                |
+-------------+-----------+----------+---------------------------------------------------------+
| tpcds.store | histogram | status   | Histogram statistics created for column 's_gmt_offset'. |
+-------------+-----------+----------+---------------------------------------------------------+
1 row in set (0.06 sec)
 
mysql> SELECT ...
+------------+------------+--------------------------------------------------------------------------+
| promotions | total      | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
+------------+------------+--------------------------------------------------------------------------+
| 3213210.07 | 5966836.78 |                                                              53.85114741 |
+------------+------------+--------------------------------------------------------------------------+
1 row in set (1.37 sec)
With this histogram, the query execution time dropped to less than 1.4 seconds, a 2x improvement. The reason is that in the first plan, the optimizer selects the first derived table to perform a full table scan on the table store, and then performs a primary key lookup in <item, store_sales, date_dim, customer, and customer_address respectively. However, when it realizes that Table Store will return more rows than it expected, and no histogram statistics are available, the optimizer chooses to perform a full table scan on the table entries and perform a full table scan on store_sales , store , date_dim , customer and finally perform a primary key lookup in customer_address.

But, why not an index?

Some of you are probably thinking by now that an index could do just as well, and it does:
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset);
Query OK, 0 rows affected (0.53 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> SELECT ...
+------------+------------+--------------------------------------------------------------------------+
| promotions | total      | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
+------------+------------+--------------------------------------------------------------------------+
| 3213210.07 | 5966836.78 |                                                              53.85114741 |
+------------+------------+--------------------------------------------------------------------------+
1 row in set (1.41 sec)

However, there are two reasons why you might consider histograms instead of indices:

1. There is a cost to maintaining the index. If you have an index, every INSERT/UPDATE/DELETE will cause the index to be updated. It's not free and will hit your performance. A histogram, on the other hand, is created only once and is never updated unless you explicitly ask for it. So it won't hurt your insert/update/delete performance.

2. If you have an index, the optimizer will perform what we call "index dives" to estimate the number of records in a given range. This also has a cost, which may be too high if you have very long in-lists in your query. In this case, histogram statistics are much cheaper and thus probably more appropriate.

Translator's Note: Simply understand, index dives means that when MySQL generates an execution plan for a statement such as where id in (***, ***, ...), it scans index pages to estimate eligible data The potential problem of this method is that if there are a lot of data pages that meet the conditions, it is only necessary to scan a large number of data pages when the execution plan is evaluated and truncated for a long time, which may cause a certain performance loss. If you change the evaluation method , that is, to make an evaluation based on statistical information, which can avoid the potential situation of scanning a large number of index pages (but the estimation based on statistical information is not perfect, and the biggest problem is that it is not accurate enough). The parameter of index dives is eq_range_index_dive_limit, the default is 200.

 

Check histogram statistics

Histogram statistics are stored in the data dictionary as JSON objects, which makes them both flexible and readable. For example, you can use the built-in JSON functions to extract information from a histogram. Say you want to know when your histogram was created/updated for the "amount" column in the "payment" table. You can easily use the JSON backreference extraction operator to find this information:
mysql> SELECT
    ->   HISTOGRAM->>'$."last-updated"' AS last_updated
    -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
    -> WHERE
    ->   SCHEMA_NAME = "sakila"
    ->   AND TABLE_NAME = "payment"
    ->   AND COLUMN_NAME = "amount";
+----------------------------+
| last_updated               |
+----------------------------+
| 2017-09-15 11:54:25.000000 |
+----------------------------+
Or suppose you want to find out how many buckets are in the histogram compared to the number of buckets you specified in the ANALYZE TABLE statement:
mysql> SELECT
    ->   TABLE_NAME,
    ->   COLUMN_NAME,
    ->   HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified,
    ->   JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created
    -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
    -> WHERE
    ->   SCHEMA_NAME = "sakila";
+------------+--------------+-----------------------+---------------------+
| TABLE_NAME | COLUMN_NAME  | num_buckets_specified | num_buckets_created |
+------------+--------------+-----------------------+---------------------+
| payment    | amount       | 32                    |                  19 |
| payment    | payment_date | 32                    |                  32 |
+------------+--------------+-----------------------+---------------------+

We refer to the manual for more information on what information can be extracted from the histogram.

optimizer trace

If you want to know the estimates made by the histogram, the easiest way is to look at the EXPLAIN output:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | customer | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 98633 |    11.11 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
 
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS;
+----------------+-----------+----------+--------------------------------------------------------+
| Table          | Op        | Msg_type | Msg_text                                               |
+----------------+-----------+----------+--------------------------------------------------------+
| tpcds.customer | histogram | status   | Histogram statistics created for column 'c_birth_day'. |
+----------------+-----------+----------+--------------------------------------------------------+
1 row in set (0.10 sec)
 
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | customer | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 98633 |    32.12 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
 
If you look at the "Filter" column, you'll see that it went from the default 11.11% to a more precise 32.12%. However, if you have multiple conditions where some columns have histogram statistics and others do not, it will be difficult to know what the optimizer has estimated:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967;
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | customer | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 98633 |     6.38 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
If you want to see the estimates made by the histogram in more detail, you can look at the query's trace:
mysql> SET OPTIMIZER_TRACE = "enabled=on";
Query OK, 0 rows affected (0.00 sec)
 
mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000;
Query OK, 0 rows affected (0.00 sec)
 
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967;
 
mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
+----------------------------------------------------------------------------------------+
| JSON_EXTRACT(TRACE, "$**.filtering_effect")                                            |
+----------------------------------------------------------------------------------------+
| [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Here we use JSON_EXTRACT function to extract relevant parts from trace output. Here we can see that for each condition using a histogram we can see the estimated selectivity. In this example, the selectivity for only one of the conditions (c_birth_day <= 20) was estimated and 63.76% of the rows in the estimated column of the histogram would match this condition. In fact, this matches the actual data in the column:
mysql> SELECT
    ->   (SELECT count(*) FROM customer WHERE c_birth_day <= 20)
    ->   /
    ->   (SELECT COUNT(*) FROM customer) AS ratio;
+--------+
| ratio  |
+--------+
| 0.6376 |
+--------+
1 row in set (0.03 sec)

 


 

Posted by vishal99 on Mon, 05 Dec 2022 13:52:51 +1030