This article is shared from Huawei Cloud Community " GaussDB(DWS) fuzzy query performance optimization ", Author: The Wind of Dawn.
When using GaussDB(DWS), fuzzy query through like sometimes encounters the problem of slow query performance.
(1) LIKE fuzzy query
A typical query statement is as follows:
select * from t1 where c1 like 'A123%';
When the amount of data in table t1 is large, use like to perform fuzzy query, and the query speed is very slow.
View the query plan generated by the statement through explain:
test=# explain select * from t1 where c1 like 'A123%'; QUERY PLAN ----------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+------------------------------+--------+----------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 8 | 16.25 2 | -> Seq Scan on t1 | 1 | 1MB | 8 | 10.25 Predicate Information (identified by plan id) --------------------------------------------- 2 --Seq Scan on t1 Filter: (c1 ~~ 'A123%'::text)
The query plan shows that a full table scan is performed on table t1, so the execution speed will be slower when the amount of data in table t1 is large.
The fuzzy matching condition 'A123%' of the above query, we call it post-fuzzy matching. In this scenario, query performance can be improved by creating a BTREE index.
When building an index, you need to set the operator corresponding to the index according to the field data type. For text, varchar, and char, set text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops respectively.
For example, the type of column c1 in the above example is text, and text_pattern_ops is added when creating an index. The statement for creating an index is as follows:
CREATE INDEX ON t1 (c1 text_pattern_ops);
Print the query plan after adding the index:
test=# explain select * from t1 where c1 like 'A123%'; QUERY PLAN ---------------------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+-----------------------------------------+--------+----------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 8 | 14.27 2 | -> Index Scan using t1_c1_idx on t1 | 1 | 1MB | 8 | 8.27 Predicate Information (identified by plan id) ---------------------------------------------------------------------- 2 --Index Scan using t1_c1_idx on t1 Index Cond: ((c1 ~>=~ 'A123'::text) AND (c1 ~<~ 'A124'::text)) Filter: (c1 ~~ 'A123%'::text)
After creating the index, you can see that the previously created index will be used when the statement is executed, and the execution speed will become faster.
The query condition used in the previous problem is the fuzzy query of the suffix. If the fuzzy query of the prefix is used, we can check whether the query plan uses the index.
test=# explain select * from t1 where c1 like '%A123'; QUERY PLAN ----------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+------------------------------+--------+----------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 8 | 16.25 2 | -> Seq Scan on t1 | 1 | 1MB | 8 | 10.25 Predicate Information (identified by plan id) --------------------------------------------- 2 --Seq Scan on t1 Filter: (c1 ~~ '%A123'::text)
As shown in the figure above, when the query condition becomes a prefixed fuzzy query, the previously built index cannot be used, and the entire table is scanned when the query is executed.
In this case, we can use the reverse function (reverse) to build an index to support the pre-fuzzy query. The index building statement is as follows:
CREATE INDEX ON t1 (reverse(c1) text_pattern_ops);
After rewriting the conditions of the query statement using the reverse function, output the query plan:
test=# explain select * from t1 where reverse(c1) like 'A123%'; QUERY PLAN ------------------------------------------------------------------------------------------ id | operation | E-rows | E-memory | E-width | E-costs ----+-------------------------------+--------+----------+---------+--------- 1 | -> Streaming (type: GATHER) | 5 | | 8 | 14.06 2 | -> Bitmap Heap Scan on t1 | 5 | 1MB | 8 | 8.06 3 | -> Bitmap Index Scan | 5 | 1MB | 0 | 4.28 Predicate Information (identified by plan id) ---------------------------------------------------------------------------------------- 2 --Bitmap Heap Scan on t1 Filter: (reverse(c1) ~~ 'A123%'::text) 3 --Bitmap Index Scan Index Cond: ((reverse(c1) ~>=~ 'A123'::text) AND (reverse(c1) ~<~ 'A124'::text))
After the statement is rewritten, it can go to the index, and the query performance is improved.
(2) Specify collate to create an index
If you use the default index ops class, to make the b-tree index support fuzzy queries, you need to specify collate="C" when querying and indexing.
Note: The index can only be used when the collate of the index and the query condition are consistent.
The statement to create an index is:
CREATE INDEX ON t1 (c1 collate "C");
The collate setting needs to be added to the where condition of the query statement:
test=# explain select * from t1 where c1 like 'A123%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+-----------------------------------------+--------+----------+---------+--------- 1 | -> Streaming (type: GATHER) | 1 | | 8 | 14.27 2 | -> Index Scan using t1_c1_idx on t1 | 1 | 1MB | 8 | 8.27 Predicate Information (identified by plan id) ------------------------------------------------------------------ 2 --Index Scan using t1_c1_idx on t1 Index Cond: ((c1 >= 'A123'::text) AND (c1 < 'A124'::text)) Filter: (c1 ~~ 'A123%'::text COLLATE "C")
(3) GIN inverted index
GIN (Generalized Inverted Index) general inverted index. It is designed to handle the case where the index item is a combination value, and the query needs to search for a specific element value that appears in the combination value through the index. For example, a document is composed of multiple words, and it is necessary to find out specific words contained in the document.
The following example illustrates how to use the GIN index:
create table gin_test_data(id int, chepai varchar(10), shenfenzheng varchar(20), duanxin text) distribute by hash (id); create index chepai_idx on gin_test_data using gin(to_tsvector('ngram', chepai)) with (fastupdate=on);
The above statement creates a GIN inverted index on the license plate column.
If you want to perform fuzzy query based on the license plate, you can use the following statement:
select count(*) from gin_test_data where to_tsvector('ngram', chepai) @@ to_tsquery('ngram', 'Xiang F');
The query plan for this statement is as follows:
test=# explain select count(*) from gin_test_data where to_tsvector('ngram', chepai) @@ to_tsquery('ngram', 'Xiang F'); QUERY PLAN ------------------------------------------------------------------------------------------------ id | operation | E-rows | E-memory | E-width | E-costs ----+------------------------------------------------+--------+----------+---------+--------- 1 | -> Aggregate | 1 | | 8 | 18.03 2 | -> Streaming (type: GATHER) | 1 | | 8 | 18.03 3 | -> Aggregate | 1 | 1MB | 8 | 12.03 4 | -> Bitmap Heap Scan on gin_test_data | 1 | 1MB | 0 | 12.02 5 | -> Bitmap Index Scan | 1 | 1MB | 0 | 8.00 Predicate Information (identified by plan id) ---------------------------------------------------------------------------------------------- 4 --Bitmap Heap Scan on gin_test_data Recheck Cond: (to_tsvector('ngram'::regconfig, (chepai)::text) @@ '''Xiang f'''::tsquery) 5 --Bitmap Index Scan Index Cond: (to_tsvector('ngram'::regconfig, (chepai)::text) @@ '''Xiang f'''::tsquery)
The inverted index is used in the query, so it has relatively good execution performance.
Click to follow and learn about Huawei Cloud's fresh technologies for the first time~