ClickHouse from entry to mastery
ClickHouse beginner level chapter
Getting started with ClickHouse
- ClickHouse is a column storage database management system for online analysis OLAP, which is open-source by Yandex of Russia. It is written in C + + language and supports SQL real-time query. Due to the efficient performance of ClickHouse in query processing of large data sets, it has attracted global attention since its open source in 2016, and even ranked first in the attention of github for a time.
- OLAP: ClickHouse is designed for OLAP offline data processing. Compared with OLTP online transaction processing, ClickHouse pays more attention to the calculation and analysis of massive data, focusing on data throughput, query speed and computing performance indicators. For frequent modification and change of data, he is not very good at it. Therefore, ClickHouse is usually used to build back-end real-time data warehouse or offline data warehouse.
1. Features of Clickhouse
Column storage
- ClickHouse is a real columnar storage database. Traditional databases store data according to data rows. For example, the commonly used MySQL uses the leaf node of the B + tree to retain a complete row of data.
- The advantage of line storage is that when you want to query a piece of data, you can get a complete piece of data through one disk search + sequential read.
- The way ClickHouse stores data is to store data by column, and store data from different columns separately. Each column is stored as a file.
- Benefits of in-line storage:
- For column aggregation, counting, summation and other statistical operations, it is much better than row storage;
- Since the data types of each column are the same, it is easier to compress the data for data storage. A better data compression algorithm is selected for each column, which greatly improves the data compression ratio;
- Because the data compression ratio is better, on the one hand, it saves disk space, on the other hand, it also has more space for cache.
Function of DBMS
- It covers almost most of the syntax of standard SQL, including DDL and DML, as well as various supporting functions, user management, permission management, data backup and recovery.
Diversification engine
- Similar to MySQL, ClickHouse plug-ins the table level storage engine. Different storage engines can be set according to different requirements of the table, including merge tree, log, interface and other four categories of more than 20 engines.
High throughput write capability
- ClickHouse adopts a structure similar to LSM Tree. After data is written, it is regularly compared in the background. Through the structure similar to LSM Tree, ClickHouse is written in sequence when importing data. After writing, the data segments cannot be changed. During background compilation, multiple segments are written in sequence after merge sort. The sequential write feature makes full use of the throughput of the disk and has excellent write performance even on HDD.
- The official public benchmark test shows that it can reach the write throughput of 50MB-200MB/s. according to the estimation of 100Byte per line, it is about equivalent to the write speed of 50W-200W / s.
Data partitioning and thread level parallelism
- ClickHouse divides the data into multiple partitions, and each partition is further divided into multiple index granularity, and then processes part of them through multiple CPU cores to realize parallel data processing. Under this design, a single Query can use all CPUs of the whole machine. The extreme parallel processing ability greatly reduces the Query delay.
- Therefore, ClickHouse can divide the query of a large amount of data into parts and process it in parallel. However, one disadvantage is that using multiple CPU s for a single query is not conducive to concurrent multiple queries at the same time. Therefore, for high QPS query services, ClickHouse is not a strong point.
2. Performance comparison
- Single table query
- Association query
- Like many OLAP databases, ClickHouse's single table query speed is better than the associated query, and the gap between the two is more obvious.
ClickHouse installation
- ClickHouse provides a variety of installation methods, including installation and deployment tutorials: https://clickhouse.com/docs/zh/getting-started/install/
1. Preparation
- Confirm to turn off the firewall
systemctl status firewalld.service
- CentOS cancels the limit on the number of open files
# vim /etc/security/limits.conf * soft nofile 65536 * hard nofile 65536 * soft nproc 131072 * hard nproc 131072 # vim /etc/security/limits.d/20-nproc.conf * soft nofile 65536 * hard nofile 65536 * soft nproc 131072 * hard nproc 131072
2. Single machine installation
- Download installation package: https://packages.clickhouse.com/rpm/stable/ , select the latest version 22.3.3.44
- Installation:
sudo rpm -ivh *.rpm # Check the installation sudo rpm -qa|grep clickhouse
3. docker installation
docker run -di --name docker_ch -p 8123:8123 -p 9000:9000 --ulimit nofile=262144:262144 --volume=/Volumes/D/dokcer_data/clickhouse:/var/lib/clickhouse clickhouse/clickhouse-server
data type
- Official reference documents: https://clickhouse.com/docs/zh/sql-reference/data-types/
1. Integer
- Fixed length integers, including signed or unsigned integers, with value range: [ − 2 n − 1 , 2 n − 1 − 1 ] [-2^{n-1},\ 2^{n-1}-1] [−2n−1, 2n−1−1].
type | Numerical range | Java type | |
---|---|---|---|
signed int | Int8 | [ − 128 , 127 ] [-128,\ 127] [−128, 127] | byte |
Int16 | [ − 32768 , 32767 ] [-32768,\ 32767] [−32768, 32767] | short | |
Int32 | [ − 2147483648 , 2147483647 ] [-2147483648,\ 2147483647] [−2147483648, 2147483647] | int | |
Int64 | [ − 9223372036854775808 , 9223372036854775807 ] [-9223372036854775808,\ 9223372036854775807] [−9223372036854775808, 9223372036854775807] | long | |
unsigned int | UInt8 | [ 0 , 255 ] [0,\ 255] [0, 255] | |
UInt16 | [ 0 , 65535 ] [0,\ 65535] [0, 65535] | ||
UInt32 | [ 0 , 4294967295 ] [0,\ 4294967295] [0, 4294967295] | ||
UInt64 | [ 0 , 18446744073709551615 ] [0,\ 18446744073709551615] [0, 18446744073709551615] |
- Applicable scenario: number, quantity and storage id.
2. Floating point type
- Float32 - float,Float64 - double
- It is recommended to store data in integer form as much as possible. For example: convert a fixed precision number into an integer value, and the time is expressed in milliseconds, because the calculation of floating-point type may cause rounding error.
- Applicable scenario: generally, the data value is relatively small, does not involve a large number of statistical calculations, and the accuracy requirements are not high. For example: save the weight of goods.
3. Boolean
- There is no separate type to store Boolean values. UInt8 type can be used, and the value is limited to 0 or 1.
4. Decimal type
- A signed floating-point number that maintains accuracy during addition, subtraction, and multiplication. For division, the least significant number is discarded (not rounded).
- There are three declarations: s identifies decimal places
- Decimal 32 (s): equivalent to Decimal(9-s,s), with significant digits of 1 ~ 9
- Decimal 64 (s): equivalent to Decimal(18-s,s), with significant digits of 1 ~ 18
- Decimal 128 (s): equivalent to Decimal(38-s,s), with significant digits of 1 ~ 38
- Applicable scenario: fields such as amount, exchange rate and interest rate are stored in Decimal in order to ensure Decimal precision.
5. Character type
- String: the string can be of any length. It can contain any byte set, including empty bytes.
- FixedString(N): a string of fixed length N, which must be a strictly positive natural number. When the server reads a string with a length less than N, the length of N bytes is reached by adding empty bytes at the end of the string. When the server reads a string with a length greater than N, an error message will be returned.
- Compared with String, FixedString is rarely used because it is not very convenient to use.
- Applicable scenario: name, text description and character coding. Fixed length can save some fixed length content, such as some codes, gender, etc., but considering certain change risks, the benefits are not obvious enough, so the use of fixed length strings is of limited significance.
6. Enumeration type
- There are Enum8 and Enum16, and the corresponding relationship of 'string' = integer is saved.
- Usage demonstration:
-- Create table create table t_enum ( x Enum8('hello' = 1, 'world' = 2) ) ENGINE = TinyLog; -- insert data insert into t_enum values('hello'), ('world'), ('hello'); -- query select * from t_enum; -- If you try to insert any other value, an error will be reported insert into t_enum values('hehe'); -- View the value of the corresponding row select cast(x, 'Int8') from t_enum;
- Applicable scenario: it is a kind of spatial optimization and data constraint for some state and type fields. However, in actual use, some changes in data content often increase the maintenance cost, or even the problem of data loss, so it is used with caution.
7. Time type
type | explain | 🌰 |
---|---|---|
Date | 'Year Month Day' string | 2022-04-12 |
Datetime | 'year, minute, day, second' string | 2022-04-12 12:12:12 |
Datetime64 | 'year month day hour: minute: second 'ms' string | 2022-04-12 12:12:12:123 |
- Date type, stored in two bytes, represents the date value from 1970-01-01 (unsigned) to the current date.
8. Array
- Array(T): an array of elements of type T.
- T can be any type, including array type. However, multi-dimensional arrays are not recommended. ClickHouse has limited support for multi-dimensional arrays. For example, you cannot store multidimensional arrays in the MergeTree table.
- Usage demonstration:
-- Create array method 1: use array function select array(1, 2) as x, toTypeName(x); -- How to create an array: use square brackets select [1, 2] as x, toTypeName(x);
9. Can be blank
- Most basic types can accept null values by adding a Nullable() declaration in front of them. For example, a column of Nullable(Int8) type can store a value of Int8 type, and a row without a value can store null.
- Nullable type fields cannot be included in a table index. And using nullable almost always has a negative impact on performance. Try to avoid using nullable when designing databases. For example, for a string, you can replace Null with an empty character. For integer data, nulls can be represented by numbers that have no business significance, such as - 1.
10. other types
- clickhouse also designs many special data types, such as Geo, Map, Tuple, UUID and so on.
Library engine
- To use a database, you must first build a database. clickhouse provides a variety of Library engines to realize the library declaration in different scenarios.
1. Atomic library engine
- It is the default library engine of ClickHouse. The default library created by default is this kind of engine. You can declare when creating a database:
CREATE DATABASE test [ENGINE = Atomic]
- The Atomic type database is completely managed by clickhouse itself. Each database corresponds to a subdirectory under / var/lib/clickhouse/data /. Each table in the database is assigned a unique UUID, and the data is stored in the directory / var/lib/clickhouse/store /.
2. MySQL database engine
- As a data warehouse, ClickHouse also provides many library engines that integrate with other databases, the most common of which is MySQL.
- MySQL engine is used to map the tables in the remote MySQL server to ClickHouse, and allows you to perform INSERT and SELECT queries on the tables to facilitate data exchange between ClickHouse and mysql. The MySQL database engine converts its query into MySQL syntax and sends it to the MySQL server, so you can perform operations such as SHOW TABLES or SHOW CREATE TABLE.
- Many ETL processes can be saved through MySQL engine. For example, the following statement can create a mysqldb in clickhouse.
CREATE DATABASE IF NOT EXISTS mysqldb ENGINE = MySQL('node01:3306', 'test', 'root', '123456');
- The operation of MySQL DB database will be escaped into MySQL syntax and sent to MySQL for execution. Next, you can perform insert, delete and other operations like the clickhouse's own table, but you cannot perform RENAME, CREATE TABLE and ALTER operations.
- This library engine, clickhouse itself does not store data, but just forwards requests to mysql. Similarly, clickhouse also provides library engines for PostgreSQL and SQLLite.
- Do you think it's not cool enough just to request forwarding? Performance is not high enough? clickhouse also provides its own materialized engine for storing data, the materialized MySQL engine for MySQL and the materialized PostgreSQL engine for PostgreSQL. Both engines will work with the clickhouse server as the slave Library of the corresponding database. Synchronize the data in the main library to the clickhouse in real time through the execution log. However, the two engines are still in the experimental stage. It can be tried, but it is not recommended to use in production.
- Please refer to the official document for specific usage: https://clickhouse.com/docs/zh/engines/database-engines/materialized-mysql/
- In fact, in most scenarios, it is enough for us to use clickhouse's own default engine. Other engines will be implemented through customized ETL process. However, the simplicity of clickhouse function has been fully demonstrated.
Table engine
- Official documents: https://clickhouse.com/docs/zh/engines/table-engines/
- The table engine is a major feature of ClickHouse. It can be said that the table engine determines how to store the data in the table. include:
- Storage mode and location of data, where to write and where to read data;
- Which queries are supported and how to support them;
- Concurrent data access;
- Use of index (if any);
- Whether multi-threaded requests can be executed;
- Data replication parameters
- The use of the table engine is to explicitly define the engine used by the table and the relevant parameters used by the engine when creating the table.
- Note: Engine names are case sensitive
1. TinyLog
- It is saved on disk in the form of column file, which does not support indexing and concurrency control. Generally, small tables that save a small amount of data have limited function in the production environment, and can be used as usual practice tests.
CREATE TABLE t_tinylog (id Int32, name String) ENGINE = TinyLog;
2. Memory
- In the memory engine, the data is directly saved in the memory in the uncompressed original form, and the data will disappear when the server restarts. Read and write operations will not block each other, and indexes are not supported. It has very high performance under simple query (more than 10G/s).
- Generally, it is not used in many places. In addition to testing, it is used in scenarios where very high performance is required and the amount of data is not too large (the upper limit is about 100 million lines).
3. MergeTree
- The most powerful table engine in ClickHouse is the MergeTree engine and other engines in this series (* MergeTree). It supports index and partition, and its status can be equivalent to that of InnoDB and MySQL.
- Create table statement:
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], ... INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1, INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2 ) ENGINE = MergeTree() ORDER BY expr [PARTITION BY expr] [PRIMARY KEY expr] [SAMPLE BY expr] [TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...] [SETTINGS name=value, ...]
- MergeTree has many parameters, but three are more important: partition by, primary key and order by
-- Create order table create table t_order ( id UInt32, sku_id String, total_amount Decimal(16, 2), create_time Datetime ) engine = MergeTree partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id); -- insert data insert into t_order values (101, 'sku_001', 1000.00, '2022-04-12 11:00:00'), (102, 'sku_002', 1500.00, '2022-04-12 10:30:00'), (102, 'sku_003', 2500.00, '2022-04-12 12:30:00'), (102, 'sku_002', 1500.00, '2022-04-12 13:00:00'), (102, 'sku_002', 10000.00, '2022-04-12 13:00:00'), (102, 'sku_002', 800.00, '2022-04-13 12:00:00');
partition by partition (optional)
- Function: reduce the data scanning range and optimize the query speed. For example, in the example, create by creation time_ Time is partitioned. When the date is specified in the query condition where, you only need to scan the data of the corresponding date instead of scanning the whole table.
- After using partitions, when cross partition query operations are involved, clickhouse will conduct parallel processing in partition units. This is an option in clickhouse. If it is not filled in, it is equivalent to using only one partition.
- Partition Directory: MergeTree is composed of column file + index file + table definition file. However, if partition is set, these files will be saved to different partition directories.
[root@centos128 t_order]# cd /var/lib/clickhouse/data/test/t_order/ [root@centos128 t_order]# tree . . ├── 20220412_1_1_0 │ ├── checksums.txt # Verification file is used to verify the correctness of each file and store the size and hash value of each file │ ├── columns.txt # Table structure information │ ├── count.txt # The number of data entries in the current partition, so for clickhouse, the number of rows in the lookup table is very, very fast │ ├── data.bin # data file │ ├── data.mrk3 # Tag file, which acts as a bridge between idx index file and bin data file │ ├── default_compression_codec.txt │ ├── minmax_create_time.idx # Max min value of partition key │ ├── partition.dat │ └── primary.idx # The primary key index file is used to speed up query efficiency ├── 20220413_2_2_0 │ ├── checksums.txt │ ├── columns.txt │ ├── count.txt │ ├── data.bin │ ├── data.mrk3 │ ├── default_compression_codec.txt │ ├── minmax_create_time.idx │ ├── partition.dat │ └── primary.idx ├── detached └── format_version.txt
-
PartitionId_MinBlockNum_MaxBlockNum_Level: zones_ Minimum partition block number_ Maximum partition block number_ Merge level
- PartitionId: data partition ID. the generation rule is determined by the PARTITION BY partition key.
- Undefined partition key: a data partition named all is generated by default, and all data is stored in the all directory;
- Integer partition key: directly use the string form of the integer value as the partition ID;
- Date type partition key, or string that can be converted to log type;
- Other types of partition keys: String and Float take their Hash value as partition ID through 128 bit Hash algorithm.
- MinBlockNum: minimum partition block number, self increasing type, starting from 1 and increasing upward. Each time a new directory is generated, the partition is incremented by a number.
- MaxBlockNum: the maximum partition block number. The newly created partition MinBlockNum is equal to MaxBlockNum.
- Level: the level to be merged and the number of times to be merged. The more merging times, the higher the level value.
- PartitionId: data partition ID. the generation rule is determined by the PARTITION BY partition key.
-
Parallel: after partitioning, ClickHouse will process the query statistics involving cross partitions in partition units.
-
Data writing and partition merging: any batch of data writing will generate a temporary partition and will not be included in any existing partition. At a certain time after writing (about 10-15 minutes later), ClickHouse will automatically perform the merge operation (you can't wait or manually optimize it) to merge the data of the temporary partition into the existing partition.
OPTIMIZE TABLE ${table_name} FINAL;
- Log view partition merge:
# pwd /var/log/clickhouse-server less clickhouse-server.log | grep 'c145f3c8-0833-4fa3-9dc1-17adf3ba65b7' -C 10
Primary key (optional)
- The primary key in ClickHouse is different from other databases. It only provides the primary index of data, but it is not the only constraint. This means that data with the same primary key can exist.
- The setting of the primary key is mainly based on the where condition in the query statement. According to the condition, through some form of binary search on the primary key, the corresponding index granularity can be located to avoid full table scanning.
- Index granularity: index granularity refers to the interval of data corresponding to two adjacent indexes in a sparse index. MergeTree in ClickHouse defaults to 8192. Officials do not recommend changing this value unless there are a large number of duplicate values in this column. For example, tens of thousands of rows in a partition have different data.
- Sparse index: you can use a small amount of index data to locate more data at the cost of only locating to the first row of index granularity, and then scanning a little.
Order by (required)
- It specifies the field order in which the data in the partition is saved in order.
- Order by is the only required item in MergeTree, which is even more important than the primary key, because when the user does not set the primary key, many processes will be processed according to the field of order by (for example, de duplication and summary will be discussed later).
- Note: the primary key must be the prefix field of the order by field.
- For example, order by (id, sku_id), the primary key must be ID or (id, sku_id)
Secondary index
- At present, the function of secondary index on ClickHouse's official website is V20 Before 1.2.4, it was marked as experimental. After this version, it is turned on by default.
- The setting needs to be added before using the secondary index in the old version
set allow_experimental_data_skipping_indices=1;
- For the specified index GRANULARITY: gray is the second level index
ALTER TABLE t_order ADD INDEX idx_total_amount total_amount TYPE minmax GRANULARITY 5;
- Query test: the secondary index can play a role in the query of non primary key fields
clickhouse-client --send_logs_level=trace <<< 'select * from test.t_order where total_amount > toDecimal32(1200., 2)' --password
Data TTL
- TTL is Time To Live. MergeTree provides the function of managing the life cycle of data tables or columns.
- Column level TTL:
-- Create test table create table t_order2 ( id UInt32, sku_id String, total_amount Decimal(16, 2) TTL create_time + interval 10 SECOND, create_time Datetime ) engine = MergeTree partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id); -- insert data insert into t_order2 values (101, 'sku_001', 1000.00, '2022-04-15 09:00:00'), (102, 'sku_002', 1500.00, '2022-04-15 10:30:00'), (102, 'sku_003', 2500.00, '2022-04-15 12:30:00'), (102, 'sku_002', 1500.00, '2022-04-15 13:00:00'), (102, 'sku_002', 10000.00, '2022-04-15 13:00:00'), (102, 'sku_002', 800.00, '2022-04-15 12:00:00');
- Table level TTL:
- You can set an expression to remove expired rows and multiple expressions to automatically transfer data fragments on disk or volume
TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'][, DELETE|TO DISK 'aaa'|TO VOLUME 'bbb'] ... [WHERE conditions] [GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ] -- TTL The type of rule follows each TTL After the expression, it affects what should be done when the expression is satisfied (when the specified time is reached): -- DELETE - Delete expired lines (default operation); -- TO DISK 'aaa' - Move data segments to disk aaa; -- TO VOLUME 'bbb' - Move data fragment to volume bbb; -- GROUP BY - Aggregate expired rows. -- For example: the following statement is that the whole row of data will be displayed in the create_time Lost after 10 seconds ALTER TABLE t_order2 MODIFY TTL create_time + INTERVAL 10 SECOND;
- The field specified in TTL cannot be specified as a primary key field, and must be of Date or Datetime type. It is recommended to use the partitioned Date field. The time units that can be used are:
- SECOND,MINUTE,HOUR,DAY,WEEK,MONTH,QUARTER,YEAR.
4. ReplacingMergeTree
- It is a variant of MergeTree. Its storage features fully inherit MergeTree, but it has an additional de duplication function. Although MergeTree can set the primary key, the primary key has no unique constraint function. If you want to deal with duplicate data, you can use this replacing merge tree.
De duplication mechanism
- The de duplication of data will only occur in the process of merging. The merging will be carried out in the background at an unknown time, so you can't make a plan in advance. Some data may not be processed yet.
Weight removal range
- If the table has been partitioned, de duplication will only be performed within the partition, and cross partition de duplication cannot be performed.
- Therefore, the ability of replaceingmergetree is limited. Replaceingmergetree is suitable for clearing duplicate data in the background to save space, but it does not guarantee that there is no duplicate data.
Case demonstration
- Create table
create table t_order_rmt( id UInt32, sku_id String, total_amount Decimal(16, 2), create_time Datetime ) engine = ReplacingMergeTree(create_time) partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id);
- The parameter filled in by replaceingmergetree() is the version field, and the value of the duplicate data retention version field is the largest. If you do not fill in the version field, the last one is reserved by default according to the insertion order.
- Insert data:
insert into t_order_rmt values (101, 'sku_001', 1000.00, '2022-04-15 11:00:00'), (102, 'sku_002', 2000.00, '2022-04-15 12:00:00'), (102, 'sku_003', 2500.00, '2022-04-15 11:20:00'), (102, 'sku_002', 2000.00, '2022-04-15 13:00:00'), (102, 'sku_002', 12000.00, '2022-04-16 12:00:00'), (102, 'sku_002', 800.00, '2022-04-15 13:30:00');
- Insert the data again and execute the query:
select * from t_order_rmt; -- Manual merge OPTIMIZE TABLE t_order_rmt FINAL; -- Query again select * from t_order_rmt;
conclusion
- Actually, the order by field is used as the unique key;
- De duplication cannot cross partitions;
- De duplication can only be performed when the same batch is inserted (new version) or partitions are merged;
- Identify the field with the largest value of duplicate data retention version;
- If the version fields are the same, keep the last one in the insertion order.
5. SummingMergeTree
- For scenarios that do not query details and only care about summarizing and aggregating results by dimension. If only ordinary MergeTree is used, the overhead of storage space and temporary aggregation during query will be relatively large.
- ClickHouse provides a "pre aggregation" engine, SummingMergeTree, for this scenario.
Case demonstration
- Create table:
create table t_order_smt( id UInt32, sku_id String, total_amount Decimal(16, 2), create_time Datetime ) engine = SummingMergeTree(total_amount) partition by toYYYYMMDD(create_time) primary key (id) order by (id, sku_id);
- Insert data:
insert into t_order_smt values (101, 'sku_001', 1000.00, '2022-04-15 11:00:00'), (102, 'sku_002', 2000.00, '2022-04-15 12:00:00'), (102, 'sku_003', 2500.00, '2022-04-15 11:20:00'), (102, 'sku_002', 2000.00, '2022-04-15 13:00:00'), (102, 'sku_002', 12000.00, '2022-04-16 12:00:00'), (102, 'sku_002', 800.00, '2022-04-15 13:30:00');
- Insert the data again and execute the query:
select * from t_order_smt; -- Manual merge OPTIMIZE TABLE t_order_smt FINAL; -- Query again select * from t_order_smt;
conclusion
- Take the column specified in SummingMergeTree() as the summary data column;
- You can fill in multiple columns, which must be numeric columns; If it is not filled in, the fields that are non dimension columns and numeric columns are used as summary data columns;
- Take the column of order by as the dimension column;
- The first row is reserved for other columns according to the insertion order;
- Data not in a partition will not be aggregated;
- Aggregation occurs only when the same batch is inserted (new version) or piecemeal merged.
Development suggestions and issues
- When designing an aggregation table, the unique key value and serial number can be removed, and all fields are dimensions, measures or timestamps.
Can you directly execute the following SQL to get the summary value?
select total_amount from t_order_smt where sku_id = 'xxx' and create_time = 'yyy';
- No, it may contain some temporary details that haven't been aggregated yet.
- If you want to obtain the summary value, you still need to use sum for aggregation, which will improve the efficiency to a certain extent, but the ClickHouse itself is stored in columns, so the efficiency improvement is limited and will not be particularly obvious.
select sum(total_amount) from t_order_smt where sku_id = 'xxx' and create_time = 'yyy';