Apache Doris is a modern MPP analytical database product. Query results can be obtained with sub second response time, which effectively supports real-time data analysis. Apache Doris's distributed architecture is very simple, easy to operate, and can support more than 10PB of super large data sets.
Apache Doris can meet a variety of data analysis needs, such as fixed historical reports, real-time data analysis, interactive data analysis and exploratory data analysis. Make your data analysis more simple and efficient!
-
dorisFE start and stop:
./start_fe.sh --daemon ./stop_fe.sh --daemon
-
View Fe status:
SHOW PROC '/frontends';
-
dorisBE stop:
./start_be.sh --daemon /start_fe.sh --daemon
-
View Be status
SHOW PROC '/backends';
-
Add and delete Fe nodes:
-
Precautions for FE expansion:
- The number of Follower FE (including leaders) must be odd. It is recommended to deploy up to 3 constituent high availability (HA) modes.
- When the FE is in high availability deployment (one Leader and two followers), we suggest adding Observer FE to expand the FE's read service capability. Of course, we can continue to add Follower FE, but it is almost unnecessary.
- Generally, one FE node can handle 10-20 BE nodes. It is recommended that the total number of FE nodes BE less than 10. Usually three can meet most of the needs.
- The helper cannot point to the FE itself, but must point to one or more master / follower FES that already exist and are in normal operation.
# Add Follower ALTER SYSTEM ADD FOLLOWER "follower_host:edit_log_port"; # Add Observer ALTER SYSTEM ADD OBSERVER "observer_host:edit_log_port"; # Including follower_host and observer_host is the ip address of the node where the Follower or Observer is located, and edit_log_port in its configuration file fe Conf. # Configure and start Follower or Observer. The configuration of Follower and Observer is the same as that of Leader. When starting for the first time, execute the following commands: # ./bin/start_fe.sh --helper leader_fe_host:edit_log_port --daemon # Delete FE node ALTER SYSTEM DROP FOLLOWER[OBSERVER] "fe_host:edit_log_port";
FE shrinkage precautions:
- When deleting Follower FE, ensure that the final remaining Follower (including Leader) nodes are odd.
-
Add and delete BE nodes:
Precautions for BE expansion:
- After BE capacity expansion, Doris will automatically balance the data according to the load condition, which will not affect the use during the period.
# Increase be ALTER SYSTEM ADD BACKEND "be_host:heartbeat-service_port"; # Delete BE node ALTER SYSTEM DECOMMISSION BACKEND "be_host:be_heartbeat_service_port";
-
Create user:
# Sign in mysql -h FE_HOST -P9030 -uroot
# To change the root password: SET PASSWORD FOR 'root' = PASSWORD('your_password'); # Create a new user CREATE USER 'test' IDENTIFIED BY 'test_passwd'; # Account Authorization # example_ After the DB is created, the example can be created through the root/admin account_ DB read and write permissions are authorized to ordinary accounts, such as test. Use test account after authorization # The user can operate example after logging in_ DB database. GRANT ALL ON example_db TO test;
-
Create database
CREATE DATABASE example_db; # view the database SHOW DATABASES;
All commands can use 'HELP command;' See the detailed syntax help. For example: HELP CREATE DATABASE;
If you do not know the full name of the command, you can use "help command a field" for fuzzy query. If you type 'HELP CREATE', you can match commands such as create database, create table, and create user.
-
Create database table
HELP CREATE TABLE; # Switch database USE example_db; # Single partition table creation CREATE TABLE table1 ( siteid INT DEFAULT '10', citycode SMALLINT, username VARCHAR(32) DEFAULT '', pv BIGINT SUM DEFAULT '0' ) AGGREGATE KEY(siteid, citycode, username) DISTRIBUTED BY HASH(siteid) BUCKETS 10 PROPERTIES("replication_num" = "1"); # Multi partition table creation CREATE TABLE table2 ( event_day DATE, siteid INT DEFAULT '10', citycode SMALLINT, username VARCHAR(32) DEFAULT '', pv BIGINT SUM DEFAULT '0' ) AGGREGATE KEY(event_day, siteid, citycode, username) PARTITION BY RANGE(event_day) ( PARTITION p201706 VALUES LESS THAN ('2017-07-01'), PARTITION p201707 VALUES LESS THAN ('2017-08-01'), PARTITION p201708 VALUES LESS THAN ('2017-09-01') ) DISTRIBUTED BY HASH(siteid) BUCKETS 10 PROPERTIES("replication_num" = "1"); # Dynamic partition table creation # Dynamic configuration needs to be enabled_ partition. enable = ture # Dynamic partition only supports Range partition. # Dynamic partition scheduling unit. Can be specified as HOUR, DAY, WEEK, MONTH. It means to create or delete partitions by DAY, WEEK and MONTH respectively. dynamic_partition.time_unit Dynamic partition scheduling unit. Can be specified as HOUR,DAY,WEEK,MONTH. It means to create or delete partitions by day, week and month respectively. When specified as HOUR The suffix format of the dynamically created partition name is yyyyMMddHH,For example, 2020032501. Partition column data type in hours cannot be DATE. When specified as DAY The suffix format of the dynamically created partition name is yyyyMMdd,For example, 20200325. When specified as WEEK The suffix format of the dynamically created partition name is yyyy_ww. That is, the current date belongs to the week of the year, such as 2020-03-25 The created partition name suffix is 2020_13, Indicates that the current is the 13th week of 2020. When specified as MONTH When you create a dynamic partition, the suffix is yyyyMM,For example, 202003. # The time zone of the dynamic partition. If it is not filled in, it defaults to the time zone of the system of the current machine, such as Asia/Shanghai. If you want to obtain the currently supported time zone settings, you can refer to https://en.wikipedia.org/wiki/List_of_tz_database_time_zones . dynamic_partition.time_zone # The starting offset of the dynamic partition is negative. According to time_ The unit attribute is different. Based on the current day (week / month), the partition range before this offset will be deleted. If it is not filled in, it defaults to - 2147483648, that is, the historical partition will not be deleted. dynamic_partition.start # The end offset of the dynamic partition, which is a positive number. According to time_ Different from the unit attribute, the partition of the corresponding range is created in advance based on the current day (week / month). dynamic_partition.end # Dynamically created partition name prefix. dynamic_partition.prefix # Number of buckets corresponding to dynamically created partitions dynamic_partition.buckets # The number of replicas corresponding to the dynamically created partition. If it is not filled in, it defaults to the number of replicas specified when the table is created. dynamic_partition.replication_num # When time_ When unit is WEEK, this parameter is used to specify the starting point of each WEEK. Values are 1 to 7. Where 1 means Monday and 7 means Sunday. The default value is 1, which means that the starting point of each WEEK is Monday. namic_partition.start_day_of_week # When time_ When unit is MONTH, this parameter is used to specify the start date of each MONTH. Values are 1 to 28. Where 1 represents the 1st day of each MONTH and 28 represents the 28th day of each MONTH. The default is 1, which means that the starting point of each MONTH is position 1. It is not supported to start on the 29th, 30th and 31st to avoid ambiguity caused by leap year or leap MONTH. dynamic_partition.start_day_of_month # The default is false. When set to true, Doris will automatically create all partitions. See the specific creation rules below. At the same time, the parameter max of FE_ dynamic_ partition_ Num limits the total number of partitions to avoid creating too many partitions at once. When the number of partitions expected to be created is greater than max_ dynamic_ partition_ When the value of num is, the operation will be prohibited This parameter does not take effect when the start attribute is not specified. dynamic_partition.create_history_partition # When create_ history_ When partition is true, this parameter is used to specify the number of historical partitions created. The default value is - 1, which is not set dynamic_partition.history_partition_num # Specify the latest number of partitions as hot partitions. For hot zones, the system will automatically set their storage_ The medium parameter is SSD and storage is set_ cooldown_ time. dynamic_partition.hot_partition_num # All partitions in the past n days and in the future hot_partition_num # The time range of the history partition that needs to be retained. When dynamic_ partition. time_ When unit is set to "DAY/WEEK/MONTH", it needs to be set with [yyyy MM DD, yyyy MM DD], [...,...] Format. When dynamic_ partition. time_ When unit is set to "HOUR", it needs to be set with [yyyy MM DD HH: mm: SS, yyyy MM DD HH: mm: SS], [...,...] Format. If not set, it defaults to "NULL" dynamic_partition.reserved_history_periods # Table tbl1 partition column k1 is of type DATE, creating a dynamic partition rule. Partition by day, only the partitions of the last 7 days are retained, and the partitions of the next 3 days are created in advance. CREATE TABLE tbl1 ( k1 DATE, ... ) PARTITION BY RANGE(k1) () DISTRIBUTED BY HASH(k1) PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-7", "dynamic_partition.end" = "3", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "32" ); # The partition column k1 type of table tbl1 is DATETIME, and a dynamic partition rule is created. Partition by week, only the partitions of the last 2 weeks are reserved, and the partitions of the next 2 weeks are created in advance. CREATE TABLE tbl1 ( k1 DATETIME, ... ) PARTITION BY RANGE(k1) () DISTRIBUTED BY HASH(k1) PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "WEEK", "dynamic_partition.start" = "-2", "dynamic_partition.end" = "2", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "8" ); # Table tbl1 partition column k1 is of type DATE, creating a dynamic partition rule. Partition by month, do not delete the historical partition, and create the partition for the next 2 months in advance. At the same time, the 3rd day of each month is set as the starting DATE. CREATE TABLE tbl1 ( k1 DATE, ... ) PARTITION BY RANGE(k1) () DISTRIBUTED BY HASH(k1) PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "MONTH", "dynamic_partition.end" = "2", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "8", "dynamic_partition.start_day_of_month" = "3" ); # Modify dynamic partition properties ALTER TABLE tbl1 SET ( "dynamic_partition.prop1" = "value1", ... ); # View the scheduling of dynamic partition table # You can further view the scheduling of all dynamic partition tables in the current database through the following commands: SHOW DYNAMIC PARTITION TABLES; # LastUpdateTime: the time when the dynamic partition property was last modified # LastSchedulerTime: the last time dynamic partition scheduling was executed # State: the state of the last execution of dynamic partition scheduling # LastCreatePartitionMsg: error message of the last execution of dynamic add partition scheduling # LastDropPartitionMsg: error message of the last execution of dynamic deletion partition scheduling +-----------+--------+----------+-------------+------+--------+---------+-----------+----------------+---------------------+--------+------------------------+----------------------+-------------------------+ | TableName | Enable | TimeUnit | Start | End | Prefix | Buckets | StartOf | LastUpdateTime | LastSchedulerTime | State | LastCreatePartitionMsg | LastDropPartitionMsg | ReservedHistoryPeriods | +-----------+--------+----------+-------------+------+--------+---------+-----------+----------------+---------------------+--------+------------------------+----------------------+-------------------------+ | d3 | true | WEEK | -3 | 3 | p | 1 | MONDAY | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | [2021-12-01,2021-12-31] | | d5 | true | DAY | -7 | 3 | p | 32 | N/A | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | NULL | | d4 | true | WEEK | -3 | 3 | p | 1 | WEDNESDAY | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | NULL | | d6 | true | MONTH | -2147483648 | 2 | p | 8 | 3rd | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | NULL | | d2 | true | DAY | -3 | 3 | p | 32 | N/A | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | NULL | | d7 | true | MONTH | -2147483648 | 5 | p | 8 | 24th | N/A | 2020-05-25 14:29:24 | NORMAL | N/A | N/A | NULL | +-----------+--------+----------+-------------+------+--------+---------+-----------+----------------+---------------------+--------+------------------------+----------------------+-------------------------+ 7 rows in set (0.02 sec) # The execution frequency of dynamic partition thread is 600 (10 minutes) by default, that is, it is scheduled every 10 minutes. You can modify FE Conf and restart FE to take effect. You can also perform the following command modifications at run time: dynamic_partition_check_interval_seconds = "7200"
-
Table structure change
# Add a new column uv, the type is BIGINT, the aggregation type is SUM, and the default value is 0: ALTER TABLE table1 ADD COLUMN uv BIGINT SUM DEFAULT '0' after pv; # Check the job progress. When the job status is FINISHED, it indicates that the job is completed. The new Schema has taken effect. SHOW ALTER TABLE COLUMN; # View the latest Schema DESC table1; # You can cancel a job that is currently executing with the following command: CANCEL ALTER TABLE COLUMN FROM table1; # more HELP ALTER TABLE;
-
Add Rollup
Rollup can be understood as a materialized index structure of a Table. Materialization is because its data is physically stored independently. Index means that rollup can adjust the column order to increase the hit rate of prefix index, and reduce the key column to increase the data aggregation.
# For table1, the detailed data is siteid, citycode and username, which form a group of key s to aggregate the pv field; If the business side often looks at the city pv # According to the total demand, a rollup with only citycode and PV can be established. ALTER TABLE table1 ADD ROLLUP rollup_city(citycode, pv); # View Rollup job progress: when the job status is FINISHED, it indicates that the job is completed. SHOW ALTER TABLE ROLLUP; # Cancel the job currently executing: CANCEL ALTER TABLE ROLLUP FROM table1; # View Rollup information for table DESC table1 ALL; # Check whether sql hits rollup EXPLAIN your_sql; # more HELP ALTER TABLE;
-
Query of data table
Memory limit:
# In order to prevent a user's query from consuming too much memory. Memory control is carried out for query. One query task uses no more than 2GB of memory by default on a single BE node. # When using, if the user finds a Memory limit exceeded error, it generally exceeds the memory limit. # Display query memory limit SHOW VARIABLES LIKE "%mem_limit%"; +---------------+------------+ | Variable_name | Value | +---------------+------------+ | exec_mem_limit| 2147483648 | +---------------+------------+ 1 row in set (0.00 sec) # exec_ mem_ The unit of limit is byte, and exec can be changed through SET command_ mem_ The value of limit. If changed to 8GB. SET exec_mem_limit = 8589934592; +---------------+------------+ | Variable_name | Value | +---------------+------------+ | exec_mem_limit| 8589934592 | +---------------+------------+ 1 row in set (0.00 sec) # The above is modified to session level and is only valid in the currently connected session. If you disconnect and reconnect, it will change back to the default value. # If you need to modify a global variable, you can set it like this: ` SET GLOBAL exec_mem_limit = 8589934592;`. After setting, disconnect the session and log in again, and the parameters will take effect permanently.
Query timeout:
SHOW VARIABLES LIKE "%query_timeout%"; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | QUERY_TIMEOUT | 300 | +---------------+-------+ 1 row in set (0.00 sec) # Modify the timeout to 1 minute: SET query_timeout = 60;
Broadcast/Shuffle Join:
The system provides two implementation methods of Join, broadcast join and shuffle join (partitioned Join).
broadcast join means that after conditional filtering, the small table is broadcast to each node where the large table is located to form a memory Hash table, and then stream the data of the large table for Hash Join.
shuffle join refers to hashing small and large tables according to the key of the Join, and then performing distributed Join.
When the data volume of small tables is small, broadcast join has better performance. On the contrary, shuffle join has better performance.
The system will automatically attempt broadcast join, and you can also explicitly specify the implementation of each join operator. The system provides configurable parameter auto_broadcast_join_threshold, which specifies the upper limit of the proportion of the memory used by hash table in the overall execution memory when using broadcast join. The value range is 0 to 1, and the default value is 0.8. When the memory used by the system to calculate hash table exceeds this limit, it will be automatically converted to shuffle join.
When auto_ broadcast_ join_ When the threshold is set to less than or equal to 0, all joins will use shuffle join.
Automatically select join mode (default):
select sum(table1.pv) from table1 join table2 where table1.siteid = 2; +--------------------+ | sum(`table1`.`pv`) | +--------------------+ | 10 | +--------------------+ 1 row in set (0.20 sec) use Broadcast Join((specify explicitly): select sum(table1.pv) from table1 join [broadcast] table2 where table1.siteid = 2; +--------------------+ | sum(`table1`.`pv`) | +--------------------+ | 10 | +--------------------+ 1 row in set (0.20 sec) use Shuffle Join: select sum(table1.pv) from table1 join [shuffle] table2 where table1.siteid = 2; +--------------------+ | sum(`table1`.`pv`) | +--------------------+ | 10 | +--------------------+ 1 row in set (0.15 sec)