MacOS installs and uses HBase

Install HBase

brew install hbase
# Installation directory / usr / local / cell / HBase / 2. **/
# Configure java path
vi /usr/local/Cellar/hbase/x.x.x/libexec/conf/

export JAVA_HOME="xxxx"

to configure

vi /usr/local/Cellar/hbase/x.x.x/libexec/conf/hbase-site.xml

    //This is where HBase stores files
    //This is where HBase stores the built-in zookeeper file



/usr/local/Cellar/hbase/x.x.x/bin/hbase shell

HBase Shell basic operation




create_namespace 'Namespace'
drop_namespace 'Namespace'
# drop namespace, the namespace must be empty
# create 'namespace: table name', 'column family', 'column family'
# A table must have at least one column family
# drop 'table name' must be disable d before deleting 'table name'
# alter 'table name', {name = > 'column family', versions = > x}
# describe 'table name'


# put 'namespace: table name', 'rowkey', 'column family: column', 'value'
# If the rowkey is the same, it will be regarded as the same data, and if the rowkey is the same, it will be regarded as modification
# delete 'namespace: table name', 'rowkey', 'column family: column'
# scan 'namespace: table name' [, {startRow = > 'rowkey', stoprow = > 'rowkey'}]
# The range is closed on the left and open on the right. rowkey dictionary order
# Scan full table

# get 'namespace: table name', 'rowkey' [, 'column']
Column cluster design

The principle pursued is: at present, the official recommendation of Hbase is no more than 2~3 column families. When a column family is flushing, its adjacent column family will also be triggered to flush due to the correlation effect, resulting in more I/O in the system.

The optimal design is to put all key values with strong correlation under the same column cluster, which can not only achieve the highest query efficiency, but also keep accessing different disk files as little as possible.

Taking user information as an example, you can store the necessary basic information in one column family, while some additional information can be placed in another column family

Three principles of Rowkey design

The unique identification of a piece of data is rowkey. The partition in which this piece of data is stored depends on which pre partition the rowkey is in. The main purpose of designing rowkey is to make the data evenly distributed in all region s and prevent data skew to a certain extent. Next, let's talk about the common design schemes of rowkey.

rowkey length principle

Rowkey is a binary code stream. Many developers suggest that the length of rowkey should be 10 ~ 100 bytes. However, it is suggested that the shorter the better, no more than 16 bytes. It is stored as a byte [] byte array, which is generally designed to be of fixed length.

The reasons are as follows:

1. The data persistence file HFile is stored according to the KeyValue. If the Rowkey is too long, such as 100 bytes, 10 million columns of data will occupy 1 * 10 million = 1 billion bytes, nearly 1G of data, which will greatly affect the storage efficiency of HFile;

2. MemStore caches some data into memory. If the Rowkey field is too long, the effective utilization of memory will be reduced, and the system will not be able to cache more data, which will reduce the retrieval efficiency. Therefore, the shorter the byte length of Rowkey, the better.

3. At present, the operating systems are all 64 bit systems, and the memory is aligned with 8 bytes. It is controlled at 16 bytes and an integer multiple of 8 bytes, making use of the best characteristics of the operating system.

rowkey hashing principle

If the Rowkey is incremented by time stamp, do not put the time in front of the binary code. It is recommended to use the high bit of the Rowkey as a hash field, which is generated by the program cycle, and the low bit as a time field, which will improve the probability of balanced distribution of data in each RegionServer and load balancing. If there is no hash field, the first field is the time information directly, which will lead to the hot phenomenon that all new data are stacked on a RegionServer. In this way, the load will be concentrated on individual regionservers during data retrieval, reducing the query efficiency.

row key It is stored in dictionary order, so the design row key Make full use of this sorting feature, store the data often read together in one piece, and put the data that may be accessed recently in one piece.

For example: if the data recently written into the HBase table is the most likely to be accessed, you can consider using the timestamp as a part of the row key. Because it is lexicographic sorting, you can use long MAX_ Value - timestamp is used as the row key to ensure that the newly written data can be hit quickly during reading.

rowkey uniqueness principle

It must be unique in design. Rowkeys are stored in dictionary order. Therefore, when designing rowkey s, we should make full use of the sorting characteristics, store the frequently read data in one piece, and put the recently accessed data in one piece.

Posted by ecko on Sat, 16 Apr 2022 12:20:10 +0930