- The original content of the GreatSQL community is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.
- GreatSQL is a domestic branch version of MySQL, which is consistent with MySQL in use.
- Author: Huajiashe
- Article source: GreatSQL community original

## Previous review

Annotation: cstack maintains a simple sqlite-like database implementation on github. Through this simple project, you can have a good understanding of how the database works. This article is the tenth article, mainly to realize the splitting of leaf nodes in B-tree

## Part 10 leaf node split

Our B-Tree has only one node, which doesn't look like a standard tree. In order to solve this problem, some code is needed to implement the split leaf node. After that, an internal node needs to be created to be the parent of the two new leaf nodes.

Basically, our goal for this series of articles starts here:

one-node btree

to this:

two-level btree

First of all, firstly, remove the processing node full error:

void leaf_node_insert(Cursor* cursor, uint32_t key, Row* value) { void* node = get_page(cursor->table->pager, cursor->page_num); uint32_t num_cells = *leaf_node_num_cells(node); if (num_cells >= LEAF_NODE_MAX_CELLS) { // Node full - printf("Need to implement splitting a leaf node.\n"); - exit(EXIT_FAILURE); + leaf_node_split_and_insert(cursor, key, value); + return; } ExecuteResult execute_insert(Statement* statement, Table* table) { void* node = get_page(table->pager, table->root_page_num); uint32_t num_cells = (*leaf_node_num_cells(node)); - if (num_cells >= LEAF_NODE_MAX_CELLS) { - return EXECUTE_TABLE_FULL; - } Row* row_to_insert = &(statement->row_to_insert); uint32_t key_to_insert = row_to_insert->id;

## Splitting Algorithm

The easy part is over. Here is a description of what we need to do (from: SQLite Database System: Design and Implementation)

Original: If there is no space on the leaf node, we would split the existing entries residing there and the new one (being inserted) into two equal halves: lower and upper halves. (Keys on the upper half are strictly greater than those on the lower half.) We allocate a new leaf node, and move the upper half into the new node.

Translation: If there is no space left in the leaf node, we need to split the existing entry residing in the node and the new entry (which is being inserted) into two equal halves: the lower half and the upper half (in the upper half keys are strictly greater than those in the lower half). We allocate a new node and move the upper half entries into the new node.

Now to process the old node and create a new one:

+void leaf_node_split_and_insert(Cursor* cursor, uint32_t key, Row* value) { + /* + Create a new node and move half the cells over. + Insert the new value in one of the two nodes. + Update parent or create a new parent. + */ + + void* old_node = get_page(cursor->table->pager, cursor->page_num); + uint32_t new_page_num = get_unused_page_num(cursor->table->pager); + void* new_node = get_page(cursor->table->pager, new_page_num); + initialize_leaf_node(new_node);

Next, copy each cell in the node to its new location:

+ /* + All existing keys plus new key should be divided + evenly between old (left) and new (right) nodes. + Starting from the right, move each key to correct position. + */ + for (int32_t i = LEAF_NODE_MAX_CELLS; i >= 0; i--) { + void* destination_node; + if (i >= LEAF_NODE_LEFT_SPLIT_COUNT) { + destination_node = new_node; + } else { + destination_node = old_node; + } + uint32_t index_within_node = i % LEAF_NODE_LEFT_SPLIT_COUNT; + void* destination = leaf_node_cell(destination_node, index_within_node); + + if (i == cursor->cell_num) { + serialize_row(value, destination); + } else if (i > cursor->cell_num) { + memcpy(destination, leaf_node_cell(old_node, i - 1), LEAF_NODE_CELL_SIZE); + } else { + memcpy(destination, leaf_node_cell(old_node, i), LEAF_NODE_CELL_SIZE); + } + }

Update the number of cells marked in the head of the node (update node's header)

+ /* Update cell count on both leaf nodes */ + *(leaf_node_num_cells(old_node)) = LEAF_NODE_LEFT_SPLIT_COUNT; + *(leaf_node_num_cells(new_node)) = LEAF_NODE_RIGHT_SPLIT_COUNT;

Then we need to update the node's parent. If the original node is a root node (root node), then it has no parent node. In this case, create a new root node as its parent. Here is another stub (not implemented first):

+ if (is_node_root(old_node)) { + return create_new_root(cursor->table, new_page_num); + } else { + printf("Need to implement updating parent after split\n"); + exit(EXIT_FAILURE); + } +}

## Allocating New Pages

Let's go back and define some functions and constants. When we create a new leaf node, we put it into a page determined (returned) by the get_unused_page_num() function.

+/* +Until we start recycling free pages, new pages will always +go onto the end of the database file +*/ +uint32_t get_unused_page_num(Pager* pager) { return pager->num_pages; }

Now, we assume that there are N data pages in a database, and pages numbered from 0 to N-1 have been allocated. So we can always assign a page N code for a new page. After we finally implement the delete (data) operation, some pages may become empty pages, and their page numbers may not be used. To be more efficient, we reclaim these free pages.

## The size of the leaf node (Leaf Node Sizes)

To keep the tree balanced, we distribute the cells equally between the two new nodes. If a leaf node can hold N cells, then during the split we need to distribute N + 1 cells between the two nodes (N original cells and one newly inserted cell). If N+1 is odd, I arbitrarily choose the cell with the most nodes on the left.

+const uint32_t LEAF_NODE_RIGHT_SPLIT_COUNT = (LEAF_NODE_MAX_CELLS + 1) / 2; +const uint32_t LEAF_NODE_LEFT_SPLIT_COUNT = + (LEAF_NODE_MAX_CELLS + 1) - LEAF_NODE_RIGHT_SPLIT_COUNT;

## Creating a New Root

Here is the process described by "SQLite Database System" to create a new root node:

Original text: Let N be the root node. First allocate two nodes, say L and R. Move lower half of N into L and the upper half into R. Now N is empty. Add 〈L, K,R〉 in N, where K is the max key in L. Page N remains the root. Note that the depth of the tree has increased by one, but the new tree remains height balanced without violating any B+-tree property.

Translation: Let N be the root node. First allocate two nodes, say L and R. Move the lower half of the entries in N to L and the upper half of them to R. Now N is empty. Add 〈L, K,R〉 to N, where K is the largest key in L. Page N is still the root node. Note that the depth of the tree has been increased by one layer, but without violating any B-Tree properties, the new tree still maintains a balance in height.

At this point, we have allocated the right child node and moved the upper half of the entries to this child node. Our function takes this right child as input and allocates a new page to hold the left child.

+void create_new_root(Table* table, uint32_t right_child_page_num) { + /* + Handle splitting the root. + Old root copied to new page, becomes left child. + Address of right child passed in. + Re-initialize root page to contain the new root node. + New root node points to two children. + */ + + void* root = get_page(table->pager, table->root_page_num); + void* right_child = get_page(table->pager, right_child_page_num); + uint32_t left_child_page_num = get_unused_page_num(table->pager); + void* left_child = get_page(table->pager, left_child_page_num);

The old root node has been copied to the left child, so we can reuse the root node (without reassigning):

+ /* Left child has data copied from old root */ + memcpy(left_child, root, PAGE_SIZE); + set_node_root(left_child, false);

Finally we initialize the root node as a new internal node with two children.

+ /* Root node is a new internal node with one key and two children */ + initialize_internal_node(root); + set_node_root(root, true); + *internal_node_num_keys(root) = 1; + *internal_node_child(root, 0) = left_child_page_num; + uint32_t left_child_max_key = get_node_max_key(left_child); + *internal_node_key(root, 0) = left_child_max_key; + *internal_node_right_child(root) = right_child_page_num; +}

## Internal Node Format

Now that we have finally created the internal node, we have to define its layout. It starts with the general header, then the number of key s it contains, followed by the page number of its right child. An internal node always has one more child pointer than its key. This child node pointer is stored in the header.

+/* + * Internal Node Header Layout + */ +const uint32_t INTERNAL_NODE_NUM_KEYS_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_NUM_KEYS_OFFSET = COMMON_NODE_HEADER_SIZE; +const uint32_t INTERNAL_NODE_RIGHT_CHILD_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_RIGHT_CHILD_OFFSET = + INTERNAL_NODE_NUM_KEYS_OFFSET + INTERNAL_NODE_NUM_KEYS_SIZE; +const uint32_t INTERNAL_NODE_HEADER_SIZE = COMMON_NODE_HEADER_SIZE + + INTERNAL_NODE_NUM_KEYS_SIZE + + INTERNAL_NODE_RIGHT_CHILD_SIZE;

The body of an internal node is an array of cells, and each cell contains a child pointer and a key. Each key must be the largest key contained in its left child.

+/* + * Internal Node Body Layout + */ +const uint32_t INTERNAL_NODE_KEY_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_CHILD_SIZE = sizeof(uint32_t); +const uint32_t INTERNAL_NODE_CELL_SIZE = + INTERNAL_NODE_CHILD_SIZE + INTERNAL_NODE_KEY_SIZE;

Based on these constants, here's what the internal node layout looks like:

Our internal node format

Notice our huge branching factor (aka fanout). Because each child pointer / key pair is so small, we can fit 510 keys and 511 child pointers in each internal node (that is, each internal node can have 510 children). This means we never have to traverse many levels of the tree when looking up a key.

# internal node layers | max # leaf nodes | Size of all leaf nodes |
---|---|---|

0 | 511^0 = 1 | 4 KB |

1 | 511^1 = 512 | ~2 MB |

2 | 511^2 = 261,121 | ~1 GB |

3 | 511^3 = 133,432,831 | ~550 GB |

In fact, we cannot store full 4KB of data in each leaf node, because of the overhead of storing header s and keys and the waste of space. But we can retrieve about 500G of data by loading 4 pages from disk (the tree is four levels high, and each level only needs to retrieve one page). That's why B-Tree is a useful data structure for databases.

Here's how to read and write to an internal node:

+uint32_t* internal_node_num_keys(void* node) { + return node + INTERNAL_NODE_NUM_KEYS_OFFSET; +} + +uint32_t* internal_node_right_child(void* node) { + return node + INTERNAL_NODE_RIGHT_CHILD_OFFSET; +} + +uint32_t* internal_node_cell(void* node, uint32_t cell_num) { + return node + INTERNAL_NODE_HEADER_SIZE + cell_num * INTERNAL_NODE_CELL_SIZE; +} + +uint32_t* internal_node_child(void* node, uint32_t child_num) { + uint32_t num_keys = *internal_node_num_keys(node); + if (child_num > num_keys) { + printf("Tried to access child_num %d > num_keys %d\n", child_num, num_keys); + exit(EXIT_FAILURE); + } else if (child_num == num_keys) { + return internal_node_right_child(node); + } else { + return internal_node_cell(node, child_num); + } +} + +uint32_t* internal_node_key(void* node, uint32_t key_num) { + return internal_node_cell(node, key_num) + INTERNAL_NODE_CHILD_SIZE; +}

For an internal node, the maximum key is always its right key. For a leaf node, the largest key is the largest index key.

+uint32_t get_node_max_key(void* node) { + switch (get_node_type(node)) { + case NODE_INTERNAL: + return *internal_node_key(node, *internal_node_num_keys(node) - 1); + case NODE_LEAF: + return *leaf_node_key(node, *leaf_node_num_cells(node) - 1); + } +}

## Keeping Track of the Root

We finally have the is_root field in the generic node header. The callback is what we use to decide how to split a leaf node:

if (is_node_root(old_node)) { return create_new_root(cursor->table, new_page_num); } else { printf("Need to implement updating parent after split\n"); exit(EXIT_FAILURE); } }

Here are the getters & setters:

+bool is_node_root(void* node) { + uint8_t value = *((uint8_t*)(node + IS_ROOT_OFFSET)); + return (bool)value; +} + +void set_node_root(void* node, bool is_root) { + uint8_t value = is_root; + *((uint8_t*)(node + IS_ROOT_OFFSET)) = value; +}

Initializing both types of nodes (internal nodes & leaf nodes) should default to setting is_root to false.

void initialize_leaf_node(void* node) { set_node_type(node, NODE_LEAF); + set_node_root(node, false); *leaf_node_num_cells(node) = 0; } +void initialize_internal_node(void* node) { + set_node_type(node, NODE_INTERNAL); + set_node_root(node, false); + *internal_node_num_keys(node) = 0; +}

We need to set is_root to true when creating the first node of the table.

// New database file. Initialize page 0 as leaf node. void* root_node = get_page(pager, 0); initialize_leaf_node(root_node); + set_node_root(root_node, true); } return table;

## Printing the Tree

To help us visualize the state of our database, we should update our .btree metacommand to print a multilevel tree.

I'm replacing the current print_leaf_node() function:

-void print_leaf_node(void* node) { - uint32_t num_cells = *leaf_node_num_cells(node); - printf("leaf (size %d)\n", num_cells); - for (uint32_t i = 0; i < num_cells; i++) { - uint32_t key = *leaf_node_key(node, i); - printf(" - %d : %d\n", i, key); - } -}

Implement a recursive function that takes any node and prints it and its children. It accepts an indentation level as an argument, which is incremented on each recursion. I also added a tiny helper function to the indentation.

+void indent(uint32_t level) { + for (uint32_t i = 0; i < level; i++) { + printf(" "); + } +} + +void print_tree(Pager* pager, uint32_t page_num, uint32_t indentation_level) { + void* node = get_page(pager, page_num); + uint32_t num_keys, child; + + switch (get_node_type(node)) { + case (NODE_LEAF): + num_keys = *leaf_node_num_cells(node); + indent(indentation_level); + printf("- leaf (size %d)\n", num_keys); + for (uint32_t i = 0; i < num_keys; i++) { + indent(indentation_level + 1); + printf("- %d\n", *leaf_node_key(node, i)); + } + break; + case (NODE_INTERNAL): + num_keys = *internal_node_num_keys(node); + indent(indentation_level); + printf("- internal (size %d)\n", num_keys); + for (uint32_t i = 0; i < num_keys; i++) { + child = *internal_node_child(node, i); + print_tree(pager, child, indentation_level + 1); + + indent(indentation_level + 1); + printf("- key %d\n", *internal_node_key(node, i)); + } + child = *internal_node_right_child(node); + print_tree(pager, child, indentation_level + 1); + break; + } +}

And update the call to the print function to pass an indentation level of zero.

} else if (strcmp(input_buffer->buffer, ".btree") == 0) { printf("Tree:\n"); - print_leaf_node(get_page(table->pager, 0)); + print_tree(table->pager, 0, 0); return META_COMMAND_SUCCESS;

Here is a test case for the new print function

+ it 'allows printing out the structure of a 3-leaf-node btree' do + script = (1..14).map do |i| + "insert #{i} user#{i} person#{i}@example.com" + end + script << ".btree" + script << "insert 15 user15 person15@example.com" + script << ".exit" + result = run_script(script) + + expect(result[14...(result.length)]).to match_array([ + "db > Tree:", + "- internal (size 1)", + " - leaf (size 7)", + " - 1", + " - 2", + " - 3", + " - 4", + " - 5", + " - 6", + " - 7", + " - key 7", + " - leaf (size 7)", + " - 8", + " - 9", + " - 10", + " - 11", + " - 12", + " - 13", + " - 14", + "db > Need to implement searching an internal node", + ]) + end

The new format is a bit simplified, so we need to update our existing .btree tests:

"db > Executed.", "db > Executed.", "db > Tree:", - "leaf (size 3)", - " - 0 : 1", - " - 1 : 2", - " - 2 : 3", + "- leaf (size 3)", + " - 1", + " - 2", + " - 3", "db > " ]) end

Here's the .btree output from the new test itself:

Tree: - internal (size 1) - leaf (size 7) - 1 - 2 - 3 - 4 - 5 - 6 - 7 - key 7 - leaf (size 7) - 8 - 9 - 10 - 11 - 12 - 13 - 14

At the least indented level, we see the root node (an internal node). It outputs a size of 1 because it has a key . Indented one level, we see leaf nodes, a key , and a leaf node. The key (7) in the root node is the largest key in the first left child. Each key greater than 7 is stored in the second child node.

## A major problem (A Major Problem)

If you've been paying close attention, you may have noticed that we missed some big things. See what happens if we try to insert an extra row:

db > insert 15 user15 person15@example.com Need to implement searching an internal node

Oh ho! Who writes TODO messages? (The author is playing tricks! Obviously he stubs the internal node search function in the table_find() function!)

Next time we will continue the epic B-tree saga by implementing search on multilevel trees.

Enjoy GreatSQL :)

## About GreatSQL

GreatSQL is a MySQL branch maintained by Wanli Database. It focuses on improving the reliability and performance of MGR, supports InnoDB parallel query features, and is a MySQL branch version suitable for financial-level applications.

Related Links: GreatSQL Community Gitee GitHub Bilibili

## GreatSQL community:

Details of community blog prize call for papers: https://greatsql.cn/thread-10...

## Technical exchange group:

WeChat: Scan the QR code to add WeChat friends of GreatSQL Community Assistant, and send verification information to join the group.