introduce
Most users have worked with hierarchical data in SQL databases, and there is no doubt that hierarchical data management is not intended for relational databases. The tables in a relational database are not hierarchical (like XML), but are a simple list. Hierarchical data has parent-child relationships that are not naturally represented in relational database tables.
For our purposes, hierarchical data is a data collection where each item has a parent and zero or more children (except for the root item, which has no parent). Hierarchical data can be found in a variety of database applications, including forums and mailing list threads, business organization diagrams, content management categories, and product categories. For our purposes, we will use the following product category hierarchy in a fictional e-store:
The way these categories form hierarchies is very similar to the other examples mentioned above. In this article, we will explore two models for processing hierarchical data in MySQL, starting with the traditional adjacency table model.
1. Adjacency List Model
Typically, the example categories shown above will be stored in a table, as shown below (I include complete create and insert statements so you can follow them):
CREATE TABLE category( category_id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) NOT NULL, parent INT DEFAULT NULL ); INSERT INTO category VALUES(1,'ELECTRONICS',NULL),(2,'TELEVISIONS',1),(3,'TUBE',2), (4,'LCD',2),(5,'PLASMA',2),(6,'PORTABLE ELECTRONICS',1),(7,'MP3 PLAYERS',6),(8,'FLASH',7), (9,'CD PLAYERS',6),(10,'2 WAY RADIOS',6); SELECT * FROM category ORDER BY category_id; +-------------+----------------------+--------+ | category_id | name | parent | +-------------+----------------------+--------+ | 1 | ELECTRONICS | NULL | | 2 | TELEVISIONS | 1 | | 3 | TUBE | 2 | | 4 | LCD | 2 | | 5 | PLASMA | 2 | | 6 | PORTABLE ELECTRONICS | 1 | | 7 | MP3 PLAYERS | 6 | | 8 | FLASH | 7 | | 9 | CD PLAYERS | 6 | | 10 | 2 WAY RADIOS | 6 | +-------------+----------------------+--------+ 10 rows in set (0.00 sec)
In the adjacency table model, each item in the table contains a pointer to its parent. The topmost element, in this case electronics, has an empty parent. The advantage of the adjacency table model is that it is easy to see that FLASH is a child of the MP3 PLAYERS player, MP3 PLAYERS is a child of PORTABLE ELECTRONICS, and PORTABLE ELECTRONICS is a child of ELECTRONICS. Although the adjacency list model can be easily handled in client code, it can be more problematic to use in pure SQL.
Retrieve Full Tree
When working with hierarchical data, the first common task is to display the entire tree, usually in some form of indentation. In pure SQL, the most common implementation is to use self-join:
SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4 FROM category AS t1 LEFT JOIN category AS t2 ON t2.parent = t1.category_id LEFT JOIN category AS t3 ON t3.parent = t2.category_id LEFT JOIN category AS t4 ON t4.parent = t3.category_id WHERE t1.name = 'ELECTRONICS'; +-------------+----------------------+--------------+-------+ | lev1 | lev2 | lev3 | lev4 | +-------------+----------------------+--------------+-------+ | ELECTRONICS | TELEVISIONS | TUBE | NULL | | ELECTRONICS | TELEVISIONS | LCD | NULL | | ELECTRONICS | TELEVISIONS | PLASMA | NULL | | ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS | FLASH | | ELECTRONICS | PORTABLE ELECTRONICS | CD PLAYERS | NULL | | ELECTRONICS | PORTABLE ELECTRONICS | 2 WAY RADIOS | NULL | +-------------+----------------------+--------------+-------+ 6 rows in set (0.00 sec)
Find all leaf nodes
By using the left join query, we can find all the leaf nodes (without children) in the tree:
SELECT t1.name FROM category AS t1 LEFT JOIN category as t2 ON t1.category_id = t2.parent WHERE t2.category_id IS NULL; +--------------+ | name | +--------------+ | TUBE | | LCD | | PLASMA | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +--------------+
Retrieve a single path (from leaf node up)
Self-join also allows us to see the full path in the hierarchy:
SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4 FROM category AS t1 LEFT JOIN category AS t2 ON t2.parent = t1.category_id LEFT JOIN category AS t3 ON t3.parent = t2.category_id LEFT JOIN category AS t4 ON t4.parent = t3.category_id WHERE t1.name = 'ELECTRONICS' AND t4.name = 'FLASH'; +-------------+----------------------+-------------+-------+ | lev1 | lev2 | lev3 | lev4 | +-------------+----------------------+-------------+-------+ | ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS | FLASH | +-------------+----------------------+-------------+-------+ 1 row in set (0.01 sec)
The main limitation of this approach is that each level in the hierarchy requires a self-join, and as the complexity of the join increases, performance naturally decreases with each level.
Limitations of the adjacency table model
Using the adjacency table model in pure SQL can be difficult. Before we can see the full path of a category, we must know its level. In addition, special care must be taken when deleting a node, because in the process the entire subtree may be orphaned (deleting the portable electronics category makes all its subtrees orphaned). Some of these limitations can be addressed by using client code or stored procedures. Using process language, we can iterate up from the bottom of the tree to return a complete tree or a single path. We can also use procedural programming to delete nodes without isolating the entire subtree by promoting a child element and reordering the remaining child elements to point to the new parent element.
2. Nested Set Model
I would like to highlight another approach in this article, commonly referred to as the Nested Set Model. In a nested set model, we can view hierarchies in a new way, not as nodes and rows, but as nested containers. Try to describe our electronics category in the following ways:
Note that as parent categories wrap up their children, our hierarchy is still maintained. We represent this form of hierarchy in a table by using left and right values to represent the nesting of nodes:
CREATE TABLE nested_category ( category_id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) NOT NULL, lft INT NOT NULL, rgt INT NOT NULL ); INSERT INTO nested_category VALUES(1,'ELECTRONICS',1,20),(2,'TELEVISIONS',2,9),(3,'TUBE',3,4), (4,'LCD',5,6),(5,'PLASMA',7,8),(6,'PORTABLE ELECTRONICS',10,19),(7,'MP3 PLAYERS',11,14),(8,'FLASH',12,13), (9,'CD PLAYERS',15,16),(10,'2 WAY RADIOS',17,18); SELECT * FROM nested_category ORDER BY category_id; +-------------+----------------------+-----+-----+ | category_id | name | lft | rgt | +-------------+----------------------+-----+-----+ | 1 | ELECTRONICS | 1 | 20 | | 2 | TELEVISIONS | 2 | 9 | | 3 | TUBE | 3 | 4 | | 4 | LCD | 5 | 6 | | 5 | PLASMA | 7 | 8 | | 6 | PORTABLE ELECTRONICS | 10 | 19 | | 7 | MP3 PLAYERS | 11 | 14 | | 8 | FLASH | 12 | 13 | | 9 | CD PLAYERS | 15 | 16 | | 10 | 2 WAY RADIOS | 17 | 18 | +-------------+----------------------+-----+-----+
We use lft and rgt because left and right are reserved words in MySQL. For information, see http://dev.mysql.com/doc/mysql/en/reserved-words.html . Keep a complete list of words.
So how do we determine the left and right values? We numbered from the far left until the right:
This design can also be applied to typical trees:
When working with trees, we go from left to right, one level at a time, down to the child nodes of each node, assign the numbers to the right, and move to the right. This method is called an improved pre-sorted tree traversal algorithm.
Retrieve Full Tree
We can retrieve the complete tree by using a self-join that links the parent node to the node, based on the fact that the LFT value of the node will always appear between the **lft ** and rgt values of its parent node:
SELECT node.name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt AND parent.name = 'ELECTRONICS' ORDER BY node.lft; +----------------------+ | name | +----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +----------------------+
Unlike our previous example of the adjacency table model, this query will be independent of the depth of the tree. We don't care about the rgt value of the node in the BETWEEN clause because the rgt value always belongs to the same parent node as the lft value.
Find all leaf nodes
Finding all leaf nodes in a nested set model is even easier than the left join method used in the adjacency table model. If you look at nested_category table, you may notice that the LFT and RGT values of leaf nodes are continuous numbers. To find the leaf node, we look for the node rgt=lft+1:
SELECT name FROM nested_category WHERE rgt = lft + 1; +--------------+ | name | +--------------+ | TUBE | | LCD | | PLASMA | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +--------------+
Retrieving a single path
Using the nested set model, we can retrieve a single path without multiple self-connections:
SELECT parent.name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.name = 'FLASH' ORDER BY parent.lft; +----------------------+ | name | +----------------------+ | ELECTRONICS | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | +----------------------+
Depth of finding nodes
We've learned how to display the entire tree, but what if we want to show the depth of each node in the tree so that we can better identify where each node is in the hierarchy? This can be achieved by adding COUNT functions and GROUP BY clauses to an existing query to display the entire tree:
SELECT node.name, (COUNT(parent.name) - 1) AS depth FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +----------------------+-------+ | name | depth | +----------------------+-------+ | ELECTRONICS | 0 | | TELEVISIONS | 1 | | TUBE | 2 | | LCD | 2 | | PLASMA | 2 | | PORTABLE ELECTRONICS | 1 | | MP3 PLAYERS | 2 | | FLASH | 3 | | CD PLAYERS | 2 | | 2 WAY RADIOS | 2 | +----------------------+-------+
We can also indent our category names with depth values using CONCAT and REPEAT string functions:
SELECT CONCAT( REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +-----------------------+
Of course, in client applications, you are more likely to use depth values directly to display hierarchies. Web developers can iterate through the tree, adding <li></li> and <ul></ul> tags as the number of depths increases and decreases.
Subtree Depth
When we need depth information for a subtree, we cannot restrict the node table or parent table in a self-join because it can destroy our results. Instead, we added a third self-join and a subquery to determine the depth at which the subtree will start:
SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth FROM nested_category AS node, nested_category AS parent, nested_category AS sub_parent, ( SELECT node.name, (COUNT(parent.name) - 1) AS depth FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.name = 'PORTABLE ELECTRONICS' GROUP BY node.name ORDER BY node.lft )AS sub_tree WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt AND sub_parent.name = sub_tree.name GROUP BY node.name ORDER BY node.lft; +----------------------+-------+ | name | depth | +----------------------+-------+ | PORTABLE ELECTRONICS | 0 | | MP3 PLAYERS | 1 | | FLASH | 2 | | CD PLAYERS | 1 | | 2 WAY RADIOS | 1 | +----------------------+-------+
This function can be used with any node name, including the root node. The depth value is always relative to the specified node.
Finding direct descendants of a node
Suppose you display an electronic product category on a retailer's website. When a user clicks on a category, you may want to display the product of that category and list its direct subcategories, but not the entire category tree underneath it. To do this, we need to show the node and its immediate children, not the entire lower part of the tree. For example, when displaying the PORTABLE ELECTRONICS category, we want to display MP3 PLAYERS, CD PLAYERS, and 2 WAY RADIOS, but not FLASH.
This can be easily accomplished by adding a HAVING clause to the previous query:
SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth FROM nested_category AS node, nested_category AS parent, nested_category AS sub_parent, ( SELECT node.name, (COUNT(parent.name) - 1) AS depth FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.name = 'PORTABLE ELECTRONICS' GROUP BY node.name ORDER BY node.lft )AS sub_tree WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt AND sub_parent.name = sub_tree.name GROUP BY node.name HAVING depth <= 1 ORDER BY node.lft; +----------------------+-------+ | name | depth | +----------------------+-------+ | PORTABLE ELECTRONICS | 0 | | MP3 PLAYERS | 1 | | CD PLAYERS | 1 | | 2 WAY RADIOS | 1 | +----------------------+-------+
If you do not want to display the parent node, change the HAVING depth <= 1 line to HAVING depth = 1.
Aggregation function in nested set
Let's add a product table to demonstrate the summary function:
CREATE TABLE product ( product_id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(40), category_id INT NOT NULL ); INSERT INTO product(name, category_id) VALUES('20" TV',3),('36" TV',3), ('Super-LCD 42"',4),('Ultra-Plasma 62"',5),('Value Plasma 38"',5), ('Power-MP3 5gb',7),('Super-Player 1gb',8),('Porta CD',9),('CD To go!',9), ('Family Talk 360',10); SELECT * FROM product; +------------+-------------------+-------------+ | product_id | name | category_id | +------------+-------------------+-------------+ | 1 | 20" TV | 3 | | 2 | 36" TV | 3 | | 3 | Super-LCD 42" | 4 | | 4 | Ultra-Plasma 62" | 5 | | 5 | Value Plasma 38" | 5 | | 6 | Power-MP3 128mb | 7 | | 7 | Super-Shuffle 1gb | 8 | | 8 | Porta CD | 9 | | 9 | CD To go! | 9 | | 10 | Family Talk 360 | 10 | +------------+-------------------+-------------+
Now let's generate a query that retrieves our category tree and the number of products in each category:
SELECT parent.name, COUNT(product.name) FROM nested_category AS node , nested_category AS parent, product WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.category_id = product.category_id GROUP BY parent.name ORDER BY node.lft; +----------------------+---------------------+ | name | COUNT(product.name) | +----------------------+---------------------+ | ELECTRONICS | 10 | | TELEVISIONS | 5 | | TUBE | 2 | | LCD | 1 | | PLASMA | 2 | | PORTABLE ELECTRONICS | 5 | | MP3 PLAYERS | 2 | | FLASH | 1 | | CD PLAYERS | 2 | | 2 WAY RADIOS | 1 | +----------------------+---------------------+
This is our typical whole tree query, which adds COUNT and GROUP BY as well as references to product tables and joins between nodes and product tables in the WHERE clause. As you can see, each category has a count, and the count of subcategories is reflected in the parent category.
Add a new node
Now that we know how to query a tree, we should see how to update it by adding new nodes. Let's take a look at our nested set diagram again:
1: Add a sibling node if there are already children under the parent node
If we want to add a new node between the "TELEVISIONS" and "PORTABLE ELECTRONICS" nodes, the "lft" and "rgt" values of the new node are 10 and 11, respectively, and the "lft" and "rgt" values of all the nodes on the right of the new node will be increased by 2. We will then add the new node using the appropriate "lft" and "rgt" values. Although you can use stored procedures in MySQL 5 to do this, for now I will assume that most readers will use 4.1 because it is the latest stable version, and I will use the LOCK TABLES statement to isolate queries:
LOCK TABLE nested_category WRITE; SELECT @myRight := rgt FROM nested_category WHERE name = 'TELEVISIONS'; UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myRight; UPDATE nested_category SET lft = lft + 2 WHERE lft > @myRight; INSERT INTO nested_category(name, lft, rgt) VALUES('GAME CONSOLES', @myRight + 1, @myRight + 2); UNLOCK TABLES;
Then we can check the nesting using an indented tree query:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | GAME CONSOLES | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +-----------------------+
2: Add a child node without a child node under the parent node
Conversely, if we want to add a node as a child of a node that has no existing child nodes, we need to modify it slightly. Let's add a new FRS node under the "2 WAY RADIOS" node:
LOCK TABLE nested_category WRITE; SELECT @myLeft := lft FROM nested_category WHERE name = '2 WAY RADIOS'; UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myLeft; UPDATE nested_category SET lft = lft + 2 WHERE lft > @myLeft; INSERT INTO nested_category(name, lft, rgt) VALUES('FRS', @myLeft + 1, @myLeft + 2); UNLOCK TABLES;
In this example, we expand everything to the right of the number on the left side of the new parent node and place the node on the right of the value on the left. As you can see, our new nodes are now nested correctly:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | GAME CONSOLES | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | | FRS | +-----------------------+
Delete Node
The last basic task involved in using nested sets is to delete nodes. The action taken to delete a node depends on its location in the hierarchy. Deleting leaf nodes is easier than deleting nodes with children because we have to deal with isolated nodes.
When deleting a leaf node, this process is the opposite of adding a new node. We delete the node and its width from the right side of each node:
LOCK TABLE nested_category WRITE; SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1 FROM nested_category WHERE name = 'GAME CONSOLES'; DELETE FROM nested_category WHERE lft BETWEEN @myLeft AND @myRight; UPDATE nested_category SET rgt = rgt - @myWidth WHERE rgt > @myRight; UPDATE nested_category SET lft = lft - @myWidth WHERE lft > @myRight; UNLOCK TABLES;
Again, we perform an indentation tree query to confirm that our nodes have been deleted without breaking the hierarchy:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | | FRS | +-----------------------+
The same applies to deleting a node and all its children:
LOCK TABLE nested_category WRITE; SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1 FROM nested_category WHERE name = 'MP3 PLAYERS'; DELETE FROM nested_category WHERE lft BETWEEN @myLeft AND @myRight; UPDATE nested_category SET rgt = rgt - @myWidth WHERE rgt > @myRight; UPDATE nested_category SET lft = lft - @myWidth WHERE lft > @myRight; UNLOCK TABLES;
Query again to see if the entire subtree has been successfully deleted:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | PORTABLE ELECTRONICS | | CD PLAYERS | | 2 WAY RADIOS | | FRS | +-----------------------+
Another scenario we have to deal with is to delete the parent node but not the child node. In some cases, you may want to change the name to just one placeholder until an alternative name appears, such as when the supervisor is dismissed. In other cases, all child nodes should be moved to the level of the deleted parent node:
LOCK TABLE nested_category WRITE; SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1 FROM nested_category WHERE name = 'PORTABLE ELECTRONICS'; DELETE FROM nested_category WHERE lft = @myLeft; UPDATE nested_category SET rgt = rgt - 1, lft = lft - 1 WHERE lft BETWEEN @myLeft AND @myRight; UPDATE nested_category SET rgt = rgt - 2 WHERE rgt > @myRight; UPDATE nested_category SET lft = lft - 2 WHERE lft > @myRight; UNLOCK TABLES;
In this example, we subtract 2 from all the elements on the right side of the node (because there are no child nodes, its width will be 2) and 1 from the node that is its child (to make up for the difference caused by the loss of the left value of the parent node). Again, our elements have been elevated:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +---------------+ | name | +---------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | CD PLAYERS | | 2 WAY RADIOS | | FRS | +---------------+
Other scenarios for deleting a node include elevating a child node to the location of the parent node and moving the child node to a sibling node of the parent node, but for space reasons, these scenarios are not covered in this article.
Last thought
Although I hope the information in this article will be useful to you, the concept of nested sets in SQL has existed for more than a decade, and there is much more information in books and on the Internet. In my view, the most comprehensive source of information about the management hierarchy is a book called Joe Celko's Trees and Hierarchies in SQL for Smarties Written by Joe Celko, a respected author in the advanced SQL field. Joe Celko is often known as a nested set model and is by far the most prolific author on this topic. I find Celko's book a valuable resource in my own learning, so I strongly recommend it. This book covers advanced topics not covered in this article and provides methods for managing hierarchical data in addition to adjacent tables and nested set models.
In the References/Resources section below, I list some Web resources that can be used for your research on managing hierarchical data, including a pair of PHP-related resources, including pre-built PHP libraries for working with nested sets in MySQL. Those who currently use the adjacency table model and want to use the nested set model are listed below Storing Hierarchical Data in a Database Sample code for converting between the two was found in the resource.
References/Resources
- Joe Celko's Trees and Hierarchies in SQL for Smarties – ISBN 1-55860-920-2
- Storing Hierarchical Data in a Database
- A Look at SQL Trees
- SQL Lessons
- Nontraditional Databases
- Trees in SQL
- Trees in SQL (2)
- Converting an adjacency list model to a nested set model
- Nested Sets in PHP Demonstration (German)
- A Nested Set Library in PHP
Realization
Later, I'm going to write an implementation of MyBatis.