Recently I wrote an overview of a common data structure known as a tree and I mentioned that there were in fact many variants of trees being used for different purposes in production. Today I would like to talk a bit about a particularly powerful tree known as the <a class=’blog-post-link’ href=’ https://en.wikipedia.org/wiki/B-tree’>B-tree. Since the b-tree is a comparatively complicated and nuanced tree I am going to avoid diving into code examples and instead focus on the two main things which anyone interested needs to know: what defines a b-tree and why is it important? So without further ado …
What is a B-Tree?
A b-tree is a tree that is sorted and self-balancing. The tree is sorted from least to greatest, increasing order. This means that a depth first traversal, one which travels down the branch before moving over to the next branch, should reach and act upon each value in the tree in increasing order. The model by which a b-tree balances is very strict. First and foremost, it is always of equal depth throughout the tree, that is, no branch should be longer or shorter than any other. This is not just because it creates a clean mental model, though it does, but because this allows it to achieve its primary purpose which we will talk about shortly.
In addition to its evenness in depth a b-tree should be balanced by a few other rules all dictated by the trees minimum degree which is declared at the time the tree is created. The tree is built and balanced by the minimum degree, which we will call d, in the following ways:
- All nodes should have at least d-1 and at most 2d-1 values
- The root is the only exception which may have 1 value
- Each node should have 1 child more than it has values. Ex: 2 values gets 3 children
- This means that a node should have at least d children, if any, and 2d at most
In putting all of this together I find this visual to be very helpful.
This example is a b-tree which has been declared with a minimum degree, d, of 2. You can see how each node has at least 1 value because d-1 = (2) - 1 = 1 and at most 3 because 2d - 1 = 2(2) - 1 = 3. Similarly each node has exactly 1 child more than it has values, thus at least 2.
I particularly like how this displays visually the relationship of the values in each child array to those in the parent. That is, it shows through it's alignment how the first child of the root contains sorted values which are strictly less than the first value in the root, which is 21. The second child, in turn, contains values that are strictly greater than 21 but less than the roots second value, 48. This, I think makes it much easier to see how the information should be traversed so as to maintain its increasing order.
As I said the evenness in depth of the tree is central to allowing it to achieve its primary objective and that objective is speed. Imagine that we filled the b-tree above with values and were searching for a particular value, say 46. Begining at the root we can see that 46 is greater than the first value 21 so could 46 possibly be in the left most branch? Nope. All of those will be less than 21 because we know this tree is sorted.
Sweet that saved some work! Now we look at the second value in the root, 48. Hmmm, 46 is less than 48 so it must be somewhere down the second branch. Awesome! No need to look down branches 3 or 4. Now we can move down to the first child on that second branch and we have immediately cut work involved in our search down by a factor of 4! So what happens when we continue the search on this branch? The exact same thing! We can again identify which child to continue our search in by comparing our target to the values at our current node and again divide our potential work by 4. That's a huge savings!
This division of work which allows a search to operate faster has the exact same effect on inserting and deleting new values into a b-tree. So, lets talk time complexity for a moment. What is the time complexity of operations on a b-tree? Well in this example it would be O(log4(n)) because we have 4 children at each level dividing our total options. In general however a b-tree is O(log2d(n)) every node in the tree has at most 2d children.
Hopefully this has given you a general understanding of the form and function of a b-tree. Like I said I am not going to dive into code examples here since the fact is that there are several awesome resources for defining and visualizing this very cool data structure readily available.
Note: This repo is still a work in progress. If you find a bug or think of a test you believe ought to be included I welcome any pull requests.