If you download this page, note that you should also download the subdirectory b-trees to be able to see the graphical images. One way to do that all at once is to download 06BTree.zip.

Dynamic data structures allow lookups and change based on their use: in hash tables and B-trees there can be additions and deletions.

Another form of dynamic data structure in a union-find tree. This structure has a rather specialized use, and I delay discussing it until after we introduce a greedy graph algorithm that can use it, and give it context.

WARNING! Very misleading terminology in the book: What I will call chained hashing, the book call "Separate chaining" or "open hashing".

Completely different are the versions of open address hashing, which the book also calls "closed hashing", so open is used in totally different ways in the book. Henceforth, I will only use "open" in connection with open address hashing. As in the book, I distinguish the two general approaches to open address hashing as "linear probing" or "double hashing". For short, open linear and open double.

For open address hashing, the analysis is much more difficult and was carried out exactly by Knuth (see the references at the end of the chapter) in the case that there are only additions, not deletions from the table. In this case the average number of comparisons is

Open address hashing with linear probing: 0.5 + 0.5/(1 - α)

Open address hashing with double hashing: -(1/α)ln(1 - α)

Average Number of Hash Table Comparisons

----------------------------------------

Load Open Open Chained

Factor Linear Double

0.25 1.17 1.15 1.13

0.50 1.50 1.39 1.25

0.75 2.50 1.85 1.38

0.90 5.50 2.56 1.45

0.95 10.50 3.15 1.48

0.99 50.50 4.65 1.50

1.00 ---- ---- 1.50

1.50 ---- ---- 1.75

2.00 ---- ---- 2.00

2.50 ---- ---- 2.25

This table was printed using hashAvgComp.py, in the handouts on the web. This table is somewhat misleading, as your homework problem E gets you to explore.

You should understand the algorithms for each of these three versions of hashing, and be able to show the sequence of changes to a hash table made for any example listing the size of the array and the values of the hash codes for each key added or removed (plus for double hashing, second hash value -- the increment used to resolve collisions).

The book mentioned that prime length hash tables are a good idea in case you have an imperfect hash function. (For instance if your hash function ends up always being even, and the hash table has and even length n, then the target index h mod n will always be even - wasting half the hash table. The common factor is the issue. A prime length avoids that. For double hashing there are only n-1 places you want to look with the second hash (since you assume a collision at the first place). You could make the increment be the original hash value h mod (n-1), but n-1 will alway be even if n is prime. Not good. Knuth noted, however, there are an infinite number of double primes pairs p, p+2 (3:5, 5:7, 11:13, ...). If the prime N is the larger of a double prime pair (2 apart), then if h is the hash value for a key, the index for the first try is the usual h mod N. For the double hash, you want an easily calculated, apparently independent increment 1, 2, .... N-1. An easy an excellent choice uses the already calculated h: 1 + h mod (N-2). The only defect here is that it takes on one fewer value (N-1 is impossible). I use a double hash increment chosen in this way for the examples below.

**Hash Table Algorithm illustrations**

Assume a hash table of size n = 13, to which we want
to add a number
of names as keys. The arithmetic for a sequence of names is
shown.
Hash codes given are the actual Java String hashCodes. The
index
is
Math.abs(key.hashCode()) % n. Also shown for double hashing
are
the
increments used after collisions 1+ Math.abs(key.hashCode()) %
(n-2). Note the use of absolute value. Hash results are
usually arbitrary machine integers, half or which are negative.
-13 mod 5 is -3, not a good index. The abs avoids that
problem. (You could also use an if statement and add n if the
modular result is negative.)

` Key
HashCode Index Increment`

`----- -------- -----
--------`

`
Phu
113485
8 10`

`
Amy
93139
7 3`

` Miao
4047726
7 2`

` Jose
3904637
9 1`

`
Liz
108051
8 10`

`
Eve
98928
11 6`

` Xiao
4604909
10 2`

`
Abe
92712
9 5`

Consider adding just the first three names to the
hash table, in the
order given. I show three columns, one for each of the three
hashing
methods: open addressing, double hashing, and chained
hashing.
'/' is used to mean null or nil for the linked lists. `
open
double chained`

**Hash Example a**: Add the
next three names to
the hash table in the same format. Answer
a

**b.** Now imagine that Liz is
removed. How would the
table look then? Answer b

**c.** Now add Xiao and
Abe. How would the table look
then? Answer c

`//Simple Hashtable example with use of an iterator`

`import java.util.Hashtable; `

Run this program and note the order that the numbers are printed out! When the keys get hashed, their order depends on the hash codes and the hashing algorithm used, which is implementation dependent!

`Action Key Index Increment`

`------ --- ----- ---------`

`Add
A 0
3`

`Add `` B
5
1`

Add `C
0
2`

`Add ``D
5
4`

`Add E
6
6
Delete D`

Add F 5 2

Binary Search Trees

Search in a Binary Search Tree is O(tree depth). The depth can be O(lg n) but it can also be O(n) if earlier insertions and deletions in the binary tree are done naively. There are several variations on search trees that keep the depth at O(lg n): Balanced trees, AVL trees, and Red-Black trees. Red-Black trees are a very popular version for trees held entirely in memory. The red-black algorithms are hard to motivate directly (for me), but they are related to the much more accessible and separately useful B-trees.

All of these approaches have trade offs, but not of space and time as the book suggests in its chapter title, but between time to maintain the data structure vs time to do searches, insertions, and deletions.

B-trees are not binary trees. They typically have multiple data values in each node, x

Generalized B-tree of height h >= 0

The only generalized B-tree of height 0 is nil.

A generalized B-tree of height h > 0 has a root node which may have zero or more data values. This number of data values is its size; denote it by k. Name the data values x

Each T_{i}
is a general B-tree
of height h-1.

The values x_{i} are sorted in nondecreasing order.

The data in non-nil subtrees is interleaved by the data in the node: If L_{i} and H_{i}
are the lowest and
highest data value in the whole subtree T_{i},

H_{i-1}
≤ x_{i} ≤ L_{i}
for i = 1 to k.

The values x

The data in non-nil subtrees is interleaved by the data in the node: If L

H

An m-M tree, or just B-tree if m and M are known in the present context, is a generalized B-tree with the additional restrictions:

m and M are integers with 2 ≤ m, and 2m-1 ≤ M

If the tree is non-nil, each node except the root satisfies m ≤ number of children ≤ M, or equivalently m-1 ≤ size ≤ M-1, and the root node satisfies 2 ≤ number of children ≤ M, or equivalently 1 ≤ size < M.

In some definitions M is required to equal 2m-1. If the value of M is small, all the possible number of children may be listed in the name, as in 2,3,4 tree rather than 2-4 tree.

The recursive definition gives an equal depth to all child subtrees. Hence all leaf nodes are at the exact same depth.

Note that a B-tree may be efficiently searched. If a desired target is not in the root node, there is just one subtree that need to be searched recursively to find it, similar to a binary search tree.

We wish to discuss algorithms for inserting elements into and deleting from B-trees. At intermediate points, generalized B-trees may be created. There are two basic transformations (plus their inverses) on generalized B-trees that need to be considered:

promote and split and its inverse demote and join.

shift right and its inverse shift left

In the shifts, note that subtree between the two moved data nodes switches between the two subtrees (to stay between the two moved data nodes).

For generalized B-trees, when the red data exists, these transformations can always be made.

boolean insert(target)

- If the tree is empty, make the tree have one node with just the target as data, and return true.
- Otherwise, keep track of the subtree that the target might be found in. If it is not found in the root via binary search, continue in the one subtree that it could be found, until
- It is
found, return false

- Else a node at the bottom of the tree is reached and the target is not in the node.
- Initially add the target to the bottom node, inserting it in order.
- While the node with the insertion or promotion is not the root
- If the node is still of size less than M, return true.
- Else
the size of the node is M. Since M >= 2m-1,
promote
the middle element and split the node in two, both of which will have
legal size, at least (2m-1 - 1)/2 = m-1. To make each choice
unique when demonstrating the algorithm, you can insist that
if the two parts are of
unequal size, the left sibling node ends up one smaller than the right
sibling node. The loop now focuses on the node with the promotion.

- If this
point is reached, the root has size M.
Create a node with
no data above the root, and promote the old root's middle item into
it. Then this new root has size 1, which is always legal for
the
root, and the new
root's two children will have size at least m-1 as in 4b, so the the
tree is an
m-M tree with height one more than previously. Return true.

Here is an initial 2-3 tree, then the generalized tree with 32 inserted, and then the final 2-3 tree after three promotions, including the last which creates a new root.

The only way the height of the tree increases is when step 5 is reached and there is a promotion out of the original root node, as in the last example. Extending at the top means all the leaf nodes still have an equal depth, but now one more than it was before.

boolean delete(target)

- Search for the target. If it is not found, return false.
- If the target is found, there are two cases
- The target
was found at the bottom level or

- The target was found at a higher level. In this case
- Traverse
the tree down to find the successor of the target
at
the bottom of the tree by going to the subtree to the right of the
target, and following the tree down its leftmost subtree to the
leftmost element of a leaf node.

- Overwrite the element to be deleted with this successor
- Switch so the copy of the successor at the bottom of the tree is the element to be removed
- Remove the element to be deleted at the
bottom
level.

- While the node with an element removed or demoted is not the root with size < m-1: fix the possibly small size node:
- If its
node still has size >= m-1, return true.

At this point the node is too small, of size m-2. Since the node is not the root, it must have an adjacent sibling. - If an adjacent sibling has size more than m-1,

use the left or right shift transformation to reduce the size of the adjacent sibling by one (and it will still have size >= m-1) and increase the size of the deficient node, making it have size m-1, so this node is fixed without making a deficiency elsewhere. Return true. To make each choice unique when demonstrating the algorithm, you can insist on shifting out of the larger sibling node or out of the right sibling if both are of the same size.

- Else (an adjacent sibling has size m-1): perform the demote and join transformation, joining the deficient node (with size m-2) and the adjacent sibling (of size m-1), adding the demoted element, so the joined node has size m-2 + m-1 + 1 = 2m-2 < M, so this node has a legal size, but now the parent is the possibly small node. To make each choice unique when demonstrating the algorithm, you can insist on joining with the right sibling if there are two adjacent siblings of size m-1.
- If this point is reached, the possibly small node is the root. It only needs to be fixed if it has no elements. In that case the node has only one subtree, and the empty root can be removed and let its subtree be the new root. Return true.

Example 2:

Hence all of the operations search, insert, delete are O(log n).

Binary trees do have a certain simplicity. A 2-4 tree can be implemented as a binary tree: If there is a single data value and two subtrees, they can be stored via a single binary tree node. If the 2-4 node size is two or three, it can be modeled with 2 or three binary tree nodes. The one complication is that in a traversal down from the top of the tree, some nodes indicate the start of a 2-4 tree node, but there are extra binary tree nodes corresponding to a 2-4 node with 2 or three data values, and it is not clear where a B-tree 2-4 node starts and ends. To keep the correspondence with 2-4 trees, one extra bit of information is needed for each binary node, typically indicated as a choice between red or black coloring, where black indicates the start of a new level of the 2-4 tree, which may be connected to one or two red nodes that would be part of the same 2-4 node. The figures below indicate the correspondences for the different sizes of 2-4 node. In the pictures in this page, red is replaced by white.

Full tree example

A 2-4 tree modeled in this way is a red-black tree. The book has a rather different looking definition! Comparing terms, we see the black height of a red black tree is the corresponding height of the B-tree. The red-black trees are generally drawn with all the nodes modeling one B-tree node at the same level, reinforcing the correspondence with 2-4 trees.

Since a red-black tree is actually a binary search tree, searching is the same as for any binary search tree (the color information is not needed). Algorithms for insertion and deletion could be worked out as direct translations of the B-tree algorithms given above, and they would be still be O(log n). In fact, the algorithms can be made somewhat more efficient (though not always easier to follow!) by taking further advantage of the particular red-black binary tree implementation.

I am skipping the lengthy red-black tree algorithms. We have outlined how they are binary trees with a color attribute that allow searches, insertions, and deletions in O(log n). I do hold you responsible for the B-tree algorithms, and the relationship between 2-4 trees and red black trees, specifically how you can convert between a 2-4 tree and a red black tree, in either direction.

I'm not sure why this section is under space-time tradeoffs in the text. There is a space-time tradeoff in one common use case: where the data does not fit in memory. In that case slow disk reads are decreased by having each node bigger, but this takes up more space in memory.

Exercises.

- Show the result of this insertion algorithm after insertion into the 2-3 tree above. To save time you only need to show the nodes that change and child links of those nodes, with "..." under children whose subtree does not change:
- Insert 193
- Insert 52
- Show the
result of this deletion algorithm starting from the
original 2-3 tree above. Again, you may just show a partial
tree.

- Delete 80
- Delete 250
- The delete algorithm needed to include a case which does a right or left shift. Right and left shifts are not necessary, but possibly helpful in the insert algorithm. Modify the insertion algorithm to possibly terminate the loop that goes through ancestors by doing a shift. You do not need to totally rewrite the algorithm, but clearly refer to how the parts you write out fit into the parts that you are keeping from the original algorithm.
- Prove my claim that more than half the elements of a non-nil B-tree are in the lowest level, for any legal values of m and M. This is a challenge for most students. It depends strongly on the fact that nodes not at maximal depth have one more subtree than data elements.
- Draw the B-tree corresponding to the red-black tree below. I did not use graphics; The Black nodes have B before the numeric data. The dots are just spacers, so if you view the diagram in a fixed width font it looks right..

.............B30On to basic graph traversals.

........../........\

.......B15.........B50

...../....\......./....\

..10.....B18....40......75

../ \.........../ \...../ \

B5..B14......B35 B42..B60 B90

..\......................./ \

...7.....................80 95

` open----- double chain`

`0
/`

`1
/`

`2
/`

`3
/`

`4
/`

`5
Liz
/`

`6
/`

`7 Amy Amy
-> Miao ->
Amy -> /`

`8 Phu Phu
-> Liz ->
Phu -> /`

`9 Miao Miao
-> Jose -> /`

`10 Jose Jose
/`

`11 Liz Eve
1 -> Eve -> /`

`12 Eve
/`

` open----- double chain`

`0
/`

`1
/`

`2
/`

`3
/`

`4
/`

`5
vacated
/`

`6
/`

`7 Amy Amy
-> Miao ->
Amy -> /`

`8 Phu Phu
-> Phu -> /`

`9 Miao Miao
-> Jose -> /`

`10 Jose Jose
/`

`11 vacated Eve
-> Eve -> /`

`12 Eve
/`

You might have used some other notation, but the
point is that the
vacated
spaces must not appear exactly the same as the spaces that were never
occupied!

Imagine you are searching for Eve now. In the open
addressing,
you would not want to stop at empty location 11 and decide that Eve was
not present.

On to basic graph traversals.