Sets - Purely Functional Data Structures

A set is one of the most fundamental data structure. It allows to keep a unique, unordered enumeration of elements.

Ordered Lists

We could use an ordered list to implement a set data structure.

The problem: these are still linear.

Binary Trees

Sets can be implemented using the list as the underlying data structure, but important operations such as member will be performed in linear time. We can attempt to improve complexity by using binary trees. As we don't have to make any assumptions regarding the ordering of data, we can use recursive binary division to find and alter the data structure in logarithmic time. In order to achieve the maximum benefit of binary trees, we need to make them balanced. This aim of this section is to get you started, so we will not go into binary tree variations like AVL or Red-Black. If you are not familiar with the principles behind binary trees, the principle is quite simple: there is a root element which contains a value, the values on the left hand are always smaller to that value and values on the right are always greater than that value.

Our data type Set will be defined in terms of data type Tree. The tree can either be empty (Null) or have node (Node). The node itself contains the value and two subtrees on the left and right (which are also of type Tree, because we are dealing with recursion). If the node is the last one in the whole tree, then left (or right) will evaluate to Null.

With these data types, we can already start thinking about the instance of SetSpec. An empty set, will simply be Null since there are no trees. If we want to create a single value (without worrying about left and right trees), we can use singleton x. This function will create a node which contains only the value x and both subtress would be Null. Finally, we can also calculate the size of the tree, which simply recursively enumerates through through the subtrees on the left and on the right of the root element (+1 for the actual root element).

Before we go into the glory details of how to implement various operations, we need to consider how to change the list representation to a binary tree one. After all, we don't even have a way of combining two trees together (like ++ for lists). Let us, therefore, define a function cat which takes two trees and puts them together with the values in the left subtree are lesser than the values in the right subtree. Joining two trees requires a parent node. What if we don't have one? The solution is to pick the extreme value from one of the trees - minimum from the right subtree (used in this example) or maximum from the left tree. Once we extracted the value, we also need to destroy this node. The function min takes a tree and returns that minimum value. delMin takes a tree and traverses through to left branches to find the minimum element. This is done, by looking for a Null value. Once encountered, we simply replace it with the Tree on the right-hand side.

Now that we know how to merge two (ordered) trees, we can implement member, add and delete. The implementation of the member is quite simple: look for the Node which has that specific value by comparing value a with the current Node value b.

Adding element to a tree is also quite simple. We traverse through the tree until we find appropriate node to add the node.

Finally, the delete function. This is where we make use of the previously defined cat function. We traverse through the tree in the search for the node which contains a value. When the value is found we apply the cat function which simply joins the two trees by looking for an minimum value in the right hand tree.

Now that the basic operations are pretty efficient, let's see how we can implement operations such as union. Merging an ordered list is quite simple: just compare the head of each list, but in a binary tree the root doesn't really convey all the information we need. Instead what we need to do is take the root of one tree r and split the other tree into two: a tree of values before r (i.e. less then) and beyond (i.e. greater than) r.

There are of course two base cases: if one of the trees is Null, then the other value is returned. The inductive case creates a new tree Node y' b z' with before and beyond functions.

Yey! So that's that... Well... Kind of... There is one small detail: we need to ensure that the trees are balanced, otherwise, this solution may not work out so well. There is a trade-off between the benefits of balance and the costs of rebalancing the trees.

In order to avoid computing the size of each tree, we will cache this value (this is what the Int in the definition stands for. We will limit the ratio of sizes of sibling subtrees thereby limiting the maximum difference in height.

We will create a smart constructor which will

A smart constructor (unlike the standard "dumb" one), carries out the extra computation.