Algorithms, Web Page 5
More on Sorting:
HeapSort, Radix Sort, Sort Comparison, Sort Analysis
Back to More Sorts and Divide
and Conquer.
Heapsort (6.4)
In class: Heap definition, reason for shape - array implementation.
Assume D is the depth of the heap; n elements; D = ceiling(lg
(n+1))
FixHeap: if have a heap, you know the largest element, and can
remove it.
Easy way to make a new heap?
Plug the hole with the last element, fix the order with
fixheap
comparisons with i left in
heap: 2 lg i + O(1):
total
|
[2 lg i + O(1)]
= 2n lg
n + O(n)
|
Initial heap creation with heapify: fix from bottom up
number of key comparisons?
obviously element placed
in <= 2D
steps so all in n*2D = 2n lg n
but D steps possible in only 1
case; look
closer
We could also create the heap from top down, keeping on adding one more
element at the bottom, which may need to bubble up. Each step in
bubbleUp only takes one comparison to a parent, rather than the 2 in
fixHeap to settle a parent down. All n nodes might need to be
fixed, and the maximum number of comparisons for any one is now 1*lg
n. Which approach is better?
A java version is below. A similar Python version is here.
public class Heapsort
{
public static void heapsort(int[ ] data, int n){
// sort first n elements of data array via
heapsort; 0
<= n < data.length
// first make heap
for (int r = (n-2)/2; r
>= 0; r -= 1)
//
work backwards from last parent node = (last index - 1)/2
heapify(data, r, n);
// sort heap by swapping
biggest element to end of list,
// and
make heap again of the rest
while (n > 1){
n -= 1;
int
temp = data[n]; // swap last unsorted element with top
data[n] = data[0];
data[0] = temp;
heapify(data, 0, n);
}
}
private static void heapify(int[ ] data, int root,
int n)
// let out-of-place element at index root
// settle down
through tree to make a
legal heap
// start assuming any children of root are at tops
of
legal heaps
// all indices of heap elements have index
< n
{
int left = 2*root + 1;
while(left < n) {
// while root is not a leaf
int
right = left+1;
int
rootdata = data[root]; // need a copy for swaps
if
(right == n) { // right past end, only one leaf
if (data[left] > rootdata) {// swap if smaller at root
data[root] = data[left];
data[left] = rootdata;
}
return; // already fixed child, which is the final leaf
}
else
{ // two leaves
int bigger = right; // find bigger of two children
if (data[left] > data[bigger])
bigger = left;
if (rootdata >= data[bigger])
return; // OK, now is legal
heap below
// else root out of place
data[root] = data[bigger]; // swap root with bigger
child
data[bigger] = rootdata;
root = bigger; // descend in heap for next loop
left = 2*root + 1;
}
}
}
There are a variety of online animations for all the basic sorts, for
example
https://www.toptal.com/developers/sorting-algorithms
Count comparisons (comp.) in heap creation:
For simplicity assume complete tree. n = 2D+1
- 1:
depth D 2D
elements move 0 levels,
0 comp. max for each, tot 0
depth D-1 2D-1
elements move 1 levels,
1*2 comp. max for each, tot 2*1*2D-1
depth D-2 2D-2
elements move 2 levels,
2*2 comp. max for each, tot 2*2*2D-2
depth D-3 2D-3
elements move 3 levels,
3*2 comp. max for each, tot 2*3*2D-3
...
depth 0 2D-D
elements move D
levels, D*2 comp. max for each, tot 2*D*20
grand total
|
[2D2d
- 2d2d]
= 2D(2D - 1) - 2*[(D-2)2D
+ 2] = 4*2D
- 2D - 4 = 2n - O(lg n)
|
!! Smaller order than n lg n
Total comparisons for sort, to build heap and take apart is 2n lg n +
O(n)
Creating the heap with the bubbleUp approach - could that worst case
actually be O(n), too? No: the bottom most full row by
itself has about n/2 elements, and each might have to bubble up
through lg n - 1 levels, producing at least Θ(n lg n).
Fine point - Accelerated Heapsort - can skip:
Accelerated Heapsort, accelerated fixHeap after max of the heap is
deleted, and the nth element needs to be inserted somewhere. This
is a curious variation on binary search:
Call the height of the tree h = ⌈lg(n+1)⌉
guess the nth element to be inserted fits in d =
h/2 steps
below the root, promote larger of two siblings until that far down
(1 comparison per level, h/2 total)
if go too low, bubble up at most h/2
levels, total
h/2 down, h/2 up = h
else do accelerated heapsort
on subtree half
way down, of height h/2
worse case is when it needs to go all the way to the bottom, down h
levels, and after each promotion, checking to see if you are done
requires a second comparison on that level, one more than accounted for
so far. The worse case involves promotions all the way to the
bottom, calling promotion at most lg h times, so the worst case total
number of comparison to delete the maximum element and fix the heap is
h + lg h ≈ lg(n+1) + lg lg (n+1) times
Repeating this extraction of the max for all n items in the heap, the
time is <= n * comparisons for one deletion = n (lg n
+
lg lg n) + O(n)
This sort can be done in place: all recursion can be removed,
since the action follows a single branch down the tree each time.
In accelerated heapsort the number of comparisons in deleteMax is
always at least h/2, whereas in the original heapsort, there could be
fewer. Still I believe the accelerated version improves the
average as well as the worst case, since the element to be inserted are
always taken from the bottom of the heap, and most of the time the
insertion will be more than 1/2 way down the tree, when the original
verison would have done more comparisons than the maximum possible with
the accelerated version.
(This fixHeapFast could also be used in the initial heap
generation.
It is less important, because this does not effect the highest order
term. Also, in this case the comment made about average
behavior
of
the original and accelerated fixHeap do not apply completely:
The
elements to be bubbled down are random, not ones originally on the
bottom of the heap, so they are not likely to sink down as
far.
Still,
since half the elements are within one level of the bottom, they are
still likely to go pretty far down.)
The worse case number of data moves is comparable to the
number
of comparisons.
A quicksort with a highly improved pivot finding routine might do fewer
data moves (A lower bound for the average with
quicksort
would be with the pivot in the middle and half the elements on the
wrong side, requiring only half the elements to be moved at each level,
or a total of .5n lg n with the original partition routine.)
I do not find it productive to analyse theoretically past here, where
the relative times for comparison, data moves, and index moves on a
particular machine all effect the leading coefficient of n lg
n.
At some point it is time to do test runs with realistic data in the
real environment the program is going to run.
Bucket Partition (not in book)
This approach take a list, preferably a linked list, and in O(n)
time splits the list into an array of sublists (easiest if each sublist
is also a linked list), basing the array index of the sublist on a
corresponding part
of each key. The name comes from the idea of physically tossing
keys into one of a number of buckets.
If partitioning is based on a portion of the key that can have integer
values 0, ... k-1:
Create or clear an array A of k empty lists.
For each key in the original list:
calculate the part of the key examined.
Suppose it has integer value i
add the key to sublist A[i]
Since array lookup is O(1), one pass, partitioning a whole list into
sublists is O(n).
For instance ASCII string keys can be partitioned into an array of 128
lists, where each key is put at the end of the sublist whose array
index is the binary code in the most significant byte of the key.
This can be used as a first step combined with other sorting methods if
the part of the key chosen is the most
significant part. After the sublists are sorted, concatenate
sublists A[0], A[1], ... A[k-1]. It can also be turned into a
full sort (radix sort below), using one section of the keys after
another, but in an order that you might not guess:
Radix sort (not in book)
The approach involves repeated bucket partitions using different
portions of the key. A key is needed that is (or can be converted
to) an integer of fixed size. This obviously works
with size limited integer types. It ls works with size limited
ASCII
strings padded with nulls or blanks, ordered by the underlying codes,
if the raw binary data is treated as an integer. For
instance with a
3 byte key, you could partition on any of 6 different
4 bit sections. A partition on one 4-bit section would result in
an array of 24 = 16 sublists.
One naive way to use only bucket partitions for a complete sort is to
take sublists which have the same common
starting part of the key and partition them into smaller lists based on
the next portion
of the keys until you reach lists of length 1, and then concatenate the
lists back together. This recursively generates an enornmous
unwieldy collection of lists that you must track.
Instead radix sort follow an alternate, efficient but
non-intuitive approach.
Radix sort starts by partitioning on the least
significant part of the key first and recombining all the lists immediately
after each completed partition.
Radix sort, given list L
For the least significant through most significant portion of the key:
Bucket partition L into array A
concatenate the sublists in A in order of the
array indices to create a revised list L
The final version of L is sorted.
For instance if keys were 5
digit
numbers, the sort could first be on the 1's digit, then the 10's digit,
... and finally the most significant digit, the 10,000's
digit.
The only requirement to have this work is that the splitting and
recombining operations make keys that go in the same bucket end up in
the same relative order as the started (i.e. be stable).
Here is a small example, partitioning at individual digits. For
simplicity there are only 3 possible values in each place, and 3 digits:
Initial data
|
first pass
|
second pass
|
last pass
|
combined
|
123
222
312
113
231
111
333
121
|
231
111
121
+
222
312
+
123
113
333
|
111
312
113
+
121
222
123
+
231
333
|
111
113
121
123
+
222
231
+
312
333
|
111
113
121
123
222
231
312
333 |
Thus if there are m pieces of the key on which partitions are done
for a list of n items, the sort is of order O(mn). In various
texts radix sort is referred to as a form of bucket sort. It is
my term, bucket
partition, which is analogous to the partition in quicksort, but
with more parts, as only a step of some sorts.
Proof that radix sort works
After partittion i, the list is correctly sorted, in a stable sort, on
the last i portions of the key.
Basis: After one
partition, the list is directly sorted on the one portion, and key with
the same last portion are added to the sublist in the same order as
they appeared in the original list, so the sort is stable.
Induction step:
Suppose partitions and merges are done on the last i portions of
the key, and the resulting list is sorted on the last i portions of the
key. Show the next sort will produce a stable sort on the last
i+1 portions of the key. Elements that differ in the latest part
of the key considered will clearly be in order. Elements with the
latest part of the key the same appear in the same order as they were
in the previous list, when they were in the proper order by the
induction hypothesis.
Done.
Radix sort works well for massive datasets that must be stored in the
file system (assuming keys are not too long).
Counting Sort (7.1)
If a large lecture has a quiz with possible integer scores 0-10, how
might you sort the scores? Easier than mergesort or quicksort
would be just counting the number of students with each
possible score. You need an array with indices 0-10 (length 11),
initialized to 0's. To count each quiz, you just increment the
array element corresponding to the score. After you have the
counts, put the calculated number of 0's at the beginning of the sorted
grade array, then the calculated number of 1's, .... In
general, if all the values are know lie in a range low <= value <
low + m, and there are
n numbers in the list, the time for the sort is O(m+n).
This algorithm is an example where a separate subsidiary data
structure is introduced that speeds up the algorithm and requires extra
storage so it is a form of space-time
tradeoff. We look further
into space-time tradeoffs in various dynamic data structures.
Decision trees and a lower bound for many sorts (11.2)
An alternate source to the text section 11.2 is DasGupta ch 2 8th page,
showing that any sort where the only
use of keys is to test if key1 < key2, then whatever
the sorting method, the number of these comparisons is bounded below by
an expression with highest order term n lg n. This is the same
leading term as with mergesort (not just order, but also the same
constant
multiple).
Questions:
- If a list of 5 digit integers of length n are sorted using
radix
sort on the digits, the algorithm is O(5n) , which is less than n lg n
for large n. Is this a contradiction with the lower bound above?
- If a list of numbers is sorted with counting sort, the order is
O(m+n). In particular, if m=n, this is O(n). Is this a
contradiction?
Sort Comparisons
Be aware of the advantages and disadvantages of each sort! Expect questions on this!
Selection Sort
+ short code
+ low overhead for very small datasets
+ reqires limited moving of
data
- bad for large datasets
- never better than O(n2)
Insertion Sort
+ short code
+ low overhead for small datasets
+ best case O(n)
- worse case O(n2) is common
Quicksort:
+ generally the fastest key comparison sort on average
+ essentially in place
- worst case O(n2)
- best case only O(n lg n)
Mergesort:
+ worse case is still fast, n lg n + O(n) comparisons
+ some implementations have O(n) best case (Timsort)
+/- quite fast on average
+ if the list happens to be implemented as a mutable linked list, no
extra storage needed
+ works via files as last steps for datasets that are too big for memory
- need O(n) extra storage in array implementations
Heapsort
+ reasonable fast in worst and average cases, n lg n + O(n) comparisons
+ in place in array
- best case still n lg n
Radix sort
+ extremely fast O(n) if keys short
+ space efficient if data in linked list or for a massive dataset using
files for storage
- not useful if keys long
- needs O(n) extra storage if in array in memory.
Counting sort
+ O(n) when it is useful
- Only useful when keys in a small known range that is O(n).
- Uses extra space based on the possible range of values.
Examples. Choose an appropriate sort:
- linked list with long keys
- linked list with relatively small numeric keys
- in array, with real time bound on allowed time, space constraints
- list only slightly out of order for sure
- list often not far out of order, but sometime quite random
- sorting the letters in a book
- short list but with large items of data to sort.
- long keys, data too big for memory.
On to Amortizing for dynamic data
structures