Comp 363 More Sophisticated Sorts; Divide and Conquer (Ch 5- 5.1-5.4 )

Back to recurrence.

Stable sorts:

It sounds like "stable sort" was not defined generally in 271, and the example in the book turns out to be pretty poor, so here is a more complete one:

Consider 2 implementations of selection sort for a linked list.  I use letter suffixes to distinguish different entries with the same value.

unsorted: 4a 5a  2  5b 4b     sorted:

If I remember the location of the last copy of the largest element and pull it out and move it to the beginning  of a new sorted list, and keep repeating, I would get

unsorted: 4a 5a 2 4b   sorted: 5b
unsorted: 4a  2 4b   sorted: 5a 5b
unsorted: 4a 2    sorted: 4b 5a 5b
unsorted: 2   sorted: 4a 4b 5a 5b
unsorted:    sorted 2 4a 4b 5a 5b

5a come before 5b, and 4a before 4b, as they did in the unsorted list.   

If I pull out the *first* maximal element, and otherwise do the same thing, I get

unsorted: 4a 2 4b 5b   sorted: 5a
unsorted: 4a  2 4b   sorted: 5b 5a
unsorted: 2 4b   sorted: 4a 5b 5a
unsorted: 2   sorted: 4b 4a 5a 5b
unsorted:    sorted 2 4b 4a 5b 5a

The order of the 4's and 5's are reversed.  That means this version is not stable.  To be unstable, we only need an example with at least one pair of equal values reversed, not all pairs.  This particular implementation will reverse the order of every pair of equal elements.

You can easily show that the first version will be stable in general. 

5.1: Mergesort

The basic recursive algorithms should be familiar.  We can do a pretty direct translaton. particularly concise Python 3.1 code in mergesort.py.  Also

There are much more sophisticated algorithms that handle sorted, or near sorted data better.  Anything recursive can be converted to a stack based algorithm.  A more elaborate version looks for natural monotonic runs, and figures out when to pop or push an increasing sequence by the relative lengths.  This is significant for non-random data, when you may have considerable monotonic runs.  The algorithm used by Python has interesting optimizations to make it be both efficient on totally random data and also on data with long monotonic runs.  I have copied a description of the algorithm from the Python source repository. (Not required but interesting.)

5.2: Quicksort

On to quicksort.  Everyone familiar with it?
Worst case, as in book is pivot is always at one end or the other, then it is very slow, O(n2).   This O(n2) is always possible with quicksort, even if a larger, but limited number of elements is considered first as the pivot.

Improvements in finding the pivot are to use the middle element, or a random element, or the median of a few elements - simplest is the first, last and middle.  Larger group can also be used, leading to a secondary problem of finding the median efficiently.  How do you do that?

We discussed in class that you can do a variant on quicksort, with the partition followed by choosing with of the two parts the median must be in, and hence recursing only once at each level.

See 04AvgQSort.html for the details of the calculation of a bound on the average number of comparisons in Quicksort.  The result is:

A(n)   ≤ 2 n ln n, for all n >0.  

In terms of the more common log in algorithms, lg  or log base 2:

A(n)   ≤ 2 ln 2 n lg n ≈ 1.386 n lg n

For mergesort, the coeffiicient is 1, but quicksort is generally faster.

Examples worked out in class: 
1.  approach to median
2.  counting the number of inversions, text problem 5.1 #9

Quicksort Variations and Improvements

It is not essential to separate the pivot from the rest of the elements as in
AltQSort.cs.  In this form however, there are technical problems with a proof of average time.

An algorithm is in place if it requires an additional O(1) in memory resources.

Is book's quicksort in place?

Easier analysis of non recursive version: put start and end indices of parts left to do on a stack

Worst case for time is if one part is always small, but which end is significant. Suppose first part goes first on stack, and I illustrate

Initial:

0 n

Pop, split, push back

3 n < top
0 1

Pop, split, push back

6 n < top
3 4
0 1

Pop, split, push back

9 n < top
6 7
3 4
0 1

Pop, split, push back

12 n < top
9 10
6 7
3 4
0 1

How big will the stack get?


Compare if the second part goes on the stack first:

Initial:

0 n

Pop, split, push back

0 1 < top
3 n

Pop 01, completely finish this base case

3 n < top


Pop, split, push back

3 4 < top
6 n

Pop 3 4, completely finish

6 n < top


Pop, split, push back

6 7 < top
9 n

How big will the stack get?


How do we make something closer to the latter case always happen?

How big can the stack get then, in general?





Size n

Size < n
size < n/2

Size < n
size < n/2
size < n/4

size is < lg n + 1

Another general feature on divide and conquer algorithms is that the method for a large dataset may not be efficient for a small data set. In the case of quicksort, a simpler sort makes sense for small datasets. In general it could be a variety of simpler sorts, and it could be applied to each data set that is small enough. Switching to that algorithm takes some time. Knuth noted that the time spent for insertion sort is bounded in terms of the maximum distance any element needs to move. After partitioning down to a small size, no element needs to move far, so an insertion sort can be deferred until the end and be done in one sweep through the entire array.

See my Java code where you are just comparing integers. It include the elaboration with choosing the pivot from sorting three elements initially, is non-recursive, limits stack size, and has a parameter for switching to insertion sort.  I also include a testing class, where you can interactively choose the best parameter the size of sequences left for the final insertion sort, plus see behavior on various sorts of lists. The links in the text here are to .txt files that are more likely to display well in a browser. There are also copies ending in .java in the handouts.

Another issue for homework: The standard partition stops at pivot values in both the upward and downward sweep, so it end up swapping pivot values unnecessarily. Why not just swap th pivot values one way?

More Divide and Conquer - Not Sorts

4.4: Binary search

You should be familiar with binary search. The worst case number of 3-way comparisons is lg n.

Aside: why is a 3-way comparison particularly reasonable? Know Java compareTo or C# CompareTo?

Now let's look at the average time for binary search, assuming the element is found, and all are equally likely. To keep things easy, assume n = 2k – 1.

1st time: end if looking for middle element from 2k – 1:
1 element found in 1 comparison

next problems of size 2k-1 – 1:
2 elements found in 2 comparisons

next problems of size 2k-2 – 1:
4 elements found in 3 comparisons

finally when sizes just 1
2k-1 element found in k comparisons

Adding up and dividing by the total number of possibilites:

avg = [1+ 2*2 + 3*22 + 4*23 + … k*2k-1 ]/( 2k – 1)

The sum is almost the one in the back of the book #6. It matches the numerator perfectly if you multiply by 2 (and divide by 2) you get

avg = [(k-1)2k +1]/( 2k – 1)

adjusting the first part to include 2k – 1, to make dividing through by 2k – 1 easier:

avg = [(k-1)(2k – 1) +k]/( 2k – 1) = k-1 + k/( 2k – 1)

Since the maximum is k, so the average is within 1 of the maximum. For general n, it is within 2.

The book lists this technique under "divide by a constant factor", which is just divide and conquer when one only smaller problem needs to be considered. 

Powers

Before 4.1 in Chapter 4, rather than in 4.4, the book has a major example for calculating integeral powers:

def power(a, n):
   '''Assume integer n >= 0 for simplicity'''
   if n == 0;
      return 1
   b = power(a, n//2)
   if n % 2 == 0:
      return b*b
   return  b*b*a

The recursion goes to depth floor(log n).  The is useful in cryptography and various other places.

5.4: Multiplication

Large integers: The base 2 version I took from the beginning of DasGupta, chapter 2, is directly applicable to computers, with their shift operations. The base 10 version in Levitin is still completely analogous.

Both the integer algorithm (first page in DG ch2) and the matrix algorithm (12th page) are useful, and are good examples of applying the Master Theorem. They are not so good a models of creative thinking for this class – we are not likely to see their match anywhere! I have no idea how they were dreamed up.

Class problem (DasGupta): Given a sorted array of distinct integers A[1, . . . , n], you want to find out whether there is an index i for which A[i] = i. Give a divide-and-conquer algorithm that runs in time O(log n).

5.5: Geometric Algorithms

These are useful algorithms if you are dealing with geometric situations.   Be aware of them.

Class problem (DasGupta): An array A[1 . . . n] is said to have a majority element if more than half of its entries are the same. Given an array, the task is to design an efficient algorithm to tell whether the array has a majority element, and, if so, to find that element. The elements of the array are not necessarily from some ordered domain like the integers, and so there can be no comparisons of the form “is A[i] > A[ j ]?”. However you can answer questions of the form: “is A[i] = A[ j ]?” in constant time.

(a) Show how to solve this problem in O(n log n) time. (Hint: Split the array A into two arrays A1 and A2 of about half the size. Does knowing the majority elements of A1 and A2 help you figure out the majority element of A? If so, you can use a divide-and-conquer approach.)

(b) Can you give a linear-time algorithm? Hint: (This is more a reduce by a constant factor algorithm, Ch 4.4.)

Next:  More Sorts and Sort Analysis