## December 2008 solution

This month's riddle reaped a large number of different solutions.

All solutions received this month had in common that the basic algorithm sorted non-negative integers, rather than general integers. For this reason, solutions typically included a preparation phase, where the input was made non-negative. This can be done as follows in O(n):

```def sorting(array):
if len(array)==0:
return []
minimum=min(array)
nnarray=[x-minimum for x in array]
return [x+minimum for x in sorting_nonnegative(nnarray)]
```
(For clarity, code shown throughout is in Python shorthand, but it should be clear how to implement all commands in the desired complexity in the target computational model.)

Another commonality to all solutions is that they utilized the fact that manipulation of large integers can be done in O(1). They did this by packing the information from many integers into a single integer, then used arithmetic on the new integer as a form of performing parallel computation on all the integers that served as its source. We will call this packing procedure "encoding" and its reverse "decoding".

At this point, solutions divided into two major approaches, depending on the basic type of encoding used. One way was to encode the set of numbers {xi} by the cumulative OR of 1<<xi. I call this "the set-based approach". The other way was to encode the vector of values x0, ..., xn-1 by the sum of xi2i*width, where width is chosen to be a large enough number so that "max(xi)" is a number that has at most width bits. This is "the vector-based approach".

The set-based approach further requires, at its core, uniqueness of the set elements, so most solutions kept a count of the number of times each integer appears in the input array. One way to do this is by use of an additional array of size "max(array)+1". This is similar to what is done in Bucket sorting.

The set-based approach lends itself naturally to very elegant solutions, as soon as one learns the basics of how to use it. For example, note that "B&-B" is a way to calculate the position of the least '1' in "B". Using this simple function, one can encode the original set of numbers in O(n), then read them out, smallest to largest, again in O(n). This solution is implemented here.

Using vector encoding the basic parallel function is "ge" (greater or equal). Given two arrays, (ai) and (bi), we wish to calculate, in O(1), the vector that is 1 in every position where aibi and 0 everywhere else. To do this, consider first that if A is the integer encoding (ai) and B is the integer encoding (bi), then A+B is the integer encoding their element-by-element sum. Furthermore, if we wish to calculate their element by element difference, we can do so by first calculating the integer corresponding to (-bi). This can be done using the 2's-complement formula: "-b=~b+1". In order to do this for an entire vector, we simply need to switch the "1" for an all-ones vector that we will pre-encode in advance. After that, "~B+ones_vector" yields the desired result.

The only problem with "~B+ones_vector" is that the addition may introduce unwanted overflows between vector elements. To avoid this, we choose the width parameter of the vector to be at least one larger than is needed to store the largest integer in the vector. This gives us an additional "flag" bit. After an addition, it is a carry bit. After subtraction, it can be used as a sign bit. Furthermore, at any time we can zero it by AND-ing with an appropriate mask. For this purpose we will pre-encode two additional vectors, "flagmask_vector" and "datamask_vector". The former will have ones only in the flag-bit position, the latter only in the data-bit positions.

Pulling all this together, we can subtract two vectors and read the sign of the result, giving us the desired greater-or-equal function:

```def ge(a,b,ones_vector,datamask_vector,flagmask_vector,width):
return ((a+(~b & datamask_vector)+ones_vector) & flagmask_vector) >> (width-1)
```
Here is an example solution given the vector-based approach that uses "ge" to sort in O(n). The way it works is by first encoding the input array, then comparing it to cyclic shifts of itself. In n-1 such comparisons, each element is compared to all other elements. By summing the result of the comparisons, we know how many elements are smaller or equal to each element. This tells us where to place each element in the sorted array.

Both solutions above can be improved considerably. They are presented as they are for clarity. As an example, width, in the vector-based solution, can be chosen to be as small as floor(log2 m)+2, where m is the maximum value we may wish to store in the vector. Instead, we pick width to be much larger, namely m+1, just so as not to go into unnecessary details. Given a few optimizations, the real-life performance of the vector-based solution above is comparable to that of Bubble sorting (where, by "real life performance" I mean the computational model in which addition, subtraction, bit-shifts and bitwise operations all take time-complexity proportional to the length of their inputs and outputs).

This demonstrates an important property of the vector-based solutions: what they lack in elegance they make up for in practicality and wide-applicability. A good demonstration of this is the vector-based solution sent by Sen Gu. It is a sorting algorithm imported directly from parallel computation theory. Using the parallelized "ge", this solution implements in O(1) a single round of Odd-Even Transposition Sort, a classic sorting algorithm which is a parallelized version of Bubble sort. The entire sort takes n such rounds and so can be performed in O(n). Sen Gu simply substituted the O(n) simultaneous comparisons normally used in Odd-Even sort with a single vector comparison.

In all solution examples given above, the key ingredient in achieving the desired parallelism is a mix of bitwise operations and arithmetic operations. The last solution I present here comes from Yingjie Xu, and is worth seeing because it demonstrates that this mix in operation types is not an essential ingredient. This solution uses addition, subtraction and bit-shifts only. Using these, it counts how many values are smaller/larger than a specific element, doing so in the set-based approach. It is available here.

One last point: I was told by one reader that the fact that we allow unlimited integers here is a trick, and that the sorting procedures I ask people to come up with would not work in the more realistic case of bounded integers. This is true. If you want to sort n bounded integers in O(n) time, use Radix sort.