First, you will note that "ones_vector", "flagmask_vector" and "datamask_vector" have all been replaced by functions that work on the fly in O(1). These are "ones", "flagmask" and "datamask" respectively. The idea is that each of these is computable as a geometric sum

*C*+*C**2^{width}+ ...
+*C**2^{(size-1)width}

for which the closed-form formula is well known. I've also added the convenience function "mask", that gives a number of the desired length that is all ones in binary. With this it is possible to compute, for example, "flagmask" of the previous solution.

Second, the function "ge" (greater or equals) is now complemented by a second function, "lt" (less than). Once we know how to compute "ge", additional comparators, such as "lt" or "equals" are trivial extensions of it.

Next, and more importantly, the first solution presented had a rather awkward
handling of numbers repeated in the input. This solution uniquifies the
input beforehand by multiplying all numbers by a large enough factor (here,
the size of the input) and adding a counter. This insures that the numbers
are unique modulo *n*, and are hence unique. After sorting, we can
retrieve the original numbers by simple integer division by *n*.

Lastly, and most notably, the comparison by a different cyclic shift in each
iteration has been replaced. We are still performing all O(*n*^{2})
comparisons, and are still doing so in O(*n*) iterations, but the order
is now different. Now, in the *i*'th iteration we compare all array
members to the *i*'th element in the input array. To do this, we compare
the vector encoding the original array with a vector representing *n*
copies of the *i*'th element. This is, again, an easy to compute
geometrical series. It is simply "ones(size,width)*array[i]".

As in the original solution, we sum over the results of all comparisons between
"array[i]" and all other elements. Here this means taking the sum over all
the elements in an entire vector (the vector of comparisons computed in the
*i*'th step). Importantly, summing over the elements of an entire vector is
tractable in O(1) using the following observation:

*X*_{0}*2^{0*width} +
*X*_{1}*2^{1*width} + ... +
*X*_{(size-1)}*2^{(size-1)*width} =
*X*_{0}+*X*_{1}+ ... +
*X*_{(size-1)} (mod 2^{width}-1)

In our case, all we need to do is to take the integer housing the result
of the "less than" operation modulo 2^{width}-1. As long as
we know that the total sum is less than 2^{width}-1, this
is known to produce the same answer as summing all elements.

Once we know how many numbers are smaller than "array[i]", we know where to
put it in the sorted array, completing this implementation of O(*n*)
"sorting".

In trying to parallelize this algorithm, the first problem that arises is that we packed the input array a little too well. By choosing "width" to be "int(log(max(maximum,1))/log(2))+1" we haven't left any room for a "flag" bit, and so it isn't possible to use "ge" as-is. This was actually a hint for the readers to look for a "change_width" function, by which I mean a function working in O(1) that takes an input

*a*=*a*_{0}+*a*_{1}*2^{width}+
... + *a*_{size-1}*2^{(size-1)width}

and returns the output

*a*'=*a*_{0}+*a*_{1}*2^{width'}+
... + *a*_{size-1}*2^{(size-1)width'}

where all *a _{i}* are in the range
0≤

It turns out that this function is not difficult to create. If
*width*'≥(*size*+1)**width*, we can multiply the original
array by "ones(size,width'-width)". This places copies of the original array
in regular intervals chosen so that with a single additional AND we can mask
out the superfluous elements and end up with the desired array.

If we want to choose a new width that is much narrower than the original width
(if *width*≥*width*'**size*, to be exact), we can do so
even more easily, by calculating the original array modulo
"mask(width-new_width)", a very similar trick to what we did when summing array
elements with a modulo operation, only this time we use a modulo that is much
larger because we want the summand to be an entire vector, not a single element.

This solves two extreme cases: making "width" extremely wide or extremely narrow. To solve all the intermediate cases, we simply widen the array to an intermediate width "temp_width=(size+1)*max(width,width')", then narrow it back again. The full implementation of "change_width" (as well as all of the code for this month's solution) is available in Python code here.

As a side note, I'd like to mention that "change_width" is an extremely powerful
function to have. A person using it together with addition, subtraction and
bit shifts can implement any bitwise binary function. Proving this is left as
an exercise to the reader. Another powerful thing about having "change_width"
is that it makes us never need to use arrays again. We have seen that
integers that house vector-encoded arrays (which we'll henceforth refer to
simply as "vectors") carry the same information as the original arrays, and
it's not difficult to see that O(1) "set" and "get" functions can be implemented
for these, to make element access to vectors as easy as to the original arrays,
but so far we had the problem of bounded element size. If a person wanted to
set an array position to a value greater than 2^{width}-1, that
person would have been stumped by the limits of the vector encoding. With
"change_width", however, we can re-encode the vector to a new width at any time,
making all usage of arrays completely superfluous: vectors do everything arrays
do, and - as we have seen in the previous month's riddle - they can also do
much more.

Now, we can begin parallelizing our existing solution. The original first step
was to
multiply all array elements by *n* as part of the uniquification process.
We replace this by a multiplication by the larger value 2^{n}.
This makes both multiplication and division very easy: they can be accomplished
by simple bit shifts.

Next we need to add a counter. This is equivalent to adding the number

0+1*2^{width}+ ...
+(*size*-1)*2^{(size-1)width}

which, too, is a number easy to compute in O(1) if one remembers one's geometrical series. The function "counting" calculates this number in the code.

This handles uniquification of the input array. Next, we want to perform all
O(*n*^{2}) comparisons together. We do so by producing two vectors,
"repeated_array" and "wide_array". To make the explanation clearer, we will
follow it with an example. Let us say that the original unique array is
[4, 9, 7]. "repeated_array" will be [4, 9, 7, 4, 9, 7, 4, 9, 7], whereas
"wide_array" will be [4, 4, 4, 9, 9, 9, 7, 7, 7]. "repeated_array" is easy
to produce by making copies of the entire array through multiplication by
the appropriate "ones" vector. "wide_array" can be produced by first using
"change_width" to widen it into [4, 0, 0, 9, 0, 0, 7, 0, 0], then multiplying
by a "ones" vector.

Once performing a "less than" operation, we now need to sum all results
relevant to
a particular array element. So, once we compared where "wide_array" is less
than "repeated_array" we have, in our example, the vector
[0, 1, 1, 0, 0, 0, 0, 1, 0] and we want to produce from it the vector
[0, 2, 1] by summing in position *i* all the elements whose original
positions are *i* modulo 3. (The result is
[0+0+0, 1+0+1, 1+0+0].) In the original solution, we calculated each of these
values separately by use of a modulo. To calculate all of them together we
simply use a larger modulo, instead. Taking the modulo by "mask(size*width)"
folds the length-9 vector into the appropriate length-3 vector.

The final step in the original sort is array look-up. The original line is
"sorted_array[num_less_than]=array[i]". When performing this in parallel what
we really need is a function that receives two vectors, *a* and *b*,
in our case [4, 9, 7] and [0, 2, 1], and produces a vector, *result*,
such that *result*[*b*[*i*]]=*a*[*i*].
This sounds more
difficult than it really is. First, let's use our counter to create the vector
[0, 1, 2] and multiply it by a "ones" vector to created the repeated version
[0, 1, 2, 0, 1, 2, 0, 1, 2]. We will also need the "wide" version of vector
*b*, which is [0, 0, 0, 2, 2, 2, 1, 1, 1]. Now, we find where these two
vectors are equal ([1, 0, 0, 0, 0, 1, 0, 1, 0]), use this as a mask into the
original "wide_array" (the "wide" version of vector *a*), and get the
vector [4, 0, 0, 0, 0, 9, 0, 7, 0]. Lastly, a simple modulo operation folds
this into the desired sorted array [4, 7, 9].

Again, the full code for this solution is available here.

(Incidentally, this was "reverse" array look-up, but clearly implementing the
"forward" version, *result*[*i*]=*a*[*b*[*i*]], is
equally straightforward.)

What is the "real life" complexity of the algorithm? Well, that depends on what
you mean by "real life".
This algorithm works in the desired O(1)-time in the described computational
model. It does so by using only a fixed number of atomic operations.
If we wish to consider the complexity of the algorithm under more stringent
computational models we should therefore consider the time it takes to accomplish
the "heaviest" of these operations. This is (arguably)
the multiplication of a vector housing *n* numbers
with a vector housing *n*^{2} numbers. If one was to perform this
on vectors with minimized values of *width* (which we don't), it would have
been possible to compute the result of this single atomic operation by
O(*n*^{3}) multiplication/addition operations on integers with
size comparable to that of the integers in the input array. I would normally
refer to this
as an O(*n*^{3}) operation, but, of course, the actual complexity
is a function of the computational model used and different people will prefer
to use different computational models, so there is no absolute answer to this
question.