Nevertheless, both of this month's problems can be solved without resorting to questionable means.
The trick to Part 1 is to find another data structure, not an array, that can store many values, and can be initialised in constant time.
A list is such a structure: suppose you want to store a list of n values using an array, you can place all values in the first n positions of the array. To add an element, simply place it after the last used place and increase n. To empty the list, simply set n to zero.
What we want to do is to use such a list to indicate which positions in our simulated array have already been assigned a value. We will use a separate array, an array of reals, to store the actual values, as would be done with a normal array of real values, but when we need to read the value of a particular array position we will first consult with our list of assigned indices. If the position queried is not in the list then this position of the array has never been written to, so we will return for it the default value on a GET_VALUE call. If it is in the list, we'll simply use the value that is in the array of reals.
The problem with this plan is that checking if an element is in a list takes O(n) time. To speed this up, we use yet another auxiliary array, this one containing hints of where to find particular elements in our list. If we place at list position L the value P to indicate that P is a written-to position of our array, our auxiliary array will store at its P position the value L, to help us find the correct list element in constant time.
A full implementation of an initialiseable array class is given, in Python, here.
The algorithm was first described by Aho, Hopcroft and Ullman in The Design and Analysis of Computer Algorithms.
As in the case of the constant-time-initialiseable array, knowing that this is possible is a big step towards a solution. In fact, the biggest hint towards a solution is that it is not just the median that is found in linear time. The k-statistic (the k-smallest element) can be found in O(n) time independently of k. The method we will describe takes k as an additional input.
The basic method we wish to employ -- which was known long before Blum et al.'s paper -- is divide-and-conquer. Namely, we will choose a pivot point, p. Then, in O(n) steps move all elements smaller than p to one side of the array and the remainder to the other. This will tell us, as a side-effect, what statistic p is in the array. Given this information, we now know in which of the two sides our target statistic resides, and what statistic it is within the sub-array.
Continuing now recursively, we can find our target element.
For all this to happen in linear time, we must ensure that each sub-array, after each pivoting, has at least a constant fraction of the number of elements in the original (before pivoting) array.
Blum et al.'s innovation was in a good way to choose a pivot. The method is known as Median-of-Medians, and ensures that each sub-array contains at least 30% of the total.
The way to do this is to divide the array into sub-arrays of size 5 each and find for each of these (in constant time) its median. The median of the resulting array, of size n/5, will satisfy this property. And this we can find recursively.
Altogether, we claim that this algorithm works in at most cn time for some constant c.
To see this, consider that the two recursive calls require at most 70% cn and 20% cn time, respectively. The rest of the execution time, which is clearly linear, accounts for the remaining 10%.
A full Python implementation can be found here.
Why the magic number 5? No reason. Other constants would have worked as well, and I have a suspicion that 9 would work even better than 5.
Both algorithms described in this month's riddle are classic algorithms, and can be found as building blocks inside many other algorithms. However, despite my personal best attempts, I have not yet found even a single case where either one of these algorithms is of any practical use. While their time complexities are wonderful, in practice they are always slower than simpler alternative algorithms, with worse complexities but better time constants. (This is true whether you choose 5 or 9 as your magic number.)
Back to riddle
Back to main page