If you've ever used file names with a numerical component (such as
"FILE1.TXT" through "FILE1000.TXT") you know this problem: when listing the
content of a directory or otherwise trying to access these files sequentially,
you find that "FILE10.TXT", for example, is before "FILE2.TXT". The reason
for this is that file names are generally sorted lexicographically, not
according to their numerical values. Here is a
description of how lexicographic sorting works to find the order in which any
two non-identical digit strings, A and B, should be placed:
The problem is that if we use the straightforward encoding of numbers to digit
strings, with 1 becoming "1", 2 becoming "2" and 1000 becoming "1000", the
resulting lexicographic sorting differs from sorting by numerical values.
Normally, what one does to overcome this problem is to switch to a different
encoding scheme where both sorting criteria coincide.
For example, if we were to encode the number
1 by the digit-string "0001" and likewise pad all numbers with leading zeros
to four-digit length, we will find that "FILE0001.TXT" through "FILE1000.TXT"
are all sorted properly now.
- First, if either A or B is the empty string, it is first.
- Otherwise, the order of A and B is according to the
numerical value of their first digit.
- If their first digits are identical, their order is what it would have been
had the first digit been omitted from both strings (recursively).
This encoding has two major flaws:
This month's question: describe an encoding of numbers into
digit strings that preserves numerical ordering and
should work for all naturals 1 and up, and both the encoding and decoding
algorithms should be reasonably easy to implement.
- It can only be used for numbers in a limited range (in this case, 0 through
9999). We want an encoding that can be used for all naturals, 1 and up.
- It can bloat some of the numbers (in this case, the small ones)
considerably. Let's say that the length of the original number, n,
in its standard decimal representation, is U(n), and that the
length of its encoded version is E(n). We can say that
R(n) is the bloating factor. We know that, in terms of orders of
magnitude, U(n) is of length O(log(n)). We'd like
R(n) to be less than that.
As a bonus question, prove that the encoding you found has an
R(n) of minimal complexity.
List of solvers:
Oded Margalit (2 March 04:41)
Hongcheng Zhu (2 March 13:48)
Omer Angel (3 March 01:38)
Mark Tilford (3 March 05:10)
Itsik Horovitz (3 March 20:41)
Albert Stadler (3 March 21:00)
Ross Millikan (6 March 03:06)
Brian Kemper (8 March 09:43)
Rani Hod (19 March 05:07)
Ananda Raidu (30 March 17:19)