## Saturday, August 31, 2013

From time to time I like spending time on some of my favorite problems.  Recently it's been the turn of a good old friend, 3x+1.  I have a few things to comment on.  First, you should check out this book.  There is also this awesome page with tons of information in it.  Eric Roosendaal (quoted in the above mentioned book) passed me the source code for the program he uses to hunt for delay records.  According to Eric, the program had been optimized quite a bit.  I confirm: yes it was optimized quite a bit.  But, you know... optimization and math together are such an inviting and exciting challenge for me that I couldn't help it.  What follows is an experience report.

First, some definitions.  The 3x+1 problem tends to be better studied using the T(x) form as follows:
• T(n) = n/2, n even
• T(n) = (3n+1)/2, n odd
The above definition allows the following beautiful identity.  First, break n by dividing it by a power of two 2^k

n = q2^k + r

using the usual division algorithm, and requiring that the remainder r is always non-negative.  Then,

T^k(n) = T^k(q2^k + r) = q3^j + T^k(r)

where j is the number of times T^k(r) chose the odd branch.  In addition,

T^k(2^k - 1) = 3^k - 1

Isn't that awesome?  Now let's put that identity to good use.  Consider T^3(8q+4):
1. T(8q+4) = 4q+2
2. T(4q+2) = 2q+1
3. T(2q+1) = 3q+2
Now look at T^3(8q+5):
1. T(8q+5) = 12q+8
2. T(12q+8) = 6q+4
3. T(6q+4) = 3q+2
So T^3(8q+4) = T^3(8q+5).  If you're checking for T(n) records, that is, the least values of n such that something interesting happens, then you'd be well served by ignoring all 5 mod 8 numbers because checking 4 mod 8 already takes care of 5 mod 8.

But who said you can only do this with 2^3?  To begin with,  Eric's program calculates a 2^20 sieve based on the awesome identity.  The process consists of finding all j, T^k(r) pairs for all 0 <= r < 2^20, and then keeping the lowest values of r for all pairs that repeat.  Eric's observation for doing this was that, if you are going to get a duplicate pair for some r1, any value of r2 > r1 that results in the same j, T^k(r) pair is not far away from r1.  For example, for the 2^20 sieve, r2 - r1 <= 359.  This distance is called the sieve gap.  Well, since the gap is small and can be calculated once by doing the linear search thing, in practice you just need a circular buffer and then you can recalculate the sieve at will.  Now, obviously, calculating gaps by brute force is incredibly costly.  Since you don't know what the gap is, you are forced to do linear search through all the results seen so far during sieve generation.  This limitation basically forbids checking gaps for sieves much larger than 2^20.  Larger sieves mean doing less work later though, so it's worth looking into this issue.  So, that became my first goal: enable sieve analysis on much larger sieves, and try to make the sieve calculation faster if possible.

The first change was to replace the circular buffer linear search with a hashed table.  The lookup key is the pair j, T^k(r), and the stored value is r (storing r allows to calculate the sieve gap on the fly, which although now unnecessary is still interesting on its own).  I packed all that information in 16 bytes per entry.  Slightly tighter packings are possible at the cost of additional complexity.  In any case, this approach slashed the processing time for the 2^20 sieve from several minutes to 0.2 seconds.  I approximated the final hashed collection size with something that looks like 1/x, so I stopped growths and rehashing for larger sieves while still being efficient space-wise.  This strategy allowed this machine to calculate the full 2^31 sieve using just 6gb of memory and 8 minutes of 5-year old CPU time.  Here are some interesting results (obviously the sieves skip all even numbers).
• 2^20 sieve: gap 359 (e.g. 271105), skip rate 87.371%.
• 2^21 sieve: gap 442 (e.g. 361473), skip rate 87.968%.
• 2^22 sieve: gap 878 (e.g. 774721), skip rate 88.528%.
• 2^23 sieve: gap 878 (e.g. 774721), skip rate 89.053%.
• 2^24 sieve: gap 1210 (e.g. 10905985), skip rate 89.546%.
• 2^25 sieve: gap 2360 (e.g. 21811969), skip rate 90.011%.
• 2^26 sieve: gap 2360 (e.g. 21811969), skip rate 90.449%.
• 2^27 sieve: gap 3146 (e.g. 29082625), skip rate 90.862%.
• 2^28 sieve: gap 3146 (e.g. 29082625), skip rate 91.252%.
• 2^29 sieve: gap 4195 (e.g. 38776833), skip rate 91.621%.
• 2^30 sieve: gap 8237 (e.g. 922014465), skip rate 91.971%.
• 2^31 sieve: gap 8237 (e.g. 922014465), skip rate 92.303%.
So, this new code enables on the fly calculation of sieves much larger than 2^20.  The best part is that, as you can see, going from a 2^20 sieve to a 2^31 sieve makes you work about 7.7% of the time, instead of 12.6% of the time.  In other words, that's about 40% less work!  The extra setup sieve overhead does pay off, because even 1% less work is very significant.  As you can see in Eric's page here, work is distributed in 20*10^12 number chunks.  Each chunk takes on the order of a week with the original program.  1% of a week is much more than the 8 minutes it takes to figure out the 2^31 sieve, so you win.

After the sieve is calculated in full, it can be represented as a bitfield that also skips all even numbers.  So, a 2^31 sieve requires just 2^27 bytes of memory.  But that's not all the sieving that can be done.  We can also skip 3q+2, 9q+4 and 81q+10 numbers precisely because these values can be easily derived from others such as 8q+4.  The original program issued, on average, 5/3 divisions per number checked to remove the first two cases.  In other words, there was code along the lines of

if (x % 3 !=2 && x % 9 != 4)

Integer multiplication has become very fast in modern processors, but division is still comparatively slow.  This is particularly so when operating on largish values.  I removed these divisions, and added the x % 81 != 10 case as well, by using the Chinese Remainder Theorem.  First I rewrote the program's iteration so that instead of scanning a block of numbers in sequential fashion, the program will scan the block using a set of linear combs.  In other words, instead of scanning e.g. [32, 47] like this:

32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

the program breaks that in classes of equivalence.  If the program was using a 2^2 sieve, it would scan like this:

32, 36, 40, 44
33, 37, 41, 45
34, 38, 42, 46
35, 39, 43, 47

The linear comb traversal reduces the sieve bit field's accesses tremendously.  Imagine scanning a block of 20*10^12 integers by checking every (odd) number against the sieve: that's 10*10^12 sieve bit field operations.  With the linear comb approach, a sieve of 2^k will require exactly 2^{k-1} accesses, regardless of how many numbers are being checked.  Since generally k is much less than 40, these computation savings are significant.  On top of this, I also used a streaming approach for the sieve.  Thus, the sieve bits are only accessed once every 64 bits.

The next thing now is to use the Chinese Remainder Theorem.  If we make 9 copies of the 2^k sieve, then every single combination of 2^k, 3 and 9 moduli will appear in the sieve.  Similarly, if we make 81 copies, the sieve will cover every combination of 2^k, 3, 9 and 81 moduli.  Then, instead of making 81 copies, we just keep a running set of mod 3, 9 and 81 counters (which are incremented with an addition and a compare to reset their tiny values as needed, instead of requiring a division of a number with at least 50 bits these days) and cycle through the sieve 81 times.  Rough speed measurements and estimates suggest this change alone deleted a significant double digit percentage fraction off the execution time in the program.

The original program used a mix of 64 and 32 bit counters.  I rewrote all that using 64 bit arithmetic only.  When multiprecision numbers are needed (T(n) can grow as large as about n^2), the new program uses 60 bit digits instead (except the top digit, which is used in full as possible).  This allows up to 3 overflow bits (see below) to accumulate safely without needing to continuously check for overflow, and also allows a simplified base 10 printing routine to work by using the 4 bit overflow area.

Both programs use a 4 way switch to determine which computation to use.
• n = 4q+0: shift 2 bits to the right.
• n = 4q+1: calculate 3q+1 by doing n/2 + n/4 + 1, in integers.
• n = 4q+2: calculate 3q+2 by doing n/2 + n/4 + 1, in integers.
• n = 4q+3: calculate 9q+8 by doing 2n + n/4 + 2, in integers (and check against potential overflow and switch to multiprecision arithmetic if needed).
That is, the program calculates more than one iteration of T(n) in each step.  As you can see, cases 4q+1 and 4q+2 are handled the same way.  Hence, the C switch has 3 cases: 3, 2 and 1, and 0.  Alas, the original program did not have default cases in the switches, and GCC 4.2 wasn't smart enough to realize the switches were handling every x & 3 case that way (maybe some spec detail prevents GCC from applying such smarts, or in practice it's handled as a case of garbage in, garbage out).  The result is that, according to C semantics, the compiler generated code for what would happen if x & 3 wasn't handled in a switch.  After adding a default case to the switches, the compiler removed a bunch of cmp, jmp, ja etc instructions from many places.  A 1060 byte function lost about 50 bytes' worth of those unnecessary instructions, and the program became about 7% faster too.

Disassembly of the GCC emitted code does not necessarily suggest that rewriting the calculation code directly in e.g. 64 bit Intel assembler will result in very significant speedups at this time because the parts that might benefit from e.g. adc instructions aren't that long.  Moreover, using assembler would allow using 64 bit digits even during multiprecision calculations.  However, the resulting code would need numerous e.g. adc instructions, whereas now the code just deals with numerous such instructions with just two instructions: and, and shl/shr.  In addition, the current C code is portable, while the assembly code is not.

Speaking of the and instructions, I also wrote the code carefully so the arguments always fit within a byte.  Consequently, the compiler can use the compact Intel encodings.  I did the same with the sieve hashed table packing by choosing where to put what chunks of information, as well as with the streaming sieve accesses (both reads during use, as well as with writes during creation).  I also ensured the single and multiprecision computation order allowed the compiler to use the lea instruction when possible.  Using lea also eliminates the need for some add, shl, and mov instructions.

I automated the GCC builds with makefiles.  In fact, I also automated the process of running a first compile of the program with instrumented branches, then recompiling the program using the resulting information.  Doing so gained another 7% or so.

In addition, as of a while ago, I also multithreaded the main calculation section of the program using POSIX threads.  Using 4 threads and a 2^30 sieve, I estimate the program will now be able to crunch through 60*10^12 numbers in a bit less than 2 days.  Compare that to the following:
• First version of the new program, single threaded, 2^20 sieve size, 60*10^12 numbers: about 11 days' CPU time.
• Estimated original program, single threaded, 2^20 sieve size, 20*10^12 numbers: about 7 days' CPU time.
Bang!

So what's next?  The program still needs a rigorous shakeout and a significant amount of audit to ensure it is actually correct in every case, despite the fact that it already can reproduce previously verified results.  We'll see what after that work is complete.  More to come...