### About 3x+1

From time to time I like spending time on some of my favorite problems. Recently it's been the turn of a good old friend, 3x+1. I have a few things to comment on. First, you should check out this book. There is also this awesome page with tons of information in it. Eric Roosendaal (quoted in the above mentioned book) passed me the source code for the program he uses to hunt for delay records. According to Eric, the program had been optimized quite a bit. I confirm: yes it was optimized quite a bit. But, you know... optimization and math together are such an inviting and exciting challenge for me that I couldn't help it. What follows is an experience report.

- T(n) = n/2, n even
- T(n) = (3n+1)/2, n odd

n = q2^k + r

using the usual division algorithm, and requiring that the remainder r is always non-negative. Then,

T^k(n) = T^k(q2^k + r) = q3^j + T^k(r)

where j is the number of times T^k(r) chose the odd branch. In addition,

T^k(2^k - 1) = 3^k - 1

Isn't that awesome? Now let's put that identity to good use. Consider T^3(8q+4):

- T(8q+4) = 4q+2
- T(4q+2) = 2q+1
- T(2q+1) = 3q+2

- T(8q+5) = 12q+8
- T(12q+8) = 6q+4
- T(6q+4) = 3q+2

But who said you can only do this with 2^3? To begin with, Eric's program calculates a 2^20 sieve based on the awesome identity. The process consists of finding all j, T^k(r) pairs for all 0 <= r < 2^20, and then keeping the lowest values of r for all pairs that repeat. Eric's observation for doing this was that, if you are going to get a duplicate pair for some r1, any value of r2 > r1 that results in the same j, T^k(r) pair is not far away from r1. For example, for the 2^20 sieve, r2 - r1 <= 359. This distance is called the sieve gap. Well, since the gap is small and can be calculated once by doing the linear search thing, in practice you just need a circular buffer and then you can recalculate the sieve at will. Now, obviously, calculating gaps by brute force is incredibly costly. Since you don't know what the gap is, you are forced to do linear search through all the results seen so far during sieve generation. This limitation basically forbids checking gaps for sieves much larger than 2^20. Larger sieves mean doing less work later though, so it's worth looking into this issue. So, that became my first goal: enable sieve analysis on much larger sieves, and try to make the sieve calculation faster if possible.

The first change was to replace the circular buffer linear search with a hashed table. The lookup key is the pair j, T^k(r), and the stored value is r (storing r allows to calculate the sieve gap on the fly, which although now unnecessary is still interesting on its own). I packed all that information in 16 bytes per entry. Slightly tighter packings are possible at the cost of additional complexity. In any case, this approach slashed the processing time for the 2^20 sieve from several minutes to 0.2 seconds. I approximated the final hashed collection size with something that looks like 1/x, so I stopped growths and rehashing for larger sieves while still being efficient space-wise. This strategy allowed this machine to calculate the full 2^31 sieve using just 6gb of memory and 8 minutes of 5-year old CPU time. Here are some interesting results (obviously the sieves skip all even numbers).

- 2^20 sieve: gap 359 (e.g. 271105), skip rate 87.371%.
- 2^21 sieve: gap 442 (e.g. 361473), skip rate 87.968%.
- 2^22 sieve: gap 878 (e.g. 774721), skip rate 88.528%.
- 2^23 sieve: gap 878 (e.g. 774721), skip rate 89.053%.
- 2^24 sieve: gap 1210 (e.g. 10905985), skip rate 89.546%.
- 2^25 sieve: gap 2360 (e.g. 21811969), skip rate 90.011%.
- 2^26 sieve: gap 2360 (e.g. 21811969), skip rate 90.449%.
- 2^27 sieve: gap 3146 (e.g. 29082625), skip rate 90.862%.
- 2^28 sieve: gap 3146 (e.g. 29082625), skip rate 91.252%.
- 2^29 sieve: gap 4195 (e.g. 38776833), skip rate 91.621%.
- 2^30 sieve: gap 8237 (e.g. 922014465), skip rate 91.971%.
- 2^31 sieve: gap 8237 (e.g. 922014465), skip rate 92.303%.

After the sieve is calculated in full, it can be represented as a bitfield that also skips all even numbers. So, a 2^31 sieve requires just 2^27 bytes of memory. But that's not all the sieving that can be done. We can also skip 3q+2, 9q+4 and 81q+10 numbers precisely because these values can be easily derived from others such as 8q+4. The original program issued, on average, 5/3 divisions per number checked to remove the first two cases. In other words, there was code along the lines of

if (x % 3 !=2 && x % 9 != 4)

Integer multiplication has become very fast in modern processors, but division is still comparatively slow. This is particularly so when operating on largish values. I removed these divisions, and added the x % 81 != 10 case as well, by using the Chinese Remainder Theorem. First I rewrote the program's iteration so that instead of scanning a block of numbers in sequential fashion, the program will scan the block using a set of linear combs. In other words, instead of scanning e.g. [32, 47] like this:

32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47

the program breaks that in classes of equivalence. If the program was using a 2^2 sieve, it would scan like this:

32, 36, 40, 44

33, 37, 41, 45

34, 38, 42, 46

35, 39, 43, 47

The linear comb traversal reduces the sieve bit field's accesses tremendously. Imagine scanning a block of 20*10^12 integers by checking every (odd) number against the sieve: that's 10*10^12 sieve bit field operations. With the linear comb approach, a sieve of 2^k will require exactly 2^{k-1} accesses, regardless of how many numbers are being checked. Since generally k is much less than 40, these computation savings are significant. On top of this, I also used a streaming approach for the sieve. Thus, the sieve bits are only accessed once every 64 bits.

The next thing now is to use the Chinese Remainder Theorem. If we make 9 copies of the 2^k sieve, then every single combination of 2^k, 3 and 9 moduli will appear in the sieve. Similarly, if we make 81 copies, the sieve will cover every combination of 2^k, 3, 9 and 81 moduli. Then, instead of making 81 copies, we just keep a running set of mod 3, 9 and 81 counters (which are incremented with an addition and a compare to reset their tiny values as needed, instead of requiring a division of a number with at least 50 bits these days) and cycle through the sieve 81 times. Rough speed measurements and estimates suggest this change alone deleted a significant double digit percentage fraction off the execution time in the program.

The original program used a mix of 64 and 32 bit counters. I rewrote all that using 64 bit arithmetic only. When multiprecision numbers are needed (T(n) can grow as large as about n^2), the new program uses 60 bit digits instead (except the top digit, which is used in full as possible). This allows up to 3 overflow bits (see below) to accumulate safely without needing to continuously check for overflow, and also allows a simplified base 10 printing routine to work by using the 4 bit overflow area.

Both programs use a 4 way switch to determine which computation to use.

Both programs use a 4 way switch to determine which computation to use.

- n = 4q+0: shift 2 bits to the right.
- n = 4q+1: calculate 3q+1 by doing n/2 + n/4 + 1, in integers.
- n = 4q+2: calculate 3q+2 by doing n/2 + n/4 + 1, in integers.
- n = 4q+3: calculate 9q+8 by doing 2n + n/4 + 2, in integers (and check against potential overflow and switch to multiprecision arithmetic if needed).

Disassembly of the GCC emitted code does not necessarily suggest that rewriting the calculation code directly in e.g. 64 bit Intel assembler will result in very significant speedups at this time because the parts that might benefit from e.g. adc instructions aren't that long. Moreover, using assembler would allow using 64 bit digits even during multiprecision calculations. However, the resulting code would need numerous e.g. adc instructions, whereas now the code just deals with numerous such instructions with just two instructions: and, and shl/shr. In addition, the current C code is portable, while the assembly code is not.

Speaking of the and instructions, I also wrote the code carefully so the arguments always fit within a byte. Consequently,
the compiler can use the compact Intel encodings. I did the same with
the sieve hashed table packing by choosing where to put what chunks of information, as well as with the streaming sieve accesses (both reads during use, as well as with writes during creation). I also ensured the single and
multiprecision computation order allowed the compiler to use the lea
instruction when possible. Using lea also eliminates the need for some add, shl, and mov instructions.

I automated the GCC builds with makefiles. In fact, I also automated the process of running a first compile of the program with instrumented branches, then recompiling the program using the resulting information. Doing so gained another 7% or so.

In addition, as of a while ago, I also multithreaded the main calculation section of the program using POSIX threads. Using 4 threads and a 2^30 sieve, I estimate the program will now be able to crunch through 60*10^12 numbers in a bit less than 2 days. Compare that to the following:

- First version of the new program, single threaded, 2^20 sieve size, 60*10^12 numbers: about 11 days' CPU time.
- Estimated original program, single threaded, 2^20 sieve size, 20*10^12 numbers: about 7 days' CPU time.

**Bang!**
So what's next? The program still needs a rigorous shakeout and a significant amount of audit to ensure it is actually correct in

__every__case, despite the fact that it already can reproduce previously verified results. We'll see what after that work is complete. More to come...
## No comments:

Post a Comment