From time to time I like spending time on some of my favorite problems. Recently it's been the turn of a good old friend, 3x+1. I have a few things to comment on. First, you should check out this book. There is also this awesome page with tons of information in it. Eric Roosendaal (quoted in the above mentioned book) passed me the source code for the program he uses to hunt for delay records. According to Eric, the program had been optimized quite a bit. I confirm: yes it was optimized quite a bit. But, you know... optimization and math together are such an inviting and exciting challenge for me that I couldn't help it. What follows is an experience report.
- T(n) = n/2, n even
- T(n) = (3n+1)/2, n odd
using the usual division algorithm, and requiring that the remainder r is always non-negative. Then,
where j is the number of times T^k(r) chose the odd branch. In addition,
Isn't that awesome? Now let's put that identity to good use. Consider T^3(8q+4):
- T(8q+4) = 4q+2
- T(4q+2) = 2q+1
- T(2q+1) = 3q+2
- T(8q+5) = 12q+8
- T(12q+8) = 6q+4
- T(6q+4) = 3q+2
But who said you can only do this with 2^3? To begin with, Eric's program calculates a 2^20 sieve based on the awesome identity. The process consists of finding all j, T^k(r) pairs for all 0 <= r < 2^20, and then keeping the lowest values of r for all pairs that repeat. Eric's observation for doing this was that, if you are going to get a duplicate pair for some r1, any value of r2 > r1 that results in the same j, T^k(r) pair is not far away from r1. For example, for the 2^20 sieve, r2 - r1 <= 359. This distance is called the sieve gap. Well, since the gap is small and can be calculated once by doing the linear search thing, in practice you just need a circular buffer and then you can recalculate the sieve at will. Now, obviously, calculating gaps by brute force is incredibly costly. Since you don't know what the gap is, you are forced to do linear search through all the results seen so far during sieve generation. This limitation basically forbids checking gaps for sieves much larger than 2^20. Larger sieves mean doing less work later though, so it's worth looking into this issue. So, that became my first goal: enable sieve analysis on much larger sieves, and try to make the sieve calculation faster if possible.
The first change was to replace the circular buffer linear search with a hashed table. The lookup key is the pair j, T^k(r), and the stored value is r (storing r allows to calculate the sieve gap on the fly, which although now unnecessary is still interesting on its own). I packed all that information in 16 bytes per entry. Slightly tighter packings are possible at the cost of additional complexity. In any case, this approach slashed the processing time for the 2^20 sieve from several minutes to 0.2 seconds. I approximated the final hashed collection size with something that looks like 1/x, so I stopped growths and rehashing for larger sieves while still being efficient space-wise. This strategy allowed this machine to calculate the full 2^31 sieve using just 6gb of memory and 8 minutes of 5-year old CPU time. Here are some interesting results (obviously the sieves skip all even numbers).
- 2^20 sieve: gap 359 (e.g. 271105), skip rate 87.371%.
- 2^21 sieve: gap 442 (e.g. 361473), skip rate 87.968%.
- 2^22 sieve: gap 878 (e.g. 774721), skip rate 88.528%.
- 2^23 sieve: gap 878 (e.g. 774721), skip rate 89.053%.
- 2^24 sieve: gap 1210 (e.g. 10905985), skip rate 89.546%.
- 2^25 sieve: gap 2360 (e.g. 21811969), skip rate 90.011%.
- 2^26 sieve: gap 2360 (e.g. 21811969), skip rate 90.449%.
- 2^27 sieve: gap 3146 (e.g. 29082625), skip rate 90.862%.
- 2^28 sieve: gap 3146 (e.g. 29082625), skip rate 91.252%.
- 2^29 sieve: gap 4195 (e.g. 38776833), skip rate 91.621%.
- 2^30 sieve: gap 8237 (e.g. 922014465), skip rate 91.971%.
- 2^31 sieve: gap 8237 (e.g. 922014465), skip rate 92.303%.
After the sieve is calculated in full, it can be represented as a bitfield that also skips all even numbers. So, a 2^31 sieve requires just 2^27 bytes of memory. But that's not all the sieving that can be done. We can also skip 3q+2, 9q+4 and 81q+10 numbers precisely because these values can be easily derived from others such as 8q+4. The original program issued, on average, 5/3 divisions per number checked to remove the first two cases. In other words, there was code along the lines of
Both programs use a 4 way switch to determine which computation to use.
- n = 4q+0: shift 2 bits to the right.
- n = 4q+1: calculate 3q+1 by doing n/2 + n/4 + 1, in integers.
- n = 4q+2: calculate 3q+2 by doing n/2 + n/4 + 1, in integers.
- n = 4q+3: calculate 9q+8 by doing 2n + n/4 + 2, in integers (and check against potential overflow and switch to multiprecision arithmetic if needed).
Disassembly of the GCC emitted code does not necessarily suggest that rewriting the calculation code directly in e.g. 64 bit Intel assembler will result in very significant speedups at this time because the parts that might benefit from e.g. adc instructions aren't that long. Moreover, using assembler would allow using 64 bit digits even during multiprecision calculations. However, the resulting code would need numerous e.g. adc instructions, whereas now the code just deals with numerous such instructions with just two instructions: and, and shl/shr. In addition, the current C code is portable, while the assembly code is not.
- First version of the new program, single threaded, 2^20 sieve size, 60*10^12 numbers: about 11 days' CPU time.
- Estimated original program, single threaded, 2^20 sieve size, 20*10^12 numbers: about 7 days' CPU time.