Speeding Up Collatz Iteration With Inline Assembly

I’ve recently had the terrible idea to start looking at the Collatz Conjecture on a hunch there may be an interesting equivalent statement in proof theory. While looking though the Wikipedia article I found an elegant algorithm that finds Collatz sequences for odd numbers represented as binary strings. The part about repeatedly dividing by two caught my eye, and the whole thing looked like a fun optimization exercise to see if I could beat a C compiler for once.

Note

I am not the first person to come up with this optimization, after writing I found this post on stackoverflow where a user named EOF points out the same optimization I did using the BSF instruction, and I’m sure the writers of the serious Collatz search programs are using even more advanced solutions. This post is a purely a fun record of my own optimization.

Collatz Iteration Link to heading

The following is a simple Collatz iteration function that given any number n will return the length of n’s Collatz sequence. The actual number of iterations is irrelevant, we really just care that n eventually reaches 1, and how quickly this happens (faster is better). If n did not reach 1 the program would enter an infinite loop and we would have successfully solved the Collatz conjecture. This is however impossible as every single unsigned long (64 bit representable) number has already been experimentally verified to reach 1.

int collatzBasic(unsigned long n){
    int rv = 0;
    for(;n!=1; rv++){
        n = n % 2 ? 3 * n + 1 : n / 2; 
    }
    return rv;
}

One common optimization is to use a a shortcut version of the Collatz function based on parity results that converges faster by immediately dividing the odd case by 2.

int collatzShortcut(unsigned long n){
    int rv = 0;
    for(;n!=1; rv++){
        n = n % 2 ? (3 * n + 1) / 2 : n / 2; 
    }
    return rv;
}

Now for a bad implementation of Wikipedia algorithm for odd numbers. Why divide by two once when you can spam it until you cant any more?

int collatzLoopCut(unsigned long n){
    int rv = 0;
    for(;n!=1; rv++){
        n = 3 * n + 1;
        while(n % 2 == 0){
            n = n / 2;
        }
    }
    return rv;
}

The insight here is that this repeated division by two is actually just throwing out all tailing zeros on the binary representation of n and that this can be done extremely efficiently with the right x86 assembly instructions. In particular BSF: the Bit Scan Forward instruction, which counts the number of trailing zeros, into a SHR : Shift Logical Right instruction to remove the trailing zeros.

int collatzLoopCutAsm(unsigned long n){
    unsigned long rv = 0;
    for(;n!=1; rv++){
        n = 3 * n + 1;
        asm(
            "bsfq %0, %%rcx \n"
            "shrq %%cl, %0 \n"
            : "=r"(n)
            : "r"(n)
            : "%rcx"
        );
    }
    return rv; 
}

Finally its time to test how much of an improvement this is. I add some benchmarking boilerplate where we count how long it takes to sum the lengths of the first half million Collatz sequences generated by odd numbers.

#define BENCH(C){\
    unsigned long cycles = 0;\
    clock_t s = clock();\
    for(unsigned long i = 1; i < 1000000; i+=2)\
        cycles += C(i);\
    double e = (double)(clock() - s) / CLOCKS_PER_SEC;\
    printf("%s, %ld cycles, %lf secs\n", #C, cycles, e);\
}

int main(){
    BENCH(collatzBasic)
    BENCH(collatzShortcut)
    BENCH(collatzLoopCut)
    BENCH(collatzLoopCutAsm)
}

Compiling on my hardware using GCC at various optimization levels, I did 5 runs and took the lowest run-time for each value.

Algo	O0	O1	O2	O3
collatzBasic	0.104199	0.161694	0.114190	0.101471
collatzShortcut	0.095399	0.082176	0.085050	0.088420
collatzLoopCut	0.086829	0.093545	0.083542	0.081439
collatzLoopCutAsm	0.060716	0.036049	0.038896	0.035135

It should be clear that using the inline assembly substantially outperforms not using it even when the compiler is trying as hard as it can to optimize.

Note

Per johnfound on Stack Overflow regarding a variant of this problem:

The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.

While I find this claim to be perhaps overstated, I think this is a good case study that if you know what you’re doing, inline assembly can be quite a tool.