### Thursday, July 15, 2004

## Bit Twiddling Hacks

1.

Bit Twiddling Hacks

By Sean Eron Anderson seander@cs.stanford.edu

From http://graphics.stanford.edu/~seander/bithacks.html

Contents

About the operation counting methodology

Compute the sign of an integer

Compute the integer absolute value (abs) without branching

Compute the minimum (min) or maximum (max) of two integers without branching

Determining if an integer is a power of 2

Sign extending

Sign extending from a constant bit width

Sign extending from a variable bit width

Sign extending from a variable bit width in 3 operations

Conditionally set or clear bits without branching

Counting bits set

Counting bits set, naive way

Counting bits set by lookup table

Counting bits set, Brian Kernighan's way

Counting bits set in 12, 24, or 32-bit words using 64-bit instructions

Counting bits set, in parallel

Computing parity (1 if an odd number of bits set, 0 otherwise)

Compute parity of a word the naive way

Compute parity by lookup table

Compute parity of a byte using 64-bit multiply and modulus division

Compute parity in parallel

Swapping Values

Swapping values with XOR

Swapping individual bits with XOR

Reversing bit sequences

Reverse bits the obvious way

Reverse bits in word by lookup table

Reverse the bits in a byte with 3 operations (64-bit muliply and modulus division)

Reverse the bits in a byte with 4 operations (64-bit multiply, no division)

Reverse the bits in a byte with 7 operations (no 64-bit, only 32)

Reverse an N-bit quantity in parallel with 5 * lg(N) operations

Modulus division (aka computing remainders)

Computing modulus division by 1 << s without a division operation (obvious)

Computing modulus division by (1 << s) - 1 without a division operation

Computing modulus division by (1 << s) - 1 in parallel without a division operation

Finding integer log base 2 of an integer (aka the position of the highest bit set)

Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)

Find the log base 2 of an integer with a lookup table

Find the log base 2 of an N-bit integer in O(lg(N)) operations

Find integer log base 2 of a 32-bit IEEE float

Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)

Count the number of consecutive zero bits (trailing) on the right in parallel

Count the consecutive zero bits (trailing) on the right with modulus division and lookup

Round up to the next highest power of 2

Interleaving bits (aka computing Morton Numbers)

Interleave bits the obvious way

Interleave bits by table lookup

Interleave bits with 64-bit multiply

Interleave bits by Binary Magic Numbers

Determine if a word has a zero byte

--------------------------------------------------------------------------------

About the operation counting methodology

When totaling the number of operations for algorithms here, any C operator is counted as one operation. Intermediate assignments, which need not be written to RAM, are not counted. Of course, this operation counting approach only serves as an approximation of the actual number of machine instructions and CPU time. All operations are assumed to take the same amount of time, which is not true in reality, but CPUs have been heading increasingly in this direction over time. There are many nuances that determine how fast a system will run a given sample of code, such as cache sizes, memory bandwidths, instruction sets, etc. In the end, benchmarking is the best way to determine whether one method is really faster than another, so consider the techniques below as possibilities to test on your target architecture.

--------------------------------------------------------------------------------

Compute the sign of an integer

int v; // we want to find the sign of v

int sign; // the result goes here

sign = v >> (sizeof(int) * 8 - 1); // if v < 0 then -1, else 0

The above expression evaluates to sign = v >> 31 for 32-bit integers. This is one operation faster than the obvious way, sign = -(v > 0). This trick works because when integers are shifted right, the value of the far left bit is copied to the other bits. The far left bit is 1 when the value is negative and 0 otherwise; all 1 bits is -1.

Alternatively, if you prefer the result be either -1 or +1, then use:

sign = +1 | (v >> (sizeof(int) * 8 - 1)); // if v < 0 then -1, else +1

Alternatively, if you prefer the result be either -1, 0, or +1, then use:

sign = (v != 0) | (v >> (sizeof(int) * 8 - 1)); // -1, 0, or +1

Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work.

--------------------------------------------------------------------------------

Compute the integer absolute value (abs) without branching

int v; // we want to find the absolute value of v

int r; // the result goes here

r = (v ^ (v >> (sizeof(int) * 8 - 1))) - (v >> (sizeof(int) * 8 - 1));

Some CPUs don't have an integer absolute value instruction (or the compiler fails to use them). On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (v < 0) ? -v : v, even though the number of operations is the same.

Caveats: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work. I've read that ANSI C does not require values to be represented as two's complement, so it may not work for that reason as well (on a diminishingly small number of old machines that still use one's complement). On March 14, 2004, Keith H. Duggar sent me the solution above; it is superior to the one I initially came up with, r=(+1|(v>>(sizeof(int)*8-1)))*v, because a multiply is not used. Unfortunately, this method has been patented in the USA on June 6, 2000 by Vladimir Yu Volkonsky and assigned to Sun Microsystems. (For an unpatented non-multiplying alternative, consider (v ^ (v >> (sizeof(int) * 8 - 1))) + ((unsigned) v >> (sizeof(int) * 8 - 1)), suggested by Thorbjørn Willoch on June 21, 2004 or my variation, (v ^ (v >> (sizeof(int) * 8 - 1))) + (v < 0), both of which take one more operation than the patented one, though it may be computed in parallel on modern CPUs.)

--------------------------------------------------------------------------------

Compute the minimum (min) or maximum (max) of two integers without branching

int x; // we want to find the minimum of x and y

int y;

int r; // the result goes here

r = y + ((x - y) & ((x - y) >> (sizeof(int) * 8 - 1)); // min(x, y)

On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (x < y) ? x : y, even though it involves one more instruction. (Notice that (x - y) only needs to be evaluated once.) It works because if x < y, then (x - y) >> 31 will be all ones (on a 32-bit integer machine), so r = y + (x - y) & ~0 = y + x - y = x. Otherwise, if x >= y, then (x - y) >> 31 will be all zeros, so r = y + (x - y) & 0 = y.

To find the maximum, use:

r = x - ((x - y) & ((x - y) >> (sizeof(int) * 8 - 1)); // max(x, y)

Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work.

--------------------------------------------------------------------------------

Determining if an integer is a power of 2

unsigned int v; // we want to see if v is a power of 2

bool f; // the result goes here

f = (v & (v - 1)) == 0;

Note that 0 is incorrectly considered a power of 2 here. To remedy this, use:

f = !(v & (v - 1)) && v;

--------------------------------------------------------------------------------

Sign extending from a constant bit width

Sign extension is automatic for built-in types, such as chars and ints. But suppose you have a signed two's complement number, x, that is stored using only b bits. Moreover, suppose you want to convert x to an int, which has more than b bits. A simple copy will work if x is positive, but if negative, the sign must be extended. For example, if we have only 4 bits to store a number, then -3 is represented as 1101 in binary. If we have 8 bits, then -3 is 11111101. The most significant bit of the 4-bit representation is replicated sinistrally to the fill in the destination when we convert to a representation with more bits; this is sign exteneding. In C, sign extension from a constant bit width is trivial, since bit fields may be specified in structs or unions. For example, to convert from 5 bits to an full integer:

int x; // convert this from using 5 bits to a full int

int r; // resulting sign extended number goes here

struct {int x:5;} s;

r = s.x = x;

The following is a C++ template function that uses the same language feature to convert from B bits in one operation (though the compiler is generating more, of course).

template

inline T signextend(const T x)

{

struct {T x:B;} s;

return s.x = x;

}

int r = signextend(x); // sign extend 5 bit number x to r

--------------------------------------------------------------------------------

Sign extending from a variable bit width

Sometimes we need to extend the sign of a number but we don't know a priori the number of bits, b, in which it is represented. (Or we could be programming in Java, which lacks bitfields.)

unsigned b; // number of bits representing the number in x

int x; // sign extend this b-bit number to r

int r; // resulting sign-extended number

int const m = 1 << (b - 1); // mask can be pre-computed if b is fixed

r = -(x & m) | x;

The code above requires five operations. Sean A. Irvine suggested that I add sign extension methods to this page on June 13, 2004, and he provided m = (1 << (b - 1)) - 1; r = -(x & ~m) | x; as a starting point from which I optimized.

--------------------------------------------------------------------------------

Sign extending from a variable bit width in 3 operations

The following may be slow on some machines, due to the effort required for multiplication and division.

unsigned b; // number of bits representing the number in x

int x; // sign extend this b-bit number to r

int r; // resulting sign-extended number

#define M(B) (1 << ((sizeof(x) << 3) - B))

int const multipliers[] =

{

0, M(1), M(2), M(3), M(4), M(5), M(6), M(7),

M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15),

M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),

M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),

M(32)

}; // (add more for 64 bits)

#undef M

const int m = multipliers[b];

r = (x * m) / m;

The following variation is not portable, but on architectures that employ an arithmetic right-shift, maintaining the sign, it should be fast.

const int s = -b; // OR: sizeof(x) * 8 - b;

r = (x << s) >> s;

--------------------------------------------------------------------------------

Conditionally set or clear bits without branching

bool f; // conditional flag

unsigned int m; // the bit mask

unsigned int w; // the word to modify: if (f) w |= m; else w &= ~m;

w ^= (-f ^ w) & m;

On some architectures, the lack of branching can more than make up for what appears to be twice as many operations. For instance, informal speed tests on an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. Glenn Slayden informed me of this expression on December 11, 2003.

--------------------------------------------------------------------------------

Counting bits set (naive way)

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

for (c = 0; v; v >>= 1)

{

c += v & 1;

}

The naive approach requires one iteration per bit, until no more bits are set. So on a 32-bit word with only the high set, it will go through 32 iterations.

--------------------------------------------------------------------------------

Counting bits set by lookup table

const unsigned char BitsSetTable256[] =

{

0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8

};

unsigned int v; // count the number of bits set in 32-bit value v

unsigned int c; // c is the total bits set in v

// Option 1:

c = BitsSetTable256[v & 0xff] +

BitsSetTable256[(v >> 8) & 0xff] +

BitsSetTable256[(v >> 16) & 0xff] +

BitsSetTable256[v >> 24];

// Option 2:

unsigned char * p = (unsigned char *) &v;

c = BitsSetTable256[p[0]] +

BitsSetTable256[p[1]] +

BitsSetTable256[p[2]] +

BitsSetTable256[p[3]];

--------------------------------------------------------------------------------

Counting bits set, Brian Kernighan's way

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

for (c = 0; v; c++)

{

v &= v - 1; // clear the least significant bit set

}

Brian Kernighan's method goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

--------------------------------------------------------------------------------

Counting bits set in 12, 24, or 32-bit words using 64-bit instructions

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 12-bit values in v:

c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

// option 2, for at most 24-bit values in v:

c = (v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL;

c += ((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL;

c %= 0x1f;

// option 3, for at most 32-bit values in v:

c = (((v & 0xfff) *

0x1001001001001ULL & 0x84210842108421ULL) +

(((v & 0xfff000) >> 12) *

0x1001001001001ULL & 0x84210842108421ULL)) % 0x1f;

c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 9; and the third option takes 14.

Rich Schroeppel originally created a 9-bit version, similiar to option 1; see the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972. His method was the inspiration for the variants above, devised by Sean Anderson.

After reading this and pondering it, Roshan James wrote a web page explaining how it works and extending it to 16 bits on March 8, 2004.

--------------------------------------------------------------------------------

Counting bits set, in parallel

unsigned int v; // count bits set in this (32-bit value)

unsigned int c; // store the total here

const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers

const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

c = v;

c = ((c >> S[0]) & B[0]) + (c & B[0]);

c = ((c >> S[1]) & B[1]) + (c & B[1]);

c = ((c >> S[2]) & B[2]) + (c & B[2]);

c = ((c >> S[3]) & B[3]) + (c & B[3]);

c = ((c >> S[4]) & B[4]) + (c & B[4]);

The B array, expressed as binary, is:

B[0] = 01010101 01010101 01010101 01010101

B[1] = 00110011 00110011 00110011 00110011

B[2] = 00001111 00001111 00001111 00001111

B[3] = 00000000 11111111 00000000 11111111

B[4] = 00000000 00000000 11111111 11111111

We can adjust the method for larger word sizes by continuing with the patterns for the Binary Magic Numbers, B and S. If there are k bits, then we need the arrays S and B to be ceil(lg(k)) elements long, and we must compute the same number of expressions for c as S or B are long.

See Ian Ashdown's nice newsgroup post for more information on counting the number of bits set (also known as sideways addition).

--------------------------------------------------------------------------------

Computing parity the naive way

unsigned int v; // word value to compute the parity of

bool parity = false; // parity will be the parity of b

while (v)

{

parity = !parity;

v = v & (v - 1);

}

The above code uses an approach like Brian Kernigan's bit counting, above. The time it takes is proportional to the number of bits set.

--------------------------------------------------------------------------------

Compute parity of a byte using 64-bit multiply and modulus division

unsigned char b; // byte value to compute the parity of

bool parity = (((b * 0x0101010101010101ULL) & 0x8040201008040201) % 0x1FF) & 1;

The method above takes around 4 operations, but only works on bytes.

--------------------------------------------------------------------------------

Compute parity in parallel

unsigned int v; // word value to compute the parity of

v ^= v >> 16;

v ^= v >> 8;

v ^= v >> 4;

v &= 0xf;

return (0x6996 >> v) & 1;

The method above takes around 9 operations, and works for 32-bit words. It may be optimized to work just on bytes in 5 operations by removing the two lines immediately following "unsigned int v;". The method first shifts and XORs the eight nibbles of the 32-bit value together, leaving the result in the lowest nibble of v. Next, the binary number 0110 1001 1001 0110 (0x6996 in hex) is shifted to the right by the value represented in the lowest nibble of v. This number is like a miniature 16-bit parity-table indexed by the low four bits in v. The result has the parity of v in bit 1, which is masked and returned.

Thanks to Mathew Hendry for pointing out the shift-lookup idea at the end on Dec. 15, 2002. That optimization shaves two operations off using only shifting and XORing to find the parity.

--------------------------------------------------------------------------------

Compute parity by lookup table

const bool ParityTable[] =

{

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0

};

unsigned char b; // byte value to compute the parity of

bool parity = parity_table[b];

// OR, for 32-bit words:

unsigned int v; // word value to compute the parity of

bool parity = ParityTable[(v & 0x000000ff)] ^

ParityTable[(v & 0x0000ff00) >> 8] ^

ParityTable[(v & 0x00ff0000) >> 16] ^

ParityTable[(v & 0xff000000) >> 24];

--------------------------------------------------------------------------------

Swapping values with XOR

#define SWAP(a, b) ((a ^= b), (b ^= a), (a ^= b))

This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable.

On January 20, 2005, Iain A. Fleming pointed out that the macro above doesn't work when you swap with the same memory location, such as SWAP(a[i], a[j]) with i == j. So if that may occur, consider defining the macro as ((a == b) || ((a ^= b), (b ^= a), (a ^= b))).

--------------------------------------------------------------------------------

Swapping individual bits with XOR

unsigned int i, j; // positons of bit sequences to swap

unsigned int n; // number of consecutive bits in each sequence

unsigned int b; // bits to swap reside in b

unsigned int r; // bit-swapped result goes here

int x = ((b >> i) ^ (b >> j)) & ((1 << n) - 1); // XOR temporary

r = b ^ ((x << i) | (x << j));

For example, suppose we have have b = 00101111 (expressed in binary) and we want to swap the n = 3 consecutive bits starting at i = 1 (the second bit from the right) with the 3 consecutive bits starting at j = 5; the result would be r = 11100011 (binary).

This method of swapping is similar to the general purpose XOR swap trick, but intended for operating on individual bits. The variable x stores the result of XORing the pairs of bit values we want to swap, and then the bits are set to the result of themselves XORed with x. Of course, the result is undefined if the sequences overlap.

--------------------------------------------------------------------------------

Reverse bits the obvious way

unsigned int v; // reverse the bits in this

unsigned int t = v; // t will have the reversed bits of v

int i;

for (i = sizeof(v) * 8 - 1; i; i--)

{

t |= v & 1;

t <<= 1;

v >>= 1;

}

t |= v & 1;

On October 15, 2004, Michael Hoisie pointed out a bug in the original version.

--------------------------------------------------------------------------------

Reverse bits in word by lookup table

const unsigned char BitReverseTable256[] =

{

0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,

0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,

0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,

0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,

0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,

0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,

0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,

0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,

0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,

0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,

0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,

0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,

0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,

0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,

0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,

0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF

};

unsigned int v; // reverse 32-bit value, 8 bits at time

unsigned int c; // c will get v reversed

// Option 1:

c = (BitReverseTable256[v & 0xff] << 24) |

(BitReverseTable256[(v >> 8) & 0xff] << 16) |

(BitReverseTable256[(v >> 16) & 0xff] << 8) |

(BitReverseTable256[(v >> 24) & 0xff]);

// Option 2:

unsigned char * p = (unsigned char *) &v;

unsigned char * q = (unsigned char *) &c;

q[3] = BitReverseTable256[p[0]];

q[2] = BitReverseTable256[p[1]];

q[1] = BitReverseTable256[p[2]];

q[0] = BitReverseTable256[p[3]];

The first method takes about 17 operations, and the second takes about 8, assuming your CPU can load and store bytes easily.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 3 operations (64-bit multiply and modulus division):

unsigned char b; // reverse this byte

b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;

The multiply operation creates five separate copies of the 8-bit byte pattern to fan-out into a 64-bit value. The AND operation selects the bits that are in the correct (reversed) positions, relative to each 10-bit groups of bits. The multiply and the AND operations copy the bits from the original byte so they each appear in only one of the 10-bit sets. The reversed positions of the bits from the original byte coincide with their relative positions within any 10-bit set. The last step, which involves modulus division by 2^10 - 1, has the effect of merging together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...) in the 64-bit value. They do not overlap, so the addition steps underlying the modulus division behave like or operations.

This method was attributed to Rich Schroeppel in the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 4 operations (64-bit multiply, no division):

unsigned char b; // reverse this byte

b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;

The following shows the flow of the bit values with the boolean variables a, b, c, d, e, f, g, and h, which comprise an 8-bit byte. Notice how the first multiply fans out the bit pattern to multiple copies, while the last multiply combines them in the fifth byte from the right.

abcd efgh (-> hgfe dcba)

* 1000 0000 0010 0000 0000 1000 0000 0010 (0x80200802)

-------------------------------------------------------------------------------------------------

0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0

& 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110)

-------------------------------------------------------------------------------------------------

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

* 0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101)

-------------------------------------------------------------------------------------------------

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

-------------------------------------------------------------------------------------------------

0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000

>> 32

-------------------------------------------------------------------------------------------------

0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba

& 1111 1111

-------------------------------------------------------------------------------------------------

hgfe dcba

Note that the last two steps can be combined on some processors because the registers can be accessed as bytes; just multiply so that a register stores the upper 32 bits of the result and the take the low byte. Thus, it may take only 6 operations.

Devised by Sean Anderson, July 13, 2001.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 7 operations (no 64-bit):

b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16

Devised by Sean Anderson, July 13, 2001. Typo spotted correction supplied by Mike Keith, January 3, 2002.

--------------------------------------------------------------------------------

Reverse an N-bit quantity in parallel in 5 * lg(N) operations:

unsigned int v; // 32 bit word to reverse bit order

int const S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers

int const B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

v = ((v >> S[0]) & B[0]) | ((v << S[0]) & ~B[0]); // swap odd and even bits

v = ((v >> S[1]) & B[1]) | ((v << S[1]) & ~B[1]); // swap consecutive pairs

v = ((v >> S[2]) & B[2]) | ((v << S[2]) & ~B[2]); // swap nibbles ...

v = ((v >> S[3]) & B[3]) | ((v << S[3]) & ~B[3]);

v = ((v >> S[4]) & B[4]) | ((v << S[4]) & ~B[4]);

This method is best suited to situations where N is large.

Any reasonable optimizing C compiler should treat the dereferences of B, ~B, and S as constants, requiring no evaluation other than perhaps a load operation for some of the B and ~B references.

See Dr. Dobb's Journal 1983, Edwin Freed's article on Binary Magic Numbers for more information.

--------------------------------------------------------------------------------

Compute modulus division by 1 << s without a division operator

const unsigned int n; // numerator

const unsigned int s;

const unsigned int d = 1 << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ...

unsigned int m; // m will be n % d

m = n & (d - 1);

Most programmers learn this trick early, but it was included for the sake of completeness.

--------------------------------------------------------------------------------

Compute modulus division by (1 << s) - 1 without a division operator

unsigned int n; // numerator

const unsigned int s; // s > 0

const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).

unsinged int m; // n % d goes here.

for (m = n; n > d; n = m)

{

for (m = 0; n; n >>= s)

{

m += n & d;

}

}

// Now m is a value from 0 to d, but since with modulus division

// we want m to be 0 when it is d.

m = m == d ? 0 : m;

This method of modulus division by an integer that is one less than a power of 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations, where N is the number of bits in the numerator. In other words, it takes at most O(N * lg(N)) time.

Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine corrected me on June 17, 2004, I mistakenly commented that we could alternatively assign m = ((m + 1) & d) - 1; at the end.

--------------------------------------------------------------------------------

Compute modulus division by (1 << s) - 1 in parallel without a division operator

// The following is for a word size of 32 bits!

const unsigned int M[] =

{

0x00000000, 0x55555555, 0x33333333, 0xc71c71c7,

0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f,

0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff,

0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff,

0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff,

0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff,

0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff,

0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff

};

const unsigned int Q[][6] =

{

{ 0, 0, 0, 0, 0, 0}, {16, 8, 4, 2, 1, 1}, {16, 8, 4, 2, 2, 2},

{15, 6, 3, 3, 3, 3}, {16, 8, 4, 4, 4, 4}, {15, 5, 5, 5, 5, 5},

{12, 6, 6, 6 , 6, 6}, {14, 7, 7, 7, 7, 7}, {16, 8, 8, 8, 8, 8},

{ 9, 9, 9, 9, 9, 9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11},

{12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14},

{15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17},

{18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20},

{21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23},

{24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26},

{27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29},

{30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31}

};

const unsigned int R[][6] =

{

{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003},

{0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f},

{0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f},

{0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f},

{0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f},

{0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff},

{0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff},

{0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff},

{0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff},

{0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff},

{0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff},

{0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff},

{0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff},

{0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff},

{0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff},

{0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff},

{0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff},

{0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff},

{0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff},

{0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff},

{0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff},

{0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff},

{0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff},

{0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff},

{0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff},

{0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff},

{0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff},

{0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff},

{0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}

};

unsigned int n; // numerator

const unsigned int s; // s > 0

const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).

unsinged int m; // n % d goes here.

m = (n & M[s]) + ((n & ~M[s]) >> s);

for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++)

{

m = (m >> *q) + (m & *r);

}

m = m == d ? 0 : m; // Or: m = ((m + 1) & d) - 1;

This method of finding modulus division by an integer that is one less than a power of 2 takes at most O(lg(N)) time, where N is the number of bits in the numerator (32 bits, for the code above). The number of operations is at most 12 + 9 * ceil(lg(N)). The tables may be removed if you know the denominator at compile time; just extract the few relevent entries and unroll the loop. It may be easily extended to more bits.

It finds the result by summing the values in base (1 << s) in parallel. First every other base (1 << s) value is added to the previous one. Imagine that the result is written on a piece of paper. Cut the paper in half, so that half the values are on each cut piece. Align the values and sum them onto a new piece of paper. Repeat by cutting this paper in half (which will be a quarter of the size of the previous one) and summing, until you cannot cut further. After performing lg(N/s/2) cuts, we cut no more; just continue to add the values and put the result onto a new piece of paper as before, while there is at least two values.

Devised by Sean Anderson, August 20, 2001.

--------------------------------------------------------------------------------

Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)

unsigned int v; // 32-bit word to find the log base 2 of

unsigned c = 0; // c will be lg(v)

while (v >>= 1) // unroll for more speed...

{

c++;

}

The log base 2 of an integer is the same as the position of the highest bit set (or most significant bit set, MSB). The following log base 2 methods are faster than this one.

--------------------------------------------------------------------------------

Find the log base 2 of an integer with a lookup table

const char LogTable256[] =

{

0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,

4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,

5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,

5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7

};

unsigned int v; // 32-bit word to find the log of

unsigned c = 0; // c will be lg(v)

register unsigned int t, tt; // temporaries

if (tt = v >> 16)

{

c = (t = v >> 24) ? 24 + LogTable256[t] : 16 + LogTable256[tt & 0xFF];

}

else

{

c = (t = v & 0xFF00) ? 8 + LogTable256[t >> 8] : LogTable256[v & 0xFF];

}

The lookup table method takes only about 7 operations to find the log of a 32-bit value. If extended for 64-bit quantities, it would take roughly 9 operations.

--------------------------------------------------------------------------------

Find the log base 2 of an N-bit integer in O(lg(N)) operations

unsigned int v; // 32-bit value to find the log2 of

const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};

const unsigned int S[] = {1, 2, 4, 8, 16};

int i;

register unsigned int c = 0; // result of log2(v) will go here

for (i = 4; i >= 0; i--) // unroll for speed...

{

if (v & b[i])

{

v >>= S[i];

c |= S[i];

}

}

// OR (IF YOU KNOW v IS A POWER OF 2):

const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0, 0xFF00FF00,

0xFFFF0000};

register unsigned int c = (v & b[0]) != 0;

for (i = 4; i >= 1; i--) // unroll for speed...

{

c |= ((v & b[i]) != 0) << i;

}

Of course, to extend the code to find the log of a 33- to 64-bit number, we would append another element, 0xFFFFFFFF00000000, to b, append 32 to S, and loop from 5 to 0. The second variation was suggested to me by John Owens on April 24, 2002; it's faster, but it is only suitable when the input is known to be a power of 2. On May 25, 2003, Ken Raeburn suggested improving the general case by using smaller numbers for b[], which load faster on some architectures (for instance if the word size is 16 bits, then only one load instruction may be needed). These values work for the general version, but not for the special-case version below it, where v is a power of 2; Glenn Slayden brought this oversight to my attention on December 12, 2003.

--------------------------------------------------------------------------------

Find integer log base 2 of a 32-bit IEEE float

const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v)

int c; // c gets the result;

c = ((*(int *) &v) >> 23) - 127;

The above is fast, but many architectures utilize subnormal floating point numbers. These have the exponent bits set to zero (signifying pow(2,-127)), and the mantissa is not normalized, so it contains leading zeros and thus the log2 must be computed from the mantissa. To accomidate for subnormal numbers, use the following:

const float v; // find int(log2(v)), where v > 0.0 && finite(v)

const int x = *(int *) &v;

int c = x >> 23; // c gets the result;

if (c)

{

c -= 127;

}

else

{ // subnormal, so recompute using mantissa: c = intlog2(x) - 149;

register unsigned int t; // temporary

// Note that LogTable256 was defined earlier

if (t = x >> 16)

{

c = LogTable256[t] - 133;

}

else

{

c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;

}

}

On June 20, 2004, Sean A. Irvine suggested that I include code to handle subnormal numbers.

--------------------------------------------------------------------------------

Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)

const int r;

const float v; // find int(log2(pow(v, 1. / pow(2, r))), where isnormal(v)

int c; // c gets the result;

c = (((((*(int *) &v) - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;

So, if r is 0, for example, we have c = int(log2(v)). If r is 1, then we have c = int(log2(sqrt(r))). If r is 2, then we have c = int(log2(pow(r, 1./4))).

--------------------------------------------------------------------------------

Count the consecutive zero bits (trailing) on the right in parallel

unsigned int v; // 32-bit word input to count zero bits on right

register unsigned int c = 32; // c will be the number of zero bits on the right,

// so if v is 01101000 (base 2), then c will be 3 as output.

const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers

if (v & B[4])

{

v <<= S[4];

c -= S[4];

}

if (v & B[3])

{

v <<= S[3];

c -= S[3];

}

if (v & B[2])

{

v <<= S[2];

c -= S[2];

}

if (v & B[1])

{

v <<= S[1];

c -= S[1];

}

if (v & B[0])

{

v <<= S[0];

c -= S[0];

}

if (v)

{

c--;

}

Here, we are basically doing the same operations as finding the log base 2 in parallel, but the values of b are inverted (in order to count from the right rather than the left), we shift v up rather than down, and c starts at the maximum and is decreased. We also have the additional step at the end, decrementing c if there is anything left in v. The number of operations is at most 4 * lg(N) + 2, roughly, for N bit words.

--------------------------------------------------------------------------------

Count the consecutive zero bits (trailing) on the right with modulus division and lookup

unsigned int v; // find the number of trailing zeros in v

int r; // put the result in r

const int Mod37BitPosition[] = // maps a bit value mod 37 to its position

{

32,0,1,26,2,23,27,0,3,16,24,30,28,11,0,13,4,

7,17,0,25,22,31,15,29,10,12,6,0,21,14,9,5,

20,8,19,18

};

r = Mod37BitPosition[(-v & v) % 37];

The code above finds the number of zeros that are trailing on the right, so binary 0100 would produce 2. It makes use of the fact that the first 32 bit position values are relatively prime with 37, so performing a modulus division with 37 gives a unique number from 0 to 36 for each. These numbers may then be mapped to the number of zeros using a small lookup table. It uses only 4 operations, however indexing into a table and performing modulus division may make it unsuitable for some situations. I came up with this independently and then searched for a subsequence of the table values, and found it was invented earlier by Reiser, according to Hacker's Delight.

--------------------------------------------------------------------------------

Round up to the next highest power of 2

unsigned int v; // compute the next highest power of 2 of 32-bit v

v--;

v |= v >> 1;

v |= v >> 2;

v |= v >> 4;

v |= v >> 8;

v |= v >> 16;

v++;

In 12 operations, this code computes the next highest power of 2 for a 32-bit integer. The result may be expressed by the formula 1 << (lg(v - 1) + 1). It would be faster by 2 operations to use the formula and the log base 2 methed that uses a lookup table, but in some situations, lookup tables are not suitable, so the above code may be best. It works by copying the highest set bit to all of the lower bits, and then adding one, which results in carries that set all of the lower bits to 0 and one bit beyond the highest set bit to 1. If the original number was a power of 2, then the decrement will reduce it to one less, so that we round up to the same original value.

Devised by Sean Anderson, Sepember 14, 2001.

--------------------------------------------------------------------------------

Interleave bits the obvious way

unsigned short x; // Interleave bits of x and y, so that all of the

unsigned short y; // bits of x are in the even positions and y in the odd;

unsigned int z = 0; // z gets the resulting 32-bit Morton Number.

for (int i = 0; i < sizeof(x) * 8; i++) // unroll for more speed...

{

z |= (x & 1 << i) << i | (y & 1 << i) << (i + 1);

}

Interleaved bits (aka Morton numbers) are useful for linearizing 2D integer coordinates, so x and y are combined into a single number that can be compared easily and has the property that a number is usually close to another if their x and y values are close.

--------------------------------------------------------------------------------

Interleave bits by table lookup

const unsigned short MortonTable256[] =

{

0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,

0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,

0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,

0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,

0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,

0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,

0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,

0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,

0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,

0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,

0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,

0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,

0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,

0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,

0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,

0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,

0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,

0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,

0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,

0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,

0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,

0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,

0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,

0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,

0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,

0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,

0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,

0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,

0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,

0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,

0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,

0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555

};

unsigned short x; // Interleave bits of x and y, so that all of the

unsigned short y; // bits of x are in the even positions and y in the odd;

unsigned int z; // z gets the resulting 32-bit Morton Number.

z = (MortonTable256[y & 0xFF] << 1) | MortonTable256[x & 0xFF] |

(MortonTable256[y >> 8] << 1 | MortonTable256[x >> 8]) << 16;

--------------------------------------------------------------------------------

Interleave bits with 64-bit multiply

In 11 operations, this version interleaves bits of two bytes (rather than shorts, as in the other versions), but many of the operations are 64-bit multiplies so it isn't appropriate for all machines.

unsigned char x; // Interleave bits of x and y, so that all of the

unsigned char y; // bits of x are in the even positions and y in the odd;

unsigned short z; // z gets the resulting 16-bit Morton Number.

z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *

0x0102040810204081ULL >> 49) & 0x5555 |

((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *

0x0102040810204081ULL >> 48) & 0xAAAA;

Holger Bettag was inspired to suggest this technique on October 10, 2004 after reading the multiply-based bit reversals here.

--------------------------------------------------------------------------------

Interleave bits by Binary Magic Numbers

const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};

const unsigned int S[] = {1, 2, 4, 8};

unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x

unsigned int y; // are in the even positions and bits from y in the odd;

unsigned int z; // z gets the resulting 32-bit Morton Number.

x = (x | (x << S[3])) & B[3];

x = (x | (x << S[2])) & B[2];

x = (x | (x << S[1])) & B[1];

x = (x | (x << S[0])) & B[0];

y = (y | (y << S[3])) & B[3];

y = (y | (y << S[2])) & B[2];

y = (y | (y << S[1])) & B[1];

y = (y | (y << S[0])) & B[0];

z = x | (y << 1);

--------------------------------------------------------------------------------

Determine if a word has a zero byte

// Fewer operations:

unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0

bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

The code above may be useful when doing a fast string copy in which a word is copied at a time; it uses 5 operations. On the other hand, testing for a null byte in the obvious ways (which follow) have at least 7 operations (when counted in the most sparing way), and at most 12.

// More operations:

bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000))

// OR:

unsigned char * p = (unsigned char *) &v;

bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);

The code at the beginning of this section (labeled "Fewer operations") works by first zeroing the high bits of the 4 bytes in the word. Subsequently, it adds a number that will result in an overflow to the high bit of a byte if any of the low bits were initialy set. Next the high bits of the original word are ORed with these values; thus, the high bit of a byte is set iff any bit in the byte was set. Finally, we determine if any of these high bits are zero by ORing with ones everywhere except the high bits and inverting the result. Extending to 64 bits is trivial; simply increase the constants to be 0x7F7F7F7F7F7F7F7F.

For an additional improvement, a fast pretest that requires only 4 operations may be performed to determine if the word may have a zero byte. The test also returns true if the high byte is 0x80, so there are occasional false positives, but the slower and more reliable version above may then be used on candidates for an overall increase in speed with correct output.

bool mayHaveZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;

if (mayHaveZeroByte) // or may just have 0x80 in the high byte

{

bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

}

Paul Messmer suggested the fast pretest improvement on October 2, 2004.

2.

Fast Bit Counting Routines

Compiled from various sources by Gurmeet Singh Manku

From http://www-db.stanford.edu/~manku/bitcount/bitcount.html

A common problem asked in job interviews is to count the number of bits that are on in an unsigned integer. Here are seven solutions to this problem. Source code in C is available.

Iterated Count

int bitcount (unsigned int n)

{

int count=0;

while (n)

{

count += n & 0x1u ;

n >>= 1 ;

}

return count ;

}

Sparse Ones

int bitcount (unsigned int n)

{

int count=0 ;

while (n)

{

count++ ;

n &= (n - 1) ;

}

return count ;

}

Dense Ones

int bitcount (unsigned int n)

{

int count = 8 * sizeof(int) ;

n ^= (unsigned int) -1 ;

while (n)

{

count-- ;

n &= (n - 1) ;

}

return count ;

}

Precompute_8bit

// static int bits_in_char [256] ;

int bitcount (unsigned int n)

{

// works only for 32-bit ints

return bits_in_char [n & 0xffu]

+ bits_in_char [(n >> 8) & 0xffu]

+ bits_in_char [(n >> 16) & 0xffu]

+ bits_in_char [(n >> 24) & 0xffu] ;

}

Iterated Count runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful if 1's are sparse and among the least significant bits. Sparse Ones runs in time proportional to the number of 1 bits. The line n &= (n - 1) simply sets the rightmost 1 bit in n to 0. Dense Ones runs in time proportional to the number of 0 bits. It is the same as Sparse Ones, except that it first toggles all bits (n ~= -1), and continually subtracts the number of 1 bits from sizeof(int). Precompute_8bit assumes an array bits_in_char such that bits_in_char[i] contains the number of 1 bits in the binary representation for i. It repeatedly updates count by masking out the last eight bits in n, and indexing into bits_in_char.

Precompute_16bit

// static char bits_in_16bits [0x1u << 16] ;

int bitcount (unsigned int n)

{

// works only for 32-bit ints

return bits_in_16bits [n & 0xffffu]

+ bits_in_16bits [(n >> 16) & 0xffffu] ;

}

Precompute_16bit is a variant of Precompute_8bit in that an array bits_in_16bits[] stores the number of 1 bits in successive 16 bit numbers (shorts).

Parallel Count

#define TWO(c) (0x1u << (c))

#define MASK(c) (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u))

#define COUNT(x,c) ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c))

int bitcount (unsigned int n)

{

n = COUNT(n, 0) ;

n = COUNT(n, 1) ;

n = COUNT(n, 2) ;

n = COUNT(n, 3) ;

n = COUNT(n, 4) ;

/* n = COUNT(n, 5) ; for 64-bit integers */

return n ;

}

Parallel Count carries out bit counting in a parallel fashion. Consider n after the first line has finished executing. Imagine splitting n into pairs of bits. Each pair contains the number of ones in those two bit positions in the original n. After the second line has finished executing, each nibble contains the number of ones in those four bits positions in the original n. Continuing this for five iterations, the 64 bits contain the number of ones among these sixty-four bit positions in the original n. That is what we wanted to compute.

Nifty Parallel Count

#define MASK_01010101 (((unsigned int)(-1))/3)

#define MASK_00110011 (((unsigned int)(-1))/5)

#define MASK_00001111 (((unsigned int)(-1))/17)

int bitcount (unsigned int n)

{

n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ;

n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ;

n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ;

return n % 255 ;

}

Nifty Parallel Count works the same way as Parallel Count for the first three iterations. At the end of the third line (just before the return), each byte of n contains the number of ones in those eight bit positions in the original n. A little thought then explains why the remainder modulo 255 works.

MIT HACKMEM Count

int bitcount(unsigned int n)

{

/* works for 32-bit numbers only */

/* fix last line for 64-bit numbers */

register unsigned int tmp;

tmp = n - ((n >> 1) & 033333333333)

- ((n >> 2) & 011111111111);

return ((tmp + (tmp >> 3)) & 030707070707) % 63;

}

MIT Hackmem Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we shift the original 2 bits right we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes tmp. Consider the octal representation of tmp. Each digit in the representation is simply the number of 1's in the corresponding three bit positions in n. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting tmp by three bits, adding it to tmp itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1's among those six positions in n. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970's.

No Optimization Some Optimization Heavy Optimization

Precomp_16 52.94 Mcps Precomp_16 76.22 Mcps Precomp_16 80.58 Mcps

Precomp_8 29.74 Mcps Precomp_8 49.83 Mcps Precomp_8 51.65 Mcps

Parallel 19.30 Mcps Parallel 36.00 Mcps Parallel 38.55 Mcps

MIT 16.93 Mcps MIT 17.10 Mcps Nifty 31.82 Mcps

Nifty 12.78 Mcps Nifty 16.07 Mcps MIT 29.71 Mcps

Sparse 5.70 Mcps Sparse 15.01 Mcps Sparse 14.62 Mcps

Dense 5.30 Mcps Dense 14.11 Mcps Dense 14.56 Mcps

Iterated 3.60 Mcps Iterated 3.84 Mcps Iterated 9.24 Mcps

Mcps = Million counts per second

Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. "No Optimization" was compiled with plain gcc. "Some Optimizations" was gcc -O3. "Heavy Optimizations" corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4.

Thanks to Seth Robertson who suggested performing speed trials by extending bitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.

Bit Twiddling Hacks

By Sean Eron Anderson seander@cs.stanford.edu

From http://graphics.stanford.edu/~seander/bithacks.html

Contents

About the operation counting methodology

Compute the sign of an integer

Compute the integer absolute value (abs) without branching

Compute the minimum (min) or maximum (max) of two integers without branching

Determining if an integer is a power of 2

Sign extending

Sign extending from a constant bit width

Sign extending from a variable bit width

Sign extending from a variable bit width in 3 operations

Conditionally set or clear bits without branching

Counting bits set

Counting bits set, naive way

Counting bits set by lookup table

Counting bits set, Brian Kernighan's way

Counting bits set in 12, 24, or 32-bit words using 64-bit instructions

Counting bits set, in parallel

Computing parity (1 if an odd number of bits set, 0 otherwise)

Compute parity of a word the naive way

Compute parity by lookup table

Compute parity of a byte using 64-bit multiply and modulus division

Compute parity in parallel

Swapping Values

Swapping values with XOR

Swapping individual bits with XOR

Reversing bit sequences

Reverse bits the obvious way

Reverse bits in word by lookup table

Reverse the bits in a byte with 3 operations (64-bit muliply and modulus division)

Reverse the bits in a byte with 4 operations (64-bit multiply, no division)

Reverse the bits in a byte with 7 operations (no 64-bit, only 32)

Reverse an N-bit quantity in parallel with 5 * lg(N) operations

Modulus division (aka computing remainders)

Computing modulus division by 1 << s without a division operation (obvious)

Computing modulus division by (1 << s) - 1 without a division operation

Computing modulus division by (1 << s) - 1 in parallel without a division operation

Finding integer log base 2 of an integer (aka the position of the highest bit set)

Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)

Find the log base 2 of an integer with a lookup table

Find the log base 2 of an N-bit integer in O(lg(N)) operations

Find integer log base 2 of a 32-bit IEEE float

Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)

Count the number of consecutive zero bits (trailing) on the right in parallel

Count the consecutive zero bits (trailing) on the right with modulus division and lookup

Round up to the next highest power of 2

Interleaving bits (aka computing Morton Numbers)

Interleave bits the obvious way

Interleave bits by table lookup

Interleave bits with 64-bit multiply

Interleave bits by Binary Magic Numbers

Determine if a word has a zero byte

--------------------------------------------------------------------------------

About the operation counting methodology

When totaling the number of operations for algorithms here, any C operator is counted as one operation. Intermediate assignments, which need not be written to RAM, are not counted. Of course, this operation counting approach only serves as an approximation of the actual number of machine instructions and CPU time. All operations are assumed to take the same amount of time, which is not true in reality, but CPUs have been heading increasingly in this direction over time. There are many nuances that determine how fast a system will run a given sample of code, such as cache sizes, memory bandwidths, instruction sets, etc. In the end, benchmarking is the best way to determine whether one method is really faster than another, so consider the techniques below as possibilities to test on your target architecture.

--------------------------------------------------------------------------------

Compute the sign of an integer

int v; // we want to find the sign of v

int sign; // the result goes here

sign = v >> (sizeof(int) * 8 - 1); // if v < 0 then -1, else 0

The above expression evaluates to sign = v >> 31 for 32-bit integers. This is one operation faster than the obvious way, sign = -(v > 0). This trick works because when integers are shifted right, the value of the far left bit is copied to the other bits. The far left bit is 1 when the value is negative and 0 otherwise; all 1 bits is -1.

Alternatively, if you prefer the result be either -1 or +1, then use:

sign = +1 | (v >> (sizeof(int) * 8 - 1)); // if v < 0 then -1, else +1

Alternatively, if you prefer the result be either -1, 0, or +1, then use:

sign = (v != 0) | (v >> (sizeof(int) * 8 - 1)); // -1, 0, or +1

Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work.

--------------------------------------------------------------------------------

Compute the integer absolute value (abs) without branching

int v; // we want to find the absolute value of v

int r; // the result goes here

r = (v ^ (v >> (sizeof(int) * 8 - 1))) - (v >> (sizeof(int) * 8 - 1));

Some CPUs don't have an integer absolute value instruction (or the compiler fails to use them). On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (v < 0) ? -v : v, even though the number of operations is the same.

Caveats: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work. I've read that ANSI C does not require values to be represented as two's complement, so it may not work for that reason as well (on a diminishingly small number of old machines that still use one's complement). On March 14, 2004, Keith H. Duggar sent me the solution above; it is superior to the one I initially came up with, r=(+1|(v>>(sizeof(int)*8-1)))*v, because a multiply is not used. Unfortunately, this method has been patented in the USA on June 6, 2000 by Vladimir Yu Volkonsky and assigned to Sun Microsystems. (For an unpatented non-multiplying alternative, consider (v ^ (v >> (sizeof(int) * 8 - 1))) + ((unsigned) v >> (sizeof(int) * 8 - 1)), suggested by Thorbjørn Willoch on June 21, 2004 or my variation, (v ^ (v >> (sizeof(int) * 8 - 1))) + (v < 0), both of which take one more operation than the patented one, though it may be computed in parallel on modern CPUs.)

--------------------------------------------------------------------------------

Compute the minimum (min) or maximum (max) of two integers without branching

int x; // we want to find the minimum of x and y

int y;

int r; // the result goes here

r = y + ((x - y) & ((x - y) >> (sizeof(int) * 8 - 1)); // min(x, y)

On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (x < y) ? x : y, even though it involves one more instruction. (Notice that (x - y) only needs to be evaluated once.) It works because if x < y, then (x - y) >> 31 will be all ones (on a 32-bit integer machine), so r = y + (x - y) & ~0 = y + x - y = x. Otherwise, if x >= y, then (x - y) >> 31 will be all zeros, so r = y + (x - y) & 0 = y.

To find the maximum, use:

r = x - ((x - y) & ((x - y) >> (sizeof(int) * 8 - 1)); // max(x, y)

Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work.

--------------------------------------------------------------------------------

Determining if an integer is a power of 2

unsigned int v; // we want to see if v is a power of 2

bool f; // the result goes here

f = (v & (v - 1)) == 0;

Note that 0 is incorrectly considered a power of 2 here. To remedy this, use:

f = !(v & (v - 1)) && v;

--------------------------------------------------------------------------------

Sign extending from a constant bit width

Sign extension is automatic for built-in types, such as chars and ints. But suppose you have a signed two's complement number, x, that is stored using only b bits. Moreover, suppose you want to convert x to an int, which has more than b bits. A simple copy will work if x is positive, but if negative, the sign must be extended. For example, if we have only 4 bits to store a number, then -3 is represented as 1101 in binary. If we have 8 bits, then -3 is 11111101. The most significant bit of the 4-bit representation is replicated sinistrally to the fill in the destination when we convert to a representation with more bits; this is sign exteneding. In C, sign extension from a constant bit width is trivial, since bit fields may be specified in structs or unions. For example, to convert from 5 bits to an full integer:

int x; // convert this from using 5 bits to a full int

int r; // resulting sign extended number goes here

struct {int x:5;} s;

r = s.x = x;

The following is a C++ template function that uses the same language feature to convert from B bits in one operation (though the compiler is generating more, of course).

template

inline T signextend(const T x)

{

struct {T x:B;} s;

return s.x = x;

}

int r = signextend

--------------------------------------------------------------------------------

Sign extending from a variable bit width

Sometimes we need to extend the sign of a number but we don't know a priori the number of bits, b, in which it is represented. (Or we could be programming in Java, which lacks bitfields.)

unsigned b; // number of bits representing the number in x

int x; // sign extend this b-bit number to r

int r; // resulting sign-extended number

int const m = 1 << (b - 1); // mask can be pre-computed if b is fixed

r = -(x & m) | x;

The code above requires five operations. Sean A. Irvine suggested that I add sign extension methods to this page on June 13, 2004, and he provided m = (1 << (b - 1)) - 1; r = -(x & ~m) | x; as a starting point from which I optimized.

--------------------------------------------------------------------------------

Sign extending from a variable bit width in 3 operations

The following may be slow on some machines, due to the effort required for multiplication and division.

unsigned b; // number of bits representing the number in x

int x; // sign extend this b-bit number to r

int r; // resulting sign-extended number

#define M(B) (1 << ((sizeof(x) << 3) - B))

int const multipliers[] =

{

0, M(1), M(2), M(3), M(4), M(5), M(6), M(7),

M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15),

M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),

M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),

M(32)

}; // (add more for 64 bits)

#undef M

const int m = multipliers[b];

r = (x * m) / m;

The following variation is not portable, but on architectures that employ an arithmetic right-shift, maintaining the sign, it should be fast.

const int s = -b; // OR: sizeof(x) * 8 - b;

r = (x << s) >> s;

--------------------------------------------------------------------------------

Conditionally set or clear bits without branching

bool f; // conditional flag

unsigned int m; // the bit mask

unsigned int w; // the word to modify: if (f) w |= m; else w &= ~m;

w ^= (-f ^ w) & m;

On some architectures, the lack of branching can more than make up for what appears to be twice as many operations. For instance, informal speed tests on an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. Glenn Slayden informed me of this expression on December 11, 2003.

--------------------------------------------------------------------------------

Counting bits set (naive way)

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

for (c = 0; v; v >>= 1)

{

c += v & 1;

}

The naive approach requires one iteration per bit, until no more bits are set. So on a 32-bit word with only the high set, it will go through 32 iterations.

--------------------------------------------------------------------------------

Counting bits set by lookup table

const unsigned char BitsSetTable256[] =

{

0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,

4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8

};

unsigned int v; // count the number of bits set in 32-bit value v

unsigned int c; // c is the total bits set in v

// Option 1:

c = BitsSetTable256[v & 0xff] +

BitsSetTable256[(v >> 8) & 0xff] +

BitsSetTable256[(v >> 16) & 0xff] +

BitsSetTable256[v >> 24];

// Option 2:

unsigned char * p = (unsigned char *) &v;

c = BitsSetTable256[p[0]] +

BitsSetTable256[p[1]] +

BitsSetTable256[p[2]] +

BitsSetTable256[p[3]];

--------------------------------------------------------------------------------

Counting bits set, Brian Kernighan's way

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

for (c = 0; v; c++)

{

v &= v - 1; // clear the least significant bit set

}

Brian Kernighan's method goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

--------------------------------------------------------------------------------

Counting bits set in 12, 24, or 32-bit words using 64-bit instructions

unsigned int v; // count the number of bits set in v

unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 12-bit values in v:

c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

// option 2, for at most 24-bit values in v:

c = (v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL;

c += ((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL;

c %= 0x1f;

// option 3, for at most 32-bit values in v:

c = (((v & 0xfff) *

0x1001001001001ULL & 0x84210842108421ULL) +

(((v & 0xfff000) >> 12) *

0x1001001001001ULL & 0x84210842108421ULL)) % 0x1f;

c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 9; and the third option takes 14.

Rich Schroeppel originally created a 9-bit version, similiar to option 1; see the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972. His method was the inspiration for the variants above, devised by Sean Anderson.

After reading this and pondering it, Roshan James wrote a web page explaining how it works and extending it to 16 bits on March 8, 2004.

--------------------------------------------------------------------------------

Counting bits set, in parallel

unsigned int v; // count bits set in this (32-bit value)

unsigned int c; // store the total here

const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers

const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

c = v;

c = ((c >> S[0]) & B[0]) + (c & B[0]);

c = ((c >> S[1]) & B[1]) + (c & B[1]);

c = ((c >> S[2]) & B[2]) + (c & B[2]);

c = ((c >> S[3]) & B[3]) + (c & B[3]);

c = ((c >> S[4]) & B[4]) + (c & B[4]);

The B array, expressed as binary, is:

B[0] = 01010101 01010101 01010101 01010101

B[1] = 00110011 00110011 00110011 00110011

B[2] = 00001111 00001111 00001111 00001111

B[3] = 00000000 11111111 00000000 11111111

B[4] = 00000000 00000000 11111111 11111111

We can adjust the method for larger word sizes by continuing with the patterns for the Binary Magic Numbers, B and S. If there are k bits, then we need the arrays S and B to be ceil(lg(k)) elements long, and we must compute the same number of expressions for c as S or B are long.

See Ian Ashdown's nice newsgroup post for more information on counting the number of bits set (also known as sideways addition).

--------------------------------------------------------------------------------

Computing parity the naive way

unsigned int v; // word value to compute the parity of

bool parity = false; // parity will be the parity of b

while (v)

{

parity = !parity;

v = v & (v - 1);

}

The above code uses an approach like Brian Kernigan's bit counting, above. The time it takes is proportional to the number of bits set.

--------------------------------------------------------------------------------

Compute parity of a byte using 64-bit multiply and modulus division

unsigned char b; // byte value to compute the parity of

bool parity = (((b * 0x0101010101010101ULL) & 0x8040201008040201) % 0x1FF) & 1;

The method above takes around 4 operations, but only works on bytes.

--------------------------------------------------------------------------------

Compute parity in parallel

unsigned int v; // word value to compute the parity of

v ^= v >> 16;

v ^= v >> 8;

v ^= v >> 4;

v &= 0xf;

return (0x6996 >> v) & 1;

The method above takes around 9 operations, and works for 32-bit words. It may be optimized to work just on bytes in 5 operations by removing the two lines immediately following "unsigned int v;". The method first shifts and XORs the eight nibbles of the 32-bit value together, leaving the result in the lowest nibble of v. Next, the binary number 0110 1001 1001 0110 (0x6996 in hex) is shifted to the right by the value represented in the lowest nibble of v. This number is like a miniature 16-bit parity-table indexed by the low four bits in v. The result has the parity of v in bit 1, which is masked and returned.

Thanks to Mathew Hendry for pointing out the shift-lookup idea at the end on Dec. 15, 2002. That optimization shaves two operations off using only shifting and XORing to find the parity.

--------------------------------------------------------------------------------

Compute parity by lookup table

const bool ParityTable[] =

{

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,

0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0

};

unsigned char b; // byte value to compute the parity of

bool parity = parity_table[b];

// OR, for 32-bit words:

unsigned int v; // word value to compute the parity of

bool parity = ParityTable[(v & 0x000000ff)] ^

ParityTable[(v & 0x0000ff00) >> 8] ^

ParityTable[(v & 0x00ff0000) >> 16] ^

ParityTable[(v & 0xff000000) >> 24];

--------------------------------------------------------------------------------

Swapping values with XOR

#define SWAP(a, b) ((a ^= b), (b ^= a), (a ^= b))

This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable.

On January 20, 2005, Iain A. Fleming pointed out that the macro above doesn't work when you swap with the same memory location, such as SWAP(a[i], a[j]) with i == j. So if that may occur, consider defining the macro as ((a == b) || ((a ^= b), (b ^= a), (a ^= b))).

--------------------------------------------------------------------------------

Swapping individual bits with XOR

unsigned int i, j; // positons of bit sequences to swap

unsigned int n; // number of consecutive bits in each sequence

unsigned int b; // bits to swap reside in b

unsigned int r; // bit-swapped result goes here

int x = ((b >> i) ^ (b >> j)) & ((1 << n) - 1); // XOR temporary

r = b ^ ((x << i) | (x << j));

For example, suppose we have have b = 00101111 (expressed in binary) and we want to swap the n = 3 consecutive bits starting at i = 1 (the second bit from the right) with the 3 consecutive bits starting at j = 5; the result would be r = 11100011 (binary).

This method of swapping is similar to the general purpose XOR swap trick, but intended for operating on individual bits. The variable x stores the result of XORing the pairs of bit values we want to swap, and then the bits are set to the result of themselves XORed with x. Of course, the result is undefined if the sequences overlap.

--------------------------------------------------------------------------------

Reverse bits the obvious way

unsigned int v; // reverse the bits in this

unsigned int t = v; // t will have the reversed bits of v

int i;

for (i = sizeof(v) * 8 - 1; i; i--)

{

t |= v & 1;

t <<= 1;

v >>= 1;

}

t |= v & 1;

On October 15, 2004, Michael Hoisie pointed out a bug in the original version.

--------------------------------------------------------------------------------

Reverse bits in word by lookup table

const unsigned char BitReverseTable256[] =

{

0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,

0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,

0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,

0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,

0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,

0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,

0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,

0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,

0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,

0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,

0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,

0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,

0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,

0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,

0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,

0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF

};

unsigned int v; // reverse 32-bit value, 8 bits at time

unsigned int c; // c will get v reversed

// Option 1:

c = (BitReverseTable256[v & 0xff] << 24) |

(BitReverseTable256[(v >> 8) & 0xff] << 16) |

(BitReverseTable256[(v >> 16) & 0xff] << 8) |

(BitReverseTable256[(v >> 24) & 0xff]);

// Option 2:

unsigned char * p = (unsigned char *) &v;

unsigned char * q = (unsigned char *) &c;

q[3] = BitReverseTable256[p[0]];

q[2] = BitReverseTable256[p[1]];

q[1] = BitReverseTable256[p[2]];

q[0] = BitReverseTable256[p[3]];

The first method takes about 17 operations, and the second takes about 8, assuming your CPU can load and store bytes easily.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 3 operations (64-bit multiply and modulus division):

unsigned char b; // reverse this byte

b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;

The multiply operation creates five separate copies of the 8-bit byte pattern to fan-out into a 64-bit value. The AND operation selects the bits that are in the correct (reversed) positions, relative to each 10-bit groups of bits. The multiply and the AND operations copy the bits from the original byte so they each appear in only one of the 10-bit sets. The reversed positions of the bits from the original byte coincide with their relative positions within any 10-bit set. The last step, which involves modulus division by 2^10 - 1, has the effect of merging together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...) in the 64-bit value. They do not overlap, so the addition steps underlying the modulus division behave like or operations.

This method was attributed to Rich Schroeppel in the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 4 operations (64-bit multiply, no division):

unsigned char b; // reverse this byte

b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;

The following shows the flow of the bit values with the boolean variables a, b, c, d, e, f, g, and h, which comprise an 8-bit byte. Notice how the first multiply fans out the bit pattern to multiple copies, while the last multiply combines them in the fifth byte from the right.

abcd efgh (-> hgfe dcba)

* 1000 0000 0010 0000 0000 1000 0000 0010 (0x80200802)

-------------------------------------------------------------------------------------------------

0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0

& 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110)

-------------------------------------------------------------------------------------------------

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

* 0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101)

-------------------------------------------------------------------------------------------------

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000

-------------------------------------------------------------------------------------------------

0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000

>> 32

-------------------------------------------------------------------------------------------------

0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba

& 1111 1111

-------------------------------------------------------------------------------------------------

hgfe dcba

Note that the last two steps can be combined on some processors because the registers can be accessed as bytes; just multiply so that a register stores the upper 32 bits of the result and the take the low byte. Thus, it may take only 6 operations.

Devised by Sean Anderson, July 13, 2001.

--------------------------------------------------------------------------------

Reverse the bits in a byte with 7 operations (no 64-bit):

b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16

Devised by Sean Anderson, July 13, 2001. Typo spotted correction supplied by Mike Keith, January 3, 2002.

--------------------------------------------------------------------------------

Reverse an N-bit quantity in parallel in 5 * lg(N) operations:

unsigned int v; // 32 bit word to reverse bit order

int const S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers

int const B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

v = ((v >> S[0]) & B[0]) | ((v << S[0]) & ~B[0]); // swap odd and even bits

v = ((v >> S[1]) & B[1]) | ((v << S[1]) & ~B[1]); // swap consecutive pairs

v = ((v >> S[2]) & B[2]) | ((v << S[2]) & ~B[2]); // swap nibbles ...

v = ((v >> S[3]) & B[3]) | ((v << S[3]) & ~B[3]);

v = ((v >> S[4]) & B[4]) | ((v << S[4]) & ~B[4]);

This method is best suited to situations where N is large.

Any reasonable optimizing C compiler should treat the dereferences of B, ~B, and S as constants, requiring no evaluation other than perhaps a load operation for some of the B and ~B references.

See Dr. Dobb's Journal 1983, Edwin Freed's article on Binary Magic Numbers for more information.

--------------------------------------------------------------------------------

Compute modulus division by 1 << s without a division operator

const unsigned int n; // numerator

const unsigned int s;

const unsigned int d = 1 << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ...

unsigned int m; // m will be n % d

m = n & (d - 1);

Most programmers learn this trick early, but it was included for the sake of completeness.

--------------------------------------------------------------------------------

Compute modulus division by (1 << s) - 1 without a division operator

unsigned int n; // numerator

const unsigned int s; // s > 0

const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).

unsinged int m; // n % d goes here.

for (m = n; n > d; n = m)

{

for (m = 0; n; n >>= s)

{

m += n & d;

}

}

// Now m is a value from 0 to d, but since with modulus division

// we want m to be 0 when it is d.

m = m == d ? 0 : m;

This method of modulus division by an integer that is one less than a power of 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations, where N is the number of bits in the numerator. In other words, it takes at most O(N * lg(N)) time.

Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine corrected me on June 17, 2004, I mistakenly commented that we could alternatively assign m = ((m + 1) & d) - 1; at the end.

--------------------------------------------------------------------------------

Compute modulus division by (1 << s) - 1 in parallel without a division operator

// The following is for a word size of 32 bits!

const unsigned int M[] =

{

0x00000000, 0x55555555, 0x33333333, 0xc71c71c7,

0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f,

0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff,

0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff,

0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff,

0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff,

0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff,

0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff

};

const unsigned int Q[][6] =

{

{ 0, 0, 0, 0, 0, 0}, {16, 8, 4, 2, 1, 1}, {16, 8, 4, 2, 2, 2},

{15, 6, 3, 3, 3, 3}, {16, 8, 4, 4, 4, 4}, {15, 5, 5, 5, 5, 5},

{12, 6, 6, 6 , 6, 6}, {14, 7, 7, 7, 7, 7}, {16, 8, 8, 8, 8, 8},

{ 9, 9, 9, 9, 9, 9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11},

{12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14},

{15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17},

{18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20},

{21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23},

{24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26},

{27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29},

{30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31}

};

const unsigned int R[][6] =

{

{0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003},

{0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007},

{0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f},

{0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f},

{0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f},

{0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f},

{0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff},

{0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff},

{0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff},

{0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff},

{0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff},

{0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff},

{0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff},

{0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff},

{0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff},

{0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff},

{0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff},

{0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff},

{0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff},

{0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff},

{0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff},

{0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff},

{0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff},

{0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff},

{0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff},

{0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff},

{0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff},

{0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff},

{0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff},

{0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}

};

unsigned int n; // numerator

const unsigned int s; // s > 0

const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).

unsinged int m; // n % d goes here.

m = (n & M[s]) + ((n & ~M[s]) >> s);

for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++)

{

m = (m >> *q) + (m & *r);

}

m = m == d ? 0 : m; // Or: m = ((m + 1) & d) - 1;

This method of finding modulus division by an integer that is one less than a power of 2 takes at most O(lg(N)) time, where N is the number of bits in the numerator (32 bits, for the code above). The number of operations is at most 12 + 9 * ceil(lg(N)). The tables may be removed if you know the denominator at compile time; just extract the few relevent entries and unroll the loop. It may be easily extended to more bits.

It finds the result by summing the values in base (1 << s) in parallel. First every other base (1 << s) value is added to the previous one. Imagine that the result is written on a piece of paper. Cut the paper in half, so that half the values are on each cut piece. Align the values and sum them onto a new piece of paper. Repeat by cutting this paper in half (which will be a quarter of the size of the previous one) and summing, until you cannot cut further. After performing lg(N/s/2) cuts, we cut no more; just continue to add the values and put the result onto a new piece of paper as before, while there is at least two values.

Devised by Sean Anderson, August 20, 2001.

--------------------------------------------------------------------------------

Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)

unsigned int v; // 32-bit word to find the log base 2 of

unsigned c = 0; // c will be lg(v)

while (v >>= 1) // unroll for more speed...

{

c++;

}

The log base 2 of an integer is the same as the position of the highest bit set (or most significant bit set, MSB). The following log base 2 methods are faster than this one.

--------------------------------------------------------------------------------

Find the log base 2 of an integer with a lookup table

const char LogTable256[] =

{

0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,

4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,

5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,

5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,

7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7

};

unsigned int v; // 32-bit word to find the log of

unsigned c = 0; // c will be lg(v)

register unsigned int t, tt; // temporaries

if (tt = v >> 16)

{

c = (t = v >> 24) ? 24 + LogTable256[t] : 16 + LogTable256[tt & 0xFF];

}

else

{

c = (t = v & 0xFF00) ? 8 + LogTable256[t >> 8] : LogTable256[v & 0xFF];

}

The lookup table method takes only about 7 operations to find the log of a 32-bit value. If extended for 64-bit quantities, it would take roughly 9 operations.

--------------------------------------------------------------------------------

Find the log base 2 of an N-bit integer in O(lg(N)) operations

unsigned int v; // 32-bit value to find the log2 of

const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};

const unsigned int S[] = {1, 2, 4, 8, 16};

int i;

register unsigned int c = 0; // result of log2(v) will go here

for (i = 4; i >= 0; i--) // unroll for speed...

{

if (v & b[i])

{

v >>= S[i];

c |= S[i];

}

}

// OR (IF YOU KNOW v IS A POWER OF 2):

const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0, 0xFF00FF00,

0xFFFF0000};

register unsigned int c = (v & b[0]) != 0;

for (i = 4; i >= 1; i--) // unroll for speed...

{

c |= ((v & b[i]) != 0) << i;

}

Of course, to extend the code to find the log of a 33- to 64-bit number, we would append another element, 0xFFFFFFFF00000000, to b, append 32 to S, and loop from 5 to 0. The second variation was suggested to me by John Owens on April 24, 2002; it's faster, but it is only suitable when the input is known to be a power of 2. On May 25, 2003, Ken Raeburn suggested improving the general case by using smaller numbers for b[], which load faster on some architectures (for instance if the word size is 16 bits, then only one load instruction may be needed). These values work for the general version, but not for the special-case version below it, where v is a power of 2; Glenn Slayden brought this oversight to my attention on December 12, 2003.

--------------------------------------------------------------------------------

Find integer log base 2 of a 32-bit IEEE float

const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v)

int c; // c gets the result;

c = ((*(int *) &v) >> 23) - 127;

The above is fast, but many architectures utilize subnormal floating point numbers. These have the exponent bits set to zero (signifying pow(2,-127)), and the mantissa is not normalized, so it contains leading zeros and thus the log2 must be computed from the mantissa. To accomidate for subnormal numbers, use the following:

const float v; // find int(log2(v)), where v > 0.0 && finite(v)

const int x = *(int *) &v;

int c = x >> 23; // c gets the result;

if (c)

{

c -= 127;

}

else

{ // subnormal, so recompute using mantissa: c = intlog2(x) - 149;

register unsigned int t; // temporary

// Note that LogTable256 was defined earlier

if (t = x >> 16)

{

c = LogTable256[t] - 133;

}

else

{

c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;

}

}

On June 20, 2004, Sean A. Irvine suggested that I include code to handle subnormal numbers.

--------------------------------------------------------------------------------

Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)

const int r;

const float v; // find int(log2(pow(v, 1. / pow(2, r))), where isnormal(v)

int c; // c gets the result;

c = (((((*(int *) &v) - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;

So, if r is 0, for example, we have c = int(log2(v)). If r is 1, then we have c = int(log2(sqrt(r))). If r is 2, then we have c = int(log2(pow(r, 1./4))).

--------------------------------------------------------------------------------

Count the consecutive zero bits (trailing) on the right in parallel

unsigned int v; // 32-bit word input to count zero bits on right

register unsigned int c = 32; // c will be the number of zero bits on the right,

// so if v is 01101000 (base 2), then c will be 3 as output.

const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers

if (v & B[4])

{

v <<= S[4];

c -= S[4];

}

if (v & B[3])

{

v <<= S[3];

c -= S[3];

}

if (v & B[2])

{

v <<= S[2];

c -= S[2];

}

if (v & B[1])

{

v <<= S[1];

c -= S[1];

}

if (v & B[0])

{

v <<= S[0];

c -= S[0];

}

if (v)

{

c--;

}

Here, we are basically doing the same operations as finding the log base 2 in parallel, but the values of b are inverted (in order to count from the right rather than the left), we shift v up rather than down, and c starts at the maximum and is decreased. We also have the additional step at the end, decrementing c if there is anything left in v. The number of operations is at most 4 * lg(N) + 2, roughly, for N bit words.

--------------------------------------------------------------------------------

Count the consecutive zero bits (trailing) on the right with modulus division and lookup

unsigned int v; // find the number of trailing zeros in v

int r; // put the result in r

const int Mod37BitPosition[] = // maps a bit value mod 37 to its position

{

32,0,1,26,2,23,27,0,3,16,24,30,28,11,0,13,4,

7,17,0,25,22,31,15,29,10,12,6,0,21,14,9,5,

20,8,19,18

};

r = Mod37BitPosition[(-v & v) % 37];

The code above finds the number of zeros that are trailing on the right, so binary 0100 would produce 2. It makes use of the fact that the first 32 bit position values are relatively prime with 37, so performing a modulus division with 37 gives a unique number from 0 to 36 for each. These numbers may then be mapped to the number of zeros using a small lookup table. It uses only 4 operations, however indexing into a table and performing modulus division may make it unsuitable for some situations. I came up with this independently and then searched for a subsequence of the table values, and found it was invented earlier by Reiser, according to Hacker's Delight.

--------------------------------------------------------------------------------

Round up to the next highest power of 2

unsigned int v; // compute the next highest power of 2 of 32-bit v

v--;

v |= v >> 1;

v |= v >> 2;

v |= v >> 4;

v |= v >> 8;

v |= v >> 16;

v++;

In 12 operations, this code computes the next highest power of 2 for a 32-bit integer. The result may be expressed by the formula 1 << (lg(v - 1) + 1). It would be faster by 2 operations to use the formula and the log base 2 methed that uses a lookup table, but in some situations, lookup tables are not suitable, so the above code may be best. It works by copying the highest set bit to all of the lower bits, and then adding one, which results in carries that set all of the lower bits to 0 and one bit beyond the highest set bit to 1. If the original number was a power of 2, then the decrement will reduce it to one less, so that we round up to the same original value.

Devised by Sean Anderson, Sepember 14, 2001.

--------------------------------------------------------------------------------

Interleave bits the obvious way

unsigned short x; // Interleave bits of x and y, so that all of the

unsigned short y; // bits of x are in the even positions and y in the odd;

unsigned int z = 0; // z gets the resulting 32-bit Morton Number.

for (int i = 0; i < sizeof(x) * 8; i++) // unroll for more speed...

{

z |= (x & 1 << i) << i | (y & 1 << i) << (i + 1);

}

Interleaved bits (aka Morton numbers) are useful for linearizing 2D integer coordinates, so x and y are combined into a single number that can be compared easily and has the property that a number is usually close to another if their x and y values are close.

--------------------------------------------------------------------------------

Interleave bits by table lookup

const unsigned short MortonTable256[] =

{

0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,

0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,

0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,

0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,

0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,

0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,

0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,

0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,

0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,

0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,

0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,

0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,

0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,

0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,

0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,

0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,

0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,

0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,

0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,

0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,

0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,

0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,

0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,

0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,

0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,

0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,

0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,

0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,

0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,

0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,

0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,

0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555

};

unsigned short x; // Interleave bits of x and y, so that all of the

unsigned short y; // bits of x are in the even positions and y in the odd;

unsigned int z; // z gets the resulting 32-bit Morton Number.

z = (MortonTable256[y & 0xFF] << 1) | MortonTable256[x & 0xFF] |

(MortonTable256[y >> 8] << 1 | MortonTable256[x >> 8]) << 16;

--------------------------------------------------------------------------------

Interleave bits with 64-bit multiply

In 11 operations, this version interleaves bits of two bytes (rather than shorts, as in the other versions), but many of the operations are 64-bit multiplies so it isn't appropriate for all machines.

unsigned char x; // Interleave bits of x and y, so that all of the

unsigned char y; // bits of x are in the even positions and y in the odd;

unsigned short z; // z gets the resulting 16-bit Morton Number.

z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *

0x0102040810204081ULL >> 49) & 0x5555 |

((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *

0x0102040810204081ULL >> 48) & 0xAAAA;

Holger Bettag was inspired to suggest this technique on October 10, 2004 after reading the multiply-based bit reversals here.

--------------------------------------------------------------------------------

Interleave bits by Binary Magic Numbers

const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};

const unsigned int S[] = {1, 2, 4, 8};

unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x

unsigned int y; // are in the even positions and bits from y in the odd;

unsigned int z; // z gets the resulting 32-bit Morton Number.

x = (x | (x << S[3])) & B[3];

x = (x | (x << S[2])) & B[2];

x = (x | (x << S[1])) & B[1];

x = (x | (x << S[0])) & B[0];

y = (y | (y << S[3])) & B[3];

y = (y | (y << S[2])) & B[2];

y = (y | (y << S[1])) & B[1];

y = (y | (y << S[0])) & B[0];

z = x | (y << 1);

--------------------------------------------------------------------------------

Determine if a word has a zero byte

// Fewer operations:

unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0

bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

The code above may be useful when doing a fast string copy in which a word is copied at a time; it uses 5 operations. On the other hand, testing for a null byte in the obvious ways (which follow) have at least 7 operations (when counted in the most sparing way), and at most 12.

// More operations:

bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000))

// OR:

unsigned char * p = (unsigned char *) &v;

bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);

The code at the beginning of this section (labeled "Fewer operations") works by first zeroing the high bits of the 4 bytes in the word. Subsequently, it adds a number that will result in an overflow to the high bit of a byte if any of the low bits were initialy set. Next the high bits of the original word are ORed with these values; thus, the high bit of a byte is set iff any bit in the byte was set. Finally, we determine if any of these high bits are zero by ORing with ones everywhere except the high bits and inverting the result. Extending to 64 bits is trivial; simply increase the constants to be 0x7F7F7F7F7F7F7F7F.

For an additional improvement, a fast pretest that requires only 4 operations may be performed to determine if the word may have a zero byte. The test also returns true if the high byte is 0x80, so there are occasional false positives, but the slower and more reliable version above may then be used on candidates for an overall increase in speed with correct output.

bool mayHaveZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;

if (mayHaveZeroByte) // or may just have 0x80 in the high byte

{

bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);

}

Paul Messmer suggested the fast pretest improvement on October 2, 2004.

2.

Fast Bit Counting Routines

Compiled from various sources by Gurmeet Singh Manku

From http://www-db.stanford.edu/~manku/bitcount/bitcount.html

A common problem asked in job interviews is to count the number of bits that are on in an unsigned integer. Here are seven solutions to this problem. Source code in C is available.

Iterated Count

int bitcount (unsigned int n)

{

int count=0;

while (n)

{

count += n & 0x1u ;

n >>= 1 ;

}

return count ;

}

Sparse Ones

int bitcount (unsigned int n)

{

int count=0 ;

while (n)

{

count++ ;

n &= (n - 1) ;

}

return count ;

}

Dense Ones

int bitcount (unsigned int n)

{

int count = 8 * sizeof(int) ;

n ^= (unsigned int) -1 ;

while (n)

{

count-- ;

n &= (n - 1) ;

}

return count ;

}

Precompute_8bit

// static int bits_in_char [256] ;

int bitcount (unsigned int n)

{

// works only for 32-bit ints

return bits_in_char [n & 0xffu]

+ bits_in_char [(n >> 8) & 0xffu]

+ bits_in_char [(n >> 16) & 0xffu]

+ bits_in_char [(n >> 24) & 0xffu] ;

}

Iterated Count runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful if 1's are sparse and among the least significant bits. Sparse Ones runs in time proportional to the number of 1 bits. The line n &= (n - 1) simply sets the rightmost 1 bit in n to 0. Dense Ones runs in time proportional to the number of 0 bits. It is the same as Sparse Ones, except that it first toggles all bits (n ~= -1), and continually subtracts the number of 1 bits from sizeof(int). Precompute_8bit assumes an array bits_in_char such that bits_in_char[i] contains the number of 1 bits in the binary representation for i. It repeatedly updates count by masking out the last eight bits in n, and indexing into bits_in_char.

Precompute_16bit

// static char bits_in_16bits [0x1u << 16] ;

int bitcount (unsigned int n)

{

// works only for 32-bit ints

return bits_in_16bits [n & 0xffffu]

+ bits_in_16bits [(n >> 16) & 0xffffu] ;

}

Precompute_16bit is a variant of Precompute_8bit in that an array bits_in_16bits[] stores the number of 1 bits in successive 16 bit numbers (shorts).

Parallel Count

#define TWO(c) (0x1u << (c))

#define MASK(c) (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u))

#define COUNT(x,c) ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c))

int bitcount (unsigned int n)

{

n = COUNT(n, 0) ;

n = COUNT(n, 1) ;

n = COUNT(n, 2) ;

n = COUNT(n, 3) ;

n = COUNT(n, 4) ;

/* n = COUNT(n, 5) ; for 64-bit integers */

return n ;

}

Parallel Count carries out bit counting in a parallel fashion. Consider n after the first line has finished executing. Imagine splitting n into pairs of bits. Each pair contains the number of ones in those two bit positions in the original n. After the second line has finished executing, each nibble contains the number of ones in those four bits positions in the original n. Continuing this for five iterations, the 64 bits contain the number of ones among these sixty-four bit positions in the original n. That is what we wanted to compute.

Nifty Parallel Count

#define MASK_01010101 (((unsigned int)(-1))/3)

#define MASK_00110011 (((unsigned int)(-1))/5)

#define MASK_00001111 (((unsigned int)(-1))/17)

int bitcount (unsigned int n)

{

n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ;

n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ;

n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ;

return n % 255 ;

}

Nifty Parallel Count works the same way as Parallel Count for the first three iterations. At the end of the third line (just before the return), each byte of n contains the number of ones in those eight bit positions in the original n. A little thought then explains why the remainder modulo 255 works.

MIT HACKMEM Count

int bitcount(unsigned int n)

{

/* works for 32-bit numbers only */

/* fix last line for 64-bit numbers */

register unsigned int tmp;

tmp = n - ((n >> 1) & 033333333333)

- ((n >> 2) & 011111111111);

return ((tmp + (tmp >> 3)) & 030707070707) % 63;

}

MIT Hackmem Count is funky. Consider a 3 bit number as being 4a+2b+c. If we shift it right 1 bit, we have 2a+b. Subtracting this from the original gives 2a+b+c. If we shift the original 2 bits right we get a, and so with another subtraction we have a+b+c, which is the number of bits in the original number. How is this insight employed? The first assignment statement in the routine computes tmp. Consider the octal representation of tmp. Each digit in the representation is simply the number of 1's in the corresponding three bit positions in n. The last return statement sums these octal digits to produce the final answer. The key idea is to add adjacent pairs of octal digits together and then compute the remainder modulus 63. This is accomplished by right-shifting tmp by three bits, adding it to tmp itself and ANDing with a suitable mask. This yields a number in which groups of six adjacent bits (starting from the LSB) contain the number of 1's among those six positions in n. This number modulo 63 yields the final answer. For 64-bit numbers, we would have to add triples of octal digits and use modulus 1023. This is HACKMEM 169, as used in X11 sources. Source: MIT AI Lab memo, late 1970's.

No Optimization Some Optimization Heavy Optimization

Precomp_16 52.94 Mcps Precomp_16 76.22 Mcps Precomp_16 80.58 Mcps

Precomp_8 29.74 Mcps Precomp_8 49.83 Mcps Precomp_8 51.65 Mcps

Parallel 19.30 Mcps Parallel 36.00 Mcps Parallel 38.55 Mcps

MIT 16.93 Mcps MIT 17.10 Mcps Nifty 31.82 Mcps

Nifty 12.78 Mcps Nifty 16.07 Mcps MIT 29.71 Mcps

Sparse 5.70 Mcps Sparse 15.01 Mcps Sparse 14.62 Mcps

Dense 5.30 Mcps Dense 14.11 Mcps Dense 14.56 Mcps

Iterated 3.60 Mcps Iterated 3.84 Mcps Iterated 9.24 Mcps

Mcps = Million counts per second

Which of the several bit counting routines is the fastest? Results of speed trials on an i686 are summarized in the table on left. "No Optimization" was compiled with plain gcc. "Some Optimizations" was gcc -O3. "Heavy Optimizations" corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr -funroll-loops -frerun-cse-after-loop -frerun-loop-opt -malign-functions=4.

Thanks to Seth Robertson who suggested performing speed trials by extending bitcount.c. Seth also pointed me to MIT_Hackmem routine. Thanks to Denny Gursky who suggested the idea of Precompute_11bit. That would require three sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried Precompute_16bit which turned out to be even faster.