Tuesday, July 11, 2006
Fast Bit Counting Routines
This seems to have vanished from its stanford site. Thought I'd archive it here for my reference. Very interesting concepts.
Compiled from various sources by Gurmeet Singh Manku
Iterated Count runs in time
proportional to the total number of bits. It simply loops through
all the bits, terminating slightly earlier because of the while
condition. Useful if 1's are sparse and among the least significant
bits. Sparse Ones runs in time
proportional to the number of 1 bits. The line n &= (n - 1) simply
sets the rightmost 1 bit in n to 0. Dense
Ones runs in time proportional to the number of 0 bits.
It is the same as Sparse Ones, except that it first toggles all bits
(n ~= -1), and continually subtracts the number of 1 bits from
sizeof(int). Precompute_8bit
assumes an array bits_in_char such that bits_in_char[i] contains the
number of 1 bits in the binary representation for i. It repeatedly
updates count by masking out the last eight bits in
n, and indexing
into bits_in_char.
Precompute_16bit is a variant of
Precompute_8bit in that an array bits_in_16bits[] stores the number
of 1 bits in successive 16 bit numbers (shorts).
Parallel Count carries out bit
counting in a parallel fashion. Consider n after the first line has
finished executing. Imagine splitting n into pairs of bits. Each
pair contains the number of ones in those two bit positions
in the original n. After the second line has finished executing,
each nibble contains the number of ones in those four bits
positions in the original n. Continuing this for five iterations,
the 64 bits contain the number of ones among these sixty-four bit
positions in the original n. That is what we wanted to compute.
Nifty Parallel Count works the same
way as Parallel Count for the first three iterations. At the end of
the third line (just before the return), each byte of n contains the
number of ones in those eight bit positions in the original n. A
little thought then explains why the remainder modulo 255 works.
MIT Hackmem Count is funky.
Consider a 3 bit number as being 4a+2b+c. If we shift it right 1
bit, we have 2a+b. Subtracting this from the original gives 2a+b+c.
If we right-shift the original 3-bit number by two bits, we get a,
and so with another subtraction we have a+b+c, which is the number of
bits in the original number. How is this insight employed? The
first assignment statement in the routine computes tmp.
Consider the octal representation of tmp. Each digit in the
octal representation is simply the number of 1's in the corresponding
three bit positions in n. The last return statement sums
these octal digits to produce the final answer. The key idea is to
add adjacent pairs of octal digits together and then compute the
remainder modulus 63. This is accomplished by right-shifting
tmp by three bits, adding it to tmp itself and
ANDing with a suitable mask. This yields a number in which groups of
six adjacent bits (starting from the LSB) contain the number of 1's
among those six positions in n. This number modulo 63
yields the final answer. For 64-bit numbers, we would have to add
triples of octal digits and use modulus 1023. This is HACKMEM 169,
as used in X11 sources. Source: MIT AI Lab memo, late 1970's.
Which of the several bit counting routines is the fastest? Results
of speed trials on an i686 are summarized in the table on left. "No
Optimization" was compiled with plain gcc. "Some
Optimizations" was gcc -O3. "Heavy Optimizations"
corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr
-funroll-loops -frerun-cse-after-loop -frerun-loop-opt
-malign-functions=4.
Thanks to Seth
Robertson who suggested performing speed trials by extending href="bitcount.c">bitcount.c. Seth also pointed me to
MIT_Hackmem routine. Thanks to href="mailto:dennyrolling@hotmail.com"> Denny Gursky who
suggested the idea of Precompute_11bit. That would require three
sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried
Precompute_16bit which turned out to be even faster.
Gurmeet Singh Manku
Last update: 27 Jul 2002
Fast Bit Counting Routines
Compiled from various sources by Gurmeet Singh Manku
A common problem asked in job interviews is to count the number of
bits that are on in an unsigned integer. Here are
seven solutions to
this problem. Source code in C is available.
|
| ||||
|
|
Iterated Count runs in time
proportional to the total number of bits. It simply loops through
all the bits, terminating slightly earlier because of the while
condition. Useful if 1's are sparse and among the least significant
bits. Sparse Ones runs in time
proportional to the number of 1 bits. The line n &= (n - 1) simply
sets the rightmost 1 bit in n to 0. Dense
Ones runs in time proportional to the number of 0 bits.
It is the same as Sparse Ones, except that it first toggles all bits
(n ~= -1), and continually subtracts the number of 1 bits from
sizeof(int). Precompute_8bit
assumes an array bits_in_char such that bits_in_char[i] contains the
number of 1 bits in the binary representation for i. It repeatedly
updates count by masking out the last eight bits in
n, and indexing
into bits_in_char.
Precompute_16bit |
|
Precompute_16bit is a variant of
Precompute_8bit in that an array bits_in_16bits[] stores the number
of 1 bits in successive 16 bit numbers (shorts).
Parallel Count |
|
Parallel Count carries out bit
counting in a parallel fashion. Consider n after the first line has
finished executing. Imagine splitting n into pairs of bits. Each
pair contains the number of ones in those two bit positions
in the original n. After the second line has finished executing,
each nibble contains the number of ones in those four bits
positions in the original n. Continuing this for five iterations,
the 64 bits contain the number of ones among these sixty-four bit
positions in the original n. That is what we wanted to compute.
Nifty Parallel Count |
|
Nifty Parallel Count works the same
way as Parallel Count for the first three iterations. At the end of
the third line (just before the return), each byte of n contains the
number of ones in those eight bit positions in the original n. A
little thought then explains why the remainder modulo 255 works.
MIT HACKMEM Count |
|
MIT Hackmem Count is funky.
Consider a 3 bit number as being 4a+2b+c. If we shift it right 1
bit, we have 2a+b. Subtracting this from the original gives 2a+b+c.
If we right-shift the original 3-bit number by two bits, we get a,
and so with another subtraction we have a+b+c, which is the number of
bits in the original number. How is this insight employed? The
first assignment statement in the routine computes tmp.
Consider the octal representation of tmp. Each digit in the
octal representation is simply the number of 1's in the corresponding
three bit positions in n. The last return statement sums
these octal digits to produce the final answer. The key idea is to
add adjacent pairs of octal digits together and then compute the
remainder modulus 63. This is accomplished by right-shifting
tmp by three bits, adding it to tmp itself and
ANDing with a suitable mask. This yields a number in which groups of
six adjacent bits (starting from the LSB) contain the number of 1's
among those six positions in n. This number modulo 63
yields the final answer. For 64-bit numbers, we would have to add
triples of octal digits and use modulus 1023. This is HACKMEM 169,
as used in X11 sources. Source: MIT AI Lab memo, late 1970's.
No Optimization Some Optimization Heavy Optimization |
Which of the several bit counting routines is the fastest? Results
of speed trials on an i686 are summarized in the table on left. "No
Optimization" was compiled with plain gcc. "Some
Optimizations" was gcc -O3. "Heavy Optimizations"
corresponds to gcc -O3 -mcpu=i686 -march=i686 -fforce-addr
-funroll-loops -frerun-cse-after-loop -frerun-loop-opt
-malign-functions=4.
Thanks to Seth
Robertson who suggested performing speed trials by extending href="bitcount.c">bitcount.c. Seth also pointed me to
MIT_Hackmem routine. Thanks to href="mailto:dennyrolling@hotmail.com"> Denny Gursky who
suggested the idea of Precompute_11bit. That would require three
sums (11-bit, 11-bit and 10-bit precomputed counts). I then tried
Precompute_16bit which turned out to be even faster.
If you have niftier solutions up your sleeves, please href="mailto:manku@cs.stanford.edu">send me an e-mail
Gurmeet Singh Manku
Last update: 27 Jul 2002