LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 1488 - bithacks: optimize code that counts bits
Summary: bithacks: optimize code that counts bits
Status: RESOLVED LATER
Alias: None
Product: libraries
Classification: Unclassified
Component: Scalar Optimizations (show other bugs)
Version: trunk
Hardware: All All
: P enhancement
Assignee: Unassigned LLVM Bugs
URL: http://graphics.stanford.edu/~seander...
Keywords: code-quality
: 1792 (view as bug list)
Depends on:
Blocks: 17101
  Show dependency tree
 
Reported: 2007-06-02 20:35 PDT by Nick Lewycky
Modified: 2013-09-23 17:45 PDT (History)
5 users (show)

See Also:
Fixed By Commit(s):


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Lewycky 2007-06-02 20:35:20 PDT
The common idiom for counting the number of bits set can be optimized.

unsigned countbits_slow(unsigned v)
{
  unsigned c;
  for (c = 0; v; v >>= 1)
  {
    c += v & 1;
  }
  return c;
}

This can be rewritten as such:

unsigned countbits_fast(unsigned v)
{
  unsigned c;
  for (c = 0; v; c++)
  {
    v &= v - 1; // clear the least significant bit set
  }
  return c;
}

There are other ways to rewrite the bit-counting code, but they involve other
tradeoffs such as constant arrays. Also, the naive method may be written such
that it starts with the high bit or low bit for performance, if the programmer
happens to know about the distribution of the bits in advance. countbits_fast
will equal or outperform either implementation.

With stock GCC, countbits_slow has a 63% longer run-time in a tight loop, while
with llvm-gcc, countbits_slow has a 45% longer run-time. (Sadly, llvm-gcc is
slower than GCC at either function on x86 Linux, but that's another bug.)
Comment 1 Nick Lewycky 2007-06-02 20:37:08 PDT
I forgot to mention that this is also the llvm.ctpop.* intrinsic. We should
match the code better and, at least on x86, emit it better.
Comment 2 Chris Lattner 2007-06-02 20:37:24 PDT
The right way to do this is to recognize the loop and turn it into a call to llvm.ctpop
Comment 3 Chris Lattner 2007-06-02 20:48:14 PDT
fwiw, 186.crafty contains this loop:

  int PopCnt(register BITBOARD a)
  {
    register int c=0;
    while(a) {
      c++;
      a &= a - 1;
    }
    return(c);
  }

BITBOARD = unsigned long long

If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge 
speedup for crafty.

-Chris
Comment 4 Nick Lewycky 2007-11-12 17:10:55 PST
I just got this on IRC:

  unsigned int popcount(unsigned int input) {
    unsigned int count = 0;
    for (unsigned int i =  0; i < 4 * 8; i++)
      count += (input >> i) & i;
    return count;
  }

with the question, "why doesn't loop-unroll unroll my loop?".
Comment 5 Nick Lewycky 2007-11-12 17:12:37 PST
*** Bug 1792 has been marked as a duplicate of this bug. ***
Comment 6 Eli Friedman 2008-05-18 21:58:04 PDT
(In reply to comment #3)
> If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge 
> speedup for crafty.

For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if you pass in the right flags on X86).  And the asm implementations are faster than what LLVM currently generates.
Comment 7 Nadav Rotem 2008-05-19 11:39:22 PDT
(In reply to comment #6)
> (In reply to comment #3)
> > If you're interested in matching these patterns, matching this one to llvm.popcnt would result in a huge 
> > speedup for crafty.
> 
> For crafty in particular, it actually has asm popcount/ctlz/cttx functions (if
> you pass in the right flags on X86).  And the asm implementations are faster
> than what LLVM currently generates.
> 

This requires SSE4 support:
  http://en.wikipedia.org/wiki/SSE4#SSE4.2

Comment 8 Chris Lattner 2008-05-19 13:23:48 PDT
I'm not aware of any shipping cpu with SSE4.2.  Penryn etc only have 4.1
Comment 9 Eli Friedman 2008-05-19 14:04:59 PDT
(In reply to comment #7)
> This requires SSE4 support:
>   http://en.wikipedia.org/wiki/SSE4#SSE4.2

crafty is actually just using a simple loop for popcount on X86; I think it ends up being faster than __builtin_popcount because the popcount is usually small.
Comment 10 Chris Lattner 2008-10-15 11:02:28 PDT
I moved these ideas into lib/Target/README.txt
Comment 11 Shuxin Yang 2012-12-13 17:28:45 PST
We are now able to recognize following patten.  Since Popcount recognization is architecture-dependent 
optimization. It is enabled only if the underlying architecture has HW support. ScalarTargetTransformInfo::getPopcntHwSupport() serves as the interface between the LoopIdiomRecognize pass and HW support. 

On x8864, currently popcount optimization is trigged if -msse4 is turned on (or other flags that implies sse4).  The CodeGen will emit POPCNT instruction (in SSE4 instruction set). 

According to "http://www.strchr.com/crc32_popcnt", the POPCNT instruction is not as faster as the SSE3 implementation. 

In the near future, we need to implement popcount HW support using SSE3 instruction. 



 int PopCnt(register BITBOARD a)
  {
    register int c=0;
    while(a) {
      c++;
      a &= a - 1;
    }
    return(c);
  }