I'll run it for you on the same machine Wojciech is using and report back shortl...

zwegner · on Jan 6, 2019

Ah, just figured it out. I accidentally deleted this line:

            len = 64 - __builtin_popcountll(mask);

So the destination pointer never got updated. Mental debugging is kind of annoying :)

Thanks!

nkurz · on Jan 6, 2019

OK, I'm still up and just saw your fix. Yes, that makes the ./unittest work. I have numbers for ./benchmark; let me grab and compile his original for comparison. Do you know which result you'd expect to see the best difference for?

Results: Yes, if I'm reading things right (and assuming the unittest caught any issues) you are much faster at large cardinalities! Your approach seems to be a constant .7 cycle/op, while Wojciech's varies from about .4 for card=1 to 4.0 for card=64. The breakeven point is card=7, with your approach faster at all higher numbers. Nice improvement!

Send me email if you'd like to pursue this further tomorrow. Also, you probably want to read dragontamer's post here closely and mentally consider whether his approach might make even more sense. I haven't thought about it yet, but if he thinks he knows the best way, he may well be right. Good night!

Edit: Just noticed that my numbers seem slightly faster than the ones Wojciech reported for large cardinalities. My guess would be that this is because I added "-march=native" to both compilations, although maybe there's something more subtle happening too. He's in Europe, so probably asleep now, but will likely be by to offer his thoughts in a few hours.

zwegner · on Jan 6, 2019

Awesome, thanks! I'll play around a bit with the algorithms and see if I come up with anything neat. I'll email you if I do.

wjnc · on Jan 6, 2019

Just favorited this thread for algorithmic awesomeness. Thanks guys.

BeeOnRope · on Jan 6, 2019

Keep in mind that the microbenchmark uses exactly fixed cardinalities, with no randomness, so the branches in the original algorithm will be perfectly predicted and hence will tend to inflate the reported performance versus a case where the number of spaces has an expected cardinality but with enough variance that the loop mispredicts once for every block.

So I suspect the cutover point for the pext approach to be better than the microbenchmark suggestions for typical real-world inputs.

It would also be interesting to see a lookup based approach, using multiple lookups (probably with gather) to build the shuffle control for a single block. 0.7 cycles per element means ~45 cycles per 64 byte block, so that's still quite a large budget per block, and the load ports have little pressure.