Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'll run it for you on the same machine Wojciech is using and report back shortly. I'm getting a compiler error for your call to _pext_u64(), but I think it's just a matter of adjusting the compiler flags. OK, adding "-march=native" works (was only -mavx512vbmi). Oops, now ./unittest fails:

  [nate@nuc avx512-remove-spaces]$ ./unittest
  test 1 gap
  FAILED; len_ref=63, len=0
   input: [ bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
   ref: [bcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789#@]
  result: []
Did you maybe flip the args to _pext_u64() like I often do? I'll check back in a few minutes and see if you've committed a fix.

Edit: Getting to my bedtime here. Send me email (in profile) if you'd like me to follow up with it tomorrow.



Ah, just figured it out. I accidentally deleted this line:

            len = 64 - __builtin_popcountll(mask);
So the destination pointer never got updated. Mental debugging is kind of annoying :)

Thanks!


OK, I'm still up and just saw your fix. Yes, that makes the ./unittest work. I have numbers for ./benchmark; let me grab and compile his original for comparison. Do you know which result you'd expect to see the best difference for?

Results: Yes, if I'm reading things right (and assuming the unittest caught any issues) you are much faster at large cardinalities! Your approach seems to be a constant .7 cycle/op, while Wojciech's varies from about .4 for card=1 to 4.0 for card=64. The breakeven point is card=7, with your approach faster at all higher numbers. Nice improvement!

Send me email if you'd like to pursue this further tomorrow. Also, you probably want to read dragontamer's post here closely and mentally consider whether his approach might make even more sense. I haven't thought about it yet, but if he thinks he knows the best way, he may well be right. Good night!

Edit: Just noticed that my numbers seem slightly faster than the ones Wojciech reported for large cardinalities. My guess would be that this is because I added "-march=native" to both compilations, although maybe there's something more subtle happening too. He's in Europe, so probably asleep now, but will likely be by to offer his thoughts in a few hours.


Awesome, thanks! I'll play around a bit with the algorithms and see if I come up with anything neat. I'll email you if I do.


Just favorited this thread for algorithmic awesomeness. Thanks guys.


Keep in mind that the microbenchmark uses exactly fixed cardinalities, with no randomness, so the branches in the original algorithm will be perfectly predicted and hence will tend to inflate the reported performance versus a case where the number of spaces has an expected cardinality but with enough variance that the loop mispredicts once for every block.

So I suspect the cutover point for the pext approach to be better than the microbenchmark suggestions for typical real-world inputs.

It would also be interesting to see a lookup based approach, using multiple lookups (probably with gather) to build the shuffle control for a single block. 0.7 cycles per element means ~45 cycles per 64 byte block, so that's still quite a large budget per block, and the load ports have little pressure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: