I would be surprised if prefetching helped at all, given that the memory access ...

floody-berry · on April 8, 2013

I made a quick test for the C versions to try out flags and such: https://gist.github.com/floodyberry/5335542

I found out gcc supports -fprefetch-loop-arrays, although it is not guaranteed to have a positive effect. In this case, it does appear to run faster than without prefetching. AVX versions also run faster than the standard SSE counterparts even with 128 bit registers. ymm regs are faster than xmm.

gcc is faster than icc actually, and icc runs faster with the plain c version than the vectorized versions. I can't figure out how to make icc prefetch either, the switch that controls it doesn't seem to have an effect.

jules · on April 8, 2013

Nice! What were the results?