@avx does a ton more than just use AVX instructions. It'll reorder and unroll loops when advantageous, swap out some functions for more vectorizable version of those functions and a few other tricks.
Julia uses avx instructions by default if your code is amenable to it.
Sorry, I see how what I wrote could be interpreted that way, and part of what I wrote was out of ignorance. While I didn't write it, I assumed the macro was doing some sort of equivalent to the template metaprogramming I've heard about in C++ to do similar things.
What I did't see was how I could use the AVX2 instructions myself.
Checking now, since Julia's count_ones() maps to the LLVM popcount instruction, and recent clang versions know how to optimize that fixed-length sequence in C even for AVX-512, the Julia equivalent to the code I wrote should have good performance.
There are a few optimizations (keeping one AVX register loaded with a constant byte string, and using prefetch instructions) which might be missing. I'll be talking with the conference participant who brought up Julia to work this out in more detail.
> What I did't see was how I could use the AVX2 instructions myself.
If you ever find yourself in a situation where you want manual control over vectorization, the package SIMD.jl [1] is pretty good for manual, handwritten vectorization. There's also VectorizationBase.jl [2] which LoopVectorization.jl uses. Which one of these two packages are most appropriate just kinda depends on what sort of interface you prefer.
Thanks for the pointers! The first looks like it gives a language for SIMD operations, but not all of the intrinsics. The second is for vectorization, which also doesn't include all of the intrinsics.
However! The second shows an example with Base.llvmcall, which lets me write LLVM assembly. That should let me do anything I want.
However however - the stuff I actually do doesn't need all that, and could likely be implemented with LoopVectorization.
From its benchmarks (https://chriselrod.github.io/LoopVectorization.jl/latest/exa...), a 9-line naive matrix multiplication routine in Julia + LV slightly edges out Intel's MKL.