Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Check out LoopVectorization: https://github.com/chriselrod/LoopVectorization.jl

From its benchmarks (https://chriselrod.github.io/LoopVectorization.jl/latest/exa...), a 9-line naive matrix multiplication routine in Julia + LV slightly edges out Intel's MKL.



It looks like that @avx macro just tells Julia that it's okay to use the AVX instructions?

My specific question is, how do I tell Julia that I want to compute the popcount of the intersection of two byte strings of length 256 bytes?

A reference C code is at http://www.dalkescientific.com/writings/diary/archive/2020/1... in byte_intersect_256() and threshold_bin_tanimoto_search(); my blog posts shows the important parts - I link to the full definition for the C code.


@avx does a ton more than just use AVX instructions. It'll reorder and unroll loops when advantageous, swap out some functions for more vectorizable version of those functions and a few other tricks.

Julia uses avx instructions by default if your code is amenable to it.


Sorry, I see how what I wrote could be interpreted that way, and part of what I wrote was out of ignorance. While I didn't write it, I assumed the macro was doing some sort of equivalent to the template metaprogramming I've heard about in C++ to do similar things.

What I did't see was how I could use the AVX2 instructions myself.

Checking now, since Julia's count_ones() maps to the LLVM popcount instruction, and recent clang versions know how to optimize that fixed-length sequence in C even for AVX-512, the Julia equivalent to the code I wrote should have good performance.

There are a few optimizations (keeping one AVX register loaded with a constant byte string, and using prefetch instructions) which might be missing. I'll be talking with the conference participant who brought up Julia to work this out in more detail.

Thanks for the comment!


Ah I see, yeah I misunderstood you.

> What I did't see was how I could use the AVX2 instructions myself.

If you ever find yourself in a situation where you want manual control over vectorization, the package SIMD.jl [1] is pretty good for manual, handwritten vectorization. There's also VectorizationBase.jl [2] which LoopVectorization.jl uses. Which one of these two packages are most appropriate just kinda depends on what sort of interface you prefer.

[1] https://github.com/eschnett/SIMD.jl

[2] https://github.com/chriselrod/VectorizationBase.jl


Thanks for the pointers! The first looks like it gives a language for SIMD operations, but not all of the intrinsics. The second is for vectorization, which also doesn't include all of the intrinsics.

However! The second shows an example with Base.llvmcall, which lets me write LLVM assembly. That should let me do anything I want.

However however - the stuff I actually do doesn't need all that, and could likely be implemented with LoopVectorization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: