Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Java and SIMD (prestodb.rocks)
178 points by mmastrac on June 27, 2017 | hide | past | favorite | 37 comments


Auto-vectorization is hard. Very hard. E.g. even in C/C++, the compiler (e.g. GCC C/C++ or MS VC/VC++) is unable to vectorize loops unless you help it a lot, and in most cases, you end writing SIMD "intrinsics" (e.g. [1]) in order to get optimal results. From my experience, despite auto-vectorization being better than 10 years ago, still is very far for optimizing code properly without lots of tuning (e.g. you can try build any graphic processing library and see the vectorization warnings -i.e. why the vectorization was not possible-).

Ten years ago, in the SSE2/Altivec times, I thought that it would be matter of time having much smarter compilers making graphics/pixel processing code much faster, but not. So for JIT the case it can not be better, because is similar, as even taking runtime information, the auto-vectorization phase is equivalent. I would love to see smarter compilers, understanding the code, many steps over current hardwired pattern-matching based optimizations.

[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide...


A few years back I decided to entertain myself by testing how smart today's smart compilers really are when it comes to auto-vectorization.

I had this small and simple C application I'd written years earlier that tried to find inputs whose corresponding MD5 hashes started with certain bytes. It was a good base because it was obviously vectorizable.

At first enabling the vectorizer didn't result in any changes to the binaries. I then correctly guessed that (potentially) calling printf function inside a hot loop might confuse it. After slight refactoring I got the compiler to output SSE instructions, which resulted in a nice 2.5× testing speed over the original (incidentally even without auto-vectorizing the refactored code resulted in faster binaries, which is not all that surprising).

Anyway, I also rewrote the application to use intrinsics. I hadn't used them before myself, but it didn't really take much time at all familiarize myself with them and write the code, and it was indeed quite a bit faster than what the compiler was capable of with resulting binary having 14× speed compared to the original, or over 5× compared to what the compiler could achieve without explicit hints from intrinsics.

Edit: added back a few words I had accidentally removed when rearranging sentences, causing a confusing incomplete sentence. Corrected comparing figures like for like.


I’ve had similar experiences. If I want vectorised code, I just write it myself using intrinsics or assembly. It’s fine if the compiler can autovectorise something I didn’t feel like doing by hand, but I’m not going to rely on heuristic voodoo to get the machine code I want for a hot loop. I wouldn’t mind a slightly nicer wrapper API for the intrinsics, though, something like glsl-sse2[1].

And that’s more or less what I’m planning to do in a programming language I’m working on, actually—if you use a SIMD-compatible array type, the compiler will try to keep it in a vector register, and some operations will be faster (e.g., “+” on two Float32^4 values will compile to an addps) but it’s up to the programmer to use the instructions they actually want, or tell the compiler with a macro “please vectorise this loop or warn me about why you can’t”.

[1] https://github.com/LiraNuna/glsl-sse2



That matches my experience. With auto-vectorization you get some speed-up helping the compiler (not always obvious, often requiring +1 increments, etc.), but for full speed you need to do handwritten SIMD intrinsics. I would like to have at least 50% of the optimal by the compiler, without intrinsics (and using intrinsics for the most critical code).


Would be interesting to see the difference in the compiled assembly.


As most people mentioned, the issue in C/C++ is aliasing hence why the compiler cannot safely generate SIMD instructions. However, having worked in scientific computing for a few years on CFD solvers (written in C++), I have seen a very large advancement in what the compiler can do ( Intel compilers especially ). Our approach is to use OpenMP 4.0 directives as they also give you data scoping and alignment clauses. It's probably the best of both worlds where there is some much needed input from the programmer and then the compiler takes care of the rest. On our benchmark, a compute bound kernel would perform perhaps 10% better with hand tuned intrinsics vs the autovectorized version via OpenMP. And that is mostly because it removes some intermediary loads and stores when using short vectors in the autovectorized version which are usually done from a higher level cache anyway. In terms of portability, well, you can see why going for the OpenMP version is the safer bet. One important aspect to take into account is that for efficient autovectorization, you need to rewrite a lot of your kernels to use a vector programming paradigm (basically use short vectors of SIMD size -- which can be changed at compile time depending on uarch ) and trust me, then the compiler will be in its comfort zone. I've seen the compiler generate as good instructions for conditional branching as with hand tuned intrinsics ( AVX/AVX2 and AVX512 ). It can also do transposition from an AoS data structures to SoA in flight however our hand tuned kernels still outperform these since we know to use the lower latency instructions. From a performance perspective, without SIMD, we lose on average 2-3X in our simulation turn around time but getting our whole solver vectorized was a pretty mammoth task.


"Auto-vectorization is hard. Very hard. "

This varies very heavily depending on the programming language :)

C/C++ is not a language that easily enables one to guarantee things about aliasing or loop dependence.


but you can write it in a way that's conducive to better codegen, at the cost of developer sanity.


Or you can design a language that is built all around it. Halide is pretty interesting, basically being a non-Turing complete DSL that lets you separate algorithm from scheduling:

http://halide-lang.org/

Something like that could perhaps also be useful for auto-vectorization.


This is why I like the SIMT model used in CUDA and OpenCL. Just give me a unified programming model for both SIMD vectors and multicore. Let me program each ALU as a separate thread. This requires CPU makers to wake up and give us branching intrinsics that work like in GPGPU: Just let the branched out ALUs sleep / do no-ops. As far as I understand Intel has done that with AVX-512 for KNL. This should be advertised much more. IMO one of the biggest mistakes with KNC was to only push OpenMP and neglect OpenCL. The promise of getting good performance by just recompiling OpenMP multicore code to KNC was snake oil - you have to rewrite all the vectorization in order to even get close to GPUs of the same generation, let alone GPUs released in subsequent years when Intel's production pipeline was stalled.


> I would love to see smarter compilers, understanding the code, many steps over current hardwired pattern-matching based optimizations.

It'll be interesting when we get to that point. I wonder if it'll be in ML or compilers that "cognition" happens first?


> From my experience, despite auto-vectorization being better than 10 years ago, still is very far for optimizing code properly without lots of tuning

Why do you think this is? Does C being difficult to reason about contribute (e.g. statefulness and not knowing what can modify what)?

I've done some SIMD and other optimisations on Android in C for graphics related algorithms and small changes and hints can make a x10 or more difference so I understand your point.


The problem is usually determining when it's profitable, without any profiling, actually.

Additionally, C/C++ are languages where the default is "anything can alias anything" (restrict, until very recent standards, isn't the panacea people think it is) and worse, it's really easy to end up with loop dependences, as well as non-computable loop trip counts, as well. Don't forget alignment, too!

This means inserting runtime checks.

So, assuming the compiler can reorder the loop to vectorize it (and honestly, with polyhedral optimizations, it almost always can if it's at all possible to do so), the question is: is it worth it to vectorize a loop but have to insert 5-6 runtime checks to test for aliasing/etc.

The answer is usually no.

Vectorization, sadly, is not one of those things where more is always better.

Vectorizing every loop/straight line in a program will generally make things much much slower.

(because now you have limited the execution resources to do the computation :P)


Be very careful about this in a shared environment. AVX512 slows down the CPU cores because of thermal, voltage throttling. The instructions "take more work". Intel CPUs take 1 MILLIsecond to return to normal speed.

If you're doing any kind of rapid context switching, or multiple workloads, depending on how your scheduler is setup, the OTHER workloads will show up as using more percentage of CPU time per work item. It's non-intuitive, and difficult to debug.


Now that major cloud vendors are selling VMs with guaranteed AVX512 support, how are they going to deal with the "noisy neighbor" problem?


For one, this is done on a core-by-core level. I believe Intel has the capability to monitor for this behaviour and do some dynamic throttling in certain models, which cloud vendors may have access to. I think they introduced some of this in Broadwell-E, where you can set the AVX offset per core, and the on-die PCU will throttle that core below the AVX base frequency, allowing the rest of the cores to remain at speed. These are typically controlled by BIOS, or MSRs.

We're currently trying to figure out how to do deal with this. Some ideas come from Google's CPI2 paper, and trying to dynamically schedule workloads with diversity if we think they interfere. Other thoughts have been simpler, like core pinning (knapsacking for latency, or throughput).

this is hard.

Disclaimer: These views represent my own, and not my employer's, or their vendor's views


They don't? AFAIK no cloud vendor guarantees consistent performance on their multi-tenancy offerings (at least none backed by SLAs). Vendors sell both dedicated tenancy, with consistent performance, at a higher price and shared tenancy, with variable performance, at a lower price and expect customers who choose the latter to deal with the variance on their own.

Large companies run a benchmark after each instance startup to check how noisy that instance's neighbors are. If it's bad then they simply terminate the instance and start another one[0].

[0] https://www.reddit.com/r/aws/comments/547xbx/netflix_found_5...


Giant heatsinks, and more sockets!

More likely though, they'll continue not to care. Performance is already hideously variable, it'll just get a bit worse.


CPUs throttle down for AVX and AVX2 workloads as well. Hopefully your vectorization is sufficient to make up for the throttling.


This doesn't use AVX512 though, and AVX512 is not supported in any mainstream JVM or CPU. It's only available in Xeon Phi, a HPC accelerator card with many slow cores.


AVX and AVX2 are also included.


I asked James Gosling about SIMD (MMX and SSE) back in the year 2000 JavaOne conference. He said the answer is method calls, and that the compiler can use whatever instructions it likes. (I submitted the question online, and he answered it on stage, and I later saw the recording; I didn't attend, nor meet him in person. He seemed a bit annoyed that the question was too simple.)


There's room for two approaches in Java really. The JIT can be smart and use SIMD instructions where it can see they are applicable, but there's also room for a small DSL like API that allows library authors and other very experienced users to express a suitable algorithm and have it easily translated into the SIMD instructions available at runtime. Anybody interested in the latter should take a look at the work being done under project Panama.


This presentation was posted recently on the general OpenJDK mailing list

  http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf


Annotation based compiler hints?


This has got to be one of the best blog posts I've read in a while. It's clean, concise, has a clear use-case and is well benchmarked. Kudos to the author.


Thanks, I'm glad that you liked it


My understanding is that once the Truffle/Graal language implementation framework lands with Java 9, you will be able to substitute method calls with platform-specific instructions. See page 62 of http://lafo.ssw.uni-linz.ac.at/papers/2015_CGO_Graal.pdf


This is not a Java question. If a C compiler can do it then a Java Virtual Machine can do it, provided that the C code of the JVM makes it so. There are many implementations of the JVM so you really should be asking which JVMs do this. Oracle is not the only game in town, not to mention at least two open source JVMs where you could add whatever capabilities that you need.

On the other hand, if you are asking whether or not some magic compiler optimization will take your crappy code and make it run fast on SIMD, not only is that the wrong question but you have already lost the race.

The winners of the race asked the question, "How can we add a capability to our Java application to run computations fast using SIMD?" and they found number of ways to do this without relying on magic. It might be a bit of work to code because you have to do it with intent, like the old timers who placed code and data carefully on their drum memory computers to make the code run much faster. You can code with intent in any language on any platform, but because your intent is stronger than the asthetic perfection of the platform, things can look a little grungy to an outsider. Comment your code and document it well.

And ask yourself whether offloading the computation to a GPU might not be cheaper and even faster than SIMD.


Often it is not that simple to change JVM, especially if your application is very fine tuned for a specific GC. In such case, is it always worth to spend hundreds of work hours for migration (and testing it afterwards)? Other aspect is the technical support or other legal/contract bindings.

Besides, I wasn't asking for "magic compiler optimizations". I would prefer to use intrinsics directly. Is there a way to do that in Java?


Kind of.

https://www.slideshare.net/RednaxelaFX/green-teajug-hotspoti...

Here is also a presentation about them

https://www.youtube.com/watch?v=7J0RELNadks

Here is a list with some of them.

https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc

The Panama JVM has more related to SIMD.

Of course all of this is JVM specific and each one has its own set.


I've been wondering what the best way to implement SIMD in a JITless language was. While I'm a big proponent of static compilation (coding mostly in Go), this runtime inflexibility is a bit painful. Sure, you could compile multiple version, check the CPU and deliver the correct methods on runtime... but, that's not quite right.

Another issue is how you code it up - both in terms of delivering multiple versions depending on CPU support and in terms of shielding the developer from low-level assembly. I'm all for high level APIs (think `sum(a vec, b vec)`) that get compiled down to whatever is supported, but I haven't seen many good examples of this.


Anyone knows how RyuJIT compares to Java8 and Java9 in a comparison like this?


ryujit does zero autovectorizing but net has a nifty simd library thst lets you do it by hand. big plus is you can write once run on sse or avx or avx512. big downside is only a tiny fraction of simd instructions are supported, though they are working on improving that.


Why not use some OpenCL in some form either raw or Aparapi?


Crossing the JNI barrier is expensive performance-wise. Unless it's a longer running, CPU heavy calculation its usually best to stay in pure Java. There are some things that going out to another high performance framework will speed things up, but the bar is a bit higher because of the JNI overhead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: