Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SimSIMD: Hardware-accelerated similarity metrics and distance functions (github.com/ashvardanian)
160 points by sroussey on Jan 23, 2024 | hide | past | favorite | 36 comments


This is the kind of basement sorcery that makes the world go round. Nicely done.


Nice! Similarity searching is the bane of my existence, but also gives life to so many of the things I work on. 200x sounds wonderfully fast.

Does this do levenshtein distance? Didn't see it in the main document.

Edit: also, THANK YOU for adding a parameter to specify # of threads. Wish this was a wider standard practice.


The author is also responsible for the StringZilla library that includes levenstein distance: https://github.com/ashvardanian/StringZilla


He-hey! For Levenshtein distance check out the development branch of StringZilla. Hopefully will release this week :)

Oh, and for multithreading, USearch might be more practical.


Heck yeah, thanks.

  If the first 4 bytes of the string  are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.
That's a fun heuristic.


Yes, it was very practical in most cases, but v3 only uses it for short needles with SWAR. SIMD variants will use Raita-style heuristics.


OMP_NUM_THREADS=n is pretty standard in HPC workloads


Huh, thanks. I'll check it out and see if I can drop that in.


Tangential: How do you guys decide which similarity metric to use? Euclidean seems the most intuitive but I'm sure others also have use cases. This situation is similar to the similarity metrics used to compare two distributions (KL for insurance, tho technically it's not a metric). Is there some art to it or is there really a systematic way to choose metrics for different tasks?


This is actually a really underappreciated question! (and also really interesting!) There's a lot of nuance to this because the truth is that distance becomes less meaningful as you increase dimensions. You can find some papers comparing Lp distances with different p values (p=2 == Euclidean == L2). But as dimension increases, the distance to the furthest points decreases (making it harder to differentiate near points from far points). Cosine similarity is a commonly used one, but as dimensions increase the likelihood that any two tensors are orthogonal rapidly increases. This might seem counterintuitive because the probability is really low in 2 or 3 dims as you only have 2 or a plane in R3.

So really the answer honestly tends to be ad hoc: "whatever works best". It's good to keep in mind that any intuition you have about geometry goes out the window as dimensions increase. It's always important to remember assumptions made, especially when focusing on empiricism. There are definitely some nuances that can point you in better directions (pun intended :) than random guessing, especially if you know a lot about your geometry, but it is messy and nuances can make big differences.

I wish I had a better answer but I hope this is informative. Maybe some mathematician will show up and add more. I'm sure there's someone on HN that loves to talk about higher dimensional geometry and I'd love to hear those "rants."


"whatever works best" because it also totally depends on your field of application. Here[0] is for example a similarity search using the Tanimoto[1] index.

The "works best" is also, in many cases, subjective. This is also not easy to assess, you may need several people looking at the results, here molecules, to say if yes they are similar or not. A chemist will think differently than a biologist in this regard.

[0]: https://www.chemeo.com/similar?smiles=Cc1c%28%5bN%2b%5d%28%3...

[1]: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0...


Not sure how relevant this is as not really my field, but possibly interesting! 'A feature selection method based on the Golden Jackal-Grey Wolf Hybrid Optimization Algorithm' https://journals.plos.org/plosone/article?id=10.1371/journal...


Yeah, I’m working on a test framework to try the various permutations to see what works.

And for me to get a feel for it.


My intuition:

L2 in low dims, Cosine in high dims. Hamming on similar size sets, Jaccard on others. Jensen-Shannon (symmetric KL) when dealing with histograms and probability distribution.


Manhattan is better than L2 at high dimensions..but why cos?



Cosine similarity is certainly not immune to the curse of dimensionality. In fact, it is explicitly prone to it. As dimensions increase, the likelihood that any two tensors are orthogonal rapidly increases. This is easy to reason out if you start from 2 dimensions, count the number of orthogonal vectors, and then move to 3D, and so on.


Are vector databases like Chroma already doing some of this? If not, it could be a huge win for RAG workflows!


SimSIMD is already used in USearch, ClickHouse, and Lantern. In other solutions SIMD is generally limited to pre-2015 Assembly instructions. So yes, there a lot of RAG workflows to accelerate :)


the sad part is that they are the ones ripping the benefits of this tech by capturing that sweet VC money


I’m curious if pgvector is using it. It’s actually pretty fast for us without even indexing.


As far as I'm aware pg_vector just uses the compiler's autovectorization on float32. I think specifically for Euclidean distance you won't really beat GCC (the README even admits it: "GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which has been part of the C language since 2011.")


Yes, the improvements for float32 aren't very dramatic, but it can be 3x NumPy/SciPy: https://ashvardanian.com/posts/simsimd-faster-scipy/

C11 is also a bit tricky, as its support is optional, as far as I remember.


I also notice you use FMA in your AVX2 L2 distance calculation. I don't think pg_vector enables that, so SimSIMD might be slightly faster.

Also interesting that it beats NumPy/SciPy by so much! I wonder what they're doing..


Not using SIMD at all as it seems. Thus relying on auto-vectorization of I'm not reading it indirectly.

https://github.com/scipy/scipy/blob/main/scipy/spatial/src/d...


It doesn’t support f16 or i8 at all yet (at least not in the released version).


I’m using this and became a contributor this year (prebuilt binaries for nodejs and bun, as well as some ts/js mangling). :)


Interesting project. Would be nice to see how it compares to the built-in operations provided in Faiss.

https://github.com/facebookresearch/faiss/wiki/How-to-make-F...



Thinking more of a scientific comparison with metrics. Definitely looks interesting overall, appreciate you sharing this library as open source.


You are welcome :)

There are more materials in my blog, including 4 articles from October 2023: https://ashvardanian.com/archives/

Benchmarks are quite easy to reproduce, especially in Python. The numbers for AWS Graviton 3, Apple M2 and Intel Sapphire Rapids CPUs can be found here: https://ashvardanian.com/posts/simsimd-faster-scipy/


Great, I'll take a look!


I wonder how it compares to faiss in terms of performance on Intel CPUs


If I remember correctly, for dot product it uses BLAS libraries and for Cosine distance and L2 have very minimal SIMD support. In both cases SimSIMD should be faster for distance computations.

Multiply it by a more efficient HNSW implementation and you might be looking at a 100x performance difference on a scale of a billion vectors: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...


Thanks for the link!


Is this calling SIMD 'hardware acceleration' and is it claiming all its improvements to python?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: