SimSIMD: Hardware-accelerated similarity metrics and distance functions

ttul · on Jan 24, 2024

This is the kind of basement sorcery that makes the world go round. Nicely done.

chaps · on Jan 24, 2024

Nice! Similarity searching is the bane of my existence, but also gives life to so many of the things I work on. 200x sounds wonderfully fast.

Does this do levenshtein distance? Didn't see it in the main document.

Edit: also, THANK YOU for adding a parameter to specify # of threads. Wish this was a wider standard practice.

lifthrasiir · on Jan 24, 2024

The author is also responsible for the StringZilla library that includes levenstein distance: https://github.com/ashvardanian/StringZilla

ashvardanian · on Jan 24, 2024

He-hey! For Levenshtein distance check out the development branch of StringZilla. Hopefully will release this week :)

Oh, and for multithreading, USearch might be more practical.

chaps · on Jan 24, 2024

Heck yeah, thanks.

  If the first 4 bytes of the string  are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.

That's a fun heuristic.

ashvardanian · on Jan 24, 2024

Yes, it was very practical in most cases, but v3 only uses it for short needles with SWAR. SIMD variants will use Raita-style heuristics.

brrrrrm · on Jan 24, 2024

OMP_NUM_THREADS=n is pretty standard in HPC workloads

chaps · on Jan 24, 2024

Huh, thanks. I'll check it out and see if I can drop that in.

behnamoh · on Jan 24, 2024

Tangential: How do you guys decide which similarity metric to use? Euclidean seems the most intuitive but I'm sure others also have use cases. This situation is similar to the similarity metrics used to compare two distributions (KL for insurance, tho technically it's not a metric). Is there some art to it or is there really a systematic way to choose metrics for different tasks?

godelski · on Jan 24, 2024

This is actually a really underappreciated question! (and also really interesting!) There's a lot of nuance to this because the truth is that distance becomes less meaningful as you increase dimensions. You can find some papers comparing Lp distances with different p values (p=2 == Euclidean == L2). But as dimension increases, the distance to the furthest points decreases (making it harder to differentiate near points from far points). Cosine similarity is a commonly used one, but as dimensions increase the likelihood that any two tensors are orthogonal rapidly increases. This might seem counterintuitive because the probability is really low in 2 or 3 dims as you only have 2 or a plane in R3.

So really the answer honestly tends to be ad hoc: "whatever works best". It's good to keep in mind that any intuition you have about geometry goes out the window as dimensions increase. It's always important to remember assumptions made, especially when focusing on empiricism. There are definitely some nuances that can point you in better directions (pun intended :) than random guessing, especially if you know a lot about your geometry, but it is messy and nuances can make big differences.

I wish I had a better answer but I hope this is informative. Maybe some mathematician will show up and add more. I'm sure there's someone on HN that loves to talk about higher dimensional geometry and I'd love to hear those "rants."

Loic · on Jan 24, 2024

"whatever works best" because it also totally depends on your field of application. Here[0] is for example a similarity search using the Tanimoto[1] index.

The "works best" is also, in many cases, subjective. This is also not easy to assess, you may need several people looking at the results, here molecules, to say if yes they are similar or not. A chemist will think differently than a biologist in this regard.

[0]: https://www.chemeo.com/similar?smiles=Cc1c%28%5bN%2b%5d%28%3...

[1]: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0...

tudorw · on Jan 24, 2024

Not sure how relevant this is as not really my field, but possibly interesting! 'A feature selection method based on the Golden Jackal-Grey Wolf Hybrid Optimization Algorithm' https://journals.plos.org/plosone/article?id=10.1371/journal...

sroussey · on Jan 24, 2024

Yeah, I’m working on a test framework to try the various permutations to see what works.

And for me to get a feel for it.

ashvardanian · on Jan 24, 2024

My intuition:

L2 in low dims, Cosine in high dims. Hamming on similar size sets, Jaccard on others. Jensen-Shannon (symmetric KL) when dealing with histograms and probability distribution.

SubiculumCode · on Jan 24, 2024

Manhattan is better than L2 at high dimensions..but why cos?

SubiculumCode · on Jan 24, 2024

https://stats.stackexchange.com/questions/341535/curse-of-di...

godelski · on Jan 24, 2024

Cosine similarity is certainly not immune to the curse of dimensionality. In fact, it is explicitly prone to it. As dimensions increase, the likelihood that any two tensors are orthogonal rapidly increases. This is easy to reason out if you start from 2 dimensions, count the number of orthogonal vectors, and then move to 3D, and so on.

turnsout · on Jan 24, 2024

Are vector databases like Chroma already doing some of this? If not, it could be a huge win for RAG workflows!

ashvardanian · on Jan 24, 2024

SimSIMD is already used in USearch, ClickHouse, and Lantern. In other solutions SIMD is generally limited to pre-2015 Assembly instructions. So yes, there a lot of RAG workflows to accelerate :)

behnamoh · on Jan 24, 2024

the sad part is that they are the ones ripping the benefits of this tech by capturing that sweet VC money

teaearlgraycold · on Jan 24, 2024

I’m curious if pgvector is using it. It’s actually pretty fast for us without even indexing.

sakras · on Jan 24, 2024

As far as I'm aware pg_vector just uses the compiler's autovectorization on float32. I think specifically for Euclidean distance you won't really beat GCC (the README even admits it: "GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which has been part of the C language since 2011.")

ashvardanian · on Jan 24, 2024

Yes, the improvements for float32 aren't very dramatic, but it can be 3x NumPy/SciPy: https://ashvardanian.com/posts/simsimd-faster-scipy/

C11 is also a bit tricky, as its support is optional, as far as I remember.

sakras · on Jan 24, 2024

I also notice you use FMA in your AVX2 L2 distance calculation. I don't think pg_vector enables that, so SimSIMD might be slightly faster.

Also interesting that it beats NumPy/SciPy by so much! I wonder what they're doing..

menaerus · on Jan 24, 2024

Not using SIMD at all as it seems. Thus relying on auto-vectorization of I'm not reading it indirectly.

https://github.com/scipy/scipy/blob/main/scipy/spatial/src/d...

teaearlgraycold · on Jan 24, 2024

It doesn’t support f16 or i8 at all yet (at least not in the released version).

sroussey · on Jan 24, 2024

I’m using this and became a contributor this year (prebuilt binaries for nodejs and bun, as well as some ts/js mangling). :)

dmezzetti · on Jan 24, 2024

Interesting project. Would be nice to see how it compares to the built-in operations provided in Faiss.

https://github.com/facebookresearch/faiss/wiki/How-to-make-F...

ashvardanian · on Jan 24, 2024

https://news.ycombinator.com/reply?id=39115268&goto=item%3Fi...

dmezzetti · on Jan 24, 2024

Thinking more of a scientific comparison with metrics. Definitely looks interesting overall, appreciate you sharing this library as open source.

ashvardanian · on Jan 24, 2024

You are welcome :)

There are more materials in my blog, including 4 articles from October 2023: https://ashvardanian.com/archives/

Benchmarks are quite easy to reproduce, especially in Python. The numbers for AWS Graviton 3, Apple M2 and Intel Sapphire Rapids CPUs can be found here: https://ashvardanian.com/posts/simsimd-faster-scipy/

dmezzetti · on Jan 24, 2024

Great, I'll take a look!

egorfine · on Jan 24, 2024

I wonder how it compares to faiss in terms of performance on Intel CPUs

ashvardanian · on Jan 24, 2024

If I remember correctly, for dot product it uses BLAS libraries and for Cosine distance and L2 have very minimal SIMD support. In both cases SimSIMD should be faster for distance computations.

Multiply it by a more efficient HNSW implementation and you might be looking at a 100x performance difference on a scale of a billion vectors: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...

egorfine · on Jan 24, 2024

Thanks for the link!

CyberDildonics · on Jan 24, 2024

Is this calling SIMD 'hardware acceleration' and is it claiming all its improvements to python?