Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lyra: Fast, in-memory, typo-tolerant, full-text search engine in TypeScript (github.com/nearform)
171 points by bpierre on Aug 1, 2022 | hide | past | favorite | 46 comments


Interesting that this is fast! I looked at the code and it looks like a super vanilla (prefix) trie, at least for exact match searching. For approximate searching, it seems to just explore the trie from the endpoint of the word in the query; it I understand correctly, if wouldn't find "banana" given the query "baana" and some tolerance.

The readme says that the trie has some tricks to make it fast; the only such trick that I could find is "fast properties", which seems to be a JS-specific thing to make object property access faster (something with promoting an ad-hoc object to a prototype, which presumably enjoys more optimisation in JS engines).

Hence I'm sort-of surprised that this performs well! If a simple trie is sufficient for that, I might go forward with implementing search for some of my other stuff just like that.


Modern computers are ridiculously fast. V8 JIT compiler can produce pretty fast native code for hot paths.

Thus all together allows JS perform surprisingly well in the vanilla case, even using simple algorithms. I wonder how much faster could one make the lookup by e.g. using a typed array to represent the trie and the index; that would be cache-friendly.


I'm less surprised about the fact that JS can be fast, and more that a vanilla trie can be fast and (hopefully) resource-efficient; I had the impression that any scalable index would need to compact the trie in some manner (i.e. combining multiple characters in one node) or do other algorithmic improvements that I don't know about. (That's why I looked at the code, to find those -- but I didn't really find any)


The current release of Lyra is treating the trie structure as a tree (of course), but we're planning to start approaching it as a graph for the exact reason you're mentioning. We don't have performance comparisons yet, but we'll document everything


I wonder how this compares to Flexsearch or the other benchmarked libraries here: https://github.com/nextapps-de/flexsearch#performance-benchm...


I quite enjoy minisearch[1] which is also 0 dependencies, actively maintained, and I expect would work well in a worker environment. I dropped it into a service worker and plugged it with a simple point in polygon script to enable geosearch for a recent project[2] and it played v. nicely.

[1] https://github.com/lucaong/minisearch

[2] https://github.com/theprojectsomething/re.places


Lyra author here: I wouldn't compare it with other JS search engines for a simple reason: we'll be targeting edge execution (Cloudflare workers, lambda@edge, etc.), so the focus is really different


Why does that matter? It seems you can run Lyra on the client and you can run other JS search engines on the "edge" (all that's needed is a JS runtime if I understand it correctly). It would be interesting to see how Lyra compares in these benchmarks.


We have a benchmark section in the repo and it's performing pretty well, we'll definitely run it against other engines!


Current version of FlexSearch (0.7.2) is not typo tolerant, see https://github.com/nextapps-de/flexsearch/issues/118


I was going to ask how this is different than lunr but I just noticed lunr hasn't been updated for almost 2 years now: https://github.com/olivernn/lunr.js


Yeah, but is that because it's unmaintained or because it's finished?


Anything with npm dependencies requires constant updates. The code may be finished but the dependencies publish security updates that need to be pulled in.



Dev dependencies can be a risk factor too. It is code you are running on your own machine after all.


I implemented a fuzzy search a few years ago and here's some things to look into.

Damerau-Levenshtein is a significant upgrade from the default Levenshtein algorithm used as it also picks up the very common transposition error much more efficiently for not a lot of extra code.

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...

Typed arrays would also probable speed thing up quite a bit.

Tokenizing may not be faster. Set the first row of weights to ZERO. When you're done, look for the lowest value in the final array. This also saves the trouble of trying to tokenize properly for every language as it's only concerned with the bytes themselves.

https://stackoverflow.com/questions/8139958/algorithm-to-fin...

Performance of that algorithm also suffers a LOT with very long strings. For those, we went with one more like the fuzzy search used in Sublime text. Basically, you iterate the string one time and look for the search characters to appear in order. To go more advanced add a weight to each match type. Stuff like consecutive characters that match, cases match, or immediately transposed characters next to each other give higher weights to the resulting string.

https://www.forrestthewoods.com/blog/reverse_engineering_sub... (this looks like a decent overview of sublime fuzzy search).


Great list of tips!


No demo to try it out? How does it compare to Fuse? I used that recently to write a search component in Preact for my site.

https://www.lloydatkinson.net/posts/2022/writing-a-fuzzy-sea...

I might make a branch to try out Lyra and compare results.


Could this be made to work in Deno? Seems like it would be a good fit, given the domain and the minimal dependencies


Lyra author here: we planned support for Node.js, Deno, Bun, and plain V8. We'll be adding runtime-specific APIs via a plugin system later on


I admittedly have been hesitant to jump into denoland but isn't it just a runtime for typescript? Wouldn't it work out of the box?


There are a couple differences that are relevant to libraries:

1. System APIs (including fetch, etc) are entirely different from Node. Probably not relevant for this library though

2a. CommonJS-style imports/exports aren't supported; you have to use proper ES imports/exports

2b. Import paths must include the file extension, which non-Deno JS/TS often doesn't do

3. You don't use a package manager to install Deno dependencies; you import the source directly from the local file system or from a remote URL. The good news is this includes GitHub URLs! But the bad news is that eg. any transitive dependencies listed in a package.json won't be found

So... it's usually easy to port a project to Deno, but few NPM projects are compatible out of the box

One workaround that can often be used for points 2 and 3 is to load the package through something like Skypack (https://www.skypack.dev/), which will theoretically turn the whole thing into an ES module. But sometimes, depending on the library, there may still be weird issues

The sooner we can ship all libraries as ES modules by default, the better!


Regarding 2b, that’s also the case in node when using es imports, so support for that probably follows from 2a in most cases.


True, but unfortunately in TypeScript the norm is to omit the file extension (I think you even have to make configuration changes to allow otherwise). Node doesn't care because it doesn't deal with the TypeScript directly at all, so most Node-targeted TypeScript ends up not having file extensions


It works fine when imported from esm.sh


I had developed a simple full-text search engine in JavaScript a long time ago, for a specific project; at the time there were, surprisingly, no ready-made options.

It's quite straightforward for simple use cases (small corpus, no typo tolerance, no fuzzing, etc.)


v8 is not great at large heap sizes...


Agree! There should be option to have pluggable storage system for this. That way people can write + pick and choose what storage mechanism they need.


indeed, I see that this project is storing documents as js objects so is likely to overwhelm the GC when the database is in the region of 500mb or so. For things like this it's usually better to use an ArrayBuffer for storage, but this complicates the implementation significantly.

It is possible to write very fast js, but it's hard to write very fast idiomatic js


I believe that depends on the document size. Large, static stuff gets written in a separate heap. Generally, bad performance happens with tons of tiny items.


Why isn't v8 good at large heal sizes?


Good question. This has just been my experience compared to the JVM or Go. That said, it's still an amazing VM. Just don't use it for 100gb heaps...

Probably something to do with being designed to run as part of a web browser. :)


[dead]


Name collisions are becoming quite common. Fortunately, these projects are at least in separate languages/environments.


Searching on github shows (active, 50 stars or more):

* A Very Low-Bitrate Codec for Speech Compression from google

* An interactive, graphical Visualization Design Environment (VDE)

* A paid-members theme for Ghost

* High availability RabbitMQ client

* Open Source Workflow Engine for Cloud Native Infrastructure

* The Lyra Protocol (ethereum)

* Symfony2 admin bundle

There's no point complaining about name "collisions".


Yes, Lyra is the name of a constellation, so there's collision even outside the programming world


Fast, and ‘in-typescript’ seem to be mutually exclusive. At least for something as compute intensive as a full-text search engine.


That's condescending. There is, obviously, a use case for in-core text search. E.g., mine is that it's better to waste a few CPU cycles in the browser than on the server. And Javascript runs pretty fast, and has been doing so for years now. And while a Javascript implementation won't beat a well-programmed C/Rust/Go version, the fact that it's in core helps it speed undoubtedly. So, if you ever need a web user to search some moderately sized set of texts, this might be really helpful.

As a matter of fact, I have a similar module in my applications, e.g. for typo resistant searching through help docs and hierarchical menus.


> C/Rust/Go version

Minor nitpick: JS/V8 beats Go in most benchmarks that are relevant for search.

JS is the outlier here however, because of the insane amount of optimizations that made it perform so unreasonably well, despite being an interpreted language.


I've never seen such a benchmark. Which one is it?


It obviously depends on how you've implemented the feature, but there are a lot of cpmparitive benchmarks if you bother to put "go vs JS Performance benchmark" into the search bar and press enter. The language itself doesn't make code performance however, so you'll have good and bad implementations in any language you go with.

I.e. specifically regex, which is highly relevant in searching through strings: https://github.com/mariomka/regex-benchmark

And the always interesting techempower Project, which leaves the implementation to participants of each round. https://www.techempower.com/benchmarks/#section=data-r21&tes...

Choose whatever category you wish there, js is faster then go in almost all categories there.

Even though I said it before, I'm going to repeat myself as I expect you to ignore my previous message: the language doesn't make any implementation fast or slow. You can have a well performing search engine in go and JS. The performance difference will most likely not be caused by the language with these two choices. And the same will apply with C/Rust. The language won't make the engine performant and creating a maximally performant search engine is hard. But a theoretically perfect implementation would likely be fastest in C/Rust, followed by the usual suspects such as Go/Java/C#/JS and finally ending with all other interpreted languages such as ruby and python


Regex doesn't feature in this kind of library, at least not in mine. It's a token by token comparison/weighing, for which regular expressions are unsuitable. Mine looks at string prefixes and does a Levenshtein comparison on the results, and I think OP's project does that too.

The techempower benchmarks seem geared towards http backend server frameworks. There's only one Javascript framework that scores well, "just-js", but that's a bit low-level for a framework. I don't think it says much about text search performance.

> the language doesn't make any implementation fast or slow

Not ignoring that, but I think it's half true (you can't have a well performing app in native Python, basically, but a bad implementation will also cost a lot of performance). JS is fast enough for most tasks, that's true.


> the language doesn't make any implementation fast or slow

I think some languages are much easier to make things fast in than others. Even if the theoretical limit is the same (or nearly the same) in all languages.

Messing up somewhere in Javascript with async isn’t unlikely.


Well, the types only exist at compile time, so they're fast at least!


Disclaimer: Lyra author here!

The biggest problem I have with TypeScript is to know WHAT .ts files will be compiled into. I.e., if you're using enums or decorators, there's no plain support for them in JS so you'll end up having some kind of "polyfills", which are not so optimized.

Once you analyze your compiled JS, you can write TS knowing what you're gonna get, which is the trick to make it optimized.

My two cents :)


The trick there is to not use enums or decorators! In all seriousness, I believe it’s common for typescript codebases to eschew these features, precisely because it’s hard to reason about the generated code (and preferable to avoid features that aren’t on an ECMAScript standards track.


Enums compile down to pretty standard objects right? Is there a situation in which you found them to do something odd?

Decorators, yeah, I can imagine that’d end up interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: