Interesting that this is fast! I looked at the code and it looks like a super vanilla (prefix) trie, at least for exact match searching. For approximate searching, it seems to just explore the trie from the endpoint of the word in the query; it I understand correctly, if wouldn't find "banana" given the query "baana" and some tolerance.
The readme says that the trie has some tricks to make it fast; the only such trick that I could find is "fast properties", which seems to be a JS-specific thing to make object property access faster (something with promoting an ad-hoc object to a prototype, which presumably enjoys more optimisation in JS engines).
Hence I'm sort-of surprised that this performs well! If a simple trie is sufficient for that, I might go forward with implementing search for some of my other stuff just like that.
Modern computers are ridiculously fast. V8 JIT compiler can produce pretty fast native code for hot paths.
Thus all together allows JS perform surprisingly well in the vanilla case, even using simple algorithms. I wonder how much faster could one make the lookup by e.g. using a typed array to represent the trie and the index; that would be cache-friendly.
I'm less surprised about the fact that JS can be fast, and more that a vanilla trie can be fast and (hopefully) resource-efficient; I had the impression that any scalable index would need to compact the trie in some manner (i.e. combining multiple characters in one node) or do other algorithmic improvements that I don't know about. (That's why I looked at the code, to find those -- but I didn't really find any)
The current release of Lyra is treating the trie structure as a tree (of course), but we're planning to start approaching it as a graph for the exact reason you're mentioning. We don't have performance comparisons yet, but we'll document everything
I quite enjoy minisearch[1] which is also 0 dependencies, actively maintained, and I expect would work well in a worker environment. I dropped it into a service worker and plugged it with a simple point in polygon script to enable geosearch for a recent project[2] and it played v. nicely.
Lyra author here: I wouldn't compare it with other JS search engines for a simple reason: we'll be targeting edge execution (Cloudflare workers, lambda@edge, etc.), so the focus is really different
Why does that matter? It seems you can run Lyra on the client and you can run other JS search engines on the "edge" (all that's needed is a JS runtime if I understand it correctly). It would be interesting to see how Lyra compares in these benchmarks.
I was going to ask how this is different than lunr but I just noticed lunr hasn't been updated for almost 2 years now: https://github.com/olivernn/lunr.js
Anything with npm dependencies requires constant updates. The code may be finished but the dependencies publish security updates that need to be pulled in.
I implemented a fuzzy search a few years ago and here's some things to look into.
Damerau-Levenshtein is a significant upgrade from the default Levenshtein algorithm used as it also picks up the very common transposition error much more efficiently for not a lot of extra code.
Typed arrays would also probable speed thing up quite a bit.
Tokenizing may not be faster. Set the first row of weights to ZERO. When you're done, look for the lowest value in the final array. This also saves the trouble of trying to tokenize properly for every language as it's only concerned with the bytes themselves.
Performance of that algorithm also suffers a LOT with very long strings. For those, we went with one more like the fuzzy search used in Sublime text. Basically, you iterate the string one time and look for the search characters to appear in order. To go more advanced add a weight to each match type. Stuff like consecutive characters that match, cases match, or immediately transposed characters next to each other give higher weights to the resulting string.
There are a couple differences that are relevant to libraries:
1. System APIs (including fetch, etc) are entirely different from Node. Probably not relevant for this library though
2a. CommonJS-style imports/exports aren't supported; you have to use proper ES imports/exports
2b. Import paths must include the file extension, which non-Deno JS/TS often doesn't do
3. You don't use a package manager to install Deno dependencies; you import the source directly from the local file system or from a remote URL. The good news is this includes GitHub URLs! But the bad news is that eg. any transitive dependencies listed in a package.json won't be found
So... it's usually easy to port a project to Deno, but few NPM projects are compatible out of the box
One workaround that can often be used for points 2 and 3 is to load the package through something like Skypack (https://www.skypack.dev/), which will theoretically turn the whole thing into an ES module. But sometimes, depending on the library, there may still be weird issues
The sooner we can ship all libraries as ES modules by default, the better!
True, but unfortunately in TypeScript the norm is to omit the file extension (I think you even have to make configuration changes to allow otherwise). Node doesn't care because it doesn't deal with the TypeScript directly at all, so most Node-targeted TypeScript ends up not having file extensions
I had developed a simple full-text search engine in JavaScript a long time ago, for a specific project; at the time there were, surprisingly, no ready-made options.
It's quite straightforward for simple use cases (small corpus, no typo tolerance, no fuzzing, etc.)
indeed, I see that this project is storing documents as js objects so is likely to overwhelm the GC when the database is in the region of 500mb or so. For things like this it's usually better to use an ArrayBuffer for storage, but this complicates the implementation significantly.
It is possible to write very fast js, but it's hard to write very fast idiomatic js
I believe that depends on the document size. Large, static stuff gets written in a separate heap. Generally, bad performance happens with tons of tiny items.
That's condescending. There is, obviously, a use case for in-core text search. E.g., mine is that it's better to waste a few CPU cycles in the browser than on the server. And Javascript runs pretty fast, and has been doing so for years now. And while a Javascript implementation won't beat a well-programmed C/Rust/Go version, the fact that it's in core helps it speed undoubtedly. So, if you ever need a web user to search some moderately sized set of texts, this might be really helpful.
As a matter of fact, I have a similar module in my applications, e.g. for typo resistant searching through help docs and hierarchical menus.
Minor nitpick: JS/V8 beats Go in most benchmarks that are relevant for search.
JS is the outlier here however, because of the insane amount of optimizations that made it perform so unreasonably well, despite being an interpreted language.
It obviously depends on how you've implemented the feature, but there are a lot of cpmparitive benchmarks if you bother to put "go vs JS Performance benchmark" into the search bar and press enter.
The language itself doesn't make code performance however, so you'll have good and bad implementations in any language you go with.
Choose whatever category you wish there, js is faster then go in almost all categories there.
Even though I said it before, I'm going to repeat myself as I expect you to ignore my previous message: the language doesn't make any implementation fast or slow. You can have a well performing search engine in go and JS. The performance difference will most likely not be caused by the language with these two choices. And the same will apply with C/Rust. The language won't make the engine performant and creating a maximally performant search engine is hard. But a theoretically perfect implementation would likely be fastest in C/Rust, followed by the usual suspects such as Go/Java/C#/JS and finally ending with all other interpreted languages such as ruby and python
Regex doesn't feature in this kind of library, at least not in mine. It's a token by token comparison/weighing, for which regular expressions are unsuitable. Mine looks at string prefixes and does a Levenshtein comparison on the results, and I think OP's project does that too.
The techempower benchmarks seem geared towards http backend server frameworks. There's only one Javascript framework that scores well, "just-js", but that's a bit low-level for a framework. I don't think it says much about text search performance.
> the language doesn't make any implementation fast or slow
Not ignoring that, but I think it's half true (you can't have a well performing app in native Python, basically, but a bad implementation will also cost a lot of performance). JS is fast enough for most tasks, that's true.
> the language doesn't make any implementation fast or slow
I think some languages are much easier to make things fast in than others. Even if the theoretical limit is the same (or nearly the same) in all languages.
Messing up somewhere in Javascript with async isn’t unlikely.
The biggest problem I have with TypeScript is to know WHAT .ts files will be compiled into. I.e., if you're using enums or decorators, there's no plain support for them in JS so you'll end up having some kind of "polyfills", which are not so optimized.
Once you analyze your compiled JS, you can write TS knowing what you're gonna get, which is the trick to make it optimized.
The trick there is to not use enums or decorators! In all seriousness, I believe it’s common for typescript codebases to eschew these features, precisely because it’s hard to reason about the generated code (and preferable to avoid features that aren’t on an ECMAScript standards track.
The readme says that the trie has some tricks to make it fast; the only such trick that I could find is "fast properties", which seems to be a JS-specific thing to make object property access faster (something with promoting an ad-hoc object to a prototype, which presumably enjoys more optimisation in JS engines).
Hence I'm sort-of surprised that this performs well! If a simple trie is sufficient for that, I might go forward with implementing search for some of my other stuff just like that.