Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ken Thompson's NFA regex patent (1971) (patents.google.com)
120 points by jonstewart on Nov 11, 2022 | hide | past | favorite | 20 comments


I read an article a million years ago which I thought described this mechanism in a super friendly way:

https://perl.plover.com/Regex/article.html

It uses the idea of "putting a penny down" on the graph and moving it around, and I found the conversational tone of the description to be very easy to grok. Assuming this is the exact same mechanism (which the author cites), it might resonate for some of you too.


Nice article. On this bit: " It's about how you would write a regex package from scratch, in a language like C that doesn't already have regexes."

There's a 30-line C regex matching program that Rob Pike wrote and Brian Kernighan wrote about here. It only implements . ^ $ * (where c is any literal):

"A Regular Expression Matcher" (2007)

https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...

> "Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics." - Brian Kernighan


As I said before: that's a simple backtracker with exponential complexity, and it adding more operators will trigger that behavior faster. I think Thompson's NFA matcher is O(n m), but I didn't check the patent. If it isn't, it's fairly easy to make it that.


Holy smoke, what an amazing article. This has nagged the back of my brain for a very long time.


ken also wrote a paper about his algorithm, published in CACM in 1968: https://dl.acm.org/doi/pdf/10.1145/363347.363387 (PDF)



Does anyone know if this patent was ever asserted by AT&T? (Of course, it's long expired.)


He did it before he was at bell labs


The patent's assigned to AT&T, though. If that was in effect before it expired, A&T could have asserted it.


Do current regex implementations use this algorithm?


Most don't, but re2 and a few others do. The ones that do use it don't have exponential runtime on malicious inputs and lack a few features (back references, mostly). https://swtch.com/~rsc/regexp/ is a great resource on this.


> Most don't

Is this true? I'm pretty sure that most of the regex engines I've used (grep, ripgrep, re2, hyperscan) use Thompson's construction or at least some NFA-based algorithm (not necessarily Thomopson's; Glushkov automata in particular are used by hyperscan).


Yes, all of those are finite automata based. Although grep usually uses the POSIX system regex library. GNU grep will use its own finite automata based approach for a subset (a large subset) of regexes.

But most regex engines are backtracking based. Perl, PCRE (used by PHP and probably others), Oniguruma/Onigmo (used by Ruby), Javascript, Java, Python. That covers a lot of ground.

Plus, popular reference web sites like https://www.regular-expressions.info/ are heavily biased toward regex engines that have features typically only implemented by backtracking engines.


There's also postgres's which descends from spencer's (Tcl Advanced Regular Expressions) and supports backreferences but IME is very hard to catch into catastrophic backtracking.

Possibly because you need to use the feature to trigger the NFA mode and the baseline test case for catastrophic backtracking doesn't, it just assumes the engine will backtrack.


GNU grep historical design discussion:

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

https://git.savannah.gnu.org/cgit/grep.git/tree/src

Most people in tech don't seem to grasp that there are very few compatible/identical regex formats. Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust. I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.


You're speaking to the author of Rust's regex engine.

> Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust.

Can you say more? All of the classes are documented here for the regex crate: https://docs.rs/regex/latest/regex/#syntax

Any not listed there are from Unicode and defined by Unicode.

> I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.

So what should have I done instead?

I generally advise people never to say "never reinvent something," because that stifles innovation and progress.

> Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

Not quite sure what you're trying to say here, but see: https://news.ycombinator.com/item?id=33567129


As a side note, when converting the NFA to a DFA, you get an exponential number of states, since each DFA state corresponds to a set of NFA states.


True, and that's why libraries like RE2 don't build full DFAs. They use lazy DFAs. DFA states are compiled as needed. At most one state is built per byte of input, so this sidesteps the exponential time bound.

In all implementations I know of, memory use is bounded. When memory fills up, the cache of DFA states is cleared. If this happens too many times, RE2 quits the lazy DFA and falls back to a Pike VM. (Effectively the Thompson NFA matcher, but with capture group support.)


Most modern 'regex' implementations don't actually implement regular grammars.

I think they're generally closer in power to PDA.

https://en.wikipedia.org/wiki/Pushdown_automaton


Yep absolutely, almost all of them are more powerful than PDAs and in fact Perl's are Turing complete.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: