Ken Thompson's NFA regex patent (1971)

joedrago · on Nov 11, 2022

I read an article a million years ago which I thought described this mechanism in a super friendly way:

https://perl.plover.com/Regex/article.html

It uses the idea of "putting a penny down" on the graph and moving it around, and I found the conversational tone of the description to be very easy to grok. Assuming this is the exact same mechanism (which the author cites), it might resonate for some of you too.

photochemsyn · on Nov 12, 2022

Nice article. On this bit: " It's about how you would write a regex package from scratch, in a language like C that doesn't already have regexes."

There's a 30-line C regex matching program that Rob Pike wrote and Brian Kernighan wrote about here. It only implements . ^ $ * (where c is any literal):

"A Regular Expression Matcher" (2007)

https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...

> "Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics." - Brian Kernighan

tgv · on Nov 12, 2022

As I said before: that's a simple backtracker with exponential complexity, and it adding more operators will trigger that behavior faster. I think Thompson's NFA matcher is O(n m), but I didn't check the patent. If it isn't, it's fairly easy to make it that.

jedimastert · on Nov 11, 2022

Holy smoke, what an amazing article. This has nagged the back of my brain for a very long time.

jonstewart · on Nov 11, 2022

ken also wrote a paper about his algorithm, published in CACM in 1968: https://dl.acm.org/doi/pdf/10.1145/363347.363387 (PDF)

HellsMaddy · on Nov 11, 2022

https://en.wikipedia.org/wiki/Thompson's_construction

Cupertino95014 · on Nov 11, 2022

Does anyone know if this patent was ever asserted by AT&T? (Of course, it's long expired.)

bodhiandphysics · on Nov 12, 2022

He did it before he was at bell labs

Cupertino95014 · on Nov 12, 2022

The patent's assigned to AT&T, though. If that was in effect before it expired, A&T could have asserted it.

ape4 · on Nov 11, 2022

Do current regex implementations use this algorithm?

bla3 · on Nov 11, 2022

Most don't, but re2 and a few others do. The ones that do use it don't have exponential runtime on malicious inputs and lack a few features (back references, mostly). https://swtch.com/~rsc/regexp/ is a great resource on this.

iron2disulfide · on Nov 11, 2022

> Most don't

Is this true? I'm pretty sure that most of the regex engines I've used (grep, ripgrep, re2, hyperscan) use Thompson's construction or at least some NFA-based algorithm (not necessarily Thomopson's; Glushkov automata in particular are used by hyperscan).

burntsushi · on Nov 11, 2022

Yes, all of those are finite automata based. Although grep usually uses the POSIX system regex library. GNU grep will use its own finite automata based approach for a subset (a large subset) of regexes.

But most regex engines are backtracking based. Perl, PCRE (used by PHP and probably others), Oniguruma/Onigmo (used by Ruby), Javascript, Java, Python. That covers a lot of ground.

Plus, popular reference web sites like https://www.regular-expressions.info/ are heavily biased toward regex engines that have features typically only implemented by backtracking engines.

masklinn · on Nov 11, 2022

There's also postgres's which descends from spencer's (Tcl Advanced Regular Expressions) and supports backreferences but IME is very hard to catch into catastrophic backtracking.

Possibly because you need to use the feature to trigger the NFA mode and the baseline test case for catastrophic backtracking doesn't, it just assumes the engine will backtrack.

1letterunixname · on Nov 12, 2022

GNU grep historical design discussion:

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

https://git.savannah.gnu.org/cgit/grep.git/tree/src

Most people in tech don't seem to grasp that there are very few compatible/identical regex formats. Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust. I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.

burntsushi · on Nov 12, 2022

You're speaking to the author of Rust's regex engine.

> Hell, most people don't know what are the comprehensive list of code points characters classes include because they're poorly doc'd or undocumented. I had to write some scripts to find them for certain languages: Ruby, Python, and Rust.

Can you say more? All of the classes are documented here for the regex crate: https://docs.rs/regex/latest/regex/#syntax

Any not listed there are from Unicode and defined by Unicode.

> I advise people to never reinvent regex or Unicode parsing themselves because there are far too many security issues and edge cases that will inevitably become problems.

So what should have I done instead?

I generally advise people never to say "never reinvent something," because that stifles innovation and progress.

> Modern GNU grep includes optional PCRE2 support. I incorrectly recalled that it skipped NFA->DFA conversion, but that maybe how something else like Go or re2 work in certain cases.

Not quite sure what you're trying to say here, but see: https://news.ycombinator.com/item?id=33567129

klodolph · on Nov 11, 2022

As a side note, when converting the NFA to a DFA, you get an exponential number of states, since each DFA state corresponds to a set of NFA states.

burntsushi · on Nov 11, 2022

True, and that's why libraries like RE2 don't build full DFAs. They use lazy DFAs. DFA states are compiled as needed. At most one state is built per byte of input, so this sidesteps the exponential time bound.

In all implementations I know of, memory use is bounded. When memory fills up, the cache of DFA states is cleared. If this happens too many times, RE2 quits the lazy DFA and falls back to a Pike VM. (Effectively the Thompson NFA matcher, but with capture group support.)

yarg · on Nov 11, 2022

Most modern 'regex' implementations don't actually implement regular grammars.

I think they're generally closer in power to PDA.

https://en.wikipedia.org/wiki/Pushdown_automaton

Kranar · on Nov 12, 2022

Yep absolutely, almost all of them are more powerful than PDAs and in fact Perl's are Turing complete.