Two weeks ago, I was at the Workshop Programmiersprachen und Rechenkonzepte, a yearly meeting of German programming language researchers. At the workshop, Frank Huch and Sebastian Fischer gave a really excellent talk about an elegant regular expression matcher written in Haskell. One design goal of the matcher was to run in time linear to the length of the input string (i.e. without backtracking) and linear in the size of the regular expression. The memory use should also only be linear in the regular expression.
In this blog post I want to describe this implementation and show the code of it, because it is quite simple. In a later post I will show what optimizations PyPy can perform on this matcher and also do some benchmarks.
Another note: This algorithm could not be used to implement PyPy's re module! So it won't help to speed up this currently rather slow implementation.
Implementing Regular Expression Matchers
There are two typical approaches to implement regular expression. A naive one is to use a back-tracking implementation, which can lead to exponential matching times given a sufficiently evil regular expression.
The other, more complex one, is to transform the regular expression into a non-deterministic finite automaton (NFA) and then transform the NFA into a deterministic finite automaton (DFA). A DFA can be used to efficiently match a string, the problem of this approach is that turning an NFA into a DFA can lead to exponentially large automatons.
Given this problem of potential memory explosion, a more sophisticated approach to matching is to not construct the DFA fully, but instead use the NFA for matching. This requires some care, because it is necessary to keep track of which set of states the automaton is in (it is not just one state, because the automaton is non-deterministic).
The algorithm described here is essentially equivalent to this approach, however it does not need an intermediate NFA and represents a state of a corresponding DFA as marked regular expression (represented as a tree of nodes). For many details about an alternative approach to implement regular expressions efficiently, see Russ Cox excellent article collection.
In the algorithm the regular expression is represented as a tree of nodes. The leaves of the nodes can match exactly one character (or the epsilon node, which matches the empty string). The inner nodes of the tree combine other nodes in various ways, like alternative, sequence or repetition. Every node in the tree can potentially have a mark. The meaning of the mark is that a node is marked, if that sub-expression matches the string seen so far.
The basic approach of the algorithm is that for every character of the input string the regular expression tree is walked and a number of the nodes in the regular expression are marked. At the end of the string, if the top-level node is marked, the string matches, otherwise it does not. At the beginning of the string, one mark gets shifted into the regular expression from the top, and then the marks that are in the regex already are shifted around for every additional character.
Let's start looking at some code, and an example to make this clearer. The base class of all regular expression nodes is this:
class Regex(object): def __init__(self, empty): # empty denotes whether the regular expression # can match the empty string self.empty = empty # mark that is shifted through the regex self.marked = False def reset(self): """ reset all marks in the regular expression """ self.marked = False def shift(self, c, mark): """ shift the mark from left to right, matching character c.""" # _shift is implemented in the concrete classes marked = self._shift(c, mark) self.marked = marked return marked
The match function which checks whether a string matches a regex is:
def match(re, s): if not s: return re.empty # shift a mark in from the left result = re.shift(s, True) for c in s[1:]: # shift the internal marks around result = re.shift(c, False) re.reset() return result
The most important subclass of Regex is Char, which matches one concrete character:
class Char(Regex): def __init__(self, c): Regex.__init__(self, False) self.c = c def _shift(self, c, mark): return mark and c == self.c
Shifting the mark through Char is easy: a Char instance retains a mark that is shifted in when the current character is the same as that in the instance.
Another easy case is that of the empty regular expression Epsilon:
class Epsilon(Regex): def __init__(self): Regex.__init__(self, empty=True) def _shift(self, c, mark): return False
Epsilons never get a mark, but they can match the empty string.
Now the more interesting cases remain. First we define an abstract base class Binary for the case of composite regular expressions with two children, and then the first subclass Alternative which matches if either of two regular expressions matches the string (usual regular expressions syntax a|b).
class Binary(Regex): def __init__(self, left, right, empty): Regex.__init__(self, empty) self.left = left self.right = right def reset(self): self.left.reset() self.right.reset() Regex.reset(self) class Alternative(Binary): def __init__(self, left, right): empty = left.empty or right.empty Binary.__init__(self, left, right, empty) def _shift(self, c, mark): marked_left = self.left.shift(c, mark) marked_right = self.right.shift(c, mark) return marked_left or marked_right
An Alternative can match the empty string, if either of its children can. Similarly, shifting a mark into an Alternative shifts it into both its children. If either of the children are marked afterwards, the Alternative is marked too.
As an example, consider the regular expression a|b|c, which would be represented by the objects Alternative(Alternative(Char('a'), Char('b')), Char('c')). Matching the string "a" would lead to the following marks in the regular expression objects (green nodes are marked, white ones are unmarked):
At the start of the process, no node is marked. Then the first char is matched, which adds a mark to the Char('a') node, and the mark will propagate up the two Alternative nodes.
The two remaining classes are slightly trickier. Repetition is used to match a regular expression any number of times (usual regular expressions syntax a*):
class Repetition(Regex): def __init__(self, re): Regex.__init__(self, True) self.re = re def _shift(self, c, mark): return self.re.shift(c, mark or self.marked) def reset(self): self.re.reset() Regex.reset(self)
A Repetition can always match the empty string. The mark is shifted into the child, but if the Repetition is already marked, this will be shifted into the child as well, because the Repetition could match a second time.
As an example, consider the regular expression (a|b|c)* matching the string abcbac:
For every character, one of the alternatives matches, thus the repetition matches as well.
The only missing class is that for sequences of expressions, Sequence (usual regular expressions syntax ab):
class Sequence(Binary): def __init__(self, left, right): empty = left.empty and right.empty Binary.__init__(self, left, right, empty) def _shift(self, c, mark): old_marked_left = self.left.marked marked_left = self.left.shift(c, mark) marked_right = self.right.shift( c, old_marked_left or (mark and self.left.empty)) return (marked_left and self.right.empty) or marked_right
A Sequence can be empty only if both its children are empty. The mark handling is a bit delicate. If a mark is shifted in, it will be shifted to the left child regular expression. If that left child is already marked before the shift, that mark is shifted to the right child. If the left child can match the empty string, the right child gets the mark shifted in as well.
The whole sequence matches (i.e. is marked), if the left child is marked after the shift and if the right child can match the empty string, or if the right child is marked.
Consider the regular expression abc matching the string abcd. For the first three characters, the marks wander from left to right, when the d is reached, the matching fails.
More Complex Example
As a more complex example, consider the expression ((abc)*|(abcd))(d|e) matching the string abcabcabcd.
Note how the two branches of the first alternative match the first abc in parallel, until it becomes clear that only the left alternative (abc)* can work.
The match function above loops over the entire string without going back and forth. Each iteration goes over the whole tree every time. Thus the complexity of the algorithm is O(m*n) where m is the size of the regular expression and n is the length of the string.
Summary & Outlook
So, what have we achieved now? The code shown here can match regular expressions with the desired complexity. It is also not much code. By itself, the Python code shown above is not terribly efficient. In the next post I will show how the JIT generator can be used to make the simple matcher shown above really fast.