Showing posts with label PL research. Show all posts
Showing posts with label PL research. Show all posts

Sunday, February 3, 2008

From Regexps to Regexp Patterns, still functional

Here now is the long overdue follow-up to "RegExp but no state machine". Remember, we were talking about regular expressions: state machines come to mind, and we hurry to the cabinet with the imperative chainsaw, but it turns out that some scalpel-like pure functions can be used to cut the problem as well.

Binding Variables


Quite often, we are not just interested in the boolean answer to the question "did this input match the regexp?". We also want to know the parts that matched parts of the regexp, as in "if the file ends in .txt, put its name without the extension into variable x". In other words, we want to bind variables to parts of the input, depending on how the input matched. Regexps that contain variables deserve to get a beautiful and original name: let us call them regexp patterns.

You want to see an example that is not buried in natural language? We can do that... I will use Scala syntax, but keep in mind that regexp patterns in Scala are
undergoing a redesign - these example are just here to explain regexp patterns.

 
def just_the_name(name:Seq[Char]) = name match {
case Seq(filename @ _*, '.', 't', 'x', 't') => filename
}


Matching regexp patterns is a bit more involved than matching regexps. That is not apparent from the above, tame example, but what about something like this



def is_backup(name:Seq[Char]) = name match {
case Seq(_*, bak @ (() | ('.','b','a','k'))) => !bak.isEmpty
}


There are two ways that the input name can match the regexp patterns, a fact that is cleverly used to return a boolean to the caller. When there are many or-patterns and "wildcard_star" patterns, the number of possibilities can grow very large.

Back to Business


Maybe regexp patterns will be resurrected soon in Scala, for now, let us focus on the problem and its functional solution. We will write a Scala program that matches regular expression patterns.

Here is code to represent regexp patterns. It uses the RegExp classes defined before (a complete listing that you can compile in one go is provided at the end).

abstract class Pat[A]
// v is the variable (represented as Int), w the prefix that was
// already seen, r the remaining part we yet have to match
case class PVar[A](v:Int, word:RegExp.Word[A], re:RegExp[A]) extends Pat[A] {
override def toString = "x"+v+"@"+re.toString
}
case class PPair[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PChoice[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PatNil[A] extends Pat[A]
case class PatCons[B](hd:B, tl:List[B]) extends Pat[B]


As the comment says, these pattern representation is interesting: it does not only represent regexp patterns, but also partial matches of inputs. When we write down a regexp pattern, the "word" field of PVar will be empty. During the matching, as input becomes available, the different possibilities of going through the input will correspond to different instances of Pat.

Let us try how that would look for the is_backup example. To keep things short, we turn the pattern into x@'A', y@(()|'B')). We match the input A against the pattern, and see how we would go about this. We freely add the empty sequence () where it is appropriate.


Is the first letter of A consumed by x@'A', y@(()|'B'))? Yes, and it is bound to x.
Is () consumed by y@(()|'B'))? Yes, and nothing is bound y and we are done.



Now the same with AB


Is the first letter of AB consumed by x@'A', y@(()|'B'))? Yes, and it is bound to x.
Is the first letter of B consumed by y@(()|'B'))? Yes, and B is bound y.
Is () consumed by ()? yes and we are done.


So we want to step character by character through the input, and
keep track of the bindings by rewriting our patterns -- this is were the partial derivatives come into play. A function pdPat will rewrite a pattern with a partial derivative (the partDeriv function was explained before, in "RegExp but no state machine".)


def pdPat[A](pat:Pat[A],c:A):Pat[A] =
pat match {
case PVar(x, w, r) => PVar(x, w:::List(c), partDeriv(r, c))
case PPair(p1, p2) =>
val pn = PPair(pdPat(p1, c), p2)
if (acceptsEmpty(strip(p1)))
PChoice(pn, PPair(mkEmpPat(p1), pdPat(p2, c)))
else
pn
case PChoice(p1, p2) => PChoice(pdPat(p1, c), pdPat(p2, c))
}


In short, this function takes a regexp pattern and a character. It goes through the regexp pattern and constructs the partial derivative of its underlying regexp. When it needs to keep track of a binding (case PVar), it does so by appending the character at the end. It uses the functions mkEmpPat (which means matching against empty input and makes the regexps disappear or fail in a regexp pattern, but keeps the bindings that were computed before) and strip, which just gives us the underlying regexp and ignores all variable binding stuff that might be going on.

Exercise: write a function that computes the result of accepts_empty(strip(p)) directly, without constructing a regexp first.


The Full Monty


Here now, without further ado, the full example (including stuff from the first part). As always, it might be instructive to run to try out with your own examples, execute step-wise in your IDE debugger (Eclipse and Netbeans), or, the classic, insert lots of println statements.

Note that different possibilities of feeding an input character to a choice are tracked by return a list of environments. The "longest match" and the "shortest match" policies can be recovered by selecting a special element of the list of environments that is the end result of running the patMatch function.


object RegExpPat {

// from part 1

type Word[A] = List[A]

def matchRegExp[A](re:RegExp[A], w:Word[A]): Boolean =
w match {
case List() => acceptsEmpty(re)
case c::w1 => matchRegExp(partDeriv(re, c), w1)
}

def acceptsEmpty[A](re:RegExp[A]): Boolean =
re match {
case Phi() => false
case Empty() => true
case Choice(r1,r2) => acceptsEmpty(r1) || acceptsEmpty(r2)
case Seq(r1,r2) => acceptsEmpty(r1) && acceptsEmpty(r2)
case Star(r) => true
case L(_) => false
}

def partDeriv[A](re:RegExp[A], c:A): RegExp[A] = {
(re, c) match {
case (Phi(), _) => Phi()
case (Empty(), _) => Phi()
case (L(c1), c2) if (c1 == c2) => Empty()
case (L(c1), _) => Phi()
case (Choice(r1,r2), c) => Choice(partDeriv(r1,c),partDeriv(r2,c))
case (Seq(r1,r2), c) =>
val rn = Seq(partDeriv(r1,c),r2)
if (acceptsEmpty(r1)) Choice(rn, partDeriv(r2,c)) else rn
case (rs @ Star(r), c) => Seq(partDeriv(r,c),rs)
}
}

abstract class RegExp[A]

case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]

// part 2

abstract class Pat[A]
// vx is the variable (represented as Int), w the prefix that was
// already seen, r the remaining part we yet have to match
case class PVar[A](v:Int, word:Word[A], re:RegExp[A]) extends Pat[A] {
override def toString = "x"+v+"@"+re.toString
}
case class PPair[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PChoice[A](left:Pat[A], right:Pat[A]) extends Pat[A]
case class PatNil[A] extends Pat[A]
case class PatCons[B](hd:B, tl:List[B]) extends Pat[B]

// combinator
def pat_var[A](n: Int, r:RegExp[A]) = PVar(n,List(),r)

// binding
type Env[a] = List[(Int,Word[a])]

def patMatch[A](pat:Pat[A], word:Word[A]):List[Env[A]] = {
(pat,word) match {
case (p, c::w) =>
patMatch(pdPat(pat, c), w)
case (PVar(x,w,r),List()) =>
if (acceptsEmpty(r)) List(List((x,w))) else List()
case (PChoice(p1,p2), List()) =>
patMatch(p1, List()):::patMatch(p2, List())
case (PPair(p1,p2), List()) =>
for(xs <- patMatch(p1, List()); ys <- patMatch(p2, List())) yield xs:::ys
}
}

def longPatMatch[A](pat:Pat[A], word:Word[A]):Option[Env[A]] =
patMatch(pat,word) match {
case env::_ => Some(env)
case _ => None
}

def shortPatMatch[A](pat:Pat[A], word:Word[A]):Option[Env[A]] =
patMatch(pat,word) match {
case List() => None
case xs => Some(xs.last)
}

def pdPat[A](pat:Pat[A],c:A):Pat[A] =
pat match {
case PVar(x, w, r) => PVar(x, w:::List(c), partDeriv(r, c))
case PPair(p1, p2) =>
val pn = PPair(pdPat(p1, c), p2)
if (acceptsEmpty(strip(p1)))
PChoice(pn, PPair(mkEmpPat(p1), pdPat(p2, c)))
else
pn
case PChoice(p1, p2) => PChoice(pdPat(p1, c), pdPat(p2, c))
}

def strip[A](pat:Pat[A]):RegExp[A] =
pat match {
case PVar(_,w,r) => r
case PPair(p1,p2) => Seq(strip(p1),strip(p2))
case PChoice(p1,p2) => Choice (strip(p1),strip(p2))
}

def mkEmpPat[A](pat:Pat[A]):Pat[A] =
pat match {
case PVar(x,w,r) =>
if (acceptsEmpty(r))
PVar(x, w, Empty[A]())
else
PVar(x, w, Phi[A]())
case PPair(p1,p2) => PPair(mkEmpPat(p1), mkEmpPat(p2))
case PChoice(p1,p2) => PChoice (mkEmpPat(p1), mkEmpPat(p2))
}

// example for testing
val the_choice = Choice(Empty(), L('B'))

val ex = PPair(PVar(1, List(), Star(L('A'))), PVar(2, List(), the_choice))

def main(args:Array[String]) {
println(longPatMatch(ex, List('A')))
println(longPatMatch(ex, List('A','B')))
}
}


Oh, did I mention that there is no mutable state involved?

Credit: The functional implementation described here follows Kenny Zhuo Ming Lu and Martin Sulzmann. A toast to them! Several people, including myself, found similar solutions, but they are the heroic ones that implemented and proved it all.

Friday, December 7, 2007

Regexp but no state machine

Let me tell you a story on pure functional programming and the Scala language.

Functional programming led to amazingly beautiful things like the
Curry-Howard correspondence, MapReduce and the lift web framework.

I want to show you a simple program that matches regular expressions (regexps), which demonstrates rather nicely what makes functional programming so appealing. You probably know what regexps are and that they can be turned to state machines and so on...

Let's forget about that for a moment. Let us start from scratch.

Well, not entirely from scratch. Speaking the words of theory, we say, a *word* w is either the empty word, a single *letter* c, or the concatenation of two words w1,w2. A regular expression then describes a set of words that all have a particular structure in common.

In order to write down regexps in Scala, we need to define the building blocks. And since we would like to keep the door open to matching other things than strings of characters, these building blocks should be generic. Here is the code.


abstract class RegExp[A]

case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]


We can now directly write things like Choice(L('a'),Star(L('b'))). Here is what these expressions mean:

Phi matches nothing
Empty matches the empty word
L(c) matches the single letter c
Seq(r1,r2) match all words w1,w2 where w1 matches r1 and w2 matches r2
Choice(r1,r2) matches all words w that match either r1 or r2 (or both)
Star(r) matches all words w1,...,wn where each of the wi matches r.

Now how can we write code that follows these definitions? Our first try might be a recursive function accepts that takes a regexp and a word and returns a boolean. Whenever a regexp contains another regexp, we could then call accept recursively, with suitable new input. Most rules seems easy, like for Choice we can try to match the input against r1 and if that doesn't work out, we would try r2. However, for Seq and Star we would have to cut a word in pieces and match the pieces against the regexps, possibly backtracking and trying out other ways to cut the word in pieces. That is quite inefficient.

It turns out there is a better way: A word can obviously only match if each of its letters were matched by some part of the regular expression. So why don't we take each letter, and try to "consume" it with some part of the regular expression. If we can play this game until no letters are left, this means the word is accepted. The key to making this work is to represent the "remaining" expression. But hey, don't we have everything there to represent expressions? We could just build the remaining expression as we go.

Here is the full program. Check out the partDeriv function, which computes the 'derivative' of a regular expression with respect to an input letter.

object Main {
abstract class RegExp[A]
case class Phi[A]() extends RegExp[A]
case class Empty[A]() extends RegExp[A]
case class L[A](letter:A) extends RegExp[A]
case class Choice[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Seq[A](left:RegExp[A], right:RegExp[A]) extends RegExp[A]
case class Star[A](exp:RegExp[A]) extends RegExp[A]

type Word[A] = List[A]

def matchRegExp[A](re:RegExp[A], w:Word[A]): Boolean =
w match {
case List() => acceptsEmpty(re)
case c::w1 => matchRegExp(partDeriv(re, c), w1)
}

def acceptsEmpty[A](re:RegExp[A]): Boolean =
re match {
case Phi() => false
case Empty() => true
case Choice(r1,r2) => acceptsEmpty(r1) || acceptsEmpty(r2)
case Seq(r1,r2) => acceptsEmpty(r1) && acceptsEmpty(r2)
case Star(r) => true
case L(_) => false
}

def partDeriv[A](re:RegExp[A], c:A): RegExp[A] =
(re, c) match {
case (Phi(), _) => Phi()
case (Empty(), _) => Phi()
case (L(c1), c2) if (c1 == c2) => Empty()
case (L(c1), _) => Phi()
case (Choice(r1,r2), c) => Choice(partDeriv(r1,c),partDeriv(r2,c))
case (Seq(r1,r2), c) =>
val rn = Seq(partDeriv(r1,c),r2)
if (acceptsEmpty(r1)) Choice(rn, partDeriv(r2,c)) else rn
case (rs @ Star(r), c) => Seq(partDeriv(r,c),rs)
}

def main(args:Array[String]): Unit = {
println matchRegExp(Choice(L('a'),Star(L('b'))), List('b','b','b'))
}
}


What do you know? Pattern matching and recursive calls. Not a single imperative update. It doesn't even use anonymous functions. The program looks remarkably similar in Haskell.

I could now go and try to explain why the partDeriv function does what it does, but you wouldn't learn anything from it. There are two things that are helpful to understand why this works: one is calculating what the function does "by hand", using sample inputs (feel free to insert debug print statements and run it if you are too lazy for that). And, if that is not enough, read the theorem and proof on the 1964 paper by Janus A. Brzozowski which is called "Derivatives of Regular Expressions" JACM 11:4 pp 481-494.

What it comes down to is that some computational problems have a structure to them that permits elegant solutions. David Pollak has summed it up nicely when he said about Scala "what it does is, it reduces code to its essence". Not every problem will have an elegant purely functional solution, but conciseness and clarity have their place in every software engineering project. Now enjoy finding the essence of your code.

Thursday, December 6, 2007

Scala for PL geeks^W researchers

Scala makes for a great language to experiment with programming language design. In my upcoming posts, I'll address applications such as:

  • parsing I'll explain how to get your parser up and running in no time, without resorting to external tools. Scala comes with an easy to use library of parser combinators, which means there's no extra tool to integrate in your build process, no weird generated code to debug, and no new language to learn. Well, that last part is a bit of a fib, as the library of parser combinators essentially implements a domain-specific language in Scala itself!
  • variable binding and substitution Getting substitution right is trickier than most people think. We'll look at a library that encapsulates all this, and which integrates nicely with Scala's parser combinators to express variable scoping right in your language's grammar.
  • type-checking We'll see how you can "port" the inference rules that define your type system from paper to Scala, with most of the boilerplate hidden underneath the covers of a type-checking monad.
My colleagues and I are happily using these libraries, and I hope to convince you to do the same (once I get around to documenting them and preparing them for release ;-) -- RSN!).