Jon Riehl’s Log

Thursday, January 28, 2010

December 2009 Mini-sprint Report

Tyler Green and I held a mini-sprint on Mython on December 17th, 2009. We worked on the following:

  • The trampoline parsing framework.
  • A regular expression quotation function.
  • A Cheetah quotation function.

I’m pleased to start moving the trampoline parsing framework out into Basil (see the basil.parsing.trampoline module, available in the Basil repository). I have been batting around the idea of using Python’s generators to implement recursive-descent parsers for a few years now, starting with a proof of concept demonstration in Fall of 2008. Finally several issues with the existing MyFront front-end have forced me to roll something into Basil.

At its core, the framework is quite simple. The framework uses a trampoline to dispatch to a set of functions that return generators.  When a LL(1) state machine (such as those generated by pgen) would push a nonterminal symbol, or a recursive-descent parsing function would call another parsing function, the generator yields the name of the nonterminal symbol. When the LL(1) state machine would pop, or a recursive-descent parsing function would return, the generator simply returns, which raises a StopIteration exception in Python. The top-level trampoline code simply maintains a stack of generators, pushing and dispatching to a new generator when a generator yields a nonterminal symbol, popping when a generator raises StopIteration. This method keeps recursive-descent parsers from running into Python’s relatively shallow call stack bounds, and affords a form of syntactic extensibility by virtue of having a per-nonterminal dispatch table.

Interested parties should expect to see more about this particular module and its application in a new Mython front-end. The unit test for the trampoline module (see basil.parsing.tests.test_trampoline for the code), demonstrates how to use the framework, defining a recursive-descent parser for a simple calculator. At the time of writing, I’m still in the process of integrating the Mython-specific pieces of the front-end and handling some corner cases that I was previously ignoring.

While at OOPSLA 2009, Martin Hirzel noted that one relatively easy demonstration for Mython might involve embedding a regular expression language. While we might want to have something more powerful in the future, Python’s regular expression sub-language embeds easily into Mython. Tyler and I followed a similar strategy to the LLVM assembly embedding I did in November 2009.

Prior to the sprint I looked at our options for storing compiled regular expressions in a module. Unfortunately, the only clear option for serializing and deserializing regular expression state machines uses Python’s pickle module, which involves re-compilation of the regular expression. The result should be comparable to the LLVM assembly embedding: we gain static checks, and can drop extra backslashes. We don’t really save any space in bytecode, nor compiled program run time.

The quotation function is now in basil.lang.regex, with a corresponding unit test in basil.lang.tests.test_regex.  Since the quotation function uses the built-in re module we can import the quotation function at compile time, and not include it
at run time. Here’s what we arrived at:

def requote(name, src, env):
    reobj = re.compile(src.strip())
    recode = pickle.dumps(reobj)
    recode1 = ("import pickle\n" +
               "%s = pickle.loads(%r)\n" % (name, recode))
    ast, env = env["myfrontend"](recode1, env)
    return ast.body, env

The requote() function compiles the regular expression source into a regular expression object, then serializes the object using the pickle module. Finally, the function generates run-time code that deserializes the pickle string.  The following shows the Mython portion of the first (and currently only) regular expression test (from “test_regex01.my“):

#! /usr/bin/env mython
quote [myfront]:
    from basil.lang.regex import requote

quote [requote] myre0:
    you only need two \\ to match one backslash

The Mython test code binds a compiled regular expression to the myre0 identifier in the bytecode module. If we disassemble the module code object in the .pyc we see the following (reformatted a little):

...
2          12 LOAD_NAME                0 (pickle)
15 LOAD_ATTR                1 (loads)
18 LOAD_CONST               2 (                \
"cre\n_compile\np0\n(S'you only need two \\\\\\\\ to match" \
" one backslash'\np1\nI0\ntp2\nRp3\n.")
21 CALL_FUNCTION            1
24 STORE_NAME               2 (myre0)
...

For those who are adept at reading Python pickle strings (I was at some point in my career), we can see that the regular expression pickle simply calls the same re.compile() function that the quotation function called. I would argue that at least we gained additional static checks, but hopefully developers are unit testing their embedded regular expression strings before using them in production, making static checks not pay off until there are so many regular expressions in the code base, nobody is sure they’ve all been checked. The test does at least save us two backslashes in the demo code (though the backslashes are then double escaped in the pickle string). I hope readers will speak up if I am missing a more efficient compile-time representation trick, such as binary pickles.

Further building on the trick that I showed in the November 2009 article, Tyler and I started looking at embedding formatting strings in Mython. For example, we might want to rewrite the requote() function in Mython as shown below:

def requote (name, src, env):
    re_obj = re.compile(src.strip())
    re_pickle_str = repr(pickle.dumps(re_obj))
    quote [mython_template] out_ast:
        import pickle
        $name = pickle.loads($re_pickle_str)
    return out_ast, env

This example uses a hypothetical quotation function, mython_template(), to generate and compile Mython code to an abstract syntax tree (AST). This quotation function combines the string formatting, and parsing (quotation) steps of requote(). Once compiled, the quoation function should expand back to something similar to the original requote() function.

On our way to something like mython_template(), it occured to us that Cheetah is an expressive formatting language that would be easy to embed in Mython. The result is two new quotation functions, cheetah() and echeetah, in the basil.lang.cheetah module. The cheetah() function takes the embedded string and uses it to create a constructor function (a curried call to the class constructor) for building a Cheetah Template object. The second function, echeetah() builds a Cheetah Template instance, using the run-time environment to satisfy the namespace arguments in the constructor. An example of using these quotation functions appears in the basil.lang.tests.test_cheetah module, which in turn loads, compiles, and runs the test_cheetah01.my Mython module.

This work continues to build examples of quotation functions. I have been working on getting a unit test suite set up for regression testing purposes, and something is available. I am looking forward to hardening the Mython implementation using these tests, which will certainly be a goal of future mini-sprints, and the up-and-coming
sprint at PyCon 2010.

posted by jriehl at 10:37 am  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress