Jon Riehl’s Log

Thursday, January 28, 2010

December 2009 Mini-sprint Report

Tyler Green and I held a mini-sprint on Mython on December 17th, 2009. We worked on the following:

  • The trampoline parsing framework.
  • A regular expression quotation function.
  • A Cheetah quotation function.

I’m pleased to start moving the trampoline parsing framework out into Basil (see the basil.parsing.trampoline module, available in the Basil repository). I have been batting around the idea of using Python’s generators to implement recursive-descent parsers for a few years now, starting with a proof of concept demonstration in Fall of 2008. Finally several issues with the existing MyFront front-end have forced me to roll something into Basil.

At its core, the framework is quite simple. The framework uses a trampoline to dispatch to a set of functions that return generators.  When a LL(1) state machine (such as those generated by pgen) would push a nonterminal symbol, or a recursive-descent parsing function would call another parsing function, the generator yields the name of the nonterminal symbol. When the LL(1) state machine would pop, or a recursive-descent parsing function would return, the generator simply returns, which raises a StopIteration exception in Python. The top-level trampoline code simply maintains a stack of generators, pushing and dispatching to a new generator when a generator yields a nonterminal symbol, popping when a generator raises StopIteration. This method keeps recursive-descent parsers from running into Python’s relatively shallow call stack bounds, and affords a form of syntactic extensibility by virtue of having a per-nonterminal dispatch table.

Interested parties should expect to see more about this particular module and its application in a new Mython front-end. The unit test for the trampoline module (see basil.parsing.tests.test_trampoline for the code), demonstrates how to use the framework, defining a recursive-descent parser for a simple calculator. At the time of writing, I’m still in the process of integrating the Mython-specific pieces of the front-end and handling some corner cases that I was previously ignoring.

While at OOPSLA 2009, Martin Hirzel noted that one relatively easy demonstration for Mython might involve embedding a regular expression language. While we might want to have something more powerful in the future, Python’s regular expression sub-language embeds easily into Mython. Tyler and I followed a similar strategy to the LLVM assembly embedding I did in November 2009.

Prior to the sprint I looked at our options for storing compiled regular expressions in a module. Unfortunately, the only clear option for serializing and deserializing regular expression state machines uses Python’s pickle module, which involves re-compilation of the regular expression. The result should be comparable to the LLVM assembly embedding: we gain static checks, and can drop extra backslashes. We don’t really save any space in bytecode, nor compiled program run time.

The quotation function is now in basil.lang.regex, with a corresponding unit test in basil.lang.tests.test_regex.  Since the quotation function uses the built-in re module we can import the quotation function at compile time, and not include it
at run time. Here’s what we arrived at:

def requote(name, src, env):
    reobj = re.compile(src.strip())
    recode = pickle.dumps(reobj)
    recode1 = ("import pickle\n" +
               "%s = pickle.loads(%r)\n" % (name, recode))
    ast, env = env["myfrontend"](recode1, env)
    return ast.body, env

The requote() function compiles the regular expression source into a regular expression object, then serializes the object using the pickle module. Finally, the function generates run-time code that deserializes the pickle string.  The following shows the Mython portion of the first (and currently only) regular expression test (from ““):

#! /usr/bin/env mython
quote [myfront]:
    from basil.lang.regex import requote

quote [requote] myre0:
    you only need two \\ to match one backslash

The Mython test code binds a compiled regular expression to the myre0 identifier in the bytecode module. If we disassemble the module code object in the .pyc we see the following (reformatted a little):

2          12 LOAD_NAME                0 (pickle)
15 LOAD_ATTR                1 (loads)
18 LOAD_CONST               2 (                \
"cre\n_compile\np0\n(S'you only need two \\\\\\\\ to match" \
" one backslash'\np1\nI0\ntp2\nRp3\n.")
21 CALL_FUNCTION            1
24 STORE_NAME               2 (myre0)

For those who are adept at reading Python pickle strings (I was at some point in my career), we can see that the regular expression pickle simply calls the same re.compile() function that the quotation function called. I would argue that at least we gained additional static checks, but hopefully developers are unit testing their embedded regular expression strings before using them in production, making static checks not pay off until there are so many regular expressions in the code base, nobody is sure they’ve all been checked. The test does at least save us two backslashes in the demo code (though the backslashes are then double escaped in the pickle string). I hope readers will speak up if I am missing a more efficient compile-time representation trick, such as binary pickles.

Further building on the trick that I showed in the November 2009 article, Tyler and I started looking at embedding formatting strings in Mython. For example, we might want to rewrite the requote() function in Mython as shown below:

def requote (name, src, env):
    re_obj = re.compile(src.strip())
    re_pickle_str = repr(pickle.dumps(re_obj))
    quote [mython_template] out_ast:
        import pickle
        $name = pickle.loads($re_pickle_str)
    return out_ast, env

This example uses a hypothetical quotation function, mython_template(), to generate and compile Mython code to an abstract syntax tree (AST). This quotation function combines the string formatting, and parsing (quotation) steps of requote(). Once compiled, the quoation function should expand back to something similar to the original requote() function.

On our way to something like mython_template(), it occured to us that Cheetah is an expressive formatting language that would be easy to embed in Mython. The result is two new quotation functions, cheetah() and echeetah, in the basil.lang.cheetah module. The cheetah() function takes the embedded string and uses it to create a constructor function (a curried call to the class constructor) for building a Cheetah Template object. The second function, echeetah() builds a Cheetah Template instance, using the run-time environment to satisfy the namespace arguments in the constructor. An example of using these quotation functions appears in the basil.lang.tests.test_cheetah module, which in turn loads, compiles, and runs the Mython module.

This work continues to build examples of quotation functions. I have been working on getting a unit test suite set up for regression testing purposes, and something is available. I am looking forward to hardening the Mython implementation using these tests, which will certainly be a goal of future mini-sprints, and the up-and-coming
sprint at PyCon 2010.

posted by jriehl at 10:37 am  

Wednesday, January 13, 2010

OOPSLA 2009 and Extensible Languages

I hate to have sat on this essay as long as I have, since it has been over two months since OOPSLA 2009 finished up.  I hope that later is better than never.  After rereading this post, I also see one piece of information missing: the part where I drop a little sizzle for myself.  I presented at DLS 2009, and it went well [Riehl09].  In that talk I argued for language extensibility and show how the extensibility mechanism in Mython affords easier language embedding and optimization.  How funny I and other extensible language implementers should stumble into Liskov’s warning a few days later…

I have always considered myself fortunate when I get to hear ACM Turing Award winners speak, and this year was no exception.  I was happy to see Barbara Liskov reprise her 2009 ACM Turing Award lecture at OOPSLA 2009.  Her talk was interesting for several reasons.  I found it exciting to listen to her discussion of the history of abstract data types, which were fundamental when I was first learning to program.  I also liked Liskov’s review of CLU; It is always humbling to see how little modern systems languages have progressed in the last three or four decades.  I particularly liked her pointers into the literature, and her injunctions against reinvention of the wheel.

As a researcher, I was particularly interested in her slide and comments about extensible languages: she didn’t think they were a good idea.  If I remember correctly, Liskov stated that extensible languages were researched in the 1960’s and 1970’s, abandoned as a bad idea, and recent efforts to revive them will encounter the same barriers.  Before I address this issue further, I’d like to discuss the two papers she cited for extensible languages:

  • R. M. Balzer. “Dataless programming.”  FJCC 1967 [Balzer67].
  • Stephen A. Schuman and Philippe Jorrand.  “Definition mechanisms in extensible programming languages.”  FJCC 1970 [SJ70].

The paper by Balzer does not seem pertinent to my notions of language extensibility.  Balzer’s dataless programming language “feels” like a special form of aspect-oriented programming, where users define data layout in a special segment of the program specification.  Algorithms access data through a generic interface for decomposing and iterating through data structures.  I would note this citation makes more sense in the context of Liskov’s 1996 history of CLU [Liskov96], where the dataless programming language’s mechanism may have been characteristic of an approach used in other extensible languages:

Extensible languages contained a weak notion of data abstraction.  This work arose from a notion of “uniform referents” [Balzer 1967; Earley 1971; Ross 1970].

The paper by Schuman and Jorrand is more recognizably about language extensibility.  They present an extensible language, describing facilities for modifying both the semantic and syntactic properties of their language.  They focus on semantic extensibility “in the domain of values”, providing basic mechanisms for user defined data structures.  Their extension types include support for type casts and name indexed unary operators.  They do not go into other forms of semantic extensibility, mentioning they are possible, but outside of the scope of this particular paper.

Schuman and Jorrand’s paper presents a form of concrete syntax macros for syntactic extension.  They build on top of syntax macros as originally described by Leavenworth [Leavenworth66], but add additional functionality.  Schuman and Jorrand’s macro specifications consist of three things: a production, an optional predicate, and a replacement.  Their approach feels like a parser description language, where the production is in BNF style, extending the existing grammar of the extensible language, and the replacement is similar to a parsing action.  Using a production makes their approach able to use a single macro definition instead of having separate forms for expression and statement macros.  Predicates also allow their macros to do general purpose compile-time metaprogramming, binding values in the compile-time environment used by the replacement clause.

I want to quickly give some credit to Barbara’s point about reinvention of the wheel.  I see similarities between Schuman and Jorrand’s macros and the “patterns” proposed in a new language, π (, presented by Roman Knöll in the Onward! track this year [KM09].  I think both of these mechanisms present issues of composibility and modularity.  Schuman and Jorrand do spend some time wringing their hands over this issue, especially since efficiency of compilation was a larger issue when they developed their system.  Neither of these papers gives me an idea of why these ideas weren’t developed further and put into the languages I learned as a teenager.  I can imagine an extensible Turbo Pascal where we were invited to learn more about object oriented programming by implementing our own object system.  I can’t trivially find a paper “Extensibility considered harmful” that exposes the fatal flaws in extensibility, and explains why I wasn’t taught to extend my language.  The impression I got from Liskov was that these systems led to a “Tower of Babel” problem, making user programs unreadable, and therefore too difficult to reuse and maintain.

It isn’t surprising, therefore, that a member of the PLT, Shriram Krishnamurthi, raised this issue during the question period of Liskov’s presentation.  The Lisp and Scheme community have lived with syntax macros for a long time, and they seem to get by just fine with them.  If I understand correctly, these communities see “language towers” develop, and they fall in and out of use.  It’s unfortunate that I wasn’t introduced to the Little Lisper until I was about to graduate from college, and even now have to admit I don’t have my head fully wrapped around the macro system in MzScheme.

I think extensibility is a desirable property, and we can tame the issues it raises.  Users can use lexical scoping to delimit syntactic extensions, making it clear where they are using custom syntax (see “Mood-specific Languages” in [WP07]).  We also have several new approaches to dealing with ambiguity in parsing concrete syntax macros.  In the limit, our machines are much bigger than they were in the 1970’s, and the cubic space and time of parsing an arbitrary context-free grammar simply isn’t the barrier it once was.  I don’t see how one can readily prove extensibility is a desirable or undesirable property (though I plan to show one technique I’m developing for Mython saves a great deal of coding and porting effort).  The fact that previous extensibility mechanisms were not widely adopted does not prove extensibility is undesirable.  At worst, this outcome only shows that we haven’t found the right mechanism or pedagogical tools.  Therefore, I remain interested in lowering the barriers to the evolution of languages, and seeing what happens.

posted by jriehl at 10:26 am  

Powered by WordPress