Jon Riehl’s Log

Wednesday, March 4, 2009

Source-to-source Translation Using Camlp5

In this post, I’d like to illustrate how we can use Camlp5 to build a source-to-source translator using concrete syntax. Recall (from Jones et al.) that a translator involves three things: a source language, a target language, and an implementation language. Our example will use a domain-specific language as its source language, Ocaml as its target language, and Ocaml with the Camlp5 extensions as its implementation language. We’d like to specifically highlight how Camlp5 allows us to use parts of the concrete syntax of the source and target languages.

Using Camlp5

Note that Camlp5 syntax has some differences with Ocaml syntax. These are documented in the Camlp5 manual, and I’m still unfamiliar enough with the two grammars that I can’t necessarily identify which is which.

It should suffice to say that my examples will be in the implementation language, Camlp5. I’d encourage the reader to play along using the Camlp5 read-eval-print loop (REPL), which is run on top of the Ocaml REPL as follows:

$ ocaml -I +camlp5 camlp5r.cma

More seasoned Camlp5 users might note that I fibbed a little about there being only two grammars. Camlp5 has an additional syntax that is accessed by substituting “camlp5o.cma” for “camlp5r.cma”. I’ll defer to the Camlp5 user manual for further explaination of these two front-ends. All my examples will use the “restricted” syntax implemented in “camlp5r.cma”.

Implementing the Source Language

We begin implementing our translator by settling on a domain-specific language for our source language. Who doesn’t love infix-notation integer calculators? Not just the Camlp5 user manual, I assure you (see the info files for Bison, for example). Why should we break tradition, and overlook this classic example?

We can define a quick abstract syntax for calculator expressions as follows:

type expr =
    [ Op of string and expr and expr
    | Int of int
    | Var of string ];

This data type forms an intermediate target language for the parser.

Camlp5 has lots of parsing goodies, but the parts I like the best are the inline syntax definitions and the extensible parsers. Before we can uses these features, we’ll need to tell Camlp5 to extend its own front-end. We accomplish this using the “#load” directive. Following that, we can define a grammar and entry point:

#load "pa_extend.cmo";

value gram = Grammar.gcreate (Plexer.gmake ());
value exp = Grammar.Entry.create gram "expression";

EXTEND
  exp:
  [ "SUM"
    [ x = exp; "+"; y = exp -> Op "+" x y
    | x = exp; "-"; y = exp -> Op "-" x y]
  | "PROD"
    [ x = exp; "*"; y = exp -> Op "*" x y
    | x = exp; "/"; y = exp -> Op "/" x y]
  | "ATOM"
    [ x = INT -> Int (int_of_string x)
    | x = LIDENT -> Var x
    | "("; x = exp; ")" -> x ] ]
  ;
END;

Here we have defined an extensible parser with the sole non-terminal, exp. We have stratified the productions for exp into three levels: SUM for addition and subtraction, PROD for multiplication and division, and ATOM for constants, variables, and explicit grouping. The order these production are defined in is important. The ordering establishes precedence without having to declare intermediate non-terminals (like “term” or “factor”). The actual string labels for each grouping are optional, but I use them here for future experimentation with extending our little language.

Individual productions are organized similar to patterns, with a left-hand side and right-hand side, separated by an arrow, “->”. The left-hand side is a matching and binding pattern, with non-terminals appearing as labels, and binding forms using an identifier and equal sign. Note that we can interleave lexical information as strings in the pattern. When the left-hand side matches, the system evaluates the right-hand side expression in an environment extended with symbols bound during matching.

Camlp5 expands the contents of the EXTEND section into a set of statements that side effect the exp entry. We can witness this by running Camlp5 as a pre-processor, asking it to print the expansion in a simplified Camlp5 syntax (assuming we have the above example code defined in a file, “CalcBase.ml”):

$ camlp5r pr_r.cmo CalcBase.ml

Once we have a grammar, we can construct a parser:

value parse s = Grammar.Entry.parse exp (Stream.of_string s);

…and now that we have a parser, we can parse:

# parse "2 + 3 * foo";
- : expr = Op "+" (Int 2) (Op "*" (Int 3) (Var "foo"))
# parse "42 / 3 - bar";
- : expr = Op "-" (Op "/" (Int 42) (Int 3)) (Var "bar")
# parse "(9 + 2)^2";
- : expr = Op "+" (Int 9) (Int 2)

The last example given above has a syntax error (using the caret operator, “^”). Our function parses as much of the input string as it can, and returns the last legitimate result. We can force an exception by doing something like passing it the empty string, or an empty pair of parenthesis. In order to force the whole string to be in the concrete syntax, we would have to add some sort of delimiter or cue to the grammar. Section 20.9 of the Camlp5 user manual gives an example of adding an end-of-string token to the lexer and using this to ensure the whole string is in the formal language.

I’ve purposely omitted a lot of details for the sake of brevity, particularly details on lexical analysis and extensible scanners. For more information on writing extensible parsers, I refer readers to Section 8 of the Camlp5 user manual.

Implementing an Evaluator

Implementing an evaluator for our domain-specific language is straightforward (for people who’ve already seen this done a thousand times, at least). All we need to do is build a function that walks and evaluates the abstract syntax data structure, along with some minimal infrastructure:

value lookup = fun id -> if id = "thingy" then 1000 else 0;

exception EvalFailure of string;

value rec eval e = match e with
  [ Op "+" x y -> (eval x) + (eval y)
  | Op "-" x y -> (eval x) - (eval y)
  | Op "*" x y -> (eval x) * (eval y)
  | Op "/" x y -> (eval x) / (eval y)
  | Op opstr _ _ -> raise (EvalFailure ("Unknown operator: '" ^
                                        opstr ^ "'"))
  | Var idstr -> lookup idstr
  | Int n -> n];

We use a fixed look-up function for variables, which is appropriate since there is no syntax for assigning values to variables. We also raise an exception in the event that an unrecognized operator string is present in an Op constructor. If we are using the parse function from the previous section, we should never see an exception raised. Nothing in our example stops someone from defining a new parser with unrecognized operators, much less constructing them “by hand”. This possibility reflects a design choice that permits introducing new binary operators without changing the abstract syntax. I intend to demonstrate using this flexibility in later experiments where we’ll extend the source language.

We can compose the evaluator function with the parser to get an evaluator on strings in the source language:

value evalstr s = eval (parse s);

…and we have a calculator:

# value tests = ["2 + 3 * foo"; "42/3 - bar";
                 "(9 + baz) * (9 + baz)"];
value tests : list string =
  ["2 + 3 * foo"; "42/3 - bar"; "(9 + baz) * (9 + baz)"]
# List.map evalstr tests;
- : list int = [2; 14; 81]

An Aside: Staging and Printing in the Target Language

A language as simple as the calculator language implemented above does not really require the same kind of infrastructure as a more sophisticated intermediate language. In more complicated systems, we often have a set of transformers for the intermediate language (optimizers being one example of this). These transformers are simply functions that take an expression in the intermediate language and return an equivalent expression in the intermediate language. In a system with numerous transformers, we find it useful to see what the input and output terms are, so we may want to define a pretty-printer for our language.

Camlp5 has a pretty-printing infrastructure available for building functions that translate from an intermediate language back to a human readable string. I am going to ignore that infrastructure here, and instead try to use one of the pretty printers that come with Camlp5. One of these pretty-printers takes a Camlp5 syntax tree and translates it to a string.

In order to use the Camlp5 pretty-printer, I am going to use staging. Staging involves “raising the evaluation level” of code. We raise the evaluation level by translating abstract syntax into abstract syntax that constructs the original abstract syntax. With a staging function, we can translate our domain-specific abstract syntax (of type expr) into Camlp5 abstract-syntax, and then feed that into the pretty printer. The funny thing that makes this (more or less) work is that the code to construct abstract syntax is often identical to how we’d pretty-print it anyway (the exception to this rule of thumb occurs when we demand pretty-printing in the concrete or “surface” language).

We continue with this staging idea by using Camlp5’s quotation facilities. A quote is a means of saying “this bit of code should construct an intermediate representation of itself”. If we want to use quotation in Camlp5, we need to extend the language with quotation syntax. Here is an example of Camlp5 quotation:

#load "q_MLast.cmo";

value demoexp = let loc = Ploc.dummy in <:expr< 3 + 4 >>;

From the REPL this should ultimately display the following:

value demoexp : MLast.expr =
  MLast.ExApp
    (MLast.ExApp  (MLast.ExLid  "+")
       (MLast.ExInt  "3" ""))
    (MLast.ExInt  "4" "")

Great, so what if we wanted to get a string back out from our quoted expression? We first need to load a pretty printer. Loading a pretty-printer module registers the module’s pretty-printer with Camlp5. We can reference this pretty printer in the Pcaml structure. The following example builds a printing function that translates from Camlp5 expression abstract syntax to strings:

# #load "pr_r.cmo";
# value print e = Eprinter.apply Pcaml.pr_expr Pprintf.empty_pc e;
value print : MLast.expr -> string = <fun>
# print demoexp;
- : string = "3 + 4"

From the very same REPL we can now define a staging function that, modulo quotation syntax, looks like an identity transformer for our abstract syntax:

value rec stage (e : expr) : MLast.expr =
  let loc = Ploc.dummy in match e with
    [ Op ops lt rt -> <:expr< Op $str:ops$ $stage lt$ $stage rt$ >>
    | Int n -> <:expr< Int $int:string_of_int n$ >>
    | Var vid -> <:expr< Var $str:vid$ >> ];

Again, the angle brackets mark parts of code that should not be evaluated, but rather construct Camlp5 abstract syntax. The parts surrounded by dollar signs are anti-quotations. These are portions of code that should be evaluated, mostly for the purpose of assisting in the construction of abstract syntax. For example, in the case matching operators, we quote the application of the operator constructor, but then anti-quote to make a recursive call to stage. In the recursive call, stage builds abstract syntax for the subexpressions, contained within the operation constructor.

Okay, so we can print Camlp5 abstract syntax, and we can stage calculator abstract syntax into Camlp5 abstract syntax. Like the previous section, all that’s left to do is composing these two functions:

value printcalc e = print (stage e);

Now we can test the printcalc function on our test strings, demonstrating the amazing capability of taking some code and printing a string that is equivalent under some evaluation and staging regime:

# value testasts = List.map parse tests;
value testasts : list expr =
  [Op "+" (Int 2) (Op "*" (Int 3) (Var "foo"));
   Op "-" (Op "/" (Int 42) (Int 3)) (Var "bar");
   Op "*" (Op "+" (Int 9) (Var "baz")) (Op "+" (Int 9) (Var "baz"))]
# List.iter (fun ast -> print_endline (printcalc ast)) testasts;
Op "+" (Int 2) (Op "*" (Int 3) (Var "foo"))
Op "-" (Op "/" (Int 42) (Int 3)) (Var "bar")
Op "*" (Op "+" (Int 9) (Var "baz")) (Op "+" (Int 9) (Var "baz"))
- : unit = ()

Like other bits of Camlp5, the details of defining locations (the loc value that I keep let-binding) are outside the scope of this discussion. These locations are important for error reporting. See Section 20.5 in the Camlp5 manual for an example of using locations to improve error messages.

Implementing Custom Quotation

I didn’t go into the details of quotation above, but hopefully I’ve sparked some curiosity about it. Quotation in Camlp5 provides a form of what I call parametric quotation. In the case of Camlp5, there is an identifier parameter to the quotation syntax. We saw an example of identifier arguments to quotation in the previous section. Specifically, the expr identifier, embedded in the quotation brackets, “<:expr<“, told the Camlp5 pre-processor that we were quoting a Camlp5 expression. Please note that the expr identifier used in quotation does not reference the data type we built for the calculator abstract syntax!

The identifier argument to the quotation syntax references what is called a quotation expander in Camlp5 parlance. A quotation expander can take two forms. In this post, we only use one of these quotation expander constructions. We construct a quotation expander from a pair of functions from strings to Camlp5 abstract syntax. One function is responsible for returning Camlp5 abstract syntax for an expression, of type MLast.expr. The other function is responsible for returning abstract syntax for a pattern, of type MLast.patt.

One detail of quotation not previously mentioned was that Camlp5 quotations can do more than just construct abstraction syntax. Quotations can also appear as patterns, where they are used to match against abstract syntax. Camlp5 provides a quotation expander for building patterns using concrete syntax, patt. For example, we can do the following to build abstract syntax for pattern matching a zero constant in our calculator:

# let loc = Ploc.dummy in <:patt< Int 0 >>;
- : MLast.patt =
MLast.PaApp <abstr> (MLast.PaUid <abstr> "Int")
    (MLast.PaInt <abstr> "0" "")

We already have a parser for our calculator, and we can define the expression expander function by composing the parser with our staging function. To define a pattern expander requires us to define another parser that basically copies the concrete syntax for calculator expressions. One primary difference between these two parsers, however, is that if we want our pattern matcher to bind any variables, we need to add additional support for anti-quotation.

The code below copies the calculator expression grammar, but adds an anti-quotation reduction at the atom stratum. This example illustrates how anti-quotation is a property of the quotation expander, not the containing Camlp5 syntax. Some location book-keeping is handled by a new non-terminal, pat_antiquot, and modulo changing the anti-quotation “operator” to a percent sign, the following is adapted from the example in Section 20.9 in the Camlp5 manual:

value pat = Grammar.Entry.create gram "expression";

EXTEND
  GLOBAL: pat;

  pat:
  [ "SUMPAT"
      [ x = pat; "+"; y = pat -> <:patt< Op "+" $x$ $y$ >>
      | x = pat; "-"; y = pat -> <:patt< Op "-" $x$ $y$ >> ]
  | "PRODPAT"
      [ x = pat; "*"; y = pat -> <:patt< Op "*" $x$ $y$ >>
      | x = pat; "/"; y = pat -> <:patt< Op "/" $x$ $y$ >> ]
  | "ATOMPAT"
      [ x = INT -> <:patt< Int $str:x$ >>
      | x = LIDENT -> <:patt< Var $str:x$ >>
      | "%"; r = pat_antiquot -> r
      | "("; x = pat; ")" -> x ] ]
  ;

  pat_antiquot:
  [ [ i = LIDENT ->
        let r =
          let loc = Ploc.make_unlined (0, String.length i) in
          <:patt< $lid:i$ >>
        in
        <:patt< $anti:r$ >> ] ]
  ;

END;

Since the calculator language is a source language, we do not need to add anti-quotation to the expression expander. If we wanted to use the calculator language as a target language, or build rewriting functions (might include such useful things as a step-wise evaluator or constant folding optimizer), we would find adding anti-quotation to the calculator expression non-terminal much more helpful.

Given the new entry point for creating patterns from the calculator concrete syntax, we can now construct an expander function for both entry points and register the full expander with Camlp5:

value expand_expr s = stage (parse s);
value expand_patt s = Grammar.Entry.parse pat (Stream.of_string s);
Quotation.add "calc" (Quotation.ExAst (expand_expr, expand_patt));
Quotation.default.val := "calc";

The last statement makes our quotation expander the default quotation expander. We can now quote into the concrete syntax of our calculator without having to give an identifier argument:

# << 3  + xyzzy >>;
- : expr = Op "+" (Int 3) (Var "xyzzy")

In this example and at any point following registration of the expander in the REPL, we are able to quote into our custom concrete syntax. If we wanted to quote into our new language in a source file, we would have to define the quotation expander in a separate compilation unit and load the compiled module using the #load directive. At present, I’m not sure how much of a limitation separate compilation poses to the application of parametric quotation. While Mython currently doesn’t have all the nifty extensible parser infrastructure, it offers the ability to use a quotation expander as soon as it is defined.

Using Quotation in an Evaluator

Now that we have a full quotation expander for the calculator language, we can use quotation to build an evaluator that matches against concrete syntax as opposed to abstract syntax:

value rec eval2 e = match e with
  [ << %x + %y >> -> (eval2 x) + (eval2 y)
  | << %x - %y >> -> (eval2 x) - (eval2 y)
  | << %x * %y >> -> (eval2 x) * (eval2 y)
  | << %x / %y >> -> (eval2 x) / (eval2 y)
  | Var idstr -> lookup idstr
  | Int n -> n
  | _ -> raise (EvalFailure ("Unhandled abstract syntax :" ^
                             (printcalc e))) ];

The new evaluator should work identically to our previous evaluator:

# List.map eval2 testasts;
- : list int = [2; 14; 81]

I propose that using concrete syntax simplifies matching and rewriting code, making it easier to read and maintain. On the other hand, I worry over several counter-arguments. If we were to count characters, there is little net savings in using the concrete syntax over the abstract syntax. We reduce character count by no longer having to write out the constructor name, but we gain characters in the arguments to the various quotation and anti-quotation forms. I’m able to save a little more in the above example by making the calculator language the default expander. Using quotation also requires the writers and maintainers keep a map from concrete to abstract syntax in their heads. I suspect most people would find it easier to make sure a match is exhaustive by consulting the constructors listed in the algebraic data-type definition.

If we were to look at domain-specific optimizations, I think these weaknesses would not be as severe. As far as brevity is concerned, I would expect to be matching more complicated source language expressions, with drastically reduced readability if we were to write them out using the abstract syntax. I would also expect matches to be non-exhaustive by construction. These domain-specific optimizations would rewrite source terms in specific cases only, ignoring input terms otherwise.

From Evaluator to Translator

Finally, we arrive at a source-to-source translator by copying the previous evaluator, and then quoting the right-hand side of our pattern matches:

exception TranslationFailure of string;

value rec tr e = let loc = Ploc.dummy in match e with
  [ << %x + %y >> -> <:expr< $tr x$ + $tr y$ >>
  | << %x - %y >> -> <:expr< $tr x$ - $tr y$ >>
  | << %x * %y >> -> <:expr< $tr x$ * $tr y$ >>
  | << %x / %y >> -> <:expr< $tr x$ / $tr y$ >>
  | Var idstr -> <:expr< lookup $str:idstr$ >>
  | Int n -> <:expr< $int:string_of_int n$ >>
  | _ -> raise (TranslationFailure ("Unhandled abstract syntax :" ^
                                    printcalc e)) ];

From the REPL we can test the translator, expecting to see Camlp5 code that would evaluate to the same thing as if we ran our evaluator on it:

# value tr_print e = print_endline (print (tr e));
value tr_print : expr -> unit = <fun>
# List.iter tr_print testasts;
2 + 3 * lookup "foo"
42 / 3 - lookup "bar"
(9 + lookup "baz") * (9 + lookup "baz")
- : unit = ()

Again, I worry over the complexity of additional quotation code. Conversely, it feels very natural to be able to copy the evaluator from the previous section, and simply quote the right-hand side of the match subexpressions. Doing this transformation automatically almost seems in our reach. We’d just quote the whole evaluator, then transform it by staging the right-hand expressions, anti-quoting recursive calls, and other fiddly bits (it looks like some form of type-mapping might be required to handle some of the anti-quotation parameters).

Unfortunately, Camlp5 doesn’t seem to be able to handle the step of quoting the evaluator:

# let loc = Ploc.dummy in <:expr< << 3 + 4 >> >>;
                                  ^^^^^^^^^^^
While expanding quotation "expr":
Parse error: illegal begin of expr_eoi

Of course, I’m rapidly wandering outside the limits of my understanding of Camlp5, and a more thorough investigation into nested quotation might show I’m not doing something properly.

Conclusions

This post shows how to use Camlp5 to write a source-to-source translator. Our translator uses concrete syntax to both match expressions in the source language and construct expressions in the target language. When we compose our translator with the Ocaml compiler, we leave the realm of interpretation and start building compilers. I think such flexibility reflects a future where our domain-specific languages can both be very high-level, but also fast.

I think this post points towards “metaprogramming nirvana”. I have demonstrated how to code using bits of multiple language by using a sophisticated language infrastructure. We used a form of parametric quotation to easily switch into a source language we defined and back into the implementation language. We did the same with the target language, which was given to us as part of the implementation language.

Hopefully, I’ve set the stage for further investigations into metaprogramming nirvana while giving you something that is both concrete and digestible.

posted by jriehl at 3:54 pm  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress