Jon Riehl’s Log

Wednesday, November 4, 2009

Embedding LLVM Assembly in Mython

Today we’re going to look at how we can use Mython and llvm-py to embed LLVM assembly code into a Mython module.  For those not familiar with Mython, I wouldn’t worry too much; what we are doing should not look, nor work too different from the following bit of code (which requires Python 2.5, LLVM, and llvm-py to work, by the way):

import StringIO, llvm, llvm.core, llvm.ee
llvm_asm = """
@msg = internal constant [15 x i8] c"Hello, world.\\0A\\00"

declare i32 @puts(i8 *)

define i32 @not_really_main() {
    %cst = getelementptr [15 x i8]* @msg, i32 0, i32 0
    call i32 @puts(i8 * %cst)
    ret i32 0
}
"""
llvm_module = llvm.core.Module.from_assembly(
                  StringIO.StringIO(llvm_asm))
mp = llvm.core.ModuleProvider.new(llvm_module)
ee = llvm.ee.ExecutionEngine.new(mp)
not_really_main = llvm_module.get_function_named(
                      'not_really_main')
ee.run_function(not_really_main, [])

This code first defines an LLVM module in a Python string, then builds a LLVM module from the embedded code, and finally uses a JIT to link and run a function from the embedded module.  I would note two things about this little demo.  One, while the multiline string allows un-escaped quote characters, developers must still take care to escape backslashes.  Failing to do this causes the LLVM assembler to reject the string literal.  Two, the user of this code pays for the assembly of the LLVM code each time it is run.  Both of these are relatively minor problems, but they illustrate why a developer might prefer Mython over embedding another language as strings in a Python file.  Later, we shall develop these arguments in more depth.

This post demonstrates how we can take the infrastructure in llvm-py and use it to embed LLVM source.  We show how to assemble the embedded LLVM source into LLVM bitcode at compile time.  We’ll then stash the bitcode for consumption by the LLVM JIT compiler and linker at run time.  This approach saves us from the bitcode compilation time, and ideally saves some space in the Python bytecode.  More importantly, this approach ensures that errors in the embedded source are detected at compile time, not run time.

Preliminaries

If you’re not terribly familiar with Python, LLVM, and llvm-py, I’d recommend reading at least the Python tutorial, the LLVM assembly tutorial, and the llvm-py user guide.  The llvm-py user guide should, in turn, point you at a specific test case for using the LLVM JIT, which the above code follows except it builds the module from assembly source.  The llvm-py documentation builds the module using wrapper objects for the LLVM intermediate representation (IR).

At the time of writing, I used LLVM 2.5 (via MacPorts), and built llvm-py from the Google code repository.  Originally, I tried the llvm-py port, but the llvm-py 0.5 tarball they use doesn’t build against LLVM 2.5.  I encountered this problem with llvm-py 0.5 again on Cygwin, this time doing a manual build and install of LLVM 2.5 from a source tarball.  I was also able to build and install the llvm-py Subversion head, but the example code for this post does not work (it can’t dynamically resolve puts()).

Mython introduces a special form of quotation into the Python language.  The idea is that you can embed raw strings in your source code, and these strings are interpreted into Python code at compile time.  Quotation blocks look something like this:

quote [quotefn] name:
    ...

The quotefn() is ideally a function that takes a name, a string, and a dictionary, and returns a 2-tuple containing a list of Python abstract syntax trees (AST’s, specifically, statement nodes), and a dictionary.  Instead of giving a quick demonstration of how to define and use a quotation function in Mython, let’s go ahead and demonstrate these by embedding LLVM assembly.  I will explain the Mython code as we go along.

I recommend you grab a copy of MyFront (which is part of the Basil language framework), and the test1.my source file from the Google Code repository (see availability, below).  The following discussion essentially gives the source code for test1.my, but lists it out of order.

Interfacing llvm-py and Mython

So now that I’ve discussed the preliminaries, let’s just go ahead and start defining the driver we’ll use to test the compile-time wrapper for the LLVM assembler.  Let’s assume that we already have a “quotation” function for LLVM assembly.

quote [llvm_as] llvm_module:
 @msg = internal constant [15 x i8] c"Hello, world.\0A\00"
 declare i32 @puts(i8 *)
 define i32 @not_really_main() {
     %cst = getelementptr [15 x i8]* @msg, i32 0, i32 0
     call i32 @puts(i8 * %cst)
     ret i32 0
 }

Our job consists of defining llvm_as() to be a quotation function that translates this quotation block into something like the following:

llvm_module = llvm.core.Module.from_bitcode(
                  StringIO.StringIO("..."))

At run time, the above constructs a LLVM module from the elided bitcode in the string literal (the "...").  We therefore need to define a compile-time function that does the following:

  • Takes the embedded source code and assembles it into an LLVM module.
  • Translates the LLVM module into a string literal containing LLVM bitcode.
  • Compiles a Python abstract syntax tree that will reconstruct the LLVM module from the embedded bitcode.

Before we proceed, let us assume that we already have three bound variables, each corresponding to a quotation function parameter: name, source, and env.  The name variable is bound to the string literal "llvm_module".  The source variable contains the string of the LLVM assembly, with the leading indentation white space removed.  The env variable is a dictionary that is supposed to be an explicit replacement of the __globals__ dictionary, originally used by Python to manage its global namespace, but passed by MyFront explicitly as a reminder that it is a compile-time environment, not a run-time environment.  I’m not sure if this “explicit store passing” style actually buys us anything, and this may be dropped from quotation functions in later versions of Mython.

We’ve seen some of the above steps accomplished in the introduction.  We first must build a LLVM module from the LLVM assembly code, which is bound to the source variable:

fobj1 = StringIO.StringIO(source)
llvm_module = llm.core.Module.from_assembly(fobj1)

We now have the same module we’ll want to use at run time bound at compile time (actually its a functionally identical module).  We need to emit the bitcode that we’re going to embed in the run time code we’ll be generating:

fobj2 = StringIO.StringIO()
llvm_module.to_bitcode(fobj2)

This writes the bitcode as a string literal within the StringIO file abstraction.  We can now build Python code in another string:

runtime_src = ("%s = llvm.core.Module.from_bitcode("
               "StringIO.StringIO(%r))\n" %
               (name, fobj2.getvalue()))

Normally, I would expect the next step to be a possibly involved process of walking over some intermediate representation and constructing a Python AST to pass back to the compiler.  In this case, we can avoid having to do this, since all we need to do is embed the LLVM code as a string argument.  To convert the run-time code into an AST, we are going to take advantage of the fact that the compiler reflects its front-end in the env dictionary.  MyFront maps the string "myfrontend" to a function that translates Mython source code and the compile-time environment into a Python AST, and a possibly mutated compile-time environment.  This function allows us to simply take the above string and parse it into a Python AST like so:

runtime_ast, env = env["myfrontend"](runtime_src, env)

The myfrontend() function specifically returns a Module AST node.  In order to get a list of statement AST nodes, we’ll just have to look at the body member of the returned Module object.  The fully wrapped up Mython quotation function looks like this:

quote [myfront]:
    def llvm_as (name, source, env):
        assert name is not None
        fobj1 = StringIO.StringIO(source)
        llvm_module = llvm.core.Module.from_assembly(fobj1)
        fobj2 = StringIO.StringIO()
        llvm_module.to_bitcode(fobj2)
        runtime_src = ("%s = llvm.core.Module.from_bitcode("
                       "StringIO.StringIO(%r))\n" %
                       (name, fobj2.getvalue()))
        runtime_ast, env = env["myfrontend"](runtime_src, env)
        return runtime_ast.body, env

If you are curious about the above quotation block, the myfront() quotation function simply evaluates the embedded code at compile time and in the compile-time environment.  This allows us to define the llvm_as() function at compile time, but then throw it away at run time.

The only thing that is left is to test it:

def main ():
    import llvm.ee
    print llvm_module
    print "_" * 60
    provider = llvm.core.ModuleProvider.new(llvm_module)
    llvm_engine = llvm.ee.ExecutionEngine.new(provider)
    not_really_main = llvm_module.get_function_named(
                          'not_really_main')
    retval = llvm_engine.run_function(not_really_main, [])
    print "_" * 60
    print "Returned", retval.as_int()

if __name__ == "__main__":
    main()

When I run this on my Mac (again, this is all in the test01.my source file), I see the following (the lines that start with “$” show command line inputs):

$ MyFront test1.my
$ python -m test1
@msg = internal constant [15 x i8] c"Hello, world.\0A\00"\
               ; <[15 x i8]*> [#uses=1]

declare i32 @puts(i8*)

define i32 @not_really_main() {
 %cst = getelementptr [15 x i8]* @msg, i32 0, i32 0\
              ; <i8*> [#uses=1]
 %1 = call i32 @puts(i8* %cst)           ; <i32> [#uses=0]
 ret i32 0
}

____________________________________________________________
Hello, world.

____________________________________________________________
Returned 0

I was not able to get LLVM to dynamically link puts() on the Cygwin platform.  The resulting runtime code correctly outputs the module source, but then raises a signal, causing a core dump.  It would be nice if the abort signal was simply thrown as an exception.  I am reminded of the utility of David Beazley‘s wrapped application debugger (WAD), or something similar, for catching signals and then translating them to Python exceptions.

Discussion

Now that we have looked at how to embed LLVM assembly in Mython, let’s look more closely at possibilities for answering why you would want to use Mython’s approach.  This section looks at three things.  First, it compares code size at the module level.  Second, it gives measurements and discusses any possible differences in the run-time performance. Finally, this section demonstrates how both approaches to embedding handle errors in the embedded assembly.

I did not expect the resulting module sizes.  The Python version, test0.py, compiles to a file, test0.pyc, which is 1,252 bytes in size.  The Mython version compiles to test1.pyc, and is 1,335 bytes big.  However, when I use llvm-as on the assembly code alone, I see that without comments, the assembly code is smaller than the LLVM bitcode file by 79 bytes (213 bytes for the source code, 292 bytes for the bitcode). I assume that for more complicated input source, the LLVM bitcode will be smaller than the source (one can always skew this by adding comments; the original standalone hello.ll was 538 bytes with white space and comments).

To compare the run-time performance of the naive and Mython embeddings, I created a test harness to measure three scenarios:

  1. The time it takes to construct an LLVM module from assembly source.  This should be representative of the time taken by a naive embedding.
  2. The time it takes to construct an LLVM module from assembly source and then serialize it into LLVM bitcode.  This should reflect the compile-time cost of the Mython embedding.
  3. The time it takes to construct an LLVM module from bitcode. This reflects the run-time code of the Mython embedding.

I implemented this test harness in test2.py, which can be found in the same repository as the other two test modules (see availability, below).  I am seeing the following results output from the test harness (times are in seconds, and reflect the minimum, maximum, and average times over 100 measurements of a function that performs the given test a 1000 times):

$ ./test2.py
Naive embedding summary: min=0.0542359 max=0.0596418 avg=0.0549726
Compile-time summary: min=0.134633 max=0.150107 avg=0.136246
Run-time summary: min=0.069649 max=0.0750451 avg=0.0704705

These results come as a second surprise.  Since the test harness solely runs wrapped LLVM code, it might seem that the LLVM infrastructure handles string inputs slightly faster than bitcode. After thinking about this for a minute, a more likely explanation is that the bitcode input is larger than the assembly string input. Using the sizes given above, we can see the bitcode string is about 1.37 times larger than the assembly source.  The module construction time is only about 1.28 times longer.  These relative numbers imply that if I did use more complicated assembly source with equivalent or smaller resulting bitcode, I would see a slight performance increase. This run-time performance increase would come at a small additional cost at compile time.  On my machine, these numbers imply it would take an additional 13.6 milliseconds per 1000 lines of embedded assembly code (not counting deallocation time).

Finally, we look at what happens when there is a syntax error in the embedded code.  In the repository, I copied the test0.py and test1.my files to the bad0.py and bad1.my, respectively.  I then remove the leading “@” from the function definition.  Here is the result of compiling these two modules using the MyFront compiler (note that I’ve hand shortened the file paths using ellipses):

$ rm *.pyc
$ MyFront bad0.py
$ MyFront bad1.my
Error in quote-generated code, from block starting at line 41:
  Traceback (most recent call last):
    File ".../basil/lang/mython/MythonRewriter.py", line 106, in
handle_QuoteDef
    ret_val, env = quotefn(node.name, node.body, env)
    File "bad1.my", line 4, in llvm_as
    File ".../site-packages/llvm/core.py", line 330, in from_assembly
    raise llvm.LLVMException, ret
  LLVMException: expected function name
$ ls *.pyc
bad0.pyc

I have chosen to focus on just using the compiler, so you can clearly see that the naive embedding was quietly compiled into a Python bytecode file.  In this particular case, the LLVM error would be caught at import time:

$ python -m bad0
Traceback (most recent call last):
  File ".../runpy.py", line 95, in run_module
    filename, loader, alter_sys)
  File ".../runpy.py", line 52, in _run_module_code
    mod_name, mod_fname, mod_loader)
  File ".../runpy.py", line 32, in _run_code
    exec code in run_globals
  File ".../sandbox/llvm/bad0.py", line 27, in
    llvm_module = llvm.core.Module.from_assembly(StringIO.StringIO(
llvm_source))
  File ".../site-packages/llvm/core.py", line 330, in from_assembly
    raise llvm.LLVMException, ret
llvm.LLVMException: expected function name

If you were to just compile this file and ship it, you might be condemning users to a nasty surprise.  I know you’d still catch these kinds of bugs by extensive testing, right?  The specific bug I’ve injected would be pretty easy to find, since the exception would occur as soon as you import the module.  If you assembled the LLVM code inside a function, or on some special path, these kinds of bugs become much harder to find.  You would have to be especially careful if the LLVM source was automatically generated.

I am slightly embarrassed to note that this kind of experiment can still go horribly wrong in Mython.  Since the current Mython implementation uses the Python tokenize module, it will not detect a DEDENT token if your embedded code has imbalanced brackets, braces, or parentheses.  Feel free to delete the close brace from the embedded LLVM and watch the resulting mess output by MyFront’s recursive descent parser.  I hope to have this problem fixed
shortly.

Conclusion

To conclude, I was really hoping to make the following claim:

  • We can embed LLVM bitcode in Python, and this should offer our compiled modules greater speed without sacrificing platform independence.

In this case, I was not able to make this claim.  The idea is that the time necessary to parse a string and create a LLVM module should be less than the time necessary to construct a module from a bitcode string of equal size.  This claim might be easier to show for embeddings of native machine code, but that would cost us platform independence.  I would be interested in learning more about the LLVM bitcode format, and determining when it is likely that the bitcode for a module is larger than its source code (our example has a string literal in it, which might play some part in the source and bitcode sizes).

I hope the following claims are easier to accept given this example:

  • Mython makes it possible to embed code from other languages without string escapes.
  • Mython makes it possible to check embedded code at compile time.
  • If you already have a language implementation that can interface with Python, it is very simple (< 10 lines of code) to embed and statically check it in Mython.

I hope you will take the time to play around with building more quotation functions in Mython, and see what you can do with them.  I think quotation functions are a powerful mechanism for metaprogramming, and I hope to continue to provide interesting examples of their utility.

Availability

Instructions for obtaining Mython, and its implementation, the MyFront compiler, are given here: http://code.google.com/p/basil/wiki/GettingStarted

The source code for the Python and Mython demonstration and test modules are in the Basil framework sandbox.  You can get them from Google Code here: http://code.google.com/p/basil/source/browse/trunk/sandbox/llvm/

posted by jriehl at 5:35 pm  

Powered by WordPress