Saturday, June 14, 2008

Pyrex for performance and obfuscation

I've recently been asked to obfuscate a bunch of Python code. Encryption is one possibility, but the user needs the key along with the encrypted code in order to run the code, so this is really just a round-about form of obfuscation. And if multi-billion dollar (and rather unsavory) industries can't get this right, I'd rather not even try.

One novel form of obfuscation is compilation to C-code, a task made relatively simple by Pyrex and, more recently, Cython. Both projects are mainly intended to ease the integration of C libraries with Python; both accomplish this by compiling native Python code into a .so shared object. This .so file should, in turn, be slightly harder to decypher than Python bytecode.

Pyrex isn't as actively maintained as Cython, but it is available via Macports, so I'm using Pyrex for now.

Pyrex appears to work by first translating your Python code into C, then compiling this C against the Python libraries. Unannotated Python objects remain PyObject * pointers -- it's quite possible that the Python interpreter, or VM, or whatever lies underneath, is still doing most of the heavy lifting with Pyrex-translated code; I can't make any sense of it.

But Pyrex also allows you to write Python-like code that gets translated to native C, with all the implied performance gains. As a simple example, I've done a naive implementation of the Fibonacci sequence in plain Python and in Pyrex' C/Python intermediary. Here's the file, called "pyrex_fib.pyx":

cdef _cfib( int i ):
if i < 3:
return 1
else:
return _cfib(i-1) + _cfib(i-2)

def cfib( i ):
return _cfib( i )

def pyfib( i ):
if i < 3:
return 1
else:
return pyfib(i-1) + pyfib(i-2)


_cfib and pyfib are the same function, w/ _cfib implemented in Pyrex C notation; cfib is a wrapper around _cfib. (Native C functions can't be called directly from Python and must be wrapped.)

"pyrex_fib.pyx" is compiled to a Python-friendly .so file via distutils; here's the contents of "setup.py" -- lifted from Michael's Guide to Pyrex:

from distutils.core import setup
from distutils.extension import Extension
from Pyrex.Distutils import build_ext
setup(
name = "PyrexGuide",
ext_modules=[
Extension("pyrex_fib", ["pyrex_fib.pyx"])
],
cmdclass = {'build_ext': build_ext}
)

The compilation is accomplished via python setup.py build_ext --inplace, but note that bugs can result in the rather cryptic error message error: Pyrex does not appear to be installed on platform 'posix'


I also wrote a plain Python version, "py_fib.py":

def pyfib( i ):
if i < 3:
return 1
else:
return pyfib(i-1) + pyfib(i-2)


This enables me to compare Pyrex-translated Python code stored in a .so to the same code stored in a regular Python module, and compare them both to the "native" version.

I do this comparison via a simple module that imports both forms and runs them, timing each invocation. Here's the output:

kieran@host:~/tmp/pyrex$ ./time_fib.py
Sat Jun 14 12:15:07 2008
pyrex_fib.cfib(40) = 102334155 in 9.6s
pyrex_fib.pyfib(40) = 102334155 in 121.5s
py_fib.pyfib(40) = 102334155 in 98.0s

Interestingly, the Pyrex-translated Python code is about 20% slower than the regular Python; presumably, it's not benefiting from various interpreter optimizations. The "C" implementation blows them both out of the water.

Pyrex looks great for wrapping C libraries for Python, and might serve for code obfuscation, but the major limitation is the difficulty of moving non-scalar data types between C and Python: it wouldn't have been easy to return a Python dictionary from my cfib routine, for example.