Introduction¶PyPy's numpy support has matured enough that it can now support the lapack/blas libraries through the numpy.linalg module. To install the version of numpy this blog post refers to, install PyPy version 2.5.0 or newer, and run this:
pypy -m pip install git+https://bitbucket.org/pypy/numpy.git
This update is a major step forward for PyPy's numpy support. Many of the basic matrix operations depend on linalg, even matplotlib requires it to display legends (a pypy-friendly version of matplotlib 1.3 is available at https://github.com/mattip/matplotlib).
A number of improvements and adaptations, some of which are in the newly-released PyPy 2.5.0, made this possible:
- Support for an extended frompyfunc(), which in the PyPy version supports much of the ufunc API (signatures, multiple dtypes) allowing creation of pure-python, jit-friendly ufuncs. An additional keyword allows choosing between out = func(in) or func(in, out) ufunc signatures. More explanation follows.
- Support for GenericUfuncs via PyPy's (slow) capi-compatibility layer. The underlying mechanism actually calls the internal implementation of frompyfunc().
- A cffi version of _umath_linalg. Since cffi uses dlopen() to call into shared objects, we added support in the numpy build system to create non-python shared libraries from source code in the numpy tree. We also rewrote parts of the c-based _umath_linalg.c.src in python, renamed numpy's umath_linalg capi module to umath_linag_capi, and use it as a shared object through cffi.
Status¶We have not completely implemented all the linalg features. dtype resolution via casting is missing, especially for complex ndarrays, which leads to slight numerical errors where numpy uses a more precise type for intermediate calculations. Other missing features in PyPy's numpy support may have implications for complete linalg support.
Some OSX users have noticed they need to update pip to version 6.0.8 to overcome a regression in pip, and it is not clear if we support all combinations of blas/lapack implementations on all platforms.
Over the next few weeks we will be ironing out these issues.
Performance¶A simple benchmark is shown below, but let's state the obvious: PyPy's JIT and the iterators built into PyPy's ndarray implementation will in most cases be no faster than CPython's numpy. The JIT can help where there is a mixture of python and numpy-array code. We do have plans to implement lazy evaluation and to further optimize PyPy's support for numeric python, but numpy is quite good at what it does.
HowTo for PyPy's extended frompyfunc ¶The magic enabling blas support is a rewrite of the _umath_linalg c-based module as a cffi-python module that creates ufuncs via frompyfunc. We extended the numpy frompyfunc to allow it to function as a replacement for the generic ufunc available in numpy only through the c-api.
We start with the basic frompyfunc, which wraps a python function into a ufunc:
def times2(in0): return in0 * 2 ufunc = frompyfunc(times2, 1, 1)
In cpython's numpy the dtype of the result is always object, which is not implemented (yet) in PyPy, so this example will fail. While the utility of object dtypes can be debated, in the meantime we add a non-numpy-compatible keyword argument dtypes to frompyfunc. If dtype=['match'] the output dtype will match the dtype of the first input ndarray:
ufunc = frompyfunc(times2, 1, 1, dtype=['match']) ai = arange(24).reshape(3, 4, 2) ao = ufunc(ai) assert (ao == ai * 2).all()
I hear you ask "why is the dtypes keyword argument a list?" This is so we can support the Generalized Universal Function API, which allows specifying a number of specialized functions and the input-output dtypes each specialized function accepts.
Note that the function feeds the values of ai one at a time, the function operates on scalar values. To support more complicated ufunc calls, the generalized ufunc API allows defining a signature, which specifies the layout of the ndarray inputs and outputs. So we extended frompyfunc with a signature keyword as well.
We add one further extension to frompyfunc: we allow a Boolean keyword stack_inputs to specify the argument layout of the function itself. If the function is of the form:
out0, out1, ... = func(in0, in1,...)
then stack_inputs is False. If it is True the function is of the form:
func(in0, in1, ... out0, out1, ...)
Here is a complete example of using frompyfunc to create a ufunc, based on this link:
def times2(in_array, out_array): in_flat = in_array.flat out_flat = out_array.flat for i in range(in_array.size): out_flat[i] = in_flat[i] * 2 ufunc = frompyfunc([times2, times2], 1, 1, signature='(i)->(i)', dtypes=[dtype(int), dtype(int), dtype(float), dtype(float), ], stack_inputs=True, ) ai = arange(10, dtype=int) ai2 = ufunc(ai) assert all(ai2 == ai * 2)
Using this extended syntax, we rewrote the lapack calls into the blas functions in pure python, no c needed. Benchmarking this approach actually was much slower than using the upstream umath_linalg module via cpyext, as can be seen in the following benchmarks. This is due to the need to copy c-aligned data into Fortran-aligned format. Our __getitem__ and __setitem__ iterators are not as fast as pointer arithmetic in C. So we next tried a hybrid approach: compile and use numpy's umath_linalg python module as a shared object, and call the optimized specific wrapper function from it.
Benchmarks¶Here are some benchmarks, running a tight loop of the different versions of linalg.inv(a), where a is a 10x10 double ndarray. The benchmark ran on an i7 processor running ubuntu 14.04 64 bit:
|Impl.||Time after warmup|
|CPython 2.7 + numpy 1.10.dev + lapack||8.9 msec/1000 loops|
|PyPy 2.5.0 + numpy + lapack via cpyext||8.6 msec/1000 loops|
|PyPy 2.5.0 + numpy + lapack via pure python + cffi||19.9 msec/1000 loops|
|PyPy 2.5.0 + numpy + lapack via python + c + cffi||9.5 msec/1000 loops|
While no general conclusions may be drawn from a single micro-benchmark, it does indicate that there is some merit in the approach taken.
Conclusion¶PyPy's numpy now includes a working linalg module. There are still some rough corners, but hopefully we have implemented the parts you need. While the speed of the isolated linalg function is no faster than CPython and upstream numpy, it should not be significantly slower either. Your use case may see an improvement if you use a mix of python and lapack, which is the usual case.
Please let us know how it goes. We love to hear success stories too.
We still have challenges at all levels of programming,and are always looking for people willing to contribute, so stop by on IRC at #pypy.
mattip and the PyPy Team