Skip to main content

Comparing Partial Evaluation and Tracing, Part 1

As part of writing my PhD I am currently thinking about the relationship between PyPy's meta-tracing approach with various previous ideas to automatically get a (JIT-)compiler from only an interpreter of a language. One of the most-researched ideas along these lines is that of partial evaluation. Partial evaluation has basically the same goals as PyPy when it comes to compilers: Write an interpreter, and get a compiler for free. The methods for reaching that goal are a bit different. In this series of blog posts, I am trying to explore the similarities and differences of partial evaluation and PyPy's meta-tracing.

A Flowgraph Language

To be able to clearly understand what "partial evaluation" is and what "meta-tracing" is I will show an "executable model" of both. To that end, I am defining a small imperative language and will then show what a partial evaluator and a tracer for that language look like. All this code will be implemented in Prolog. (Any pattern-matching functional language would do, but I happen to know Prolog best. Backtracking is not used, so you can read things simply as functional programs.) In this post I will start with the definition of the language, and a partial evaluator for it. The code written in this blog post can be found fully here:

The language is conceptionally similar to PyPy's flow graphs, but a bit more restricted. It does not have function calls, only labelled basic blocks that consist of a series of linearly executed operations, followed by a conditional or an unconditional jump. Every operation is assigning a value to a variable, which is computed by applying some operation to some arguments.

A simple program to raise x to the yth power in that language looks like this:

    res = 1
    if y goto power_rec else goto power_done

    res = res * x
    y = y - 1
    if y goto power_rec else goto power_done


To represent the same program as Prolog data structures, we use the following Prolog code:

block(power, op1(res, same, const(1),
             if(y, power_rec, power_done))).
block(power_rec, op2(res, mul, var(res), var(x),
                 op2(y, sub, var(y), const(1),
                 if(y, power_rec, power_done)))).
block(power_done, print_and_stop(var(res))).

Every rule of block declares one block by first giving the label of the block, followed by the code. Code is a series of op1 or op2 statements terminated by a jump, an if or a print_and_stop. op1 statements are operations with one argument of the form op1(res_variable, operation_name, argument, next_statement). Arguments can be either variables in the form var(name) or constants in the form const(value).

To run programs in this flowgraph language, we first need some helper functionality. The first few helper functions are concerned with the handling of environments, the data structures the interpreter uses to map variable names occuring in the program to the variables' current values. In Python dictionaries would be used for this purpose, but in Prolog we have to emulate these by lists of key/value pairs (not very efficient, but good enough):

lookup(X, [], _) :- throw(key_not_found(X)).
lookup(Key, [Key/Value | _], Value) :- !.
lookup(Key, [_ | Rest], Value) :- lookup(Key, Rest, Value).

write_env([], X, V, [X/V]).
write_env([Key/_ | Rest], Key, Value, [Key/Value | Rest]) :- !.
write_env([Pair | Rest], Key, Value, [Pair | NewRest]) :- write_env(Rest, Key, Value, NewRest).

remove_env([], _, []).
remove_env([Key/_ | Rest], Key, Rest) :- !.
remove_env([Pair | Rest], Key, [Pair | NewRest]) :- remove_env(Rest, Key, NewRest).

resolve(const(X), _, X).
resolve(var(X), Env, Y) :- lookup(X, Env, Y).

The implementation of these functions is not too important. The lookup function finds a key in an environment list, the write_env function adds a new key/value pair to an environment, remove_env removes a key. The resolve function is used to take either a constant or a variable and return a value. If it's a constant, the value of that constant is returned, if it's a variable it is looked up in the environment. Note how the last argument of lookup and resolve is actually a return value, which is the typical approach in Prolog.

So far we have not specified what the primitive operations that can occur in the program actually mean. For that we define a do_op function which executes primitive operations:

do_op(same, X, X).
do_op(mul, X, Y, Z) :- Z is X * Y.
do_op(add, X, Y, Z) :- Z is X + Y.
do_op(sub, X, Y, Z) :- Z is X - Y.
do_op(eq, X, Y, Z) :- X == Y -> Z = 1; Z = 0.
do_op(ge, X, Y, Z) :- X >= Y -> Z = 1; Z = 0.
do_op(readlist, L, I, X) :- nth0(I, L, X).
do_op(Op, _, _, _) :- throw(missing_op(Op)).

Again the last argument is an output variable.

Now we can start executing simple operations. For that an interp predicate is defined. It takes as its first argument the current environment and as the second argument the operation to execute. E.g. to execute primitive operations with one or two arguments:

interp(op1(ResultVar, Op, Arg, Rest), Env) :-
    resolve(Arg, Env, RArg),
    do_op(Op, RArg, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

interp(op2(ResultVar, Op, Arg1, Arg2, Rest), Env) :-
    resolve(Arg1, Env, RArg1),
    resolve(Arg2, Env, RArg2),
    do_op(Op, RArg1, RArg2, Res),
    write_env(Env, ResultVar, Res, NEnv),
    interp(Rest, NEnv).

First the arguments are resolved into values. Afterwards the operation is executed, and the result is written back into the environment. Then interp is called on the rest of the program. Similarly easy are the unconditional jump and print_and_stop:

interp(jump(L), Env) :-
    block(L, Block),
    interp(Block, Env).

interp(print_and_stop(Arg), Env) :-
    resolve(Arg, Env, Val),
    print(Val), nl.

In the unconditional jump we simply get the target block and continue executing that. To execute print_and_stop we resolve the argument, print the value and then are done.

The conditional jump is only slightly more difficult:

interp(if(V, L1, L2), Env) :-
    lookup(V, Env, Val),
    (Val == 0 ->
        block(L2, Block)
        block(L1, Block)
    interp(Block, Env).

First the variable is looked up in the environment. If the variable is zero, execution continues at the second block, otherwise it continues at the first block.

Given this interpreter, we can execute the above example program like this, on a Prolog console:

$ swipl -s
?- block(power, Block), interp(Block, [x/10, y/10]).

Partial Evaluation of the Flowgraph Language

Let's look at what a partial evaluator for this simple flowgraph language would look like. Partial evaluation (PE), also called specialization, is a program manipuation technique. PE takes an input program and transforms it into a (hopefully) simpler and faster output program. It does this by assuming that some variables in the input program are constants. All operations that act only on such constants can be folded away. All other operations need to remain in the output program (called residual program). Thus the partial evaluator proceeds much like an interpreter, just that it cannot actually execute some operations. Also, its output is not just a value, but also list of remaining operations that could not be optimized away.

The partial evaluator cannot use normal environments, because unlike the interpreter not all variables' values are known to it. It will therefore work on partial environments, which store just the know variables. For these partial environments, some new helper functions are needed:

plookup(Key, [], var(Key)).
plookup(Key, [Key/Value | _], const(Value)) :- !.
plookup(Key, [_ | Rest], Value) :- plookup(Key, Rest, Value).

presolve(const(X), _, const(X)).
presolve(var(V), PEnv, X) :- plookup(V, PEnv, X).

The function plookup takes a variable and a partial environment and returns either const(Value) if the variable is found in the partial environment or var(Key) if it is not. Equivalently, presolve is like resolve, except that it uses plookup instead of lookup.

With these helpers we can start writing a partial evaluator. The following two rules are where the main optimization in the form of constant folding happens. The idea is that when the partial evaluator sees an operation that involves only constant arguments, it can constant-fold the operation, otherwise it can't:

pe(op1(ResultVar, Op, Arg, Rest), PEnv, NewOp) :-
    presolve(Arg, PEnv, RArg),
    (RArg = const(C) ->
        do_op(Op, C, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        RestResidual = NewOp
        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op1(ResultVar, Op, RArg, RestResidual)
    pe(Rest, NEnv, RestResidual).

pe(op2(ResultVar, Op, Arg1, Arg2, Rest), PEnv, NewOp) :-
    presolve(Arg1, PEnv, RArg1),
    presolve(Arg2, PEnv, RArg2),
    (RArg1 = const(C1), RArg2 = const(C2) ->
        do_op(Op, C1, C2, Res),
        write_env(PEnv, ResultVar, Res, NEnv),
        RestResidual = NewOp

        remove_env(PEnv, ResultVar, NEnv),
        NewOp = op2(ResultVar, Op, RArg1, RArg2, RestResidual)
    pe(Rest, NEnv, RestResidual).

The pe predicate takes a partial environment, the current operations and potentially returns a new operation. To partially evaluate a simple operation, its arguments are looked up in the partial environment. If all the arguments are constants, the operation can be executed, and no new operation is produced. Otherwise, we need to produce a new residual operation which is exactly like the one currently looked at. Also, the result variable needs to be removed from the partial environment, because it was just overwritten by an unknown value.

The potentially generated residual operation is stored into the output argument NewOp. The output argument of the recursive call is the last argument of the newly created residual operation, which will then be filled by the recursive call. This is a typical approach in Prolog, but may look strange if you are not familiar with it.

Note how the first case of these two rules is just like interpretation. The second case doesn't really do anything, it just produces a residual operation. This relationship between normal evaluation and partial evaluation is very typical.

The unconditional jump and print_and_stop are not much more complex:

pe(jump(L), PEnv, jump(LR)) :-
    do_pe(L, PEnv, LR).

pe(print_and_stop(Arg), Env, print_and_stop(RArg)) :-
    presolve(Arg, Env, RArg).

To partially evaluate an unconditional jump we again produce a jump. The target label of that residual jump is computed by asking the partial evaluator to produce residual code for the label L with the given partial environment. print_and_stop is simply turned into a print_and_stop. We will see the code for do_pe soon.

Conditional jumps are more interesting:

pe(if(V, L1, L2), PEnv, NewOp) :-
    plookup(V, PEnv, Val),
    (Val = const(C) ->
        (C = 0 ->
            L = L2
            L = L1
        do_pe(L, PEnv, LR),
        NewOp = jump(LR)
        do_pe(L1, PEnv, L1R),
        do_pe(L2, PEnv, L2R),
        NewOp = if(V, L1R, L2R)

First we look up the value of the condition variable. If it is a constant, we can produce better code, because we know statically that only one path is reachable. Thus we produce code for that path, and then emit an unconditional jump there. If the condition variable is not known at partial evaluation time, we need to partially evaluate both paths and produce a conditional jump in the residual code.

This rule is the one that causes the partial evaluator to potentially do much more work than the interpreter, because after an if sometimes both paths need to be explored. In the worst case this process never stops, so a real partial evaluator would need to ensure somehow that it terminates. There are many algorithms for doing that, but I will ignore this problem here.

Now we need to understand what the do_pe predicate is doing. Its most important task is to make sure that we don't do the same work twice by memoizing code that was already partially evaluated in the past. For that it keeps a mapping of Label, Partial Environment to Label of the residual code:

do_pe(L, PEnv, LR) :-
    (code_cache(L, PEnv, LR) ->
        gensym(L, LR),
        assert(code_cache(L, PEnv, LR)),
        block(L, Code),
        pe(Code, PEnv, Residual),
        assert(block(LR, Residual))

If the code cache indicates that label L was already partially evaluated with partial environment PEnv, then the previous residual code label LPrevious is returned. Otherwise, a new label is generated with gensym, the code cache is informed of that new label with assert, then the block is partially evaluated and the residual code is added to the database.

For those who know partial evaluation terminology: This partial evaluator is a polyvariant online partial evaluator. "Polyvariant" means that for every label, several specialized version of the block can be generated. "Online" means that no preprocessing is done before the partial evaluator runs.

Partial Evaluation Example

With this code we can look at the classical example of partial evaluation (it's probably the "Hello World" of partial evaluation). We can ask the partial evaluator to compute a power function, where the exponent y is a fixed number, e.g. 5, and the base x is unknown:

?- do_pe(power, [y/5], LR).
LR = power1.

To find out which code was produced, we can use listing:

?- listing(code_cache)
code_cache(power, [y/5], power1).
code_cache(power_rec, [y/5, res/1], power_rec1).
code_cache(power_rec, [y/4], power_rec2).
code_cache(power_rec, [y/3], power_rec3).
code_cache(power_rec, [y/2], power_rec4).
code_cache(power_rec, [y/1], power_rec5).
code_cache(power_done, [y/0], power_done1).

?- listing(block)
.... the block definition of the user program ....
block(power_done1, print_and_stop(var(res))).
block(power_rec5, op2(res, mul, var(res), var(x), jump(power_done1))).
block(power_rec4, op2(res, mul, var(res), var(x), jump(power_rec5))).
block(power_rec3, op2(res, mul, var(res), var(x), jump(power_rec4))).
block(power_rec2, op2(res, mul, var(res), var(x), jump(power_rec3))).
block(power_rec1, op2(res, mul, const(1), var(x), jump(power_rec2))).
block(power1, jump(power_rec1)).

The code_cache tells which residual labels correspond to which original labels under which partial environments. Thus, power1 contains the code of power under the assumption that y is 5. Looking at the block listing, the label power1 corresponds to code that simply multiplies res by x five times without using the variable x at all. The loop that was present in the original program has been fully unrolled, the loop variable y has disappeared. Hopefully this is faster than the original program.


In this blog post we saw an interpreter for a simple flow graph language in Prolog, together with a partial evaluator for it. The partial evaluator essentially duplicates every rule of the interpreter. If all the arguments of the current operation are known, it acts like the interpreter, otherwise it simply copies the operation into the residual code.

Partial evaluation can be used for a variety of applications, but the most commonly cited one is that of applying it to an interpreter. To do that, the program that the interpreter runs is assumed to be constant by the partial evaluator. Thus a specialized version of the interpreter is produced that does not use the input program at all. That residual code can be seen as a compiled version of the input program.

In the next blog post in this series we will look at writing a simple tracer for the same flowgraph language.

單中杰 wrote on 2012-01-26 16:57:

Excellent example and explanation! I look forward to the next installment!

But down with gensym! Instead, you can just let LR=pair(L,PEnv).

Armin Rigo wrote on 2012-01-26 17:36:

For those not too familiar with Prolog: assert(foo(..)) is not at all like the "assert" of Python or C code. Instead, it adds the rule 'foo(..)' in the database of rules. In other words, it is as if 'foo(..)' was added to the currently running program, as an extra rule.

Carl Friedrich Bolz-Tereick wrote on 2012-01-27 10:01:

單中杰: Thanks for the compliments.

I really like the idea of getting rid of gensym that way. It had never occurred to me to simply use a non-atomic term as a label, very nice.

Anonymous wrote on 2012-01-27 13:29:

Very interesting, but I'm a bit confused - what does block(X, Y) do? It isn't defined anywhere.

Carl Friedrich Bolz-Tereick wrote on 2012-01-27 13:38:

@Anonymous: block(L, O) lists all the labels and operations corresponding to the labels that exist in the user program. See the very beginning of the post. Also, when partial evaluation creates new code it adds new cases to block(L, O), with the statement assert(block(..., ...)).

PyPy internship at NCAR

Hello, everyone

I would like to inform you that there is a very interesting opportunity for doing an internship at NCAR in the lovely town of Boulder, situated on the foothils of Rocky Mountains. Before you read on, make sure you:

  • are a student of a US University, who is legally eligible to work in the US
  • are at least finishing second year this year
  • apply before February 3rd.

The internship itself will focus on using PyPy (in some way) to provide a high performance numeric kernel for an atmospheric model, and measuring how fast we can go. This is very much in line with what the current effort on NumPy in PyPy is about. The internship will be mentored by Davide del Vento and I hope to have some influence over where it goes myself :-)

A few interesting links:

Feel free to contact Davide for details about the proposal and pypy-dev or me directly for details about PyPy.

Cheers, fijal

Rahul wrote on 2012-01-16 05:03:

It looks good opportunity for a student. You can also post it on

Cameron Sparr wrote on 2012-02-01 22:56:

I've applied for the internship already but was hoping to get some more details so I could make some last-minute edits to my application! Do you have Davide Del Vento's contact info?

Maciej Fijalkowski wrote on 2012-02-02 08:34:

send me a mail

Transactional Memory (II)

Here is an update about the previous blog post about the Global Interpreter Lock (GIL). In 5 months, the point of view changed quite a bit.

Let me remind you that the GIL is the technique used in both CPython and PyPy to safely run multi-threaded programs: it is a global lock that prevents multiple threads from actually running at the same time. The reason to do that is that it would have disastrous effects in the interpreter if several threads access the same object concurrently --- to the point that in CPython even just manipulating the object's reference counter needs to be protected by the lock.

So far, the ultimate goal to enable true multi-CPU usage has been to remove the infamous GIL from the interpreter, so that multiple threads could actually run in parallel. It's a lot of work, but this has been done in Jython. The reason that it has not been done in CPython so far is that it's even more work: we would need to care not only about carefully adding fine-grained locks everywhere, but also about reference counting; and there are a lot more C extension modules that would need care, too. And we don't have locking primitives as performant as Java's, which have been hand-tuned since ages (e.g. to use help from the JIT compiler).

But we think we have a plan to implement a different model for using multiple cores. Believe it or not, this is better than just removing the GIL from PyPy. You might get to use all your cores without ever writing threads.

You would instead just use some event dispatcher, say from Twisted, from Stackless, or from your favorite GUI; or just write your own. From there, you (or someone else) would add some minimal extra code to the event dispatcher's source code, to exploit the new transactional features offered by PyPy. Then you would run your program on a special version of PyPy, and voilà: you get some form of automatic parallelization. Sounds magic, but the basic idea is simple: start handling multiple events in parallel, giving each one its own transaction. More about it later.

Threads or Events?

First, why would this be better than "just" removing the GIL? Because using threads can be a mess in any complex program. Some authors (e.g. Lee) have argued that the reason is that threads are fundamentally non-deterministic. This makes it very hard to reason about them. Basically the programmer needs to "trim" down the non-determinism (e.g. by adding locks, semaphores, etc.), and it's hard to be sure when he's got a sufficiently deterministic result, if only because he can't write exhaustive tests for it.

By contrast, consider a Twisted program. It's not a multi-threaded program, which means that it handles the "events" one after the other. The exact ordering of the events is not really deterministic, because they often correspond to external events; but that's the only source of non-determinism. The actual handling of each event occurs in a nicely deterministic way, and most importantly, not in parallel with the handling of other events. The same is true about other libraries like GUI toolkits, gevent, or Stackless.

(Of course the Twisted and the Stackless models, to cite only these two, are quite different from each other; but they have in common the fact that they are not multi-threaded, and based instead on "events" --- which in the Stackless case means running a tasklet from one switch() point to the next one.)

These two models --- threads or events --- are the two main models we have right now. The latter is more used in Python, because it is much simpler to use than the former, and the former doesn't give any benefit because of the GIL. A third model, which is the only one that gives multi-core benefits, is to use multiple processes, and do inter-process communication.

The problem

Consider the case of a big program that has arbitrary complicated dependencies. Even assuming a GIL-less Python, this is likely enough to prevent the programmer from even starting a multi-threaded rewrite, because it would require a huge mess of locks. He could also consider using multiple processes instead, but the result is annoying as well: the complicated dependencies translate into a huge mess of inter-process synchronization.

The problem can also be down-sized to very small programs, like the kind of hacks that you do and forget about. In this case, the dependencies might be simpler, but you still have to learn and use subtle locking patterns or a complex inter-process library, which is overkill for the purpose.

(This is similar to how explicit memory management is not very hard for small programs --- but still, nowadays a lot of people agree that automatic memory management is easier for programs of all sizes. I think the same will eventually be true for using multiple CPUs, but the correct solution will take time to mature, like garbage collectors did. This post is a step in hopefully the right direction :-))

Events in Transactions

Let me introduce the notion of independent events: two events are independent if they don't touch the same set of objects. In a multi-threaded world, it means that they can be executed in parallel without needing any lock to ensure correctness.

Events might also be mostly independent, i.e. they rarely access the same object concurrently. Of course, in a multi-threaded world we would still need locks to ensure correctness, but the point is that the locks are rarely causing pauses: lock contention is low.

Consider again the Twisted example I gave above. There are often several events pending in the dispatch queue (assuming the program is using 100% of our single usable CPU, otherwise the whole discussion is moot). The case I am interested in is the case in which these events are generally mostly independent, i.e. we expect few conflicts between them. However they don't have to be proved independent. In fact it is fine if they have arbitrary complicated dependencies as described above. The point is the expected common case. Imagine that you have a GIL-less Python and that you can, by a wave of your hand, have all the careful locking mess magically done. Then what I mean here is the case in which such a theoretical program would run mostly in parallel on multiple core, without waiting too often on the locks.

In this case, the solution I'm proposing is that with minimal tweaks in the event dispatch loop, we can handle multiple events on multiple threads, each in its own transaction. A transaction is basically a tentative execution of the corresponding piece of code: if we detect conflicts with other concurrently executing transactions, we abort the whole transaction and restart it from scratch.

By now, the fact that it can basically work should be clear: multiple transactions will only get into conflict when modifying the same data structures, which is the case where the magical wand above would have put locks. If the magical program could progress without too many locks, then the transactional program can progress without too many conflicts. In a way, you get even more than what the magical program can give you: each event is dispatched in its own transaction, which means that from each event's point of view, we have the illusion that nobody else is running concurrently. This is exactly what all existing Twisted-/Stackless-/etc.-based programs are assuming.

Note that this solution, without transactions, already exists in some other languages: for example, Erlang is all about independent events. This is the simple case where we can just run them on multiple cores, knowing by construction of the language that you can't get conflicts. Of course, it doesn't work for Python or for a lot of other languages. From that point of view, what I'm suggesting is merely that transactional memory could be a good model to cope with the risks of conflicts that come from not having a special-made language.

Not a perfect solution

Of course, transactional memory (TM) is not a perfect solution either. Right now, the biggest issue is the performance hit that comes from the software implementation (STM). In time, hardware support (HTM) is likely to show up and help mitigate the problem; but I won't deny the fact that in some cases, because it's simple enough and/or because you really need the top performance, TM is not the best solution.

Also, the explanations above are silent on what is a hard point for TM, namely system calls. The basic general solution is to suspend other transactions as soon as a transaction does its first system call, so that we are sure that the transaction will succeed. Of course this solution is far from optimal. Interestingly, it's possible to do better on a case-by-case basis: for example, by adding in-process buffers, we can improve the situation for sockets, by having recv() store in a buffer what is received so that it can be re-recv()-ed later if the transaction is aborted; similarly, send() or writes to log files can be delayed until we are sure that the transaction will commit.

From my point of view, the most important point is that the TM solution comes from the correct side of the "determinism" scale. With threads, you have to prune down non-determinism. With TM, you start from a mostly deterministic point, and if needed, you add non-determinism. The reason you would want to do so is to make the transactions shorter: shorter transactions have less risks of conflicts, and when there are conflicts, less things to redo. So making transactions shorter increases the parallelism that your program can achieve, while at the same time requiring more care.

In terms of an event-driven model, the equivalent would be to divide the response of a big processing event into several events that are handled one after the other: for example, the first event sets things up and fires the second event, which does the actual computation; and afterwards a third event writes the results back. As a result, the second event's transaction has little risks of getting aborted. On the other hand, the writing back needs to be aware of the fact that it's not in the same transaction as the original setting up, which means that other unrelated transactions may have run in-between.

One step towards the future?

These, and others, are the problems of the TM approach. They are "new" problems, too, in the sense that the existing ways of programming don't have these problems.

Still, as you have guessed, I think that it is overall a win, and possibly a big win --- a win that might be on the same scale for the age of multiple CPUs as automatic garbage collection was 20 years ago for the age of RAM size explosion.

Stay tuned for more!

--- Armin (and reviews by Antonio and Fijal)

UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.
Unknown wrote on 2012-01-14 15:17:

Great article, great solution to a big problem...

I am really looking forward to this :-)

As an experiment I have developed Pyworks, which makes objects concurrent and methods asynchronious. But it makes little sense to do performance test on an multicore CPU because of the GIL.

The code for Pyworks can be found at

Anonymous wrote on 2012-01-14 15:38:

> These two models --- threads or events --- are the two main models we have right now.

Where does Go-style concurrency fit in?

gasche wrote on 2012-01-14 16:50:

If you go that road, you will certainly find out that Transactional Memory is much, much harder to get right than it looks like in today effectful/imperative languages. Sure, it looks wonderful on paper, but if your language doesn't help you control side-effects it will give you a very hard time.

Currently, there is satisfying STM support in Haskell (because of its tight type-based control of side-effects) and Clojure (beacuse of its tight control on mutability), and it might be getting into Scala.

I doubt Python can easily get such control, at least without an important reorganization of idiomatic practices and frameworks, that go beyond the "let's be event-driven" decision. Which makes your "this is going to work magically" story a bit hard to believe.

There has been intense research on this topic for some decades now, and several attempts at getting it to work in current mainstream languages have mostly failed.

See for example this long retrospective of the STM.NET effort at Microsoft Research, by Joe Duffy:
A (brief) retrospective on transactional memory
or this shorter blog post by Brian Hurt:
The problem with STM: your languages still suck.

I was a bit disappointed that you didn't cite any of the relevant literature in your post. It made me suspicious of "reiventing the wheel"...

Anonymous wrote on 2012-01-14 16:57:

One major use-case for multithreading involves a large, unchanging data structure which many threads access. I.e., the data structure is loaded by a parent task, then not modified again; a number of threads are then spawned to use it for calculations.

In CPython, the GIL makes this impossible if only because the reference counters need to be protected. With Cython in threads, however, you can turn off the GIL and do some work on C-style data structures.

I'm wondering whether the STM PyPy effort could have a very useful, and very early, benefit: simply enabling an unchanging data structure to be accessed by a number of processors via the kinds of events you describe. There wouldn't be a need for transactions, because the programmer would take responsibility for only sharing unchanging structures between simultaneously-executing events.

But it seems like the basic requirements for this kind of facility might be met in in early stage of STM development. And a solution that allowed multiple processors to access large, unchanging structures would be very useful in certain applications. I know I have one in mind that I'm looking at CPython/Cython for, but I'd rather see if I could get the performance I need from PyPy.

Just thought it was worth mentioning.

Armin Rigo wrote on 2012-01-14 19:27:

@Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

Daniel Waterworth wrote on 2012-01-14 20:27:

@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.


Armin Rigo wrote on 2012-01-14 21:11:

@gasche: I know about Haskell, Clojure and Scala, and I just read the two blog posts you pointed to.

I'm not talking about giving explicit TM to the end programmers. I'm instead considering TM as an internal, implementation-only feature. That makes it very similar to GCs.

I know the points and issues of traditional TM systems, which are nicely reported by Joe Duffy in "A (brief) retrospective on transactional memory". These are of course perfectly valid issues, but I think they do not apply (or "not that much") in the particular context I'm talking about. For example, this includes the large sections about nested transactions, and about consistency between the transactional and non-transactional worlds (Weak or Strong Atomicity, The Privatization Problem). Even "Where is the Killer App?" is obvious in this case: any existing Twisted App is potentially a Killer App.

Sorry for not including references to papers. I must admit I don't know any paper that describes a similar use case for TM.

Simon Weber wrote on 2012-01-14 21:45:

The link to the previous blog post is broken. It should be:

Anonymous wrote on 2012-01-15 07:24:

> @Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python implementation.

Stackless (which PyPy enables) supports Go-style channels as well, no?

René Dudfield wrote on 2012-01-15 08:03:

Your idea could work for other easy to inject into points, such as loops, and comprehensions. Especially with much of the work in pypy already done for identifying information about loops.

How does this compare to grand central dispatch and blocks?

Events are a very good way to model concurrency, and are widely used. It is a great place to dispatch concurrency into parallelism.

Closures/blocks provide a fairly decent way to get some of the protection of STM - and in many programs give you the 80% solution. For code that plays nicely and avoids mutable, or global data - this works. Luckily, a lot of event based code is already written in this way. As you say, they are "generally mostly independent".

Making the bad cases a quick fail, like in JavaScript worker threads could be an ok option. As soon as someone tries to access global data(do a system call, access the DOM, or access data outside the closure even), the program would fail there. Then you could fix those cases, or "add non-determinism" as you say. I think I'd prefer fail fast here, rather than have to detect these problems, and have them silently pass by.

You still have scheduling problems, and trying to figure out task size. As well, this does not solve lots of other problems. However, it is cool that it could be applied automatically, and probably 'safely'.

Another random thought... you could probably mark chunks of code as 'pure' as your run through them, and if they do a system call or mutate global data mark them as 'unpure' and don't try them again.

I very much look forward to reading your results as you implement more.

Eric van Riet Paap wrote on 2012-01-15 08:56:

When Armin gets this excited I'd fasten my seatbelt and put my goggles on.

Thank you for letting me be an (otherwise mostly silent) observer.

Please keep shifting boundaries!

- Eric

Armin Rigo wrote on 2012-01-16 10:08:

Update: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.

René Dudfield wrote on 2012-01-16 12:11:

@Armin: That transaction code looks very simple. It seems trivial to implement a map/mapReduce style function on top of your transaction module.

It is a very similar API to worker pool APIs which many thread using programs use. The main difference is that you combine the join() in the run method. It seems that a threaded web server for example could use this? What would happen if each incoming request comes in, and is put into the transaction (and say the 10th request has an error)? Would it be better to use multiple transactions?

Have you thought how thread local storage would work?

Armin Rigo wrote on 2012-01-16 12:55:

@notme: yes, a web server or anything can use this instead of using threads. It's of course missing a convincing select() or poll() version for that.

The details haven't been thought out; right now an exception interrupts everything. In an STM model it's unclear if concurrent transactions should still be allowed to complete or not. Anyway the point is that exceptions should not really occur because precisely they interrupt everything --- you would typically add instead in every transaction code like "try: .. except: traceback.print_exc()".

Thread local storage: what would be the point?

Unknown wrote on 2012-01-18 10:06:

I also see no reason for Thread local memory.

I like the idea of thinking about TM in the same line as GC. When you have GC the changes to the language is that you don't need to write free/dealloc.

Having TM would mean that you don't have to write acquire_GIL

headius wrote on 2012-01-24 04:22:

The devil's in the details.

I'm not sure I buy your conclusions here. STM is not a panacea for solving concurrency issues, and it has some key limitations that limit its general applicability.

On what granularity do you plan to have transactions? How do you know? Perhaps the VM will have enough knowledge of a given thread's activities to limit transactional overhead to only those structures in memory that are shared, but there still needs to be some indirection in case another thread hops in and starts making changes.

Where do transactions start and end? In STMs I know, the in-transaction overhead for reading and writing data is *much* higher, since it needs to know if someone else has committed a transaction first and be able to roll back.

Perhaps this is all intended to be hidden, and you never actually have "threads" that the user can see. But if you're going to parallelize, you'll have threads *somewhere* that are going to contend for resources. If they're going to contend for resources, even in an STM, they're going to have to check for contention, register their interest, and then you're back to the indirection overhead.

Perhaps I'm not understand what your end goal is. You can't simply turn the world into a series of transactions unless you want every read and write to have transaction overhead or you have some clear way of limiting transaction overhead to only where it's needed. You cite Erlang...but Erlang deals with immutable objects, and there's far less need for anything like an STM. Others have mentioned Clojure...but again, Clojure is mostly immutable structures, and transactional overhead is limited to Refs, where you'll make single coarse-grained reads and writes.

Am I missing the point? Are you not suggesting VM-wide STM, with the resulting transactional overhead for every read and write?

Armin Rigo wrote on 2012-01-24 10:03:

@Charles: Indeed, I am suggesting VM-wide STM, with the resulting transactional overhead for every read and write. I actually got such a VM yesterday (with no GC): it seems to be about 10x slower on a single thread.

Note that even 10x slower is a plus if it scales to dozens of processors. But of course, a better point of view is that some years ago the regular pypy *was* 10x slower than CPython. It was a lot of efforts but we managed to make it only 1.5-2x slower. And this is all without counting the JIT. If STM bogs down to a generally-not-triggered read barrier before every read, then the performance impact could be well under 2x.

Please note also that I don't care about Java-like performance where even loosing 10% of performance would be a disaster. If we end up with a pypy-tm that is 2x slower than a regular pypy, I would be quite happy, and I believe that there is a non-negligible fraction of the Python users that would be, too.

On granularity: for now I'm going with the idea that the granularity is defined "naturally" in the source program as the amount of work done every time some central dispatch loop calls some code. There might be several dispatch loops in total, too. This is true in the cases I can think of: typical Twisted or Stackless programs, pypy's "", the richards benchmark, etc.

Please look at for an example of what I'm talking about. It's a diff against the standard it is a pure Python user program in which I added calls to the new 'transaction' module. At this level there is no hint of Transactional Memory.

Armin Rigo wrote on 2012-01-31 17:13:

@Gary Robinson: (off-topic:) for this kind of use case, you can use os.fork() after the immutable data is ready. It "kind of works" both in pypy and in cpython, although not really --- in cpython the reference counts are modified, causing the pages to get unshared between processes; and in pypy the garbage collector (GC) has the same effect, so far. It could be solved in pypy by more tweaks the GC.

Anonymous wrote on 2012-02-01 18:43:

@armin: @Anonymous: in the extract you cite I meant "the two main models in Python". As far as I can tell, Go does concurrency by enforcing all communications to go via channels, so I would classify it as a "special-made" language. This solution might be nice and usable, but it does not really work at all in languages like Python.

Armin, Stackless Python uses a model that at the API level is very similar to Go. Go borrows from the Bell Labs family of languages (i.e. Newsqueak). The fundamental idea is that message pasing is used to share information between threads/processes/coroutines. In this regard, Go is in the same camp as say, Erlang (although the messaging systems are different).

What I think is interesting and workable for Python are efforts in languages like Polyphonic C# (see the paper "Scalable Join Patterns") and Concurrent/Parallel ML, where lock-free libraries and STM techniques are used under the hood to improve the efficiency of the messaging/synchronisation system. In this fashion, the programmer has a conceptually clean concurrency model and still can make the important decisions about how to partition the problem.


Anonymous wrote on 2012-02-01 18:59:

@daniel@Armin, CSP may be built into Go, but IMO this was a mistake, there is no requirement for it to be a language feature; it fits nicer as library. See [python-csp] for a python library

I have looked at Python-CSP a long time ago. I recall it being verbose. However I use Stackless Python. And using PyPy's, I implemented select() and join patterns. Sometimes I wish I had language support: they cut down on silly mistakes and make the code less verbose for simple cases. However what I have found is that the language can get in the way. For instance, in Go, one has to come up with hacks to do some simple like do a select on an arbitrary number of channels. Perhaps I am wrong but I suspect stuff like select()'s design was influenced by the fact Newsqueak was originally designed to make a windowing system easier to write. So one is monitoring only a handful of channels. In constrast, this is not the way Stackless Python programmes are written.


Armin Rigo wrote on 2012-02-01 20:39:

A link to a group that did the same thing (thanks a lot Andrew for this link!):

In particular the May 2007 paper (HotOS) nicely summarizes exactly what I'm trying to say, and I think it is clearer than me, if I have to jugde from feedback :-)

Anonymous wrote on 2012-02-27 17:57:

Speaking as someone maintaining a large application that uses Twisted, this sounds great.

NumPyPy progress report - running benchmarks


We're excited to let you know about some of the great progress we've made on NumPyPy: both completeness and performance. In this blog entry we mostly will talk about performance and how much progress we have made so far.

Word of warning: this work is in progress -- we're maybe half way to where we want to be and there are many trivial and not so trivial optimizations to be written. (For example, we haven't even started to implement important optimizations, like vectorization.)


We chose a laplace equation solver, based on SciPy's PerformancePython wiki. Unfortunately, the different implementations on the wiki page accidentally use two different algorithms, which have different convergences, and very different performance characteristics on modern computers. As a result, we implemented our own versions in both C and Python (with and without NumPy). The full source can be found in fijal's hack repo, all these benchmarks were performed at revision 18502dbbcdb3.

First, let me describe various algorithms used. Note that some of them contain PyPy-specific hacks to work around limitations in the current implementation. These hacks will go away eventually and the performance will improve. Numerically the algorithms used are identical, however exact data layout in memory differs between them.

A note about all the benchmarks: they each were run once, but the performance is very stable across runs.

Starting with the C version, it implements a trivial laplace transform using two loops and double-reference memory (array of int*). The double reference does not matter for performance and the two algorithms are implemented in inline-laplace.c and laplace.c. They were both compiled with gcc 4.4.5 at -O3. The inline version modifies array in-place while the non-inline version stores results in a copy. That makes them converge at different rate, hence different number of iterations

A straightforward version of those in Python is implemented in using, respectively, inline_slow_time_step and slow_time_step. slow_2_time_step does the same thing, except it copies arrays in-place instead of creating new copies. Table below compares running PyPy against C:

bench number of iterations time per iteration
laplace C 219 6.3ms
inline-laplace C 278 20ms
slow python 219 17ms
slow 2 python 219 14ms
inline_slow python 278 23.7ms

An important thing to notice is the data dependency of the inline version causes a huge slowdown for the C versions. This is not a severe disadvantage for us though -- the brain-dead Python version takes longer and PyPy is not able to take advantage of the knowledge that the data is independent. The results are in the same ballpark as the C versions -- 15% - 170% slower, but the algorithm one chooses matters more than the language. By comparison, the slow versions take about 5.75s each on CPython 2.6 per iteration and, by estimation, are about 200x slower than the PyPy equivalent, if I had the patience to measure the full run.

The next step is to use NumPy expressions. The first problem we run into is that computing the error requires walking the entire array a second time. This is fairly inefficient in terms of cache access, so I took the liberty of computing the errors every 15 steps. This results in the convergence being rounded to the nearest 15 iterations, but speeds things up considerably. numeric_time_step takes the most braindead approach of replacing the array with itself, like this:

u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 +
                       (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv

We need 3 arrays here -- one is an intermediate (PyPy only needs one, for all of those subexpressions), one is a copy for computing the error, and one is the result. This works automatically because in NumPy + or * creates an intermediate, while NumPyPy avoids allocating the intermediate if possible.

numeric_2_time_step works in pretty much the same way:

src = self.u
self.u = src.copy()
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

except the copy is now explicit rather than implicit.

numeric_3_time_step does the same thing, but notice one doesn't have to copy the entire array, it's enough to copy the border pieces and fill rest with zeros:

src = self.u
self.u = numpy.zeros((self.nx, self.ny), 'd')
self.u[0] = src[0]
self.u[-1] = src[-1]
self.u[:, 0] = src[:, 0]
self.u[:, -1] = src[:, -1]
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

numeric_4_time_step is the one that tries hardest to resemble the C version. Instead of doing an array copy, it actually notices that one can alternate between two arrays. This is exactly what the C version does. The remove_invalidates call is a PyPy specific hack - we hope to remove this call in the near future, but, in short, it promises "I don't have any unbuilt intermediates that depend on the value of the argument", which means one doesn't have to compute sub-expressions one is not actually using:

self.old_u[:,:] = self.u
src = self.old_u
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

This one is the most comparable to the C version.

numeric_5_time_step does the same thing, but notices one doesn't have to copy the entire array, it's enough to just copy the edges. This is an optimization that was not done in the C version:

src = self.u
self.old_u, self.u = self.u, self.old_u
self.u[0] = src[0]
self.u[-1] = src[-1]
self.u[:, 0] = src[:, 0]
self.u[:, -1] = src[:, -1]
self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 +
                      (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv

Let's look at the table of runs. As before, gcc 4.4.5, compiled at -O3, and PyPy nightly 7bb8b38d8563, on an x86-64 machine. All of the numeric methods run for 226 steps, slightly more than the 219, rounding to the next 15 when the error is computed.

benchmark PyPy CPython
numeric 21ms 35ms
numeric 2 14ms 37ms
numeric 3 13ms 29ms
numeric 4 11ms 31ms
numeric 5 9.3ms 21ms

We think that these preliminary results are pretty good. They're not as fast as the C version (or as fast as we'd like them to be), but we're already much faster than NumPy on CPython -- almost always by more than 2x on this relatively real-world example. This is not the end, though. In fact, it's hardly the beginning! As we continue work, we hope to make even more use of the high level information that we have. Looking at the assembler generated by gcc for this example, it's pretty clear we can outperform it thanks to better aliasing information and hence better possibilities for vectorization. Stay tuned.

EDIT: fixed the benchmark name

EDIT2: added info that first table is about PyPy

Cheers, fijal

D wrote on 2012-01-10 20:24:

Nice to hear, but what we (numpy users) really need is 2-dimensional matrices with basic arithmetic operations (+, -, /, *, sin, cos, pow etc) and other related methods, e.g. min(array,axis), nanmax(array, axis), argmax(array,axis), nanargmin(array, axis) etc. While CPython soft dependent on these operations works more or less fast, with PyPy it mere doesn't work at all. I hope first of all you'll focus on it instead of speed improvement for single-dimensional arrays.
Regards, D.

Maciej Fijalkowski wrote on 2012-01-10 20:27:

It would be really cool if you try before complaining. I think all of it works on a nightly build, except the axis argument which is on a branch being worked on.

D wrote on 2012-01-10 20:28:

Also, IIRC NumPyPy still misses linalg.solve method for solving systems of linear equations, that is highly important for lots of soft. Connecting sparse SLE solver (like umfpack or superlu from scipy.sparse) also would be very essential.

Maciej Fijalkowski wrote on 2012-01-10 20:30:

We're working on it. Stay tuned

D wrote on 2012-01-10 20:32:

Maciej, anything about 2-dimensional matrix implementations with related operations haven't been mentioned in blog, so why I have to know about it? I only installed and tried stable PyPy 1.7, because I had tried building PyPy from sources and found it damned hard, especially for my limited hardware (2 GB RAM).

Maciej Fijalkowski wrote on 2012-01-10 20:33:

Good point, we'll write a blog post what has been implemented as well. Try nightly

Adam wrote on 2012-01-10 21:02:

A Laplace transform is something quite different to solving Laplace's equation with finite differences...

Maciej Fijalkowski wrote on 2012-01-10 21:07:

fixed, thanks

Anonymous wrote on 2012-01-10 21:13:

It may be nice to link to the nightly builds so that people can try this out :)

Chris LeBlanc wrote on 2012-01-10 23:12:

This is excellent! Great work, the potential of this project is very exciting. I was quietly wishing for this since pypy first started.

I use NumPy all the time, and any increase in performance makes a big difference. This is one of the main advantages of NumPyPy over NumPy, so it makes sense to focus on it.

There seems to be lots of complaining about missing features and such, but having a solid foundation to work from seems to be the most important thing. Missing features can be added down the line.

I remember reading a blog post last year about using transactional memory as a way of removing the GIL. If you could combine that with NumPyPy to run numerical tasks in parallel, that would make a lot of scientific programmers very happy. I don't know if this is feasible, but it sure would be nice.

Keep up the good work.

Maciej Fijalkowski wrote on 2012-01-10 23:18:

Hi Chris.

We have vague plans how to parallelize numpy expressions without even having to remove the GIL. That way you'll have workers that are able to perform (or help perform) numeric tasks, but the interpreter itself will still run in a single thread. The same goes for GPUs and MIC.

Anonymous wrote on 2012-01-11 10:55:

Nightly builds

Anonymous wrote on 2012-01-11 13:33:

Please when you consider parallelizing things, do remember about leaving an explicit switch to turn it off!

I run my Python stuff on clusters through a queuing system and it will be VERY unhappy if single processes use more than one thread without informing the scheduler.

Anonymous wrote on 2012-01-11 13:34:

Hey, by the way, your progress on NumPy is amazing and highly appreciated.

Maciej Fijalkowski wrote on 2012-01-11 15:31:

@Anonymous of course, this is a given that we'll leave the switch to turn it off. It might be not even on by default, that's up for discussion

Paul Harrison wrote on 2012-01-12 02:42:

Chris, if you haven't considered this already, it's sometimes possible to achieve parallelism with multiple processes using memory mapped files as numpy arrays. It's a bit awkward, but it can also make for an easier path to a computation that is resumable or can be run on a cluster.

GIL removal would be wonderful, but it's a pretty ambitious idea. Then again, these pypy folk seem able to deliver on some pretty amazing stuff.

Peter S wrote on 2012-01-16 10:33:

I am closely following these developments with numpypy and I just succesfully tested the last nightly build, which I find very impressive!

For research purposes, the main thing we need is scipy.stats.ttest_1samp to work on pypy. Is there an estimation on when scipypy will be available?

Leysin Winter Sprint

PyPy Leysin Winter Sprint: 15-22nd January 2012

The next PyPy sprint will be in Leysin, Switzerland, for the eighth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

  • Py3k: work towards supporting Python 3 in PyPy
  • NumPyPy: work towards supporting the numpy module in PyPy
  • JIT backends: integrate tests for ARM; look at the PowerPC 64; maybe try again to write an LLVM- or GCC-based one
  • STM and STM-related topics; or the Concurrent Mark-n-Sweep GC
  • And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.

Exact times

The work days should be 15-21 January 2011 (Sunday-Saturday). The official plans are for people to arrive on the 14th or the 15th, and to leave on the 22nd.

Interested?

Anonymous wrote on 2011-12-28 01:30:

How is the STM work going, btw?

Do you have any indications yet on whether it'll be workable in an imperative VM?

Anonymous wrote on 2012-01-02 11:49:

any news on the win64 port?

Klaus Ramelow wrote on 2012-01-07 12:56:

Leysin Winter Sprint
Exact times

The work days should be 15-21 January 2011 (Sunday-Saturday).

I assume it will be January 2012

Armin Rigo wrote on 2012-01-09 19:54:

STM work is slowly progressing, as you must have noticed in pypy-dev.

The Win64 port's progress is unknown, sorry.

Come see us at PyCon 2012

PyCon 2012 is coming up in just a few short months, and PyPy will be well
represented there. We'll be delivering a tutorial, two talks, plus we'll be
around for the sprints.

Here are the abstracts for the tutorials and talks:

  • How to get the most out of your PyPy, by Maciej Fijalkowski, Alex Gaynor
    and Armin Rigo: For many applications PyPy can provide performance benefits
    right out of the box. However, little details can push your application to
    perform much better. In this tutorial we'll give you insights on how to push
    PyPy to its limits. We'll focus on understanding the performance
    characteristics of PyPy, and learning the analysis tools in order to maximize
    your applications' performance. This is the tutorial.
  • Why PyPy by example, by Maciej Fijalkowski, Alex Gaynor and Armin Rigo:
    One of the goals of PyPy is to make existing Python code faster; however an
    even broader goal was to make it possible to write things in Python that
    previously would needed to be written in C or other low-level language. This
    talk will show examples of this, and describe how they represent the
    tremendous progress PyPy has made, and what it means for people looking at
    using PyPy.
  • How the PyPy JIT works, by Benjamin Peterson: The Python community is
    abuzz about the major speed gains PyPy can offer for pure Python code. But how
    does the PyPy JIT actually work? This talk will discuss how the PyPy JIT is
    implemented. It will include descriptions of the tracing, optimization, and
    assembly generation phases. I will demonstrate each step with an example loop.

If you have any questions let us know! We look forward to seeing people at
PyCon and chatting about PyPy and the entire Python ecosystem.

See you there,
Maciej Fijalkowski, Alex Gaynor, Benjamin Peterson, Armin Rigo, and the entire PyPy team

Plotting using matplotlib from PyPy

Big fat warning This is just a proof of concept. It barely works. There are missing pieces left and right, which were replaced with hacks so I can get this to run and prove it's possible. Don't try this at home, especially your home. You have been warned.

There has been a lot of talking about PyPy not integrating well with the current scientific Python ecosystem, and numpypy (a NumPy reimplementation on top of pypy) was dubbed "a fancy array library". I'm going to show that integration with this ecosystem is possible with our design.

First, the demo:

#!/usr/bin/env pypy

# numpy, pypy version
import numpypy as numpy
# DRAGONS LIVE THERE (fortunately hidden)
from embed.emb import import_mod

pylab = import_mod('matplotlib.pylab')

if __name__ == '__main__':
    a = numpy.arange(100, dtype=int)
    b = numpy.sin(a)
    pylab.plot(a, b)

And you get:

Now, how to reproduce it:

  • You need a PyPy without cpyext, I did not find a linker that would support overriding symbols. Right now there are no nightlies like this, so you have to compile it yourself, like:

    ./ -Ojit --withoutmod-cpyext

    That would give you a PyPy that's unable to load some libraries like PIL, but perfectly working otherwise.

  • Speaking of which, you need a reasonably recent PyPy.

  • The approach is generally portable, however the implementation has been tested only on 64bit linux. Few tweaks might be required.

  • You need to install python2.6, the python2.6 development headers, and have numpy and matplotlib installed on that python.

  • You need a checkout of my hacks directory and put embedded on your PYTHONPATH, your pypy checkout also has to be on the PYTHONPATH.

Er wait, what happened?

What didn't happen is we did not reimplement matplotlib on top of PyPy. What did happen is we embed CPython inside of PyPy using ctypes. We instantiate it. and follow the embedding tutorial for CPython. Since numpy arrays are not movable, we're able to pass around an integer that's represents the memory address of the array data and reconstruct it in the embedded interpreter. Hence with a relatively little effort we managed to reuse the same array data on both sides to plot at array. Easy, no?

This approach can be extended to support anything that's not too tied with python objects. SciPy and matplotlib both fall into the same category but probably the same strategy can be applied to anything, like GTK or QT. It's just a matter of extending a hack into a working library.

To summarize, while we're busy making numpypy better and faster, it seems that all external libraries on the C side can be done using an embedded Python interpreter with relatively little effort. To get to that point, I spent a day and a half to learn how to embed CPython, with very little prior experience in the CPython APIs. Of course you should still keep as much as possible in PyPy to make it nice and fast :)

Cheers, fijal

Kumo wrote on 2011-12-09 04:06:

Pretty cool!

Eli Bressert wrote on 2011-12-09 20:27:

Two thumbs up! This is quite exciting! Looking forward to further followup from this.

How does Scipy look in terms of implementation, e.g. wrapping fortran code with f2py? Could it become achieved?

Pankaj wrote on 2011-12-10 06:14:

freaking awesome :)

Laptop repair wrote on 2011-12-10 13:02:

PyPy Version is showing best result, it is giving extra protection to program.

dac wrote on 2011-12-13 20:52:

Good work. Is this approach your long term plan for supporting scientific python libraries or just a stop-gap solution until "proper" support can be added to pypy (or to the library)?

Maciej Fijalkowski wrote on 2011-12-13 23:06:

@dac this can scale to the entire matplotlib/scipy fully. Whether scientific community people will take up a gargantuan task of moving SciPy/matplotlib out of using CPython C API is beyond my knowledge, but even if it'll happen, it won't happen in short-to-mid-term.

So overall I think it's a good midterm solution, that might just stay forever.

Anonymous wrote on 2014-01-04 09:23:

Another solution containing dragons, that someone might find useful:
1) create new python file, that would print diagrams
2) send data from main program running in pypy to the second python file using call from subprocess
eg. call(["python", "", "-data", str(my_data).replace(" ", ";")]), data should be be text type and contain separator other than space
3) parse input data using argparse and convert them using ast

Konstantin Lopuhin wrote on 2014-02-02 12:32:

Also, seems that embed must live below the root of pypy source tree (else it fails to create proper paths to ".o" output files in rpython.translator.platform.Platform._make_o_file).

PyPy 1.7 - widening the sweet spot

We're pleased to announce the 1.7 release of PyPy. As became a habit, this release brings a lot of bugfixes and performance improvements over the 1.6 release. However, unlike the previous releases, the focus has been on widening the "sweet spot" of PyPy. That is, classes of Python code that PyPy can greatly speed up should be vastly improved with this release. You can download the 1.7 release here:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 1.7 and cpython 2.7.1 performance comparison) due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or Windows 32. Windows 64 work is ongoing, but not yet natively supported.

The main topic of this release is widening the range of code which PyPy can greatly speed up. On average on our benchmark suite, PyPy 1.7 is around 30% faster than PyPy 1.6 and up to 20 times faster on some benchmarks.


  • Numerous performance improvements. There are too many examples which python constructs now should behave faster to list them.

  • Bugfixes and compatibility fixes with CPython.

  • Windows fixes.

  • PyPy now comes with stackless features enabled by default. However, any loop using stackless features will interrupt the JIT for now, so no real performance improvement for stackless-based programs. Contact pypy-dev for info how to help on removing this restriction.

  • NumPy effort in PyPy was renamed numpypy. In order to try using it, simply write:

    import numpypy as numpy

    at the beginning of your program. There is a huge progress on numpy in PyPy since 1.6, the main feature being implementation of dtypes.

  • JSON encoder (but not decoder) has been replaced with a new one. This one is written in pure Python, but is known to outperform CPython's C extension up to 2 times in some cases. It's about 20 times faster than the one that we had in 1.6.

  • The memory footprint of some of our RPython modules has been drastically improved. This should impact any applications using for example cryptography, like tornado.

  • There was some progress in exposing even more CPython C API via cpyext.

Things that didn't make it, expect in 1.8 soon

There is an ongoing work, which while didn't make it to the release, is probably worth mentioning here. This is what you should probably expect in 1.8 some time soon:

  • Specialized list implementation. There is a branch that implements lists of integers/floats/strings as compactly as array.array. This should drastically improve performance/memory impact of some applications
  • NumPy effort is progressing forward, with multi-dimensional arrays coming soon.
  • There are two brand new JIT assembler backends, notably for the PowerPC and ARM processors.


It's maybe worth mentioning that we're running fundraising campaigns for NumPy effort in PyPy and for Python 3 in PyPy. In case you want to see any of those happen faster, we urge you to donate to numpy proposal or py3k proposal. In case you want PyPy to progress, but you trust us with the general direction, you can always donate to the general pot.

Maciej Fijałkowki, Armin Rigo and the entire PyPy team

Unknown wrote on 2011-11-21 12:29:

Could you put a link to some sort of NEWS file, a list of issue tracker tickets, or at least the relevant span of the revision control tool so that I could browse what sorts of changes have gone into trunk since 1.6?

Anonymous wrote on 2011-11-21 12:54:

"PyPy now comes with stackless features enabled by default"

Could you please tell a bit more about it? Is it just sort of internal optimizations, something under the hood? Or does it mean tail recursion optimization? Or cooperative multitasking with greenlets? What's the API for stackless features?

Anonymous wrote on 2011-11-21 14:27:

Is it so hard to wait until you have a Windows build before announcing a release?

Or not telling in the release that the Windows binary is available?

Benjamin Peterson wrote on 2011-11-21 15:30:


hg log -rrelease-1.6:release-1.7

Jan Ziak (atomsymbol) wrote on 2011-11-21 16:38:

I am getting a segmentation fault.

D wrote on 2011-11-21 18:37:

So if I want to run PyPy on my code with numpy I have to replace in each file "import numpy" by "import numpypy", "from numpy import ..." by "from numpypy import ...". And each time I want to switch beween PyPy and CPython, I have to search and replace all those occurrences backward. Well done...

Anonymous wrote on 2011-11-21 19:35:

Thank you for all your work, it's nice to see how far you have come in so little time! Keep raising the bar.

Amaury wrote on 2011-11-21 21:06:

@D: Please take it the easy way and add "sys.modules['numpy'] = numpypy" at the start of your program.

Maciej Fijalkowski wrote on 2011-11-21 21:08:

@⚛ report a bug to

@D it's gonna stay like this until it's finished. The problem is that most programs won't run out of the box anyway as of now, because of some missing functionality. We'll probably rename it back once it's finished.

Armin Rigo wrote on 2011-11-21 21:09:

@D: all you need is to create a file "" that contains "from numpypy import *". (The real reason we did this temporary renaming is because numpy developers asked us to.)

More likely, though, you are probably going to hit some unimplemented feature anyway, as our numpy(py) is still incomplete.

Anonymous wrote on 2011-11-21 22:49:

Re: numpypy. The standard in the bad old days with three different and subtly incompatible array libraries was "try: import ...; except: ..."

Jan Ziak (atomsymbol) wrote on 2011-11-22 07:14:

@Maciej: I am *not* going to submit a bug report, on purpose. When developing software for the masses, there are always two sets of users. One set comprises the users who report bugs, the other set comprises the users who are experiencing issues but do not report bugs.

The ideal state would be that there are no bugs, but this is only theoretical of course.

As an experiment, I have decided not to tell you any information about the segmentation fault. Nothing. Absolutely nothing.

The question is what measures are you going to take to solve this PyPy issue.

Good luck ...

Maciej Fijalkowski wrote on 2011-11-22 08:09:

@⚛ we're going to do nothing with that. Most probably you're using a CPython C extension or some illegal ctypes invocation or older version of jinja that did that or something... Besides, there is absolutely no point in trying to fix a bug that noone can potentially provide any information for.


Jan Ziak (atomsymbol) wrote on 2011-11-22 09:07:


PyPy 1.6 worked OK (but it was slower than CPython).

"we're going to do nothing with that."


"Most probably you're using a CPython C extension or some illegal ctypes invocation or older version of jinja that did that or something..."

I don't think so. GDB says that the EIP register stops at an address which does not seem to belong to the PyPy executable nor to any dynamically loaded library. This leads me to the conclusion that the issue is in the x86 code generated by PyPy.

"Besides, there is absolutely no point in trying to fix a bug that noone can potentially provide any information for."

I am not saying you have to fix it. I am just saying that PyPy 1.7 generates code that segfaults.

Does PyPy employ partial verification when generating x86 code?

Jorgen wrote on 2011-11-22 09:28:


"As an experiment, I have decided not to tell you any information about the segmentation fault. Nothing. Absolutely nothing."

So you want to conduct an experiment into 'How to help out an open source project by withholding crucial information'? And I thought the ideas of my PhD-advisor were bad ...

Anonymous wrote on 2011-11-22 09:56:

The point he, she, or it is making is that PyPy should contain a theorem prover to verify the code it generates so it is possible to prove mathematically that it never generates bad code—and that anything else is beneath the contempt of a serious computer scientist. If you need information about a segfault in order to debug it, you obviously have not thought it through thoroughly enough.

Jan Ziak (atomsymbol) wrote on 2011-11-22 10:02:

@Jorgen and @Maciej:

Well, I previously wrote here that "The question is what measures are you (=the PyPy team) going to take to solve this PyPy issue."

This sentence of mine contained the additional information that: I believe that it is a PyPy issue.

Maciej then wrote: "Most probably you're using a CPython C extension or ... that did that or something". This means he was trying to put the blame on others (C extensions or whatever) rather than admitting that it might be an issue attributable to PyPy and PyPy alone.

Then you (Jorgen) wrote "So you want to conduct an experiment into 'How to help out an open source project by withholding crucial information'?". And that is exactly what I intend to do: to help the PyPy project by withholding crucial information.

It will work.

Jan Ziak (atomsymbol) wrote on 2011-11-22 10:14:


"... PyPy should contain a theorem prover to verify the code it generates so it is possible to prove mathematically that it never generates bad code"

I believe such a thing is impossible.

Anonymous wrote on 2011-11-22 11:36:

It's possible if you let the verifier reject legal code. It's probably not realistic though, RPython (or is that the JIT-annotation language?) would have to be designed to be verifiable for whatever property you want to verify.

Armin Rigo wrote on 2011-11-22 14:28:

@⚛: you're sitting in your own corner of the world thinking that we will try hard to figure out which segfault you could possibly mean, and that it will help the PyPy project :-) I've heard many misconceptions of how Open Source works, but I've never heard this one.

How it really works is: you think you have a genuine segfault and want to report it, in which case you file a bug to, and maybe we have to discuss more to figure out why, for example, it appears on your machine and not ours, or which configuration you need to reproduce it; sometimes it can take efforts on both parties to even reproduce the problem.

You are free to not play this game, but then just like Maciej said, you will be fully ignored. Even if it's a real bug, it's likely that over time someone else will report or fix it. I'm not trying to force you to "reveal" it to us; feel free to ignore me. I'm just explaining how I believe Open Source works.

The difference for us is small, because a real bug will be seen and reported by others too. The difference for you is whether you would like to contribute and get our thanks, or don't care about it.

Anonymous wrote on 2011-11-22 23:50:

The pypy team "could" solve it. But it would be a massive waste of time, and of cource the changes are that they are unable to because of problems in your setup. I most certainly hope no open source team really spend their time on such ghost hunts.

Anonymous wrote on 2011-11-23 04:25:

Winston Ewert wrote on 2011-11-23 04:42:

Somewhat off the topic of this post, but I'm wondering what the special optimization of string lists would be. I can see obvious benefits to storing ints/floats directly in the list rather then as boxed numbers, but not so much for strings since they have be stored using an indirection anyways.

Carl Friedrich Bolz-Tereick wrote on 2011-11-23 09:15:


astutely observed (as always). There are two points to string lists:

1) PyPy's strings have one extra indirection, e.g. the data is not stored in the string box. This is due to RPython restrictions. With string lists, one indirection can be removed.

2) If the JIT knows that the full list stores only strings, it can actually generate better code, because it does not need to check the type of the item that was just read out of the list.

vacation homes in kissimmee florida wrote on 2011-11-25 09:30:

This means he was trying to put the blame on others....

wholesale electronics wrote on 2011-12-17 01:20:

omething under the hood? Or does it mean tail recursion optimization?

Gothenburg sprint report

In the past week, we have been busy hacking on PyPy at the Gothenburg sprint, the second of this 2011. The sprint was hold at Laura's and Jacob's place, and here is a brief report of what happened.

In the first day we welcomed Mark Pearse, who was new to PyPy and at his first sprint. Mark worked the whole sprint in the new SpecialisedTuple branch, whose aim is to have a special implementation for small 2-items and 3-items tuples of primitive types (e.g., ints or floats) to save memory. Mark paired with Antonio for a couple of days, then he continued alone and did an amazing job. He even learned how to properly do Test Driven Development :-).

Antonio spent a couple of days investigating whether it is possible to use application checkpoint libraries such as BLCR and DMTCP to save the state of the PyPy interpreter between subsequent runs, thus saving also the JIT-compiled code to reduce the warmup time. The conclusion is that these are interesting technologies, but more work would be needed (either on the PyPy side or on the checkpoint library side) before it can have a practical usage for PyPy users.

Then, Antonio spent most of the rest of the sprint working on his ffistruct branch, whose aim is to provide a very JIT-friendly way to interact with C structures, and eventually implement ctypes.Structure on top of that. The "cool part" of the branch is already done, and the JIT already can compile set/get of fields into a single fast assembly instruction, about 400 times faster than the corresponding ctypes code. What is still left to do is to add a nicer syntax (which is easy) and to implement all the ctypes peculiarities (which is tedious, at best :-)).

As usual, Armin did tons of different stuff, including fixing a JIT bug, improving the performance of file.readlines() and working on the STM branch (for Software Transactional Memory), which is now able to run RPython multithreaded programs using software transaction (as long as they don't fill up all the memory, because support for the GC is still missing :-)). Finally, he worked on improving the Windows version of PyPy. While doing so he discovered together with Anto a terrible bug which lead to a continuous leak of stack space because the JIT called some functions using the wrong calling convention.

Håkan, with some help from Armin, worked on the jit-targets branch, whose goal is to heavily refactor the way the traces are internally represented by the JIT, so that in the end we can produce (even :-)) better code than what we do nowadays. More details in this mail.

Andrew Dalke worked on a way to integrate PyPy with FORTRAN libraries, and in particular the ones which are wrapped by Numpy and Scipy: in doing so, he wrote f2pypy, which is similar to the existing f2py but instead of producing a CPython extension module it produces a pure python modules based on ctypes. More work is needed before it can be considered complete, but f2pypy is already able to produce a wrapper for BLAS which passes most of the tests under CPython, although there's still work left to get it working for PyPy.

Armin and Håkan with Laura's "5x faster" cake
Christian Tismer worked the whole sprint on the branch to make PyPy compatible with Windows 64 bit. This needs a lot of work because a lot of PyPy is written under the assumption that the long type in C has the same bit size than void*, which is not true on Win64. Christian says that in the past Genova-Pegli sprint he completed 90% of the work, and in this sprint he did the other 90% of the work. Obviously, what is left to complete the task is the third 90% :-). More seriously, he estimated a total of 2-4 person-weeks of work to finish it.

But, all in all, the best part of the sprint has been the cake that Laura baked to celebrate the "5x faster than CPython" achievement. Well, actually our speed page reports "only" 4.7x, but that's because in the meantime we switched from comparing against CPython 2.6 to comparing against CPython 2.7, which is slightly faster. We are confident that we will reach the 5x goal again, and that will be the perfect excuse to eat another cake :-)
Albien wrote on 2011-11-15 00:40:

Freaking amazing guys together!!!

Kumo wrote on 2011-11-15 03:28:

"5x faster than CPython cake". Sounds delicious.

Anonymous wrote on 2011-11-15 10:18:

awesome! what do you think? how much room for improvement is there? is 10x possible? :)

Luis wrote on 2011-11-15 13:52:

Congratulations! I guess that 5x faster (Unladen Swallow's performance goal) means that pypy is now "officially" fast.

As Anonymous asked above, I also wonder how much room for improvement there is from now on.
Have all the low hanging fruits been picked already? Can we expect this pace of improvement to go on for a while? Or you are close to hit the limit?

Well, I know it's hard to predict... I'd just like to know what your heart tells you :-)

Thank you guys for all the hard work!

Anonymous wrote on 2011-11-18 15:56:

does pygame work with pypy? would be awesome... what about pyopengl?

Anonymous wrote on 2011-11-19 00:28:

Sorry, but pyopengl require either numpy or Numeric, which unfortunatly ain't supported yet.

Anonymous wrote on 2011-12-17 01:06:

Five times faster than CPython. Great! How does it compare to C?