As part of writing my PhD I am currently thinking about the relationship between PyPy's meta-tracing approach with various previous ideas to automatically get a (JIT-)compiler from only an interpreter of a language. One of the most-researched ideas along these lines is that of partial evaluation. Partial evaluation has basically the same goals as PyPy when it comes to compilers: Write an interpreter, and get a compiler for free. The methods for reaching that goal are a bit different. In this series of blog posts, I am trying to explore the similarities and differences of partial evaluation and PyPy's meta-tracing.
A Flowgraph Language
To be able to clearly understand what "partial evaluation" is and what "meta-tracing" is I will show an "executable model" of both. To that end, I am defining a small imperative language and will then show what a partial evaluator and a tracer for that language look like. All this code will be implemented in Prolog. (Any pattern-matching functional language would do, but I happen to know Prolog best. Backtracking is not used, so you can read things simply as functional programs.) In this post I will start with the definition of the language, and a partial evaluator for it. The code written in this blog post can be found fully here: https://paste.pocoo.org/show/541004/
The language is conceptionally similar to PyPy's flow graphs, but a bit more restricted. It does not have function calls, only labelled basic blocks that consist of a series of linearly executed operations, followed by a conditional or an unconditional jump. Every operation is assigning a value to a variable, which is computed by applying some operation to some arguments.
A simple program to raise x to the yth power in that language looks like this:
power: res = 1 if y goto power_rec else goto power_done power_rec: res = res * x y = y - 1 if y goto power_rec else goto power_done power_done: print_and_stop(res)
To represent the same program as Prolog data structures, we use the following Prolog code:
block(power, op1(res, same, const(1), if(y, power_rec, power_done))). block(power_rec, op2(res, mul, var(res), var(x), op2(y, sub, var(y), const(1), if(y, power_rec, power_done)))). block(power_done, print_and_stop(var(res))).
Every rule of block declares one block by first giving the label of the block, followed by the code. Code is a series of op1 or op2 statements terminated by a jump, an if or a print_and_stop. op1 statements are operations with one argument of the form op1(res_variable, operation_name, argument, next_statement). Arguments can be either variables in the form var(name) or constants in the form const(value).
To run programs in this flowgraph language, we first need some helper functionality. The first few helper functions are concerned with the handling of environments, the data structures the interpreter uses to map variable names occuring in the program to the variables' current values. In Python dictionaries would be used for this purpose, but in Prolog we have to emulate these by lists of key/value pairs (not very efficient, but good enough):
lookup(X, , _) :- throw(key_not_found(X)). lookup(Key, [Key/Value | _], Value) :- !. lookup(Key, [_ | Rest], Value) :- lookup(Key, Rest, Value). write_env(, X, V, [X/V]). write_env([Key/_ | Rest], Key, Value, [Key/Value | Rest]) :- !. write_env([Pair | Rest], Key, Value, [Pair | NewRest]) :- write_env(Rest, Key, Value, NewRest). remove_env(, _, ). remove_env([Key/_ | Rest], Key, Rest) :- !. remove_env([Pair | Rest], Key, [Pair | NewRest]) :- remove_env(Rest, Key, NewRest). resolve(const(X), _, X). resolve(var(X), Env, Y) :- lookup(X, Env, Y).
The implementation of these functions is not too important. The lookup function finds a key in an environment list, the write_env function adds a new key/value pair to an environment, remove_env removes a key. The resolve function is used to take either a constant or a variable and return a value. If it's a constant, the value of that constant is returned, if it's a variable it is looked up in the environment. Note how the last argument of lookup and resolve is actually a return value, which is the typical approach in Prolog.
So far we have not specified what the primitive operations that can occur in the program actually mean. For that we define a do_op function which executes primitive operations:
do_op(same, X, X). do_op(mul, X, Y, Z) :- Z is X * Y. do_op(add, X, Y, Z) :- Z is X + Y. do_op(sub, X, Y, Z) :- Z is X - Y. do_op(eq, X, Y, Z) :- X == Y -> Z = 1; Z = 0. do_op(ge, X, Y, Z) :- X >= Y -> Z = 1; Z = 0. do_op(readlist, L, I, X) :- nth0(I, L, X). do_op(Op, _, _, _) :- throw(missing_op(Op)).
Again the last argument is an output variable.
Now we can start executing simple operations. For that an interp predicate is defined. It takes as its first argument the current environment and as the second argument the operation to execute. E.g. to execute primitive operations with one or two arguments:
interp(op1(ResultVar, Op, Arg, Rest), Env) :- resolve(Arg, Env, RArg), do_op(Op, RArg, Res), write_env(Env, ResultVar, Res, NEnv), interp(Rest, NEnv). interp(op2(ResultVar, Op, Arg1, Arg2, Rest), Env) :- resolve(Arg1, Env, RArg1), resolve(Arg2, Env, RArg2), do_op(Op, RArg1, RArg2, Res), write_env(Env, ResultVar, Res, NEnv), interp(Rest, NEnv).
First the arguments are resolved into values. Afterwards the operation is executed, and the result is written back into the environment. Then interp is called on the rest of the program. Similarly easy are the unconditional jump and print_and_stop:
interp(jump(L), Env) :- block(L, Block), interp(Block, Env). interp(print_and_stop(Arg), Env) :- resolve(Arg, Env, Val), print(Val), nl.
In the unconditional jump we simply get the target block and continue executing that. To execute print_and_stop we resolve the argument, print the value and then are done.
The conditional jump is only slightly more difficult:
interp(if(V, L1, L2), Env) :- lookup(V, Env, Val), (Val == 0 -> block(L2, Block) ; block(L1, Block) ), interp(Block, Env).
First the variable is looked up in the environment. If the variable is zero, execution continues at the second block, otherwise it continues at the first block.
Given this interpreter, we can execute the above example program like this, on a Prolog console:
$ swipl -s cfglang.pl ?- block(power, Block), interp(Block, [x/10, y/10]). 10000000000
Partial Evaluation of the Flowgraph Language
Let's look at what a partial evaluator for this simple flowgraph language would look like. Partial evaluation (PE), also called specialization, is a program manipuation technique. PE takes an input program and transforms it into a (hopefully) simpler and faster output program. It does this by assuming that some variables in the input program are constants. All operations that act only on such constants can be folded away. All other operations need to remain in the output program (called residual program). Thus the partial evaluator proceeds much like an interpreter, just that it cannot actually execute some operations. Also, its output is not just a value, but also list of remaining operations that could not be optimized away.
The partial evaluator cannot use normal environments, because unlike the interpreter not all variables' values are known to it. It will therefore work on partial environments, which store just the know variables. For these partial environments, some new helper functions are needed:
plookup(Key, , var(Key)). plookup(Key, [Key/Value | _], const(Value)) :- !. plookup(Key, [_ | Rest], Value) :- plookup(Key, Rest, Value). presolve(const(X), _, const(X)). presolve(var(V), PEnv, X) :- plookup(V, PEnv, X).
The function plookup takes a variable and a partial environment and returns either const(Value) if the variable is found in the partial environment or var(Key) if it is not. Equivalently, presolve is like resolve, except that it uses plookup instead of lookup.
With these helpers we can start writing a partial evaluator. The following two rules are where the main optimization in the form of constant folding happens. The idea is that when the partial evaluator sees an operation that involves only constant arguments, it can constant-fold the operation, otherwise it can't:
pe(op1(ResultVar, Op, Arg, Rest), PEnv, NewOp) :- presolve(Arg, PEnv, RArg), (RArg = const(C) -> do_op(Op, C, Res), write_env(PEnv, ResultVar, Res, NEnv), RestResidual = NewOp ; remove_env(PEnv, ResultVar, NEnv), NewOp = op1(ResultVar, Op, RArg, RestResidual) ), pe(Rest, NEnv, RestResidual). pe(op2(ResultVar, Op, Arg1, Arg2, Rest), PEnv, NewOp) :- presolve(Arg1, PEnv, RArg1), presolve(Arg2, PEnv, RArg2), (RArg1 = const(C1), RArg2 = const(C2) -> do_op(Op, C1, C2, Res), write_env(PEnv, ResultVar, Res, NEnv), RestResidual = NewOp ; remove_env(PEnv, ResultVar, NEnv), NewOp = op2(ResultVar, Op, RArg1, RArg2, RestResidual) ), pe(Rest, NEnv, RestResidual).
The pe predicate takes a partial environment, the current operations and potentially returns a new operation. To partially evaluate a simple operation, its arguments are looked up in the partial environment. If all the arguments are constants, the operation can be executed, and no new operation is produced. Otherwise, we need to produce a new residual operation which is exactly like the one currently looked at. Also, the result variable needs to be removed from the partial environment, because it was just overwritten by an unknown value.
The potentially generated residual operation is stored into the output argument NewOp. The output argument of the recursive call is the last argument of the newly created residual operation, which will then be filled by the recursive call. This is a typical approach in Prolog, but may look strange if you are not familiar with it.
Note how the first case of these two rules is just like interpretation. The second case doesn't really do anything, it just produces a residual operation. This relationship between normal evaluation and partial evaluation is very typical.
The unconditional jump and print_and_stop are not much more complex:
pe(jump(L), PEnv, jump(LR)) :- do_pe(L, PEnv, LR). pe(print_and_stop(Arg), Env, print_and_stop(RArg)) :- presolve(Arg, Env, RArg).
To partially evaluate an unconditional jump we again produce a jump. The target label of that residual jump is computed by asking the partial evaluator to produce residual code for the label L with the given partial environment. print_and_stop is simply turned into a print_and_stop. We will see the code for do_pe soon.
Conditional jumps are more interesting:
pe(if(V, L1, L2), PEnv, NewOp) :- plookup(V, PEnv, Val), (Val = const(C) -> (C = 0 -> L = L2 ; L = L1 ), do_pe(L, PEnv, LR), NewOp = jump(LR) ; do_pe(L1, PEnv, L1R), do_pe(L2, PEnv, L2R), NewOp = if(V, L1R, L2R) ).
First we look up the value of the condition variable. If it is a constant, we can produce better code, because we know statically that only one path is reachable. Thus we produce code for that path, and then emit an unconditional jump there. If the condition variable is not known at partial evaluation time, we need to partially evaluate both paths and produce a conditional jump in the residual code.
This rule is the one that causes the partial evaluator to potentially do much more work than the interpreter, because after an if sometimes both paths need to be explored. In the worst case this process never stops, so a real partial evaluator would need to ensure somehow that it terminates. There are many algorithms for doing that, but I will ignore this problem here.
Now we need to understand what the do_pe predicate is doing. Its most important task is to make sure that we don't do the same work twice by memoizing code that was already partially evaluated in the past. For that it keeps a mapping of Label, Partial Environment to Label of the residual code:
do_pe(L, PEnv, LR) :- (code_cache(L, PEnv, LR) -> true ; gensym(L, LR), assert(code_cache(L, PEnv, LR)), block(L, Code), pe(Code, PEnv, Residual), assert(block(LR, Residual)) ).
If the code cache indicates that label L was already partially evaluated with partial environment PEnv, then the previous residual code label LPrevious is returned. Otherwise, a new label is generated with gensym, the code cache is informed of that new label with assert, then the block is partially evaluated and the residual code is added to the database.
For those who know partial evaluation terminology: This partial evaluator is a polyvariant online partial evaluator. "Polyvariant" means that for every label, several specialized version of the block can be generated. "Online" means that no preprocessing is done before the partial evaluator runs.
Partial Evaluation Example
With this code we can look at the classical example of partial evaluation (it's probably the "Hello World" of partial evaluation). We can ask the partial evaluator to compute a power function, where the exponent y is a fixed number, e.g. 5, and the base x is unknown:
?- do_pe(power, [y/5], LR). LR = power1.
To find out which code was produced, we can use listing:
?- listing(code_cache) code_cache(power, [y/5], power1). code_cache(power_rec, [y/5, res/1], power_rec1). code_cache(power_rec, [y/4], power_rec2). code_cache(power_rec, [y/3], power_rec3). code_cache(power_rec, [y/2], power_rec4). code_cache(power_rec, [y/1], power_rec5). code_cache(power_done, [y/0], power_done1). ?- listing(block) .... the block definition of the user program .... block(power_done1, print_and_stop(var(res))). block(power_rec5, op2(res, mul, var(res), var(x), jump(power_done1))). block(power_rec4, op2(res, mul, var(res), var(x), jump(power_rec5))). block(power_rec3, op2(res, mul, var(res), var(x), jump(power_rec4))). block(power_rec2, op2(res, mul, var(res), var(x), jump(power_rec3))). block(power_rec1, op2(res, mul, const(1), var(x), jump(power_rec2))). block(power1, jump(power_rec1)).
The code_cache tells which residual labels correspond to which original labels under which partial environments. Thus, power1 contains the code of power under the assumption that y is 5. Looking at the block listing, the label power1 corresponds to code that simply multiplies res by x five times without using the variable x at all. The loop that was present in the original program has been fully unrolled, the loop variable y has disappeared. Hopefully this is faster than the original program.
In this blog post we saw an interpreter for a simple flow graph language in Prolog, together with a partial evaluator for it. The partial evaluator essentially duplicates every rule of the interpreter. If all the arguments of the current operation are known, it acts like the interpreter, otherwise it simply copies the operation into the residual code.
Partial evaluation can be used for a variety of applications, but the most commonly cited one is that of applying it to an interpreter. To do that, the program that the interpreter runs is assumed to be constant by the partial evaluator. Thus a specialized version of the interpreter is produced that does not use the input program at all. That residual code can be seen as a compiled version of the input program.
In the next blog post in this series we will look at writing a simple tracer for the same flowgraph language.
I would like to inform you that there is a very interesting opportunity for doing an internship at NCAR in the lovely town of Boulder, situated on the foothils of Rocky Mountains. Before you read on, make sure you:
- are a student of a US University, who is legally eligible to work in the US
- are at least finishing second year this year
- apply before February 3rd.
The internship itself will focus on using PyPy (in some way) to provide a high performance numeric kernel for an atmospheric model, and measuring how fast we can go. This is very much in line with what the current effort on NumPy in PyPy is about. The internship will be mentored by Davide del Vento and I hope to have some influence over where it goes myself :-)
A few interesting links:
- program website
- internship proposal - note that the actual roadmap is very flexible, as long as it's a numeric kernel of an atmospheric model using PyPy.
Feel free to contact Davide for details about the proposal and pypy-dev or me directly for details about PyPy.
Here is an update about the previous blog post about the Global Interpreter Lock (GIL). In 5 months, the point of view changed quite a bit.
Let me remind you that the GIL is the technique used in both CPython and PyPy to safely run multi-threaded programs: it is a global lock that prevents multiple threads from actually running at the same time. The reason to do that is that it would have disastrous effects in the interpreter if several threads access the same object concurrently --- to the point that in CPython even just manipulating the object's reference counter needs to be protected by the lock.
So far, the ultimate goal to enable true multi-CPU usage has been to remove the infamous GIL from the interpreter, so that multiple threads could actually run in parallel. It's a lot of work, but this has been done in Jython. The reason that it has not been done in CPython so far is that it's even more work: we would need to care not only about carefully adding fine-grained locks everywhere, but also about reference counting; and there are a lot more C extension modules that would need care, too. And we don't have locking primitives as performant as Java's, which have been hand-tuned since ages (e.g. to use help from the JIT compiler).
But we think we have a plan to implement a different model for using multiple cores. Believe it or not, this is better than just removing the GIL from PyPy. You might get to use all your cores without ever writing threads.
You would instead just use some event dispatcher, say from Twisted, from Stackless, or from your favorite GUI; or just write your own. From there, you (or someone else) would add some minimal extra code to the event dispatcher's source code, to exploit the new transactional features offered by PyPy. Then you would run your program on a special version of PyPy, and voilà: you get some form of automatic parallelization. Sounds magic, but the basic idea is simple: start handling multiple events in parallel, giving each one its own transaction. More about it later.
First, why would this be better than "just" removing the GIL? Because using threads can be a mess in any complex program. Some authors (e.g. Lee) have argued that the reason is that threads are fundamentally non-deterministic. This makes it very hard to reason about them. Basically the programmer needs to "trim" down the non-determinism (e.g. by adding locks, semaphores, etc.), and it's hard to be sure when he's got a sufficiently deterministic result, if only because he can't write exhaustive tests for it.
By contrast, consider a Twisted program. It's not a multi-threaded program, which means that it handles the "events" one after the other. The exact ordering of the events is not really deterministic, because they often correspond to external events; but that's the only source of non-determinism. The actual handling of each event occurs in a nicely deterministic way, and most importantly, not in parallel with the handling of other events. The same is true about other libraries like GUI toolkits, gevent, or Stackless.
(Of course the Twisted and the Stackless models, to cite only these two, are quite different from each other; but they have in common the fact that they are not multi-threaded, and based instead on "events" --- which in the Stackless case means running a tasklet from one switch() point to the next one.)
These two models --- threads or events --- are the two main models we have right now. The latter is more used in Python, because it is much simpler to use than the former, and the former doesn't give any benefit because of the GIL. A third model, which is the only one that gives multi-core benefits, is to use multiple processes, and do inter-process communication.
Consider the case of a big program that has arbitrary complicated dependencies. Even assuming a GIL-less Python, this is likely enough to prevent the programmer from even starting a multi-threaded rewrite, because it would require a huge mess of locks. He could also consider using multiple processes instead, but the result is annoying as well: the complicated dependencies translate into a huge mess of inter-process synchronization.
The problem can also be down-sized to very small programs, like the kind of hacks that you do and forget about. In this case, the dependencies might be simpler, but you still have to learn and use subtle locking patterns or a complex inter-process library, which is overkill for the purpose.
(This is similar to how explicit memory management is not very hard for small programs --- but still, nowadays a lot of people agree that automatic memory management is easier for programs of all sizes. I think the same will eventually be true for using multiple CPUs, but the correct solution will take time to mature, like garbage collectors did. This post is a step in hopefully the right direction :-))
Let me introduce the notion of independent events: two events are independent if they don't touch the same set of objects. In a multi-threaded world, it means that they can be executed in parallel without needing any lock to ensure correctness.
Events might also be mostly independent, i.e. they rarely access the same object concurrently. Of course, in a multi-threaded world we would still need locks to ensure correctness, but the point is that the locks are rarely causing pauses: lock contention is low.
Consider again the Twisted example I gave above. There are often several events pending in the dispatch queue (assuming the program is using 100% of our single usable CPU, otherwise the whole discussion is moot). The case I am interested in is the case in which these events are generally mostly independent, i.e. we expect few conflicts between them. However they don't have to be proved independent. In fact it is fine if they have arbitrary complicated dependencies as described above. The point is the expected common case. Imagine that you have a GIL-less Python and that you can, by a wave of your hand, have all the careful locking mess magically done. Then what I mean here is the case in which such a theoretical program would run mostly in parallel on multiple core, without waiting too often on the locks.
In this case, the solution I'm proposing is that with minimal tweaks in the event dispatch loop, we can handle multiple events on multiple threads, each in its own transaction. A transaction is basically a tentative execution of the corresponding piece of code: if we detect conflicts with other concurrently executing transactions, we abort the whole transaction and restart it from scratch.
By now, the fact that it can basically work should be clear: multiple transactions will only get into conflict when modifying the same data structures, which is the case where the magical wand above would have put locks. If the magical program could progress without too many locks, then the transactional program can progress without too many conflicts. In a way, you get even more than what the magical program can give you: each event is dispatched in its own transaction, which means that from each event's point of view, we have the illusion that nobody else is running concurrently. This is exactly what all existing Twisted-/Stackless-/etc.-based programs are assuming.
Note that this solution, without transactions, already exists in some other languages: for example, Erlang is all about independent events. This is the simple case where we can just run them on multiple cores, knowing by construction of the language that you can't get conflicts. Of course, it doesn't work for Python or for a lot of other languages. From that point of view, what I'm suggesting is merely that transactional memory could be a good model to cope with the risks of conflicts that come from not having a special-made language.
Of course, transactional memory (TM) is not a perfect solution either. Right now, the biggest issue is the performance hit that comes from the software implementation (STM). In time, hardware support (HTM) is likely to show up and help mitigate the problem; but I won't deny the fact that in some cases, because it's simple enough and/or because you really need the top performance, TM is not the best solution.
Also, the explanations above are silent on what is a hard point for TM, namely system calls. The basic general solution is to suspend other transactions as soon as a transaction does its first system call, so that we are sure that the transaction will succeed. Of course this solution is far from optimal. Interestingly, it's possible to do better on a case-by-case basis: for example, by adding in-process buffers, we can improve the situation for sockets, by having recv() store in a buffer what is received so that it can be re-recv()-ed later if the transaction is aborted; similarly, send() or writes to log files can be delayed until we are sure that the transaction will commit.
From my point of view, the most important point is that the TM solution comes from the correct side of the "determinism" scale. With threads, you have to prune down non-determinism. With TM, you start from a mostly deterministic point, and if needed, you add non-determinism. The reason you would want to do so is to make the transactions shorter: shorter transactions have less risks of conflicts, and when there are conflicts, less things to redo. So making transactions shorter increases the parallelism that your program can achieve, while at the same time requiring more care.
In terms of an event-driven model, the equivalent would be to divide the response of a big processing event into several events that are handled one after the other: for example, the first event sets things up and fires the second event, which does the actual computation; and afterwards a third event writes the results back. As a result, the second event's transaction has little risks of getting aborted. On the other hand, the writing back needs to be aware of the fact that it's not in the same transaction as the original setting up, which means that other unrelated transactions may have run in-between.
These, and others, are the problems of the TM approach. They are "new" problems, too, in the sense that the existing ways of programming don't have these problems.
Still, as you have guessed, I think that it is overall a win, and possibly a big win --- a win that might be on the same scale for the age of multiple CPUs as automatic garbage collection was 20 years ago for the age of RAM size explosion.
Stay tuned for more!
--- Armin (and reviews by Antonio and Fijal)
UPDATE: please look at the tiny transaction module I wrote as an example. The idea is to have the same interface as this module, but implemented differently. By making use of transactional memory internally, it should be possible to safely run on multiple CPUs while keeping the very same programmer interface.
We're excited to let you know about some of the great progress we've made on NumPyPy: both completeness and performance. In this blog entry we mostly will talk about performance and how much progress we have made so far.
Word of warning: this work is in progress -- we're maybe half way to where we want to be and there are many trivial and not so trivial optimizations to be written. (For example, we haven't even started to implement important optimizations, like vectorization.)
We chose a laplace equation solver, based on SciPy's PerformancePython wiki. Unfortunately, the different implementations on the wiki page accidentally use two different algorithms, which have different convergences, and very different performance characteristics on modern computers. As a result, we implemented our own versions in both C and Python (with and without NumPy). The full source can be found in fijal's hack repo, all these benchmarks were performed at revision 18502dbbcdb3.
First, let me describe various algorithms used. Note that some of them contain PyPy-specific hacks to work around limitations in the current implementation. These hacks will go away eventually and the performance will improve. Numerically the algorithms used are identical, however exact data layout in memory differs between them.
A note about all the benchmarks: they each were run once, but the performance is very stable across runs.
Starting with the C version, it implements a trivial laplace transform using two loops and double-reference memory (array of int*). The double reference does not matter for performance and the two algorithms are implemented in inline-laplace.c and laplace.c. They were both compiled with gcc 4.4.5 at -O3. The inline version modifies array in-place while the non-inline version stores results in a copy. That makes them converge at different rate, hence different number of iterations
A straightforward version of those in Python is implemented in laplace.py using, respectively, inline_slow_time_step and slow_time_step. slow_2_time_step does the same thing, except it copies arrays in-place instead of creating new copies. Table below compares running PyPy against C:
|bench||number of iterations||time per iteration|
|slow 2 python||219||14ms|
An important thing to notice is the data dependency of the inline version causes a huge slowdown for the C versions. This is not a severe disadvantage for us though -- the brain-dead Python version takes longer and PyPy is not able to take advantage of the knowledge that the data is independent. The results are in the same ballpark as the C versions -- 15% - 170% slower, but the algorithm one chooses matters more than the language. By comparison, the slow versions take about 5.75s each on CPython 2.6 per iteration and, by estimation, are about 200x slower than the PyPy equivalent, if I had the patience to measure the full run.
The next step is to use NumPy expressions. The first problem we run into is that computing the error requires walking the entire array a second time. This is fairly inefficient in terms of cache access, so I took the liberty of computing the errors every 15 steps. This results in the convergence being rounded to the nearest 15 iterations, but speeds things up considerably. numeric_time_step takes the most braindead approach of replacing the array with itself, like this:
u[1:-1, 1:-1] = ((u[0:-2, 1:-1] + u[2:, 1:-1])*dy2 + (u[1:-1,0:-2] + u[1:-1, 2:])*dx2)*dnr_inv
We need 3 arrays here -- one is an intermediate (PyPy only needs one, for all of those subexpressions), one is a copy for computing the error, and one is the result. This works automatically because in NumPy + or * creates an intermediate, while NumPyPy avoids allocating the intermediate if possible.
numeric_2_time_step works in pretty much the same way:
src = self.u self.u = src.copy() self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv
except the copy is now explicit rather than implicit.
numeric_3_time_step does the same thing, but notice one doesn't have to copy the entire array, it's enough to copy the border pieces and fill rest with zeros:
src = self.u self.u = numpy.zeros((self.nx, self.ny), 'd') self.u = src self.u[-1] = src[-1] self.u[:, 0] = src[:, 0] self.u[:, -1] = src[:, -1] self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv
numeric_4_time_step is the one that tries hardest to resemble the C version. Instead of doing an array copy, it actually notices that one can alternate between two arrays. This is exactly what the C version does. The remove_invalidates call is a PyPy specific hack - we hope to remove this call in the near future, but, in short, it promises "I don't have any unbuilt intermediates that depend on the value of the argument", which means one doesn't have to compute sub-expressions one is not actually using:
remove_invalidates(self.old_u) remove_invalidates(self.u) self.old_u[:,:] = self.u src = self.old_u self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv
This one is the most comparable to the C version.
numeric_5_time_step does the same thing, but notices one doesn't have to copy the entire array, it's enough to just copy the edges. This is an optimization that was not done in the C version:
remove_invalidates(self.old_u) remove_invalidates(self.u) src = self.u self.old_u, self.u = self.u, self.old_u self.u = src self.u[-1] = src[-1] self.u[:, 0] = src[:, 0] self.u[:, -1] = src[:, -1] self.u[1:-1, 1:-1] = ((src[0:-2, 1:-1] + src[2:, 1:-1])*dy2 + (src[1:-1,0:-2] + src[1:-1, 2:])*dx2)*dnr_inv
Let's look at the table of runs. As before, gcc 4.4.5, compiled at -O3, and PyPy nightly 7bb8b38d8563, on an x86-64 machine. All of the numeric methods run for 226 steps, slightly more than the 219, rounding to the next 15 when the error is computed.
We think that these preliminary results are pretty good. They're not as fast as the C version (or as fast as we'd like them to be), but we're already much faster than NumPy on CPython -- almost always by more than 2x on this relatively real-world example. This is not the end, though. In fact, it's hardly the beginning! As we continue work, we hope to make even more use of the high level information that we have. Looking at the assembler generated by gcc for this example, it's pretty clear we can outperform it thanks to better aliasing information and hence better possibilities for vectorization. Stay tuned.
EDIT: fixed the benchmark name
EDIT2: added info that first table is about PyPy
PyPy Leysin Winter Sprint: 15-22nd January 2012
The next PyPy sprint will be in Leysin, Switzerland, for the eighth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.
Goals and topics of the sprint
- Py3k: work towards supporting Python 3 in PyPy
- NumPyPy: work towards supporting the numpy module in PyPy
- JIT backends: integrate tests for ARM; look at the PowerPC 64; maybe try again to write an LLVM- or GCC-based one
- STM and STM-related topics; or the Concurrent Mark-n-Sweep GC
- And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.
The work days should be 15-21 January 2011 (Sunday-Saturday). The official plans are for people to arrive on the 14th or the 15th, and to leave on the 22nd.
Interested? Read more...
PyCon 2012 is coming up in just a few short months, and PyPy will be well
represented there. We'll be delivering a tutorial, two talks, plus we'll be
around for the sprints.
Here are the abstracts for the tutorials and talks:
How to get the most out of your PyPy, by Maciej Fijalkowski, Alex Gaynor
and Armin Rigo: For many applications PyPy can provide performance benefits
right out of the box. However, little details can push your application to
perform much better. In this tutorial we'll give you insights on how to push
PyPy to its limits. We'll focus on understanding the performance
characteristics of PyPy, and learning the analysis tools in order to maximize
your applications' performance. This is the tutorial.
Why PyPy by example, by Maciej Fijalkowski, Alex Gaynor and Armin Rigo:
One of the goals of PyPy is to make existing Python code faster; however an
even broader goal was to make it possible to write things in Python that
previously would needed to be written in C or other low-level language. This
talk will show examples of this, and describe how they represent the
tremendous progress PyPy has made, and what it means for people looking at
How the PyPy JIT works, by Benjamin Peterson: The Python community is
abuzz about the major speed gains PyPy can offer for pure Python code. But how
does the PyPy JIT actually work? This talk will discuss how the PyPy JIT is
implemented. It will include descriptions of the tracing, optimization, and
assembly generation phases. I will demonstrate each step with an example loop.
If you have any questions let us know! We look forward to seeing people at
PyCon and chatting about PyPy and the entire Python ecosystem.
See you there,
Maciej Fijalkowski, Alex Gaynor, Benjamin Peterson, Armin Rigo, and the entire PyPy team
Big fat warning This is just a proof of concept. It barely works. There are missing pieces left and right, which were replaced with hacks so I can get this to run and prove it's possible. Don't try this at home, especially your home. You have been warned.
There has been a lot of talking about PyPy not integrating well with the current scientific Python ecosystem, and numpypy (a NumPy reimplementation on top of pypy) was dubbed "a fancy array library". I'm going to show that integration with this ecosystem is possible with our design.
First, the demo:
#!/usr/bin/env pypy # numpy, pypy version import numpypy as numpy # DRAGONS LIVE THERE (fortunately hidden) from embed.emb import import_mod pylab = import_mod('matplotlib.pylab') if __name__ == '__main__': a = numpy.arange(100, dtype=int) b = numpy.sin(a) pylab.plot(a, b) pylab.show()
And you get:
Now, how to reproduce it:
You need a PyPy without cpyext, I did not find a linker that would support overriding symbols. Right now there are no nightlies like this, so you have to compile it yourself, like:
./translate.py -Ojit targetpypystandalone.py --withoutmod-cpyext
That would give you a PyPy that's unable to load some libraries like PIL, but perfectly working otherwise.
Speaking of which, you need a reasonably recent PyPy.
The approach is generally portable, however the implementation has been tested only on 64bit linux. Few tweaks might be required.
You need to install python2.6, the python2.6 development headers, and have numpy and matplotlib installed on that python.
You need a checkout of my hacks directory and put embedded on your PYTHONPATH, your pypy checkout also has to be on the PYTHONPATH.
Er wait, what happened?
What didn't happen is we did not reimplement matplotlib on top of PyPy. What did happen is we embed CPython inside of PyPy using ctypes. We instantiate it. and follow the embedding tutorial for CPython. Since numpy arrays are not movable, we're able to pass around an integer that's represents the memory address of the array data and reconstruct it in the embedded interpreter. Hence with a relatively little effort we managed to reuse the same array data on both sides to plot at array. Easy, no?
This approach can be extended to support anything that's not too tied with python objects. SciPy and matplotlib both fall into the same category but probably the same strategy can be applied to anything, like GTK or QT. It's just a matter of extending a hack into a working library.
To summarize, while we're busy making numpypy better and faster, it seems that all external libraries on the C side can be done using an embedded Python interpreter with relatively little effort. To get to that point, I spent a day and a half to learn how to embed CPython, with very little prior experience in the CPython APIs. Of course you should still keep as much as possible in PyPy to make it nice and fast :)
We're pleased to announce the 1.7 release of PyPy. As became a habit, this release brings a lot of bugfixes and performance improvements over the 1.6 release. However, unlike the previous releases, the focus has been on widening the "sweet spot" of PyPy. That is, classes of Python code that PyPy can greatly speed up should be vastly improved with this release. You can download the 1.7 release here:
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 1.7 and cpython 2.7.1 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or Windows 32. Windows 64 work is ongoing, but not yet natively supported.
The main topic of this release is widening the range of code which PyPy can greatly speed up. On average on our benchmark suite, PyPy 1.7 is around 30% faster than PyPy 1.6 and up to 20 times faster on some benchmarks.
Numerous performance improvements. There are too many examples which python constructs now should behave faster to list them.
Bugfixes and compatibility fixes with CPython.
PyPy now comes with stackless features enabled by default. However, any loop using stackless features will interrupt the JIT for now, so no real performance improvement for stackless-based programs. Contact pypy-dev for info how to help on removing this restriction.
NumPy effort in PyPy was renamed numpypy. In order to try using it, simply write:
import numpypy as numpy
at the beginning of your program. There is a huge progress on numpy in PyPy since 1.6, the main feature being implementation of dtypes.
JSON encoder (but not decoder) has been replaced with a new one. This one is written in pure Python, but is known to outperform CPython's C extension up to 2 times in some cases. It's about 20 times faster than the one that we had in 1.6.
The memory footprint of some of our RPython modules has been drastically improved. This should impact any applications using for example cryptography, like tornado.
There was some progress in exposing even more CPython C API via cpyext.
Things that didn't make it, expect in 1.8 soon
There is an ongoing work, which while didn't make it to the release, is probably worth mentioning here. This is what you should probably expect in 1.8 some time soon:
- Specialized list implementation. There is a branch that implements lists of integers/floats/strings as compactly as array.array. This should drastically improve performance/memory impact of some applications
- NumPy effort is progressing forward, with multi-dimensional arrays coming soon.
- There are two brand new JIT assembler backends, notably for the PowerPC and ARM processors.
It's maybe worth mentioning that we're running fundraising campaigns for NumPy effort in PyPy and for Python 3 in PyPy. In case you want to see any of those happen faster, we urge you to donate to numpy proposal or py3k proposal. In case you want PyPy to progress, but you trust us with the general direction, you can always donate to the general pot.
Maciej Fijałkowki, Armin Rigo and the entire PyPy team
In the past week, we have been busy hacking on PyPy at the Gothenburg sprint, the second of this 2011. The sprint was hold at Laura's and Jacob's place, and here is a brief report of what happened.
In the first day we welcomed Mark Pearse, who was new to PyPy and at his first sprint. Mark worked the whole sprint in the new SpecialisedTuple branch, whose aim is to have a special implementation for small 2-items and 3-items tuples of primitive types (e.g., ints or floats) to save memory. Mark paired with Antonio for a couple of days, then he continued alone and did an amazing job. He even learned how to properly do Test Driven Development :-).
Antonio spent a couple of days investigating whether it is possible to use application checkpoint libraries such as BLCR and DMTCP to save the state of the PyPy interpreter between subsequent runs, thus saving also the JIT-compiled code to reduce the warmup time. The conclusion is that these are interesting technologies, but more work would be needed (either on the PyPy side or on the checkpoint library side) before it can have a practical usage for PyPy users.
Then, Antonio spent most of the rest of the sprint working on his ffistruct branch, whose aim is to provide a very JIT-friendly way to interact with C structures, and eventually implement ctypes.Structure on top of that. The "cool part" of the branch is already done, and the JIT already can compile set/get of fields into a single fast assembly instruction, about 400 times faster than the corresponding ctypes code. What is still left to do is to add a nicer syntax (which is easy) and to implement all the ctypes peculiarities (which is tedious, at best :-)).
As usual, Armin did tons of different stuff, including fixing a JIT bug, improving the performance of file.readlines() and working on the STM branch (for Software Transactional Memory), which is now able to run RPython multithreaded programs using software transaction (as long as they don't fill up all the memory, because support for the GC is still missing :-)). Finally, he worked on improving the Windows version of PyPy. While doing so he discovered together with Anto a terrible bug which lead to a continuous leak of stack space because the JIT called some functions using the wrong calling convention.
Håkan, with some help from Armin, worked on the jit-targets branch, whose goal is to heavily refactor the way the traces are internally represented by the JIT, so that in the end we can produce (even :-)) better code than what we do nowadays. More details in this mail.
Andrew Dalke worked on a way to integrate PyPy with FORTRAN libraries, and in particular the ones which are wrapped by Numpy and Scipy: in doing so, he wrote f2pypy, which is similar to the existing f2py but instead of producing a CPython extension module it produces a pure python modules based on ctypes. More work is needed before it can be considered complete, but f2pypy is already able to produce a wrapper for BLAS which passes most of the tests under CPython, although there's still work left to get it working for PyPy.
|Armin and Håkan with Laura's "5x faster" cake|
But, all in all, the best part of the sprint has been the cake that Laura baked to celebrate the "5x faster than CPython" achievement. Well, actually our speed page reports "only" 4.7x, but that's because in the meantime we switched from comparing against CPython 2.6 to comparing against CPython 2.7, which is slightly faster. We are confident that we will reach the 5x goal again, and that will be the perfect excuse to eat another cake :-)