Skip to main content

Porting the JIT to CLI (part 3)

In my two previous posts, we talked about the PyPy JIT generator, seeing that it can produce huge speedups and how its backend-independent frontend works.

In this post, we will look closer at the internals of the CLI JIT backend; in particular, we will see how we work around some serious limitations of the platform, and why these workarounds didn't have any serious impact on the performances of our toy virtual machine.

Graphs, blocks, links

One of the core aspect of PyPy translator is the concept of flow graph: a flow graph is a data structure that represents the code we are operating on. It is composed by a set of basic blocks, each block containing a sequence of operations; blocks are connected together by links, and each link can carry a variable number of arguments whose value is passed to the target block. In case a block contains more than one outgoing links, the one to follow is selected by looking at the value of a designated variable (the exitswitch), thus making possible to implement conditional jumps. To have a more complete description of the flow graphs model, check the documentation.

As we saw in the previous post, the generated JIT compiler makes heavy use of flexswitches to generate efficient code, continuously intermixing JIT-compile time and runtime.

In terms of graphs, we can think of a flexswitch as a special block whose links change over time. In particular, adding a new case to the flexswitch is equivalent to create a link whose target is a new block where the just generated code starts. Thus, the graphs grows over the time, as showed by the following images:

In the images above, the block containing the flexswitch is colored in cyan. In the first picture, there is only one block connected to the flexswitch: this block contains the code to restart the JIT compilation. The second picture shows the graph after the first case has been added: you can clearly see that a new block has been created and attached to the flexswitch. Finally, the third picture shows the graph after a while, with a lot of new blocks attached.

Translate graphs to CLI

Conceptually, the goal of the CLI JIT backend is to express these graphs in terms of CLI bytecode.

Translating the single block is easy, as it is just a list of sequential operation, and it's straightforward to map each operation to the equivalent CLI opcode or to a call to a helper method. Moreover, we need a way to express links between the various basic blocks: if the links are known in advance, render them is as easy as emitting a (potentially conditional) jump to the target block. Thus, we won't discuss this part in detail, as it is quite straightforward.

The hard part is how to implement flexswitches: at the time when we are emitting the code, some of the blocks of this growable graph don't even exist: how can we make a jump to a non existent block of code? For backends that emit assembly code, it is rather easy: when they need to add a new case to the flexswitch, they can just patch the existing code to insert a jump to a newly allocated area of the memory, where the new code is being generated in.

For CLI this approach is not feasible, as the VM will never allow us to modify existing code. Thus, we need to think of a different approach.

Graphs and methods

In .NET, the basic unit of compilation is the method: the only way to execute some bytecode is to wrap it into a method. Moreover, it is not possible to execute a method until it has been completed, and after this point it is no longer possible to add new code.

Because of all these constraints we cannot simply map each graph to its own method, since we saw that our graphs can grow after they have already been executed few times.

Hence, we need to distinguish between the two concepts:

  • a graph is the logical unit of code as seen by the JIT compiler: concretely, the CLI JIT backend renders it as one or more methods;
  • a method is a collection of basic blocks; each method has the so called parent graph, i.e. the graph its blocks logically belongs to.

The first method of a graph is called main method (which has nothing to do with the Main static methods found in .exe files); other methods are called children methods.

When we want to add a new case to the flexswitch, we create a method containing all the new code; then we wrap the method inside a delegate (the .NET equivalent of a function pointer) and pass it to the flexswitch, so that it can later invoke it.

The hard bit: non-local links

Using this approach, after a while the blocks of our original graph are scattered over a lot of different methods; however, there are no constraints about how these blocks can be linked together, so it happens to have links between blocks which are not in the same method. In the following, we will refer to them as non-local links.

If the non-local block we want to jump to happens to be at the beginning of its containing method, it is enough to invoke the method; but, what if we want to jump somewhere in the middle? What we really want is to produce a method which has multiple entry-points; again, doing it in assembly would be trivial, but the virtual machine does not provide any support for it, so we need a work around.

Each method in a graph is assigned an unique 16 bit method id; each block in a method is assigned a progressive 16 bit block number. From this two numbers, we can compute the block id as an unsigned integer, by storing the method id in the first 16 bits and the block number in the second 16 bits. By construction, the block id is guaranteed to be unique in the graph.

The following picture shows a graph composed of three methods; the id of each method is shown in red, while the block ids are shown in red (for the method id part) and black (for the block number part). The graph contains three non-local links; in particular, note the link between blocks 0x00020001 and 0x00010001 which connects two block that resides in different methods.

Every method contains a special dispatch block, (not shown in the picture above) whose goal is to jump to the specified block number inside the method itself. The first argument of a child method is always a block id; when the method starts, it immediately jumps to the dispatch block, and thus to the desired block.

For example, suppose to have a method which contains 3 blocks numbered 0, 1, 2; here is how its dispatch blocks looks like; for simplicity it is shown as C# code, but it is actually generated as IL bytecode:

// dispatch block
int methodid = (blockid & 0xFFFF0000) >> 16); // take the first 16 bits
int blocknum = blockid && 0x0000FFFF;         // take the second 16 bits

if (methodid != MY_METHOD_ID) {
// jump_to_unknown block
...
}

switch(blocknum) {
case 0:
goto block0;
case 1:
goto block1;
case 2:
goto block2;
default:
throw new Exception("Invalid block id");
}

Whenever we want to jump to a non-local block, it is enough to store the block id in the appropriate variable and jump to the dispatch block. If the block resides in a different method, the jump_to_unknown block is entered; this special block is implemented differently by the main method and the child methods, as we will see soon.

Each time a new method is added to the graph, we build a delegate for it, and store it in a special array called method_map; since we assign the method id sequentially starting from 0, we are sure that to fetch the method whose id is n we can simply load the n-th element of the array.

The jump_to_unknown block of the main method uses this array to select the right method, and calls it (FlexSwitchCase is the type of delegates for all children methods):

// jump_to_unknown block of the main method
FlexSwitchCase meth = method_map[methodid];
blockid = meth(blockid, ...); // execute the method
goto dispatch_block;

Each child method returns a block id specifying the next block to jump to; after its execution, we assign the return value to the blockid variable, and jump again to the dispatch block, which will jump again to the appropriate block.

Keeping this in mind, it is straightforward to implement the jump_to_unknown block of children methods: it is enough to return the target block id to the caller, and let its dispatch loop do the right thing. If the caller is also a child method, it will return it again, until we reach the dispatch loop of the main method, which will finally do the jump. In theory, we could implement things differently and jumping directly from a child method to another one, but in that case the call stack could grows indefinitely in case of a tight loop between two blocks residing in different methods.

To implement the dispatch block we can exploit the switch opcode of the CLI; if the .NET JIT is smart enough, it can render it using an indirect jump; overall, jumping to a non-local block consists of an indirect function call (by invoking the delegate) plus an indirect jump (by executing the switch opcode); even if this is more costly than a simple direct jump, we will see in the next section that this not the main source of overhead when following a non-local link.

Obviously, the slow dispatching logic is needed only when we want to jump to a non-local block; if the target block happens to reside in the same method as the current one, we can directly jump to it, completely removing the overhead.

Moreover, the dispatch blocks are emitted only if needed, i.e. if the parent graph contains at least one flexswitch; graphs without flexswitches are rendered in the obvious way, by making one method per graph.

The slow bit: passing arguments

Jumping to the correct block is not enough to follow a link: as we said before, each link carries a set of arguments to be passed from the source to the target block. As usual, passing arguments across local links is easy, as we can just use local variables to hold their values; on the other hand, non-local links make things more complex.

The only way to jump to a block is to invoke its containing method, so the first solution that comes to mind is to specify its input arguments as parameter of the method; however, each block has potentially a different number (and different types) of input arguments than every other block, so we need to think of something else.

An alternative solution could be to compute the union of the sets of input arguments of all the blocks in the method, and use this set as a signature for the method; this way, there would be enough space to specify the input arguments for every block we might want to jump to, each block ignoring the exceeding unused parameters.

Unfortunately, all the children methods must have the very same signature, as they are all called from the same calling site in the dispatch block of the main method. Since the union of the set of input arguments (and hence the computed signature) varies from method to method, this solution cannot work.

We might think to determine the signature by computing the union of input arguments of all blocks in the graph; this way, all the children methods would have the same signature. But as we said above, the graph grows new blocks at runtime, so we cannot determine in advance which set of input arguments we will need.

To solve the problem we need a way to pass a variable number of arguments without knowing in advance neither their number nor their types. Thus, we use an instance of this class:

public class InputArgs {
public int[] ints;
public float[] floats;
public object[] objs;
...
}

Since the fields are arrays, they can grow as needed to contain any number of arguments; arguments whose type is primitive are stored in the ints or floats array, depending on their type; arguments whose type is a reference type are stored in the objs array: it's up to each block to cast each argument back to the needed type.

This solution impose a huge overhead on both writing and reading arguments:

  • when writing, we need to make sure that the arrays are big enough to contains all the arguments we need; if not, we need to allocate a bigger array. Moreover, for each argument we store into the array the virtual machine performs a bound-check, even if we know the index will never be out of bounds (because we checked the size of the array in advance);
  • when reading, the same bound-check is performed for each argument read; moreover, for each value read from the objs array we need to insert a downcast.

To mitigate the performance drop, we avoid to allocate a new InputArgs object each time we do a non-local jump; instead, we preallocate one at the beginning of the main method, and reuse it all the time.

Our benchmarks show that passing arguments in arrays is about 10 times slower than passing them as real parameter of a method. Unfortunately, we couldn't come up with anything better.

Implement flexswitches

Now, we can exploit all this machinery to implement flexswitches, as this is our ultimate goal. As described above, the point is to be able to add new cases at runtime, each case represented as a delegate. Here is an excerpt of the C# class that implements a flexswitch that switches over an integer value:

public class IntLowLevelFlexSwitch:
{
public uint default_blockid = 0xFFFFFFFF;
public int numcases = 0;
public int[] values = new int[4];
public FlexSwitchCase[] cases = new FlexSwitchCase[4];

public void add_case(int value, FlexSwitchCase c)
{
...
}

public uint execute(int value, InputArgs args)
{
for(int i=0; i<numcases; i++)
if (values[i] == value) {
 return cases[i](0, args);
}
return default_blockid;
}
}

For each case, we store both the triggering value and the corresponding delegate; the add_case method takes care to append value and c to the values and cases arrays, respectively (and resize them if necessary). The interesting bit is the execute method: it takes a value and a set of input arguments to be passed across the link and jumps to the right block by performing a linear search in the values array.

As shown by previous sections, the first argument of a FlexSwitchCase is the block id to jump to; since when we go through a flexswitch we always want to jump to the first block of the method, we pass the special value 0 as a block id, which precisely means jump to the first block. This little optimization let us not to have to explicitly store the block id for the first block of all the cases.

The value returned by execute is the next block id to jump to; if the value is not found in the values array, we return the default_blockid, whose value has been set before by the JIT compiler; default_blockid usually points to a block containing code to restart the JIT compiler again; when the JIT compiler restarts, it emits more code for the missing case, then calls add_case on the flexswitch; from now on, the new blocks are wired into the existing graph, and we finally managed to implement growable graphs.

Performances

As we saw, implementing growable graphs for CLI is a pain, as the virtual machine offers very little support, so we need an incredible amount of workarounds. Moreover, the code generated is much worse than what an assembly backend could produce, and the cost of following a non-local link is very high compared to local links.

However, our first blog post showed that we still get very good performances; how is it possible?

As usual in computer science, most of the time of a running program in spent in a tiny fraction of the code; our benchmark is no exception, and the vast majority of the time is spent in the inner loop that multiplies numbers; the graph is built in such a way that all the blocks that are part of the inner loop reside in the same method, so that all links inside are local (and fast).

Flexswitches and non-local links play a key role to select the right specialized implementation of the inner loop, but once it is selected they are not executed anymore until we have finished the computation.

It is still unclear how things will look like when we will compile the full Python language instead of a toy one; depending on the code, it could be possible to have non-local links inside the inner loop, thus making performance much worse.

Alternative implementations

Before implementing the solution described here, we carefully studied a lot of possible alternatives, but all of them either didn't work because of a limitation of the virtual machine or they could work but with terrible performances.

In particular, in theory it is possible to implement non-local links using tail calls, by putting each block in its own method and doing a tail call instead of a jump; this would also solve the problem of how to pass arguments, as each method could have its own signature matching the input args of the block. I would like to explain this solution in a more detailed way as I think it's really elegant and nice, but since this post is already too long, I'll stop here :-).

In theory, if the .NET JIT were smart enough it could inline and optimize away the tail calls (or at least many of those) and give us very efficient code. However, one benchmark I wrote shows that tail calls are up to 10 times slower (!!!) than normal calls, thus making impractical to use them for our purposes.

Conclusion

Despite the complexity of the implementation, our result are extremely good; the speedup we got is impressive, and it proves that PyPy's approach to JIT compiler can work well also on top of object oriented virtual machines like .NET or the JVM.

Generating bytecode for those machine at runtime is not a new idea; Jython, IronPython, JRuby and other languages have been doing this for years. However, Jython and IronPython do only a simple "static" translation, which doesn't take advantage of the informations gathered at runtime to generate better, faster and specialized code. Recently, JRuby grew a new strategy to JIT-compile only hotspots, taking advantage of some informations gathered while interpreting the code; this is still a "one-shot" compilation, where the compiled code does not change over time.

To my knowledge, PyPy brings the first example of a language which implements a truly JIT compiler on top of the underlying JIT compiler of the virtual machine, emitting bytecode that changes and adapts over the time. If someone knows other languages doing that, I would really like to know more.

Being so innovative, the problem of this approach is that the current virtual machines are not designed to support it in a native way, and this forces us to put a lot of workarounds that slow down the generated code. The hope is that in the future the virtual machines will grow features that help us to generate such kind of code. The experimental Da Vinci VM seems to go in the right direction, so it is possible that in the future I will try to write a JIT backend for it.

At the moment, the CLI JIT backend is almost complete, and all the hardest problems seems to be solved; the next step is to fix all the remaining bugs and implement some minor feature that it's still missing, then try to apply it to the full Python language and see what is the outcome.

Unknown wrote on 2008-12-07 22:33:

JikesRVM + LLVM

https://osdir.com/ml/java.jikes.rvm.devel/2003-09/msg00059.html

Don't know if it succeeded.

Yosef wrote on 2008-12-08 08:15:

The comment about assembly-code patching is interesting. Do you mean assembly code backends can do runtime patching of previously generated code? I thought this is impossible, because operating systems mark executable pages as read-only. How is that dealt with?

Maciej Fijalkowski wrote on 2008-12-08 09:29:

Most executable pages are read only, but there is nothing that stops you from creating ones that are rw. You just pass different flags to mmap.

Cheers,
fijal

Antonio Cuni wrote on 2008-12-08 19:11:

@Yosef
about patching generated code, see fijal's comment. Btw, this is exactly the same approach used by psyco

Anonymous wrote on 2008-12-11 08:22:

It's wrong to say IronPython only does static translation. The Call Site stuff happens and generates IL at run time, and generates different code depending on the types. In fact you may want to look at how they do it, becuase they regenerate the IL for a method multiple times, which may be another way of implementing Flex switches

Antonio Cuni wrote on 2008-12-12 09:02:

@Ben Young
do you have a link that explains in more detail what you mean?
As far as I know, DLR's callsites are just a way to do polymorphic inline caches, but nothing more. In particular, they don't do any specialization of the called code.

You are right that we could do the same to implement flexswitches, though I think this is a minor optimization, as right now the real performance problem is how to pass arguments across non-local links.

Anonymous wrote on 2008-12-12 16:26:

Hi Antonio

IronPython can create a method that looks like this

void object add(object a, object b)
{
throw new Exception("I don't know how to add")
}

into

void object add(object a, object b)
{
if(a is int && b is int)
return (int)a + (int)b

throw new Exception("I don't know how to add")
}

and can further add new tests at runtime. The code do do the adding is written directly into the method body and there's no futher call needed. This is runtime code generation, not just caching

In your case, instead of having multiple methods implementing different blocks you could just rewrite the whole "master" method every time the flexswitch changes. That way there's no call overhead at all. That's what the DLR does. I think the main thing it's missing is promotion, so shared tests can't be moved up a level, and it doesn't do inlining.

Anonymous wrote on 2008-12-18 08:32:

I'm a beginner programmer, so please excuse my beginner questions :-)

I just started learning Python as my first programming language. Several of my programmer friends have said I should learn Java instead, one reason being the difference in performance - specifically for doing natural language processing / AI stuff which is the area I am interested in.

With PyPy, do you think it is likely that in the near future, Python's performance may be close to that of Java? I do plan on learning multiple languages, but it would be nice if I could stick with Python for as long as possible :-)

Lucian wrote on 2008-12-20 12:56:

@Anonymous

Probably. People have great hopes for PyPy, but you can never know how it will turn out, if at all.

Right now, you can use things like numpy, psycho, shedskin, cython/pyrex and a few others to speed up you code, only needing to know a few things about C or C++. Google them.

Luis wrote on 2008-12-20 14:38:

@Sin

You don't need to know any c or c++ to use psyco or shedskin. Only python.

Anonymous wrote on 2009-03-05 03:07:

WoW shares many wow gold of its features with previously launched games. Essentially, you battle with wow gold cheap monsters and traverse the countryside, by yourself or as a buy cheap wow gold team, find challenging tasks, and go on to higher aoc gold levels as you gain skill and experience. In the course of your journey, you will be gaining new powers that are increased as your skill rating goes up. All the same, in terms of its features and quality, that is a ture stroy for this.WoW is far ahead of all other games of the genre the wow power leveling game undoubtedly is in a league of its own and cheapest wow gold playing it is another experience altogether.

Even though WoW is a Cheap Wow Gold rather complicated game, the controls and interface are done in warhammer gold such a way that you don't feel the complexity. A good feature of the game is that it buy wow items does not put off people with lengthy manuals. The instructions bygamer cannot be simpler and the pop up tips can help you start playing the game World Of Warcraft Gold immediately. If on the other hand, you need a detailed manual, the instructions are there for you to access. Buy wow gold in this site,good for you, BUY WOW GOLD.

Anonymous wrote on 2009-04-11 04:00:

My friends and I like to buy Anarchy credits, because the Anarchy Online credits is very useful to upgrade equipment. Only your equipment becomes better, then you can win this game. In Anarchy gold, you can buy everything you want in this game. Tomorrow will be my birthday, so my friends promise to buy AO credits as gifts. I am so happy. They understand me so well, Anarchy online gold is my favorite.
I like angels gold very much because it is very useful. In fact at first sight I have fallen in love with angels online gold. So no matter how much I have spent to buy angels gold, I never regret. Because of cheap angels online gold, I meet a lot of friends.

Porting the JIT to CLI (part 2)

In my previous post, we saw that PyPy JIT generator can produce huge speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it manages to do so. Note that this is a very high level overview of how the JIT works, and applies to all backends. Then, in the third post of this series, we will look closer at the CLI JIT backend, seeing how it works around some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a JIT of our own: so, instead of passing nights writing a JIT, we pass years coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this will be the subject of another blog post), but how the generated JIT works.

There are values that, if known at compile-time (i.e., when the JIT compiler runs), let the JIT to produce very efficient code. In a dynamic language, types are the primary example: for instance, suppose you are a compiler and you have to compile to following Python function:

def mysum(a):
  return a + 1

At compile time, you don't have any knowledge about the type of the parameter: it could be integer, float, an user defined object, etc. In this situation, the only safe choice is to emit code which does the usual, slow, full lookup to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an integer: this time, you could emit code that exploits this extra knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to generate efficient code, you stop compiling and wait until you know exactly what you need. Concretely, you emit code that runs until the point where you stopped the compilation, then it triggers a special procedure that restarts the compiler. This time the JIT compiler knows everything you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces something like this pseudo-code:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      default: continue_compilation(a_type, <position>);
  }
}

If you call mysum(41), the execution goes in the default branch of the switch, thus calling continue_compilation: its job is to restart the JIT compiler, which now can emit fast code because it knows the exact type of a; then, it modifies the original mysum_compiled function, in order to make it executing the newly generated code the next time it encounters an integer at that point:

PyObject mysum_compiled(PyObject a)
{
  Type a_type = a.GetType();
  switch(a_type) {
      PyInteger: return new PyInteger(a.value+1); // fast path!
      default: continue_compilation(a_type, <position>);
  }
}

From now on, every time we call mysum with an integer argument, the JIT compiler is not called anymore and the fast path is directly executed; if we happen to call mysum with a float arguments, the switch goes again in the default branch, and the JIT compiler is started once more to produce fast code also for this case. What happens in practice is that compile-time and runtime are continuously intermixed, until the switches are stable enough and the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate efficient code and which aren't? Unfortunately it can't, or at least our JIT generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by annotating the variables on which we want the JIT to stop until it knows the actual values; this is done by using particular hints, called promote and promote_class; variables annotated with such hints are said to be promoted. If something is promoted, a flexswitch is used to gain information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc virtual machine:

elif opcode == ADD:
  a, b = stack.pop(), stack.pop()
  hint(a, promote_class=True)
  hint(b, promote_class=True)
  stack.append(b.add(a))

This the implementation of the ADD opcode: first, it pops two values from the stack; then, it computes the result; finally, it push the result to the stack again. In between, both the classes of a and b have been promoted: this means that when the JIT emits the code for b.add(a), it knows exactly what is happening: if it sees that both are instances of the IntObj class, it inlines the method call and emits a fast integer addition instead.

Virtuals

The other important concept of the JIT is the presence of virtual structures, virtual lists, and virtual dictionaries. Again, I'm not going to explain in depth how they work, but only why they are so important for generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really need to do it, e.g. because they are being passed as an argument to some external function. Instead, we store all the informations we need as local variables; e.g., in the case of a virtual structure, we create as many local variables as the number of its fields: if the structure escapes the local scope, we force it to a real object, by allocating memory on the heap and initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary objects that hold intermediate results; consider for example the following Python loop:

result = 0
for i in range(N):
  result += i
return result

Without the JIT, at each iteration, a new int object is created and bound to the result variable, while the previous one is discarded and not needed anymore. By combining virtuals and promotion, the JIT can emit code that does the whole computation locally, and allocates a real object only at the end, when it escapes from the local scope because it is returned from the function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs emit multiple versions of each chunk of code: each version is specialized and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches) is almost always negligible compared to how much time you save by running the fast version instead of the more-general-but-slow one. Moreover, each specialized version knows the exact shape of the objects it's dealing with, so they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends: the backends exploit all the knowledge gathered by the previous steps to produce highly efficient code, but this will be the subject of the next blog post.

Anonymous wrote on 2008-11-07 12:46:

Wow... I love this approach. Keep up the great work and interesting posts!

Luis wrote on 2008-11-07 18:12:

This is a very clear and didactic explanation. Thanks!

Anonymous wrote on 2008-11-07 20:14:

can't wait for the next one

kbob wrote on 2008-11-08 01:35:

What does the flexswitch compile into? I'm guessing it would look like

t = type(obj);
if (t == int)
...
else if (t == float)
...
else
....

but maybe there's a better way (or maybe the answer is backend-dependent).

Antonio Cuni wrote on 2008-11-08 08:53:

@bob
your guess is right, how to implement the flexswitch is backend-dependent. This is the hardest part for .NET, as the flexswitch needs to grow dynamically (i.e., you have to add more case after the .NET method has already been compiled). It will be subject of the next blog post.

Unknown wrote on 2009-01-05 22:46:

It seems that an implication of the JIT way is that, by adopting a consistent habit of implementing type driven Generic Functions, the JIT could accomplish nearly all of the possible optimizations in a single pass. In other words, by definition, each type based variation of a Generic Function call can only be fired when data of that type is provided as a parameter.

Porting the JIT to CLI (part 1)

As the readers of this blog already know, I have been working on the CLI JIT backend for some months: last Friday, it reached an important milestone, as it is now able to produce huge speedups for a little dynamic language. To know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of few annotations, automatically generate a JIT compiler for it. In this post, we will talk about the tlc virtual machine: while tlc it is just a toy language, it contains some features that make it an interesting target for our JIT generator.

The tlc virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e. lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

  • opcodes to manipulate the stack, like PUSH, POP, etc.
  • integer operations, like ADD, MUL, all the comparisons, etc.: these operations can only be applied to integers;
  • list operations, like CONS, CAR, CDR: these operations can only be applied to lists;
  • other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities with Python (though on a smaller scale, of course):

  1. it has to do type-checks at runtime before doing most of the operations;
  2. every time you do an arithmetic operation, it has to unbox the operand, do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting bits are the classes used to represent objects, and the interp_eval function, which contains the main loop of the virtual machine. As you can see, the implementation is quite straightforward; all the hint calls you see are the special annotations needed by the JIT generator to produce better code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co https://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tl directory, and compile the tlc VM with the CLI backend and JIT enabled:

$ cd oo-jit/pypy/jit/tl/
$ ../../translator/goal/translate.py -b cli --jit --batch targettlc
...
lot of texts
...

If everything went OK, you now have a targettlc-cli executable, which accepts two arguments: the name of the file containing the tlc program we want to run, and an integer to be passed to it.

Luckily, in the same directory we have a factorial.tlc file that contains the bytecode for a function that -- guess? -- computes the factorial of a given integer; let's try it:

$ ./targettlc-cli factorial.tlc 5
Non jitted:    120 (0.009371 seconds)
Warmup jitted: 120 (0.208954 seconds)
Warmed jitted: 120 (0.000323999999999991 seconds)

Cool, it seems that the result was computed correcly :-). As you can see from the output, we ran the program three times:

  1. by plain interpretation, without any jitting;
  2. with the jit enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code;
  3. again with the jit enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very big number; the result will be 0, because obviously after a while we overflow, but after all we are interested in the time spent, not in the result:

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (19.93247 seconds)
Warmup jitted: 0 (0.293229999999998 seconds)
Warmed jitted: 0 (0.0494239999999984 seconds)

$ python -c 'print 19.93247/0.0494239999999984'
403.295362577

And no, I didn't make any mistake in copying&pasting: the jitted version is really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the results vary a lot from run to run; I've run the benchmarks a lot of times, and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C# version of the factorial, to see how good is code we produced; to get reasonable results, we need to compute a larger factorial, to let to code to run a bit more:

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0.980856 seconds)
Warmed jitted: 0 (0.769716 seconds)

$ mono factorial.exe 100000000
C#:            0 (0.153777 seconds)

$ python -c 'print 0.769716/0.153777'
5.00540392907

We know that the generated code is far from being optimal, but probably the factor of five is at least partially due to the fact that Mono's own JIT is optimized for C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the results for the same benchmarks, but run with Microsoft CLR (on a different machine, so the absolute values are not comparable):

$ ./targettlc-cli factorial.tlc 5000000
Non jitted:    0 (15,640625 seconds)
Warmup jitted: 0 (0,4375 seconds)
Warmed jitted: 0 (0,03125 seconds)

$ python -c 'print 15.640625/0.03125'
500.5

$ ./targettlc-cli --onlyjit factorial.tlc 100000000
Warmup jitted: 0 (0,90625 seconds)
Warmed jitted: 0 (0,515625 seconds)

$ ./factorial.exe 100000000
C#:            0 (0,34375 seconds)

$ python -c 'print 0.515625/0.34375'
1.5

The results are even better than before; this is probably thanks to CLR's JIT, that does a better job than Mono when faced to something which is different than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT compilers can be applied effectively also to OO virtual machines; the result is even better than what I expected, because when generating code for .NET we have much less freedom than when generating assembly code, and I had to play some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc virtual machine as soon as all the related JIT tests were passing, and surprisingly everything worked just fine, even if it was the very first time I was trying to apply some features of the JIT to something bigger than a test: I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed: as a consequence it can't still be used for the full PyPy, but all the hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the JIT, I will try to soon write a second blog post explaining how the CLI backend works internally, with a brief look to the generated code to see how it looks like.

Anonymous wrote on 2008-11-04 01:50:

If you are benchmarking on Linux then watch out for CPU speed scaling. For example on Ubuntu by default the ondemand governor is used which runs the CPU at lowest possible speed and until there is any CPU demand at which point it runs at fastest. The time to switch varies (eg sometimes it can be instantaneous, other times a second or two, and other times not at all).

Make sure to use: cpufreq-set -g performance

That will run at maximum CPU speed the whole time.

Lucian wrote on 2008-11-04 02:27:

Woohoo! Can't wait for more (and the jvm counterpart).

Anonymous wrote on 2008-11-04 08:46:

I agree that this result is very important. For me and many others too i guess, a very good JIT is the most important missing part in python and other dynamic languages. Speed *is* important,

For integers, psycho also showed huge performance gains. But i think the real proof of pypy's approach would be to show similar results for floating point operations also...

Anonymous wrote on 2008-11-04 11:09:

awesome!

Anonymous wrote on 2008-11-04 11:47:

Keep it on!

Antonio Cuni wrote on 2008-11-04 16:50:

hi, thanks for your valuable comments.

Some notes:

- I know about cpufreq-set, but even setting the governor to performance doesn't help, the timings vary a lot between different runs. If someone knows a way to run reliable benchmarks, it would be very appreciated!

- I have plans to experiment also the JIT on the JVM: since HotSpot usually does a better job than CLR's JIT, it's possible/likely that the JVM is a better platform for our purposes. Also, the experimental Da Vinci Machine contains features that could be very useful for us. Unfortunately the PyPy non-JIT JVM backend is not as advanced as the CLI one, and it lacks some features that are really needed for writing a JIT backend.

- Float operations are already (mostly) supported by our JIT backends; I bet that if you add a FloatObj to the tlc interpreter, you will see huge speedups as well. However, the real point of PyPy's approach is that once finished it will optimize much more than ints and floats, including features that are currently not implemented by psyco (e.g. generators).

Ορέστης wrote on 2008-11-04 21:21:

Brilliant post! Keep us updated!

One year PyPy Blog

Last Friday the PyPy Status Blog had its first anniversary. Yay! After not really buying into any of this new-fangled "blog" stuff for a long time we just bit the bullet and got started. Totally surprisingly it even worked. We posted 76 post in the last year, more than one per week. By now we have more than 800 subscribers (according to feedburner), which is quite cool for a rather niche blog.

To make our blog even more interesting, I would like to ask for some feedback via the comments:

  • Which posts did you like in particular?
  • What sort of posts would you be interested in getting more of?
  • Any other improvements we could make?
Anonymous wrote on 2008-11-02 18:28:

For me the most interesting posts is about status of PyPy project. It will be great if you could post more frequently.

Unknown wrote on 2008-11-02 20:40:

+1

Michael Foord wrote on 2008-11-02 21:09:

It's been great to read about PyPy progress, congratulations for surviving a year and many thanks.

I also like to hear the status updates - and wouldn't mind a bit more technical detail.

In fact some deep dives into individual aspects of PyPy would be *great*, even if they're more effort to write...

Eduardo O. Padoan wrote on 2008-11-02 21:45:

Greetings!
What about quick Weekly Status Updates, with a summary of svn activity and stuff?

Anonymous wrote on 2008-11-03 02:41:

It's not just a first for you, it's a first for me. This is the first blog I have ever subscribed to. You can attribute that to the fact that this subject is geniunly interesting.

The blog format has many benefits. For one, it amortizes the effort required to understand the project. This allows me to take my time, wiki whatever I need to, and savor the details. It takes time for me to learn the concepts but in due time, I can see myself eventually contributing to the project. The other benefit is I can see all the hot topics revolved around the various pypy projects. The whole partial evaulation, for example, was something new I learned about.

I would agree that increasing the rate of posts would be nice. While I can't say for others, in my personal experience, it seems that logged projects tend to finish faster than unlogged projects.

Anonymous wrote on 2008-11-03 02:42:

It's not just a first for you, it's a first for me. This is the first blog I have ever subscribed to. You can attribute that to the fact that this subject is geniunly interesting.

The blog format has many benefits. For one, it amortizes the effort required to understand the project. This allows me to take my time, wiki whatever I need to, and savor the details. It takes time for me to learn the concepts but in due time, I can see myself eventually contributing to the project. The other benefit is I can see all the hot topics revolved around the various pypy projects. The whole partial evaulation, for example, was something new I learned about.

I would agree that increasing the rate of posts would be nice. While I can't say for others, in my personal experience, it seems that logged projects tend to finish faster than unlogged projects.

Bill Mill wrote on 2008-11-03 03:09:

> Which posts did you like in particular?

I just scanned a bunch of entries, and "List comprehension implementation details" jumped out at me as a really nice one. I like that it points out some of the deep details of python that are easy for me to not think about because I'm not implementing it.

> What sort of posts would you be interested in getting more of?

More technical details posts, I really like the one about the JIT and Prolog too.

I post your articles to reddit too, and I think "we can now run big software X" and efficency milestones are successfuly at attracting a lot of attention (if that's what you want!)

Benjamin Peterson wrote on 2008-11-03 03:11:

Thanks so much for doing this! It makes me very jealous over here in CPython land.

I like to hear about specific new projects and ideas you guys are working on.

nshepperd wrote on 2008-11-03 05:39:

For me the most interesting things were the technical details posts, like Bill Mill said. But I get excited any time there is a new blog post. :)

Unknown wrote on 2008-11-03 06:35:

Being in the scientific computation area at the moment, I'm very eager to hear about progress in the JIT framework, esp. for 64 bit Linux.

Yet most other posts are also interesting.

Anonymous wrote on 2008-11-03 11:01:

> Which posts did you like in particular?

Anything about the JIT and its progress.

Good luck!

Unknown wrote on 2008-11-03 12:57:

Hi,

I think the blog is pretty good. Weekly summaries would make it rock, though.

And I am also especially interested in hearing about progress on JIT work. And about any use of LLVM.

Best
Anders

Carl Friedrich Bolz-Tereick wrote on 2008-11-03 14:46:

Thanks for all the friendly comments!

So more technical posts it will be :-). Those are mostly even fun to write, it's just usually quite a bit of work. I actually have a vague plan to give a basic introduction of the ideas behind the JIT (but will still take some time, I am busy with the lecture at the moment).

About more summaries of what happens: it requires a lot of discipline (see the Python-dev summaries) and I am not sure we have that :-). It would need somebody dedicated to care for it, and that one won't be me at the moment.

Luis wrote on 2008-11-03 15:03:

Personaly, I get very anxious when you say "it will be ready when it's ready". Aaarghhh! Please, at least lie a little bit :-).
For example: "Pypy is now 1.8x slower than cpython, but after [feature here] it will be 10x faster".
Well, just kidding. Congratulations for all the great work and keep it up!

Damian Cugley wrote on 2008-11-03 16:03:

I am not especially in favour of weekly summaries, unless there is some interesting progress to report. Otherwise you end up with someone filling in progress reports because they feel obliged to, rather than to celebrate new features, and it becomes a chore.

That said, PyPy has many subprojects; maybe having a round-robin system where we get a progress report from a different project every week would be interesting.

Anonymous wrote on 2008-11-03 23:05:

I'm a regular Python user that wishes often for something a little faster with the same flexibility. So generally, I read this because I think you guys are on a good track for JIT optimisation and other fun things.

I guess I'm looking forward to the eventual series of posts that talks about how you can start using this on your system to replace your system Python, followed by you talking the regular Python core developers into working directly in PyPy instead. =)

Paul D. Eden wrote on 2008-11-03 23:36:

For me the best parts are the tutorials and howtos relating to rpython, translating to c, etc.

Konrad wrote on 2008-11-07 11:54:

I'm interested in status updates and longer descriptions on how elements of PyPy work. Sprint summaries are fine as long as they carry one of the above (they usually do, though :>)

John Mudd wrote on 2008-11-10 13:27:

I'm interested in anything to do with multi-thread support, GIL elimination, general status, progress and future plans.

Sprint Discussions: JIT Generator Planning

Background

Finally, the JIT post :-). First some background: Despite our plans at the end of the EU period, PyPy's Python interpreter didn't get a good and widely applicable JIT in the last year. The reason for that was that we discovered that although the basic idea to generate JIT compilers is good, the concrete prototype made during the EU period is basically flawed. It could have been pushed a bit farther, but would have run into deep troubles eventually. One of the problems would have been performance instability: change a seemingly unrelated bit in your source program, and the performance changes in unexpected ways, which is clearly not desirable. Another problem with that old approach is that too much assembler code is generated, leading to memory problems, and also that the generated assembler is bad in various ways, e.g. it is hard in that approach to do proper register allocation.

Therefore we decided that it would be worthless to pursue this direction much further. Instead we tried to research approaches to fixing the inherent problems. This research was largely done in Prolog and I eventually wrote my Master's thesis about it. From the Prolog work we got some good insights into what needs to be done and what sorts of techniques are needed. Also, it inspired Armin to do some more exploration on a small Python prototype which used the lessons learned from Prolog and also some additional ideas from tracing JITs. So far, however, the prototype is neither in RPython, nor much integrated with PyPy.

This research is not the only thing happening in the JIT-area. During the last year, Antonio Cuni was working on bringing the JIT to pypy-cli. This consisted mostly of writing a .NET backend for the old JIT-generator. Some further work is being done since August by John Witulski, who is writing an AMD64 backend for the JIT-generator for his Bachelor's thesis.

Where to go from there

During the sprint we discussed in which directions we should continue now. We plan to work quite a bit on the JIT in the coming months. Both Armin and Anto are in Düsseldorf for four months, and them and me plan to mostly work on the JIT (as well as giving a lecture on "Dynamic Programming Languages", trying to ensnare some more students).

The first step will be to experiment a bit more with Armin's prototype. So far it looks rather promising, but there are some unsolved issues that we need to look into first. The first issue is to think a bit about how to efficiently do profiling to compile only important code paths. The other large issue are so-called "virtualizables". Roughly speaking, they are the frame objects of the interpreter from which the JIT is generated. They need special treatment, because on the one hand it is important that they get optimized away to make the code fast, since the frames are accessed all the time for the local variables; on the other hand they should still be usable for introspection if code is around that is trying to look into them.

When this is done, the prototype needs to be ported to RPython, which is a non-trivial task, since it is rather dynamic so far (it is rather important that the unresolved issues are done before the porting, because once the prototype is in RPython, experimentation will be harder). The porting has the potential to be tedious, but in a sense it is "just work", as opposed to unclear research.

At this point it will become important to think about the backend interface. The interface that the old frontend used to produce assembler code won't be usable for the new approach, so things need to be rearranged slightly. Afterwards the backends will have more information and be invoked at a slightly higher level, which should allow them to produce better code.

When all this is done, the JIT generator will be in a rather good state and it should become possible (modulo a lot of details, of course), to use it on the Python interpreter.

Conclusion

I am intentionally not attaching any time estimates to this blog post. So far our time estimates have not been very accurate when it comes to the JIT, which only lead to disappointment when the JIT failed to materialize. We hope that we will progress in interesting ways in the next four months, but who knows. Note that we are really quite disappointed ourselves that it took so much longer than we planned and hoped. The reason for this is mostly that this work really is research and sometimes it is just hard to predict what sort of problems turn up. Partial evaluation (the basis for our JIT generator) is a 30 years old technique that was always just promising and never really successful, so the fact that we think we can solve its problems in a few years is very much hubris anyway :-). On the positive side, we think that we now know these problems much better than ever before and that we have a plan that has a chance to succeed.

Also we are still convinced that our approach has huge potential, despite the difficulties. If we manage to pull it off, it should be significantly simpler to support new language features in the JIT and also to get speedups on some rather interesting bits of the language. Some ideas we are having include generating a JIT for the regex engine or speed up ctypes-bindings to be nearly as fast as an extension module (or faster?). Also the JIT will be such that by construction the JIT-generated code behaves identical to the original code, which isn't always true for Psyco, for example.

Luis wrote on 2008-10-14 19:20:

Thank you very much for this post Carl! I have a couple of questions:
You said you were experimenting with some ideas from tracing JITs. I wonder how much the new javascript VMs are influencing your work in pypy. Were these techniques considered from the beginning or this is just because of the latest success of tracemonkey? And if so, are there ideas from Chrome's v8 or Squirellfish than could be applied too?
Do they change in any way your expectations regarding the potential of pypy concerning speed?

Anonymous wrote on 2008-10-15 05:42:

Hi,

I quess you will now the work done about hybrid frames in VisualWorks works, but since you mentioned the problem, anyone else could benefit from the following link:

https://pages.cs.wisc.edu/~cymen/misc/interests/oopsla99-contexts.pdf

Maciej Fijalkowski wrote on 2008-10-15 13:21:

@luis: we definitely read same papers (by Michael Franz and others) as tracemonkey authors. Work on tracing jit for pypy is older than release of tracemonkey (it's even older than first work on tracemonkey). Regarding chrome's v8 it seems the main optimization is implementation of hidden classes, which we kind of get for free combining jit and our existing optimization called shared dict.

Anonymous wrote on 2008-10-27 05:59:

From the sound of it, it sounds like it would be useful to have a 64-bit version of Psyco, otherwise there is no stopgap in the meantime...

Anonymous wrote on 2008-12-01 13:50:

When will PyPy match Psyco's speed?

Anonymous wrote on 2009-01-18 20:40:

Yes, especially on Linux, everything is moving to 64-bit. If there's no 64-bit Psyco, you can't get the benefits of 64-bit Python (memory) and use Psyco at the same time.

Sprint Discussions: C++ Library Bindings

At the beginning of this year, PyPy grew ctypes support, thanks to generous support by Google. This made it possible to interface with C libraries from our Python interpreter, something that was possible but rather tedious before. What we are lacking so far is a way to interface to large C++ libraries (like GUI libraries). During the sprint we had a brainstorming session about possible approaches for fixing this shortcoming.

For CPython there are a number of approaches in common use:

Those all have the property that they produce some code that is then compiled with a compiler to produce a CPython extension. The produced code also uses functions from CPython's C-API. This model is not simple to use for PyPy in its current state. Since PyPy generates C code automatically, a fixed C-level API does not exist (it is not unlikely that at one point in the future we might have to provide one, but not yet). At the moment, PyPy very much has a "Don't call us, we call you"-approach.

A very different approach is followed by the Reflex package, which is developed at CERN (which has an incredible amount of C++ libraries). It is not mainly intended for writing Python bindings for C++ libraries but instead provides reflection capabilities for C++. The idea is that for every C++ shared library, an additional shared library is produced, which allows together with Reflex to introspect properties of C++ classes, methods, etc. at runtime. These facilities are then used for writing a small generic CPython extension module, that allows CPython to use any C++ library for which this reflection information was generated.

This approach is a bit similar to the ctypes module, apart from the fact that ctypes does not use any reflection information, but the user has to specify the data structures that occur in the C code herself. This makes it sometimes rather burdensome to write cross-platform library bindings.

For PyPy the approach seems rather fitting: We would need to implement only the generic extension module and could then use any number of C++ libraries. Of course some more evaluation is needed (e.g. to find out whether there are any restrictions for the C++ code that the library can use and how bothersome it is to get this reflection information for a large library) but so far it seems promising.

Anonymous wrote on 2008-10-14 17:39:

I've done a fair amount of complicated Boost.Python wrapping, and also implemented a small replacement for it with most of the complexity removed. There are two main reasons why Boost.Python is so complicated:

1. It supports arbitrarily complex memory and sharing semantics on the C++ classes (and is runtime polymorphic on how the memory of wrapped objects is managed).

2. It supports arbitrary overloading of C++ functions.

If you remove those two generality requirements (by requiring that wrapped C++ objects are also PyObjects and banning overloading), it's possible to write very lightweight C++ bindings. Therefore, I think it's critical to factor the C/C++ API design so that as much of it as possible is writable in application level python on top of a small core that does the final C++ dispatch.

For example, if you wrap a C++ vector class with a bunch of overloads of operator+ in Boost.Python, each call to __add__ has to do a runtime search through all the overloads asking whether each one matches the arguments passed. Each such check does a runtime search through a table of converters. It would a terrible shame if that overhead isn't stripped by the JIT, which means it has to be in python.

I think a good test library for thinking about these issues is numpy, since it has some memory management complexity as well as internal overloading.

I could go on, but it'd probably be better to do that via email. :)

Geoffrey

René Dudfield wrote on 2008-10-14 21:05:

Please add a C API :)

Once that is done, it's lots easier to interface with the outside world.

For a lot of C++ apis I find it easy enough to write a C api on top of it.

In fact many C++ apis provide a C API. Since that makes it easier to work with different C++ compilers. As you probably know different C++ compilers mangle things differently.

It is possible to look at C++ code at runtime. You just need to be able to interpret the C++ symbols. I know someone did a prototype of this for vc6 on windows. He parsed the symbols, and then created the functions at run time with ctypes. However the approach is not portible between platforms, compilers, or even different versions of compilers. Of course this didn't allow you to use many of the C++ features, but only some.

If you look at how swig works, you will see it kind of generates a C API for many C++ things.


For libraries, it is custom to provide a C API. It just makes things easier.

Unknown wrote on 2008-10-15 01:11:

you might want to look at PyRoot [1,2,3] which is using the Reflex library to automatically wrap (and pythonize) the C++ libraries/types for which a Reflex dictionary has been (beforehand) generated.

theoretically any piece of C++ can be wrapped as Reflex is using gccxml[4] to extract informations from a library and to generat the dictionary library.

Using it in one of CERN's LHC experiment which makes heavy (ab)use of templates (Boost) I can say that we almost had basically no problem.
Usually the only problems we got were either at the gccxml level (resolution of typedef, default template arguments,...) or at the gccxml-to-reflex level (mainly naming conventions problems interfering with the autoloading of types at runtime)

Being a client of gccxml is a rather annoying as the development is... opaque.

I know the Reflex guys were investigating at some point to migrate to an LLVM version (with GCC as a frontend) to replace gccxml.

[1] https://wlav.web.cern.ch/wlav/pyroot/
[2] https://root.cern.ch/viewcvs/trunk/bindings/pyroot/
[3] https://www.scipy.org/PyRoot
[4] https://www.gccxml.org/

Unknown wrote on 2008-10-15 11:16:

There's been some (small) discussion in the SWIG project of making an alternative output method which creates a simple C API for a C++ project, and wraps that with ctypes (generating the python side of the ctypes bindings, too). So far, this is purely theoretical, but all the pieces needed to do it are present in the SWIG source code. If reflex doesn't work out, this might be a reasonable alternative approach.

Maciej Fijalkowski wrote on 2008-10-15 13:41:

Wow. A lot of very informative posts. We'll definitely look to evaluate more what you all posted. Also, in case you want to discuss more, mailing list is usually better place for discussions. Feel free to send new ideas or more detailed info there.

Cheers,
fijal

Anonymous wrote on 2008-10-16 00:44:

fyi check out Elsa ( https://www.cubewano.org/oink ). It is much better than Reflex.

Carl Friedrich Bolz-Tereick wrote on 2008-10-16 10:57:

illume: Adding a C-API is rather hard, and probably not on our todo list, unless somebody pays for it :-).

anonymous: From a quick glance I am not sure Elsa would really help. Yes, you can get use it to parse c++ headers and get information about it. But as far as I see it, you cannot use it to create shared libraries that can be used to dynamically construct classes and dynamically call methods on them. Besides, the idea is to have a solution that works on both CPython and PyPy. Reflex already has a way to bind C++ libraries to CPython, so we only need to do the PyPy part.

Anyway, if anybody is interested in more detailed discussions, we should all move to pypy-dev.

Unknown wrote on 2008-10-19 09:40:

I also have done a fair amount of Boost.Python wrapping. I even created a code generator for it - Py++( www.language-binding.net).

The problem you want to solve is very complex. Exposing C++ code, as is, is not enough. You will have to create "bridge" between different concepts.

For example C++ and Python iterators. In C++, in order to get the iterator value, you have to dereference it. In Python, you just have it( value ).

This is just a single example, I have others.

If you continue with the project - I would like to be involved.

Unknown wrote on 2008-10-21 02:48:

Roman,

I haven't looked at the code of Boost.Python since a long time, but the way "we" do the pythonization of the iteration over STL sequences is rather simple.

when one writes:
foos = std.vector('FooKlass')()
# fill foos
# iterate
for foo in foos:
print foo.value()

what the PyROOT/Reflex layer is doing is looking at the dictionary for the std::vector(FooKlass), discovering that there is a pair of functions 'begin' and 'end' and it figures out one can create a python iterator from that pair.

anyways, as Maciej pointed it out, we could try to move this discussion here[1] or there[2]...

cheers,
sebastien.

[1] https://codespeak.net/pipermail/pypy-dev/2008q4/004847.html

[2] https://codespeak.net/pipermail/pypy-dev/2008q4/004843.html

Anonymous wrote on 2008-10-30 12:27:

Just to let you know, there is an upcoming paper in PythonPapers.org review on the topic of C++ wrapping in Python. Just watch out!

ilkosta wrote on 2008-12-09 14:08:

maybe it's worth also evaluate the library xrtti, a Reflex comparable library but without CERN and ROOT dependencies.

Sprint Discussions: Release Planning

One of the discussions that happened during the sprint was about how to approach the next PyPy release. There hasn't been a release since the end of the EU period, which is not an optimal situation. Therefore we plan to make a 1.1 release at the beginning of next year, ideally before Pycon US. We'd also like to move towards time-based releases. This will be greatly helped by the new buildbot infrastructure, which allows us to decide when the state of the codebase is stable enough to release.

Another goal of the release is to involve more people from the wider PyPy community by having bugdays and generally asking for more support. This will be particularly useful for bugs on platforms that no one of the core developers group is using.

Feature-wise the release will mostly contain CPython 2.5 language support, including some new extension modules (like ctypes, expat, sqlite). In addition we plan to make it easier to actually install and use the PyPy Python interpreter, which means some sort of proper installation procedure and supporting distutils on top of PyPy. Another part of the release will be support for fully sand-boxing an interpreter.

Additionally there were also a large number of improvements on several levels since the last release, like optimizations, faster oldstyle-classes, better GCs, correct finalization behaviour, lots and lots of bugfixes, better threading support (still with the GIL), some work on improving memory behaviour, ...

In contrast to our last release, we will focus mainly on PyPy's Python Intepreter and more particularly its C-version. There are also various experimental interpreters that PyPy contains, like for Prolog, Smalltalk, JavaScript and Scheme. We also don't intend to put the LLVM and Javascipt backends in the release, since they are essentially unmaintained and at least partially broken. If anybody is particularly interested in one of these components, please feel free to step up and take responsibility for them. Another thing that the release won't contain is a JIT. I plan to make another blog-post about this soon, stay tuned.

Michael Foord wrote on 2008-10-12 17:58:

Looking forward to news on the JIT.

Düsseldorf Sprint Report Days 1-3

The Düsseldorf sprint is currently in full progress and this post will try to summarize what progress has been made in the last days. We are (again) sprinting at the STUPS group of the Düsseldorf University. You can find the sprint announcement and the daily planning file.

Holger and Samuele put quite some effort over several days into setting up and improving PyPy's testing infrastructure. PyPy has a variety of tests. On the one hand, there are of course our own tests. But then we also have the CPython tests that should be run on top of pypy-c. Up to now we used a custom-made pile of hacks, held together by lots of duct-tape. It consisted of a variety of different machines running different things with different reporting solutions. Some of the old test-results can still be found on wyvern. Now we are moving to a buildbot based solution together with a custom reporter to have a view similar to the old one. Some details are not quite finished yet, but most of the things are already working rather well (currently all the results displayed are from the 2.5-merge branch).

Another large (and ongoing) topic of work is the 2.5 branch. It contains the work done by our Summer-of-Code student, Bruno Gola, of adding CPython 2.5 features to PyPy's Python interpreter. While Bruno implemented most language features and imported the 2.5 stdlib into PyPy, a lot of details were still missing. In the last days nearly everybody worked on fixing small issues and failing stdlib tests. While doing that we tried to categorize some CPython tests as implementation dependant so that we can skip them when running on PyPy.

Memory Improvements

One goal of the sprint is to measure and to reduce the memory behaviour of our Python interpreter. The idea is to make pypy-c a realistic option for use on embedded devices. By memory behaviour we mean both the dynamic memory usage (how much bytes does a dict or an instance take) as well as the size of the executable and details of the GC strategy.

Alexander, Carl Friedrich and Antonio did some work on analyzing the static data that a pypy-c executable contains. Our executables have the tendency to be rather large, both due to a lot of code and due to a large amount of static data. The analysis didn't give any really surprising results, the problem is mostly that we have a lot of static data originating from a bit everywhere in our program. Two big offenders are the unicodedata-module with about 750 KB of static data and the multimethod-tables with about 150 KB of data.

Armin, Iko, Anto and Maciek worked on a new approach to malloc-removal. This is (for PyPy) a crucial optimization of the translation toolchain that performs escape analysis to find out which objects don't outlive the frame they were allocated in. Since RPython is garbage-collected we usually have a lot of allocations, so it is important to statically get rid of many of them. To successfully do that, some inlining is needed to give the analysis more context. This leads to the fact that we have rather aggressive inlining-settings to allow as much malloc-removal as possible. The new approach tries to inline functions only if this actually leads to the successful removal of a malloc operation. The code is not finished quite yet, so it remains to be seen how successful it will be.

Before the sprint Maciek had started to work on a mark-compact GC for PyPy. The idea is that it is better for memory-constrained-environments because it does not double the memory-requirements during collections. During the sprint Armin and Maciek worked on cleaning up the code a bit and then merging the branch. An interesting property of the mark-compact GC is that after a collection all the memory that is not currently used by the program is returned to the operating system. Right now the GC is not as fast as our more mature ones, but it probably will be the basis for future tweaking.

A small thing that was done by Alexander and Carl Friedrich to make objects smaller is to enable shared instance dictionaries also for instances of old-style classes. Before it worked only for instances of new-style classes. Shared instance dictionaries are a way to reduce the memory-usage of instances. In the optimal case, it gives the same memory-savings that __slots__ are giving, but without any behavioural changes. Conceptually it is very similar e.g. to the notion of "map" in the Self project, or the hidden classes that Google Chrome's V8 is using (click on the link, there are nice graphics). The difference is that for them it is mostly a way to get faster attribute access, and PyPy is so far only using it form memory savings (but that might change in the future).

In parallel to all the other work, John Witulski worked tirelessly on advancing the AMD64-JIT-backend. John has the implementation of this backend as the topic of his Bachelor's thesis. He is progressing quite well (especially also considering that this is his first sizeable Python project ever), just sometimes being impaired by such annoyances as errors in the official Intel documentation. By now the backend is supporting many integer operations and control flow.

René Dudfield wrote on 2008-10-12 07:55:

Hello,

sounds like some fun sprinting :)

Have you considered mmap for some of those big memory users?

Especially for unicode stuff, which mostly won't be used for many applications, it should be a good win -- both for load time, and memory use.

Double plus extra combo win!!! for if you use multiple processes.

cu,

Unknown wrote on 2008-10-21 18:05:

I'm not sure of what you mean.
But modern operating systems (at least Linux) use on-demand loading of executables and libraries, so they never copy anything from an executable file to memory unless it is used.

In fact, process startup is implemented internally by mmap()ing the executable and libraries into the address space of the new process.

If you use multiple processes, it still works well - also data pages are shared, until some process writes to them.

For MMU-less devices, the above does not apply (and ucLinux allows Linux to run on them).
But in that case, I guess that no demand loading is available, and that mmap() copies the mapped data in memory - you need to explicitly swap in and out code segments (i.e. to use good old overlays), and no modern programming environment has direct support for them any more I guess.

You can still emulate overlays with advanced usage of linker scripts however - you put some code in a section, create variables containing the begin and end offset of that section in the linker script, and copy data in memory from that section; but I think that making relocations to that code work flawlessly is impossible, you need to always refer to the buffer containing loaded data.

Prolog-JIT Master's-Thesis Finished

As we already blogged, in the last half-year or so, Michael Leuschel, Armin and me did a lot of JIT generator work on a Prolog prototype. The idea was to experiment more quickly with some techniques than what would have been possible with RPython. These experiments were quite successful in themselves. With very little code we managed to get a JIT that is not doing too badly when compared to existing projects for Prolog.

This Prolog work was also the subject of my Master's thesis. I finished the thesis about two weeks ago (and since then have been mostly sleeping and then sprinting). The thesis should be self-contained when it comes to explaining the JIT concepts but needs knowledge of Prolog to be understandable.

PyPy/Python at the Maemo summit

Maciej and me visited the Maemo Summit in Berlin - a community meetup around Nokia's Linux based mobile platform. We spontaneously did a lightning talk about a first running pypy-c on Maemo and got nice feedback.

We also had a nice lunch with guys from the INDT in Brazil, including Marcio Marcedo and Marcelo Eduardo. It turns out that Python is used a lot on Maemo, for example the nice Canola UI is done with it. Will be interesting to see how this shapes up in relation to the iPhone and Android.

A lot of Nokia engineers were around and they announced that from October on they are going for weekly new releases of their SDK for the new Fremantle (Maemo-5) debian-based platform until the SDK becomes final - if we got this right.

Funnily enough, we met Marius Gedminas from the Programmers of Vilnius - he gave a lightning talk on his impressions as a community member. We think python programmers really should go much more to non-Python centric conferences.

The whole event took place at the C-Base - was a bit crammed in some of the sessions with something like 200 people attending.
cheers, Maciej and Holger