Skip to main content

GC improvements

In the last week, I (Armin) have been taking some time off the JIT work to improve our GCs. More precisely, our GCs now take one or two words less for every object. This further reduce the memory usage of PyPy, as we will show at the end.

Background information: RPython object model

We first need to understand the RPython object model as implemented by our GCs and our C backend. (Note that the object model of the Python interpreter is built on top of that, but is more complicated -- e.g. Python-level objects are much more flexible than RPython objects.)

Consider these two RPython classes:

class A:
    def __init__(self, x):
        self.x = x
    def f(self):
        return self.x * 42

class B(A):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def f(self):
        return self.x + self.y

The instances of A and B look like this in memory (all cells are one word):

GC header vtable ptr of A hash x

GC header vtable ptr of B hash x y

The first word, the GC header, describes the layout. It encodes on half a word the shape of the object, including where it contains further pointers, so that the GC can trace it. The other half contains GC flags (e.g. the mark bit of a mark-and-sweep GC).

The second word is used for method dispatch. It is similar to a C++ vtable pointer. It points to static data that is mostly a table of methods (as function pointers), containing e.g. the method f of the example.

The hash field is not necessarily there; it is only present in classes whose hash is ever taken in the RPython program (which includes being keys in a dictionary). It is an "identity hash": it works like object.__hash__() in Python, but it cannot just be the address of the object in case of a GC that moves objects around.

Finally, the x and y fields are, obviously, used to store the value of the fields. Note that instances of B can be used in places that expect a pointer to an instance of A.

Unifying the vtable ptr with the GC header

The first idea of saving a word in every object is the observation that both the vtable ptr and the GC header store information about the class of the object. Therefore it is natural to try to only have one of them. The problem is that we still need bits for the GC flags, so the field that we have to remove is the vtable pointer.

This means that method dispatch needs to be more clever: it cannot directly read the vtable ptr, but needs to compute it from the half-word of the GC header. Fortunately, this can be done with no extra instruction on the assembler level. Here is how things will look like in the end, assuming a 32-bit x86 machine (but note that as usual we just generate portable C).

The trick for achieving efficiency is that we store all vtables together in memory, and make sure that they don't take more than 256 KB in total (16 bits, plus 2 bits of alignment). Here is how the assembler code (produced by the normal C compiler, e.g. gcc) for calling a method looks like. Before the change:

MOV EDX, [EAX + 4]               # load the vtable ptr from object EAX
MOV EDX, [EDX + method_offset]   # load the function pointer from the vtable
CALL EDX

Instead, we now have:

MOVZX EDX, [EAX]     # load the 16-bit part of the GC header from EAX
MOV EDX, [vtable_start + 4*EDX + method_offset]
CALL EDX

Note that the complex addressing scheme done by the second MOV is still just one instruction: the vtable_start and method_offset are constants, so they are combined. And as the vtables are anyway aligned at a word boundary, we can use 4*EDX to address them, giving us 256 KB instead of just 64 KB of vtables.

Optimizing the hash field

In PyPy's Python interpreter, all application-level objects are represented as an instance of some subclass of W_Root. Since all of these objects could potentially be stored in a dictionary by the application Python program, all these objects need a hash field. Of course, in practice, only a fraction of all objects in a Python program end up having their hash ever taken. Thus this field of W_Root is wasted memory most of the time.

(Up to now, we had a hack in place to save the hash field on a few classes like W_IntegerObject, but that meant that the Python expression ``object.__hash__(42)'' would raise a TypeError in PyPy.)

The solution we implemented now (done by some Java GCs, among others) is to add a hash field to an object when the (identity) hash of that object is actually taken. This means that we had to enhance our GCs to support this. When objects are allocated, we don't reserve any space for the hash:

object at 0x74B028
...00... x y

When the hash of an object is taken, we use its current memory address, and set a flag in the GC header saying that this particular object needs a hash:

object at 0x74B028
...01... x y

If the GC needs to move the object to another memory location, it will make the new version of the object bigger, i.e. it will also allocate space for the hash field:

object at 0x825F60
...11... x y 0x74B028

This hash field is immediately initialized with the old memory address, which is the hash value that we gave so far for the object. To not disturb the layout of the object, we always put the extra hash field at the end. Of course, once set, the hash value does not change even if the object needs to move again.

Results

Running the following program on PyPy's Python interpreter with n=4000000:

def make_linked_list(n):
    a = None
    i = 0
    while i < n:
        b = X()
        b.next = a
        a = b
        i += 1

the two optimizations together save 32 MB of RAM (i.e. 8 bytes per object). The version of PyPy we measured this with was built as follows:

./translate.py --gcremovetypeptr targetpypystandalone --objspace-std-withsharingdict

The total amount of RAM used on a 32-bit Linux is 247 MB, completing in 10.3 seconds. On CPython, it consumes 684 MB and takes 89 seconds to complete... This nicely shows that our GCs are much faster at allocating objects, and that our objects can be much smaller than CPython's.

Armin Rigo & Carl Friedrich Bolz

Shahms wrote on 2009-10-16 16:53:

Not really GC related and you may have covered this in another post, but how does PyPy handle id() in a world where the object may move? Is the hash field reused for this when necessary as well? If so, how do you deal with the possibility of another object being allocated at the same address as the original object? If not, how do you avoid having an object's id() change when it's moved?

kbob wrote on 2009-10-16 17:55:

Very nice. Using the address for the hash value was especially clever. But how random are those hash values?

Alex wrote on 2009-10-16 19:15:

kbob: If PyPy is anything like CPython the randomness isn't so important. The CPython dictionary hash collision resolution strategy is extremely efficient, even amongst hashes with very similar values.

Lucian wrote on 2009-10-16 19:39:

This is all sorts of cool. I can't wait for a mostly-production-ready PyPy with JIT.

On a somewhat related note, how do the JIT and ctypes interact right now, if at all?

Carl Friedrich Bolz-Tereick wrote on 2009-10-16 19:43:

Shams: Excellent question! The implementation of id that we have is basically a weak key dict mapping objects to ids on demand. This has the fun side-effect that the ids of PyPy's object start with 1 on count up from there.

This is rather inefficient (e.g. your garbage collections become linearly slower the more objects you have that have their id taken), but there is not much else you can do. Jython uses a similar solution. For this reason, calling id a lot is essentially discouraged in code you want to run on PyPy.

Carl Friedrich Bolz-Tereick wrote on 2009-10-16 19:50:

kbob: I think they should be random enough. You get a collision if you ask the hash of object a, then a collection happens that moves a, then you ask object b for its hash and object b happens to be in the place where object a was before. That sounds unlikely.

If you write contrived code that has a loop that repeatedly allocates an object, asks its hash by putting it into a dict and then forces a nursery collection, you can get collision: all those objects will be at the beginning of the nursery when their hash is taken. Unlikely again to occur in practise.

Carl Friedrich Bolz-Tereick wrote on 2009-10-16 19:57:

Alex: you are right. We use exactly CPython's algorithm for implementing dicts, so having bad hash functions is not a big problem. However, if you really have hash value collisions (e.g. everything hashes to 1) your dict still degenerates to essentially a linear search.

Skandalfo wrote on 2009-10-16 20:05:

Wow! You guys that you are my computing heroes.

Whenever I talk to other people about your project, I always state you are the best example I can imagine of REAL innovation in computer languages.

That said, I gather the only thing making id() different from hash() is that you need to guarantee that the values for live objects are always unique.

You could just use the same strategy as with the hash, sticking the id value along the object the next time the object is moved by the GC.

Meanwhile, from the time id() is called to the time the object is moved, you can just temporarily store an {address: id} mapping somewhere. Entries would be removed from the map once the objects get moved. From then on the id would be attached to the object.

If GC cycles are frequent, the map doesn't have to grow too large.

I don't know if the need for id reuse after the id space gets exhausted is important or not. Once you get to the end of the space, you would have to scan the map and heap to find a convenient "hole" to reuse, I suppose.

Shahms wrote on 2009-10-16 20:19:

Thanks, Carl. Following up what Skandalfo said, (although this is probably a poor forum for such discussions), it seems like you could reuse the hash field for id as well. Given that the minimum size for a Python object is > 1 byte, you should have at least that much space for offsetting the hash/id. As the GC/allocator has to store information about addresses and blocks anyway it should be a relatively simple matter of building and maintaining a bloom filter of offsets in use for a particular base address.

Of course, this also constraints the addresses at which Python objects may be allocated and the lower bits in the address may already be used for other purposes...

Carl Friedrich Bolz-Tereick wrote on 2009-10-16 20:37:

Skandalof, Shahms: I guess there are possible ways to make id a bit faster than what we have now. What we have now is well-tested and works reasonably enough. I assume anyway that there is not too much Python code whose performance depends critically on having an extremely efficient implementation of id (and if there is, I am prepared to ask the author to rewrite the code instead :-) ).

Skandalfo wrote on 2009-10-16 20:38:

Shahms: I confess I don't understand your proposal. Do you mean you can have at most as many live objects as the available address space divided by the object alignment?

When I talked about id space I wasn't referring to the memory required to store the per-object id value, but the fact that if you assign the id values using sequential values, and those values are, for instance, 64 bit integers, you could theoretically create and destroy a lot of objects in a long lived process and the sequence would wrap around.

About making hash/id the same, I've just checked that CPython does indeed use the id() value as the value returned by the default hash() implementation.

You could just do the same, and use the id value as the "master" one. For hash() you would just call id(). This allows you to use just one value attached to the objects for both functions.

The cost of that approach would be having to assign an id immediately (having to put it into the temporary map, then having to look it up in the map until the next time the object is moved) for the call to hash() (with no call to id()) too.

The good thing compared to the weak key dict, is that the temporary map doesn't need to be garbage collected at all. The entries are removed when objects are moved (or collected).

Shahms wrote on 2009-10-16 20:44:

Carl, no doubt you're right. I know that I can probably count the number of times I've needed to use id() on one hand and I'm pretty sure the vast majority of those cases was sticking an-hashable object in a dict.

Skandalfo wrote on 2009-10-16 20:53:

Carl, Shahms: I couldn't agree more about id() not being important.

Probably Guido should have refrained from making it available in CPython at the time. I suppose it was just easy to add it to the language with the memory allocation model of CPython. The fact is that I don't really see any use for id() once you have the "is" operator and the hash() method...

Michael Hudson-Doyle wrote on 2009-10-16 22:19:

Yay, I remember talking about removing the gc type pointer, oh, about 3.5 years ago :) Cool that it got done, sounds like a neat pair of hacks.

Maciej Fijalkowski wrote on 2009-10-17 01:17:

@Lucian:

ctypes and JIT works just fine together.

Anonymous wrote on 2009-10-17 09:57:

Doesn't deepcopy use id() a lot? I remember once using deepcopy on a complicated structure, resulting in thousands of id() calls.

RonnyPfannschmidt wrote on 2009-10-17 10:08:

what about pickle - as far as i remember its memo code for dealing with object cycles is using id, too

Armin Rigo wrote on 2009-10-17 16:32:

Too bad for the current implementation of pickle and deepcopy. The fault in that case is CPython's general view that id() is cheap, despite repeated attempts to convince them otherwise. These attempts have been done notably by guys from Jython, even before PyPy time; indeed id() is a mess for any implementation apart from CPython's simple non-moving GC).

A suitable replacement would be e.g. a 'collections.identitydict' type, if someone feels like going Yet Another Time to python-dev with this issue.

Marius Gedminas wrote on 2009-10-17 22:20:

When I was writing objgraph I saw no way of traversing arbitrary object graphs without using id().

collections.identitydict sounds like a nice idea. Has anyone written a PEP for it?

Anonymous wrote on 2009-10-18 09:14:

Is there any possibility to translate pypy under OSX 10.6 as 32bit? Translation works but I get an "ValueError: bad marshal data" when running pypy-c. I assume that is due to the fact that I got a 64bit binary.

Maciej Fijalkowski wrote on 2009-10-18 18:49:

@Anonymous:

Try deleting all your .pyc files and see what happens.

Armin Rigo wrote on 2009-10-19 10:30:

Marius: as I said, feel free to :-) but the current situation is, no, not as far as I know.

klaussfreire wrote on 2009-10-19 16:38:

Wouldn't it free up the GC from all that burden if only a set of live ids were kept? (ie: no weak dict)

So, when you get an id() call, you check the object to see if there's a cached id (much like the hash hack) - if not, you generate a random (or sequential) unused id and store it both in the "live ids" set and in the object's structure, as done with hash values.

So, successive calls to id() would be as fast as in CPython, and garbage collection would be fast too (only an extra set deletion per object whose id was obtained).

In fact, this set could be implemented as a bit array with "free lists", which could be very very efficient, given that its size will be bound by the number of live objects.

Armin Rigo wrote on 2009-10-21 08:11:

Claudio: this doesn't work (unless I misunderstood). You cannot add a field like the hash field at any point in time, but only during collections when the object moves.

klaussfreire wrote on 2009-10-21 13:34:

Yes, I've been thinking about that too.

But that can be patched - the weak key dict could still be used for those objects that haven't been collected yet. Since addition of the id would most likely happen in the nursery, or the first generation at most (big assumption), I don't think the dict would grow very big even under heavy id() usage.

omul cu 6233 wrote on 2009-11-02 21:51:

Wohoo, nice performance

Unknown wrote on 2010-04-14 15:14:

I'm astonished a bit by your need to pack vtables together within 256KB. How many bits do you need for mark-and-sweep marking or similar stuff? The usual solution I've seen for this is to use the low two bits of the vtable pointer for flags, usually, and mask them off when reading the vtable pointer. Would it work here?

If that isn't enough, then you have to pack vtables together as you do (maybe in a bigger space if you can use more bits).

PJE wrote on 2010-09-22 18:22:

I can think of one place where I use a lot of id() calls, and that's in PEAK-Rules' generic function implementation, for indexing "is" tests.

For example, if you have a bunch of methods that test if "x is Something" (for different values of Something), then a dictionary of id()'s is used to identify which of these tests went off. While the total number of Somethings isn't likely to be high, the weakref dict in PyPy means that every 'x' the function is called with will end up burning memory and speed to hold an id forever.

While it's perhaps the case that I could avoid this by using a linear search (ugh) in cases where the number of Somethings is small, it's an example of a place where id() makes an operation neat, fast, and simple in regular Python.

Of course, if there were another way to put arbitrary (i.e possibly-unhashable, comparable only by identity) objects in a dictionary, and then determine whether a given object was one of them, that'd certainly be a suitable substitute.

Or, if PyPI offered a temp_id() that would simply let you *check* identity, without forcing the object to hold onto it, that'd work fine too. Say, if there was a has_id() that told you if an id() is outstanding for the object already, or a get_id() that returned None for an object whose id() had never been taken.

With an API like that, I could prevent memory/speed blowup by not having each call of the function adding more objects to PyPy's id() dict.

(Heck, perhaps such an API should be added across Python versions, i.e., to CPython and Jython as well.)

Maciej Fijalkowski wrote on 2010-09-22 18:30:

@PJE PyPy offers collections.identity_dict, or something similar which would have the effect how you like (but internally doesn't use id operation, just the object identity).

Anonymous wrote on 2011-05-07 02:34:

This program in C# takes 589 miliseconds, and 52 MB RAM. 17x faster, 4.75x less RAM.

Anonymous wrote on 2011-09-15 10:37:

And in assembly it will be even faster and smaller.

Python has many lovely attributes, but efficiency is not its primary virtue. That said, making it more efficient is still a plus, which this work is doing

First pypy-cli-jit benchmarks

As the readers of this blog already know, I've been working on porting the JIT to CLI/.NET for the last months. Now that it's finally possible to get a working pypy-cli-jit, it's time to do some benchmarks.

Warning: as usual, all of this has to be considered to be a alpha version: don't be surprised if you get a crash when trying to run pypy-cli-jit. Of course, things are improving very quickly so it should become more and more stable as days pass.

For this time, I decided to run four benchmarks. Note that for all of them we run the main function once in advance, to let the JIT recoginizing the hot loops and emitting the corresponding code. Thus, the results reported do not include the time spent by the JIT compiler itself, but give a good measure of how good is the code generated by the JIT. At this point in time, I know that the CLI JIT backend spends way too much time compiling stuff, but this issue will be fixed soon.

  • f1.py: this is the classic PyPy JIT benchmark. It is just a function that does some computational intensive work with integers.
  • floatdemo.py: this is the same benchmark involving floating point numbers that have already been described in a previous blog post.
  • oodemo.py: this is just a microbenchmark doing object oriented stuff such as method calls and attribute access.
  • richards2.py: a modified version of the classic richards.py, with a warmup call before starting the real benchmark.

The benchmarks were run on a Windows machine with an Intel Pentium Dual Core E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 2.4.2.3.

Because of a known mono bug, if you use a version older than 2.1 you need to pass the option -O=-branch to mono when running pypy-cli-jit, else it will just loop forever.

For comparison, we also run the same benchmarks with IronPython 2.0.1 and IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.

So, here are the results (expressed in seconds) with Microsoft CLR:

Benchmark pypy-cli-jit ipy 2.0.1 ipy 2.6 ipy2.01/ pypy ipy2.6/ pypy
f1 0.028 0.145 0.136 5.18x 4.85x
floatdemo 0.671 0.765 0.812 1.14x 1.21x
oodemo 1.25 4.278 3.816 3.42x 3.05x
richards2 1228 442 670 0.36x 0.54x

And with Mono:

Benchmark pypy-cli-jit ipy 2.0.1 ipy2.01/ pypy
f1 0.042 0.695 16.54x
floatdemo 0.781 1.218 1.55x
oodemo 1.703 9.501 5.31x
richards2 720 862 1.20x

These results are very interesting: under the CLR, we are between 5x faster and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower than IronPython 2.6. On the other hand, on mono we are consistently faster than IronPython, up to 16x. Also, it is also interesting to note that pypy-cli runs faster on CLR than mono for all benchmarks except richards2.

I've not investigated yet, but I think that the culprit is the terrible behaviour of tail calls on CLR: as I already wrote in another blog post, tail calls are ~10x slower than normal calls on CLR, while being only ~2x slower than normal calls on mono. richads2 is probably the benchmark that makes most use of tail calls, thus explaining why we have a much better result on mono than CLR.

The next step is probably to find an alternative implementation that does not use tail calls: this probably will also improve the time spent by the JIT compiler itself, which is not reported in the numbers above but that so far it is surely too high to be acceptable. Stay tuned.

Michael Foord wrote on 2009-10-15 15:01:

Perhaps you should try another run with the .NET 4 beta. They have at least *mostly* fixed the terrible performance of tail calls there.

Anyway - interesting stuff, keep up the good work. What is the current state of .NET integration with pypy-cli?

Antonio Cuni wrote on 2009-10-15 15:17:

Oh, I didn't know about .NET 4 beta. Have you got any link that explains how they fixed the tail call stuff? I'll surely give it a try.

About the .NET integration: no news from this front. Nowadays I'm fully concentrated on the JIT because I need some (possibly good :-)) results for my phd thesis. When pypy-cli-jit is super-fast, I'll try to make is also useful :-)

Michael Foord wrote on 2009-10-15 15:30:

Here's at least one link (with some references) on the tail call improvements in .NET 4:

https://extended64.com/blogs/news/archive/2009/05/10/tail-call-improvements-in-net-framework-4.aspx

Michael Foord wrote on 2009-10-15 15:31:

I'm also intrigued as to why you didn't benchmark IronPython 2.6 on Mono? I thought that on very recent versions of Mono you could build and run IronPython 2.6 fine now?

Michael Foord wrote on 2009-10-15 15:34:

Ah, I see now you say that it doesn't work. Hmmm... there are definitely folks who maintain a version that does work (perhaps needing Mono 2.4.3 which I guess is trunk?).

See the download previews here anyway: https://ironpython-urls.blogspot.com/2009/09/more-from-mono-moonlight-2-monodevelop.html

Anonymous wrote on 2009-10-15 16:09:

I wonder if this paper would be useful? It's a way to do continuations using the stack on .NET. Maybe you can use it to speed up tail calls?

https://www.cs.brown.edu/~sk/Publications/Papers/Published/pcmkf-cont-from-gen-stack-insp/

Antonio Cuni wrote on 2009-10-19 11:58:

@Michael: from the link you posted, it seems that tail call improvements in .NET 4 are only for x86_64, but my benchmarks were un on 32 bit, so I don't think it makes a difference. Anyway, I'll try to benchmark with .NET 4 soon, thanks for the suggestion.

@Anonymous: the paper is interesting, but I don't think it's usable for our purposes: throwing and catching exception is incredibly costing in .NET, we cannot really use them too heavily. The fact that the paper says nothing about performances is also interesting :-)

PyPy's JIT now supports floats

Hello.

We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.

The original version of the benchmark, was definitely tuned for the performance needs of CPython.

For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:

  • Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
  • Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
  • Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.

I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)

The performance figures below (for n = 1 000 000), average of 10 runs:
  • CPython 2.6: 7.56s
  • CPython & psyco 2.6: 4.44s
  • PyPy: 1.63s
  • Java (JVM 1.6, client mode): 0.77s

and while JVM is much faster, it's very good that we can even compare :-)

Cheers
fijal
Anonymous wrote on 2009-10-06 17:26:

So it's much faster than Psyco and only about 2x slower than the JVM. That's impressive, as Python is much more dynamic!

Congrats and thanks for the regular updates, it's much appreciated.

Luis wrote on 2009-10-06 17:31:

Very exciting!
By the way, this result doesn't include the time to generate assembler. Right?

Anonymous wrote on 2009-10-06 17:37:

Great, you guys are heroes!

Btw, what's the next big hurdle to run real-world programs? Memory use? Threads?

Anonymous wrote on 2009-10-06 17:47:

Great job! I really appreciate your work.

@Luis: I think, it does include the assembler. I just compiled trunk and ran the modified benchmark on python 2.6 and pypy-c-jit. Best time of 10 runs:
Python 2.6.2: 0.911113977432
Pypy: 0.153664112091
So it's nearly 6x faster for me (including the time for generating the assembler, of course) - even much better than on the postet numbers...I don't know, if cpython was run with the unmodified version of the benchmark though.

William wrote on 2009-10-06 19:36:

I'd be interested to see the results for a much longer run (n = 10 000 000?).

Panos Laganakos wrote on 2009-10-06 19:55:

Wicked! Keep the sweetness coming :)

Unknown wrote on 2009-10-07 03:15:

Very exciting. Thanks! These are nearing "holy crap" numbers.

<mindControl>siiiixty foooouuur biiiiit<mindControl>

:-)

René Dudfield wrote on 2009-10-07 11:35:

awesome! things are really starting to move along now :)

I tried the same little benchmark with the shedskin python to C++ compiler for comparison:

cpython2.5: 16.2770409584
cpython2.6: 12.2321541309
shedskin: 0.316256999969

Shedskin is 38.6 times faster than cpython2.6, and 51.4 times faster than cpython2.5... and to extrapolate from your numbers 3.9 times faster than the jvm.

Of course that doesn't include the time it takes to generate the C++ and then compile it with g++ (using the old 4.0.1 g++, not the latest 4.4). I also didn't include the python interpreter startup cost.

btw, I found map, reduce and filter all to be faster with pure python versions when using psyco too.

cu!

Maciej Fijalkowski wrote on 2009-10-07 13:42:

@illume

that's a bit unfair comparison, since shedskin is not python. you can compare RPython and shedskin though. RPython is sometimes faster than C even...

And also, yes, in PyPy or psyco time we include compilation time.

Cheers,
fijal

Luis wrote on 2009-10-07 14:34:

I'm still confussed.. if you post the average of 10 runs, and assembler is generated only in the first run, then this time is diluted. Shouldn't you compute the average of 10 runs, but excluding the first one? (that means, runing it 11 times and ignoring the first one?).

Anonymous wrote on 2009-10-07 18:31:

@Luis: no, I think fijal started the pypy-c interpreter 10 times, and each time it generates assembly (it's not cached afaik).

Luis wrote on 2009-10-07 19:28:

Well, no matter how they measure it, this is definitely within the "Holy Crap" range...

Maciej Fijalkowski wrote on 2009-10-07 20:37:

@Luis:

Maybe I should... I really run this 10 times while assembler was generated only during the first time. But also dilluting assembler generation time over runs is kind of real-life effect...

Baczek wrote on 2009-10-08 16:06:

how about including unladen swallow results?

Michael Allman wrote on 2009-10-08 18:26:

How come the pypy JIT is compiled AOT to C? I thought the idea of PyPy was to implement a python runtime in python? Why not run the JIT on a python runtime?

Awesome work. I wish the Ruby folk were as motivated...

Cheers.

Anonymous wrote on 2009-10-08 18:32:

I seem to recall grumblings from C++ programmers a few years ago when Java started supporting multi-core architecture, which made Java execution as fast or faster than C++ with much less development effort (for free with the Java interpreter vs hand-written C++ support).

If your testing machine is a multi-core/processor machine, it might be appropriate to say that PyPy is now as fast as C++ (without explicit multi-core support). Wow!

Armin Rigo wrote on 2009-10-09 11:39:

Michael: because our goal is to have a general framework, not a Python-centered solution. For example, the JIT generator works mostly out of the box with any other language that we implemented in RPython (which includes Smalltalk).

hihhu wrote on 2009-10-09 18:06:

Great work!

How large an effort would it be to have eg. Perl or Ruby working with this? Just out of curiosity, I'm trying to understand this project better.

Anonymous wrote on 2009-10-09 20:23:

In the correct original version of the benchmark there are two calls to sin(). A good compiler optimizes one of them away. A worse compiler don't. So it's more fair to put back the second sin in the Python code too.

Maciej Fijalkowski wrote on 2009-10-11 19:37:

@hihu:

It would be a bit easier than writing the interpreter in C, since RPython is much nicer. Also, you get JIT for almost free and decent GC for free. On the other hand, writing interpreters it's quite a bit of work on it's own.

@Anonymous:

Indeed, well, spotted, it would be more fair. However, there is no measurable difference (at least in pypy running time).

PS. We have weekends, too.

Cheers,
fijal

della wrote on 2009-10-12 08:48:

Would a Pypy implementation of Perl/Ruby/PHP mean that it would be possible to use libraries developed in one language for the other one? That would be very cool indeed.

And, for that matter, would that mean interoperability between python2 and python3 modules when the py3 interpreter will be done? :)

Maciej Fijalkowski wrote on 2009-10-12 15:33:

@della.

In general, that would not be that simple. You need to somehow map data types between interpreters in an unclear manner. For example, what would happen if you call Python2.x function passing argument that is py3k dict (which has different interface)?

Cheers,
fijal

della wrote on 2009-10-13 09:34:

One would imagine having different interfaces for the same objects when accessed from 2.x and 3.x code. Would that be difficult?

Of course, I understand mapping data structures between languages that have many more differences between them than py2 and py3 would definitely be more complex.

Anonymous wrote on 2009-11-02 18:08:

Not to rain on the parade, but Java's trig functions are very slow outside of -pi/2,pi/2 range to correct terrible fsin/fcos results on Intel x86.

See https://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

Your benchmark should include something to measure the error, or not use trig functions as a benchmark when comparing to Java.

First results of the JIT

Hi all,

Just a quick note to tell you that we are progressing on the JIT front. Here are the running times of the richards benchmark on my laptop:

  • 8.18 seconds with CPython 2.5.2;
  • 2.61 seconds with pypy-c-jit (3x faster than CPython);
  • 1.04 seconds if you ignore the time spent making assembler (8x faster than CPython);
  • 1.59 seconds on Psyco, for reference (5x faster that CPython).

Yes, as this table shows, we are spending 1.57 seconds in the JIT support code. That's too much -- even ridiculously so -- for anything but a long-running process. We are working on that :-)

If you want to build your own pypy-c-jit (for x86-32 only for now):

  • you need a Subversion checkout of trunk;
  • run pypy/translator/goal/translate.py with the -Ojit option;
  • as usual, wait a long time (and be sure you have more than 1GB of RAM).

For now pypy-c-jit spews a lot of debugging output and there are a few known examples where it crashes. As we like to repeat, however, it's a complete JIT: apart from the crashes (the bugs are probably in the JIT support code), it supports the whole Python language from the start -- in the sense of doing correct things. Future work include Python-specific improvements by e.g. tweaking the data structures used to store Python objects so that they are more JIT-friendly.

EDIT: Oh yes, fijal reminds me that CPython 2.6 is 30% faster than CPython 2.5 on this benchmark (which is mostly my "fault", as I extracted a small part of PyPy and submitted it as a patch to CPython that works particularly well for examples like richards). It does not fundamentally change the fact that we are way faster though.

Unknown wrote on 2009-09-27 17:56:

This thing just got interesting.

Why this particular benchmark?

cjrh wrote on 2009-09-27 19:32:

Fantastic!

At this point, it would be a really good idea for the pypy team to prepare downloadable binaries or setup tools, or eggs for making it extremely easy for a new user to try it out. Now that the performance is starting to become interesting, many more people will want to experiment with it and you don't want that enthusiam hampered by a somewhat involved setup process.

Unknown wrote on 2009-09-27 19:40:

> it would be a really good idea for the pypy team to prepare downloadable binaries or setup tools, or eggs

I second this notion. I am among the group of people who are quite tempted to try things out, but not sure how much work I'll have to do first.

Anonymous wrote on 2009-09-27 20:27:

Me too I'd like pre-built binaries

Anonymous wrote on 2009-09-27 20:29:

I agree. Please put some binaries on your page to make it easier for everyone to survey what you've done!

Armin Rigo wrote on 2009-09-27 20:30:

This particular benchmark happens to be the one we use; there is no deep reason besides its relative simplicity (but it is not a microbenchmark, it's really computing something). We will of course make more tests with a better range of benchmarks when things start to settle a bit. Right now we are busy developing, and the numbers change every week.

It's also for this reason that there is no nicely-packaged release, sorry :-)
Note that translating your own pypy-c-jit is not a lot of work for you. It is just a lot of work for your CPU :-) You just do "svn co", "cd pypy/translator/goal" and "./translate -Ojit".

Anonymous wrote on 2009-09-27 20:41:

I would appreciate binaries because I don't have a computer with multi-GB RAM. I tried translating pypy a few months ago but gave up after several hours (the computer was just swapping constantly).

I can wait some longer, but regular binary releases (even if just unstable trunk snapshots) would be useful.

Anyway, keep up the good work! This is looking really promising.

Anonymous wrote on 2009-09-28 00:47:

Quite nice, thank you..

PavPanchekha wrote on 2009-09-28 03:27:

Perhaps there is some way to store generated assembler code? I don't know too much about assembler or the JIT backend, but I assume that it'd be possible to stick the generated assembler code into a comment (or, if those don't exist, a docstring) in the .pyc file, so that a library that is commonly imported won't have to waste time generating assembler.

Benjamin Peterson wrote on 2009-09-28 03:45:

@PavPanchekha We specialize the assembler agressively, so that probably wouldn't be so useful. We have a lot of room to improve on assembly generation, though.

Unknown wrote on 2009-09-28 07:09:

Thanks for the update!

Anonymous wrote on 2009-09-28 11:40:
Anonymous said: I would appreciate binaries because I don't have a computer with multi-GB RAM.

I do have such a computer, but I would still appreciate binaries, because the current trunk does not translate for me:

[translation:ERROR] File "/tmp/pypy/pypy/annotation/annrpython.py", line 227, in addpendingblock
[translation:ERROR] assert s_oldarg.contains(s_newarg)
[translation:ERROR] AssertionError':
[translation:ERROR] .. v1703 = simple_call((function mmap), v1702, map_size_0, (7), (34), (-1), (0))
[translation:ERROR] .. '(pypy.rlib.rmmap:628)alloc'
Armin Rigo wrote on 2009-09-28 11:59:

You are probably missing a dependency. See https://codespeak.net/pypy/dist/pypy/doc/getting-started-python.html#translating-the-pypy-python-interpreter

della wrote on 2009-09-28 13:43:

Great work! Is it possible to build the 32-bit binary on a 64-bit machine without too much effort? Having those instructions would certainly help us 64-bit people :)

Luis wrote on 2009-09-28 14:02:

I guess the time spent making assembler is only the first time the code is executed. Is that right? If so, we can consider an 8x speedup as the most accurate result. Or not?

nshepperd wrote on 2009-09-28 14:41:

@della: I use a 32-bit chroot on my own x64 machine. I don't know if that counts as "too much effort" (certainly it theoretically shouldn't require that), but it has been for me the most painless way to do it.

Maciej Fijalkowski wrote on 2009-09-28 15:16:

@Luis: yes, it's only first time.
Well, depends how you count, but it
can be considered 8x speedup...

Armin Rigo wrote on 2009-09-28 20:17:

Here are prebuilt C sources (in which "tracing" time was reduced by 20-30% since the blog post):

https://wyvern.cs.uni-duesseldorf.de/~arigo/chain.tar.bz2

Linux x86-32 only. You still need a svn checkout of PyPy, and you still need to compile them with gcc -- but it does not take too long: edit the first entry of the Makefile to point to your checkout of PyPy and type "make". This still assumes that all dependencies have been installed first. Don't complain if the #includes are at the wrong location for your distribution; you would get them right if you translated the whole thing yourself. In fact, don't complain if it does not compile for any reason, please :-) C sources like that are not really supposed to be portable, because they are just intermediates in the translation process.

Anonymous wrote on 2009-09-28 21:28:

̉You are probably missing a dependency. See https://codespeak.net/pypy/dist/pypy/doc/getting-started-python.html#translating-the-pypy-python-interpreter

Dear Armin, it seem like this document should mention libexpat1-dev and libssl-dev as dependencies, too. Anyway, I managed to build pypy-c, and here are the result for some small benchmarks I wrote. (Is there a way here at blogger.com to not break the formatting?)

python 2.5 psyco pypy-c
richards 14.9 2.9 3.9
mergesort 27.6 4.8 26.3
convexhull 9.4 5.6 6.3
bigcityskyline 46.9 3.1 7.6
fft 14.1 15.4 25.0

Thank you all for your efforts.

Armin Rigo wrote on 2009-09-29 07:47:

Thanks for the missing dependencies; added to the development version of the page. Thanks also for the numbers you report. The next obvious thing we miss is float support (coming soon!), which shows in some of your benchmarks.

René Dudfield wrote on 2009-09-29 08:06:

Hi,

this is so unbelievably awesome, it's going to take me a while to recover from all the awesomness.

CONGRATS!

ps. a nice improvement for users is to get your ./configure script to find dependencies and report the ones missing, and ones used (s/configure/setup.py/g).

Anonymous wrote on 2009-09-29 13:09:

nice!

so what is your guess at the moment? how fast can pypy get if you further optimize the jit?

Anonymous wrote on 2009-09-29 13:47:

Dear Pypy developers, is it possible to switch off the very agressive JIT logging in pypy-c? First, this could make pypy-c a drop-in replacement for cpython. (Many more beta-testers.) Second, the logging itself seems to be somewhat resource-intensive.

Very cool Mandelbrot ascii art, by the way.

Maciej Fijalkowski wrote on 2009-09-30 08:42:

Dear anonymous.

you can compile ./translate.py -Ojit --jit-debug=profile

There is no runtime switch unfortunately, so far.

Cheers,
fijal

Anonymous wrote on 2009-09-30 10:32:

Thank you! For many of us, the translation-time switch will be just as good.

della wrote on 2009-10-01 09:31:

I can't seem to compile (32-bit Ubuntu 9.10 chroot), by manually executing the Makefile in /tmp/usession-0/testing_1 I get this traceback:

File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 1210, in (module)
tracker.process(f, g, filename=fn)
File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 229, in process
lines = self.process_function(lines, entrypoint, filename)
File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 244, in process_function
table = tracker.computegcmaptable(self.verbose)
File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 285, in computegcmaptable
self.parse_instructions()
File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 364, in parse_instructions
meth = self.find_missing_visit_method(opname)
File "/home/della/pkg/pypy/trunk/pypy/translator/c/gcc/trackgcroot.py", line 390, in find_missing_visit_method
raise UnrecognizedOperation(opname)
__main__.UnrecognizedOperation: jc

there are some type warnings also for pointers, I don't know if they could be any useful. Maybe you can help me?

Armin Rigo wrote on 2009-10-01 16:56:

Thanks for the report, della. Fixed, if you want to try again. Parsing gcc output is a bit delicate as the exact set of operations used depends on the specific version and command-line options passed to gcc.

Armin Rigo wrote on 2009-10-01 16:58:

Since the blog post, here are the updated numbers: we run richards.py in 2.10 seconds (almost 4x faster than CPython), and only spend 0.916 seconds actually running the assembler (almost 9x faster than CPython).

Anonymous wrote on 2009-10-01 18:25:

Very nice. Do you expect to get faster than psyco?

Luis wrote on 2009-10-02 01:27:

This is very exciting! Please, try to post updates to these figures... thanks!

Unhelpful wrote on 2009-10-04 01:13:

I was having the same problem as della, and your fix seems to work, but it's breaking somewhere else now. I don't think I have a dependency problem, I can build a working pypy-c without jit. Running make manually produces heaps of warnings about incompatible pointers, some probably harmless (int* vs long int*, these should be the same on x86-32), but others worry me more, like struct stat* vs struct stat64*, or struct SSL* vs char**. I put the complete output of a manual run of make online.

Anonymous wrote on 2009-10-04 01:40:

Interestingly, the translation itself seems to consume at most about 960MB of ram. It's easy to translate on a system even with only a gig of ram if you stop everything else.

Try switching run levels or the like.

The -Ojit option seems to cause an error in translation with Revision 68125, when translated using Python 2.5.2 on Debian Lenny.

proteusguy wrote on 2009-10-04 07:31:

First off - congratulations and good job on the great progress. I've been watching this project since the 2007 PyCon in DFW and it's great to see these promising results.

That said, while I know there's still a lot of work to do and this is very much an in-progress thing, I'm very much looking forward to an excuse to try this stuff out in anger - real practical situations. For me that means some statistical calculation engines (monto-carlo analysis) front ended by web services. In both situations this brings up two constraints: a) must support 64bit (because our data sets rapidly go above 4GB RAM) and b) must not be overly memory hungry (because any significant incremental overhead really hurts when your data sets are already over 4GB RAM).

For now we use Psyco for small stuff but have to re-implement in C++ once we hit that 32-bit limit. PyPy is very exciting as a practical alternative to Psyco because of anticipated 64bit support. I wonder if, due to the existence fo Psyco already, that PyPy shouldn't focus first on 64bit instead?

Few things would speed up progress than getting PyPy used out in the wild - even if only by those of us who appreciate it's very much in flux but still understand how to benefit from it.

I understand you guys have your focus and goals and encourage you to keep up the good work. Just thought I'd throw this out as an idea to consider. I'm sure there are a lot like me anxious to give it a spin.

-- Ben Scherrey

Armin Rigo wrote on 2009-10-04 11:25:

Andrew: can you update and try again? If you still have the .c files around it is enough to go there and type "make"; otherwise, restart the build. It should still crash, but give us more information about why it does.

Unhelpful wrote on 2009-10-04 16:04:

The new traceback is:

Traceback (most recent call last):
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 1211, in <module>
assert fn.endswith('.s')
AssertionError

Is the position in input tracked? that might help, or I could package my .gcmap files.

Unhelpful wrote on 2009-10-04 16:15:

The trouble seems to be implement.gcmap and implement_9.gcmap. These are bothe empty, and trigger the assertion error.

Running trackgcroot as the Makefile does, but without those two files, permits compilation to continue, but linking fails with undefined references to various symbols with the prefix 'pypy_g_'.

I suspected the changes might have invalidated the old .gcmap files, so I tried removing them, and got this when it tried to generate implement.gcmap:

Traceback (most recent call last):
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 1214, in <module>
tracker.process(f, g, filename=fn)
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 229, in process
lines = self.process_function(lines, entrypoint, filename)
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 244, in process_function
table = tracker.computegcmaptable(self.verbose)
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 285, in computegcmaptable
self.parse_instructions()
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 365, in parse_instructions
insn = meth(line)
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 741, in visit_jmp
self.conditional_jump(line)
File "/home/chshrcat/build/pypy-trunk/pypy/translator/c/gcc/trackgcroot.py", line 757, in conditional_jump
raise UnrecognizedOperation(line)
__main__.UnrecognizedOperation: jmp T.14141

Anonymous wrote on 2009-10-04 16:19:

A correction/clarification to last night's post:

There isn't a bug in the -Ojit translation process, I was just missing a dependency that I could've sworn I've installed before.

The translation process only takes < 1GB memory if done without any options. Attempting to translate with the -Ojit option takes at least 2.5GB of RAM, as I tried last night (with it as the only running process) and it consumed my swapfile and ran out of memory.

Is there any documented way to use a translated pypy binary to build other pypy translations? That might help reduce the build requirements, and would also be mighty cool.

Armin Rigo wrote on 2009-10-04 16:50:

NickDaly: checked in, please try. Also, please come to the mailing list instead of posting here if you have further comments to do... https://codespeak.net/mailman/listinfo/pypy-dev

Michael Allman wrote on 2009-10-05 10:56:

Is pypy-c-jit written in C or Python or something else? I ask because of the "c" in pypy-c-jit.

Carl Friedrich Bolz-Tereick wrote on 2009-10-05 14:28:

Michael: It is written in RPython (a subset of Python) but then translated to C. By convention we therefore call the executable-name pypy-c. If the executable also contains a JIT, we call it pypy-c-jit.

Carl Friedrich Bolz-Tereick wrote on 2009-10-05 15:48:

Ben Scherrey: 64bit support might happen not too far in the future. Not using too much memory is a different problem, that might take a while longer. It has two aspects, one is that the JIT itself uses way too much memory at the moment. We will work on that soon.

The other aspect is making sure that your dataset does not take too much heap. It depends a bit which data structures you use, but it's not likely to be that great right now. That might change at some point, I have some ideas in that direction, but not really time to work on the soon.

PyPy sprint in Düsseldorf, 6 Nov - 13 Nov

The next PyPy sprint will be held in the Computer Science department of Heinrich-Heine Universität Düsseldorf from the 6th to the 13th of November 2009. This is a fully public sprint, everyone is welcome to join us.

Topics and goals

At the sprint we intend to work on the JIT generator in PyPy and on applying it to PyPy Python interpreter.

The precise work that will be done is not fixed, as we don't know in which state the JIT will be in November. However, possible areas of work might include:

  • tweaking the interpreter/objspace to be more JIT-friendly, e.g. instance implementation code, call code
  • if there is interest starting non x86-32 JIT backends
  • trying out existing software to find features where the optimizations of the JIT could be improved
  • improving our benchmarking infrastructure

We will give special priority to topics that "non-core" people find interesting (as long as they are somehow JIT-related).

For an introduction of how our JIT-generation process works, please refer to our blog:

https://morepypy.blogspot.com/2009/03/jit-bit-of-look-inside.html

There is also a more dense academic paper about the subject:

https://codespeak.net/svn/pypy/extradoc/talk/icooolps2009/bolz-tracing-jit-final.pdf

Location

The sprint will take place in a seminar room of the computer science department. It is in the building 25.12 of the university campus. For travel instructions see

https://stups.cs.uni-duesseldorf.de/anreise/esbahn.php

Registration

If you'd like to come, please subscribe to the pypy-sprint mailing list and drop a note about your interests and post any questions. More organisational information will be send to that list. We'll keep a list of people which we'll update (which you can do so yourself if you have codespeak commit rights).

Unknown wrote on 2009-09-25 08:53:

Following the svn mailing list, there appears to have been a number of quite large refactorings of the JIT recently. Is there a good description of what they are going to achieve, and what the performance gains are? A blog post with an update would be really cool

PyPy gets a new compiler

Today, I merged the parser-compiler branch, which I have been working on over the summer. It contained a total rewrite of both PyPy's Python parser and AST compiler. PyPy's old parser was (in)famous internally for being complicated and slow (with many algorithmic complexities greater than O(n)). The new parser is a simple as I could make it LL(1) parser like CPython (though it doesn't share the hacks of CPython's parser).

The new compiler is based on the Abstract Syntax Trees (AST) that CPython 2.5 introduced instead of PyPy's old AST based on the compiler package's. This means that Python code running on PyPy will be able to use the same _ast interface as CPython. PyPy's _ast implementation supports AST features that CPython 2.6 added, including compiling modified AST to bytecode and executing it. In this rewrite, some more obscure compiler features were added, too. For example, jumps in bytecode can now be greater than 65535 bytes! (That's like an if statement with 7000 lines of code in the body.)

While the PyPy translation toolchain still has many obscure details and hacks, this merge completes the process of making the actual Python interpreter very clean. Hopefully, this will make adding new features much easier and make PyPy less frustrating to maintain as well as providing application level code with an improved AST interface!

Jeremy Cowles wrote on 2009-08-25 23:03:

Nice, keep up the good work!

Anonymous wrote on 2009-08-26 08:12:

Thank you.. Keep it up.

random user wrote on 2009-08-26 17:52:

Very nice. Thanks for all of your work!

tobami wrote on 2009-08-31 10:43:

Hi, the Gothenburg sprint news are very interesting.

What are your thoughts about a release roadmap?. Do you intend to release a pypy 1.2 with improved compatibility and speed but no JIT, and later include the JIT (version 1.5, maybe?)?.

I think publishing some kind of roadmap would be useful, as a project suffers when its release cycles are BOTH long and unpredictable.

tobami wrote on 2009-08-31 10:51:

Also, starting from the next stable release, it would be great to publish some kind of benchmarks page to keep track of performance across different versions (cpython 2.6 vs pypy 1.1 vs pypy 1.2 vs pypy with JIT).

Now that I think of it, do you need some kind of help with the website?. I think starting with the next pypy's release, the project will get a lot more visibility and a nicer and better structured website would be a definite plus. If you feel it would be a useful task I could help there.

Maciej Fijalkowski wrote on 2009-08-31 16:05:

Hey.

Both, the benchmarks (that would also include say jython) and a nice website for people who actually want to use it would be a very nice addon. We definitely would appreciate some help with it.

If you have any ideas feel free to continue discussion on pypy-dev.

Cheers,
fijal

tobami wrote on 2009-08-31 21:46:

Hi Maciej, as you suggested, I have subscribed to the pypy-dev mailing list and have started the discussion.

Cheers,

Miquel

Maciej Fijalkowski wrote on 2009-09-01 15:47:

Hey Miguel.

I fail to see your post.

Cheers,
fijal

tobami wrote on 2009-09-02 10:28:

it got rejected. I have written to pypy-dev-owner to see where the problem is.

Cheers

Maciej Fijalkowski wrote on 2009-09-05 10:13:

@tobami

you should subscribe to the list first.
We get far too much spam to accept
posts from non-members.

Cheers,
fijal

tobami wrote on 2009-09-05 21:21:

@Maciej,

well, I subscribed first, that is the problem. I now get sent the pypy-dev mailing list, but my post got rejected anyway. And pypy-owner hasn't answered yet.

What can I do?

Maciej Fijalkowski wrote on 2009-09-06 19:57:

@tobami

you did something wrong. pypy-dev
is not a moderated list (from
members, that is). Can you leave your mail, so we can no longer spam here? Mine is fijal at merlinux.eu

Gothenburg JIT sprint report

Finally, we managed to squeeze in some time to write a report about what has been going on the mysterious JIT sprint in Gothenburg, Sweden. The main goals of the sprint were to lay down the groundwork for getting more JIT work going in the next months and get more of PyPy developers up to speed with the current state of the JIT. One of the elements was to get better stability of the JIT, moving it slowly from being a prototype to actually work nicely on larger programs.

The secret goal of the sprint was to seek more speed, which Anto and Carl Friedrich did even during the break day:

We spent the first two days improving test coverage of the x86 backend and the optimizer. Now we have 100% coverage with unittests (modulo figleaf bugs), which does not mean anything, but it's better than before.

Then we spent quite some time improving the optimizer passes, so now we generate far less code than before the sprint, because a lot of it is optimized away. On the interpreter side, we marked more objects (like code objects) as immutable, so that reading fields from them can be constant-folded.

Another important optimization that we did is to remove consecutive reading of the same fields from the same structure, if no code in between can change it.

Our JIT is a hybrid environment, where only hot loops of code are jitted and the rest stays being interpreted. We found out that the performance of the non-jitted part was suboptimal, because all accesses to python frames went through an extra layer of indirection. We removed this layer of indirection, in the case where the jit and the interpreter cannot access the same frame (which is the common case).

We also spent some time improving the performance of our x86 backend, by making it use more registers and by doing more advanced variable renaming at the end of loops. It seems that using more registerd is not as much of a win as we hoped, because modern day processors are much smarter than we thought.

The most mind bending part was finding why we loose performance by making the JIT see more of the interpreter. It took us two very frustrating days and 36 gray hairs to find out that from the JIT we call a different malloc function in the Boehm GC, which is by far slower than the version that we use from the interpreter. This meant that the more we jitted, the slower our code got, purely because of the mallocs.

Now that this is fixed, the world makes much more sense again.

A lot of the sprint's work is not directly measurable in the performance figures, but we did a lot of work that is necessary for performance to improve in the next weeks. After we have done a bit more work, we should be able to provide some performance figures for programs that are more realistic than just loops that count to ten millions (which are very fast already :).

Now we're going to enjoy a couple of days off to recover from the sprint.

Bästa hälsningar,
Carl Friedrich, fijal

Anonymous wrote on 2009-08-25 14:26:

Excellent summary. You should never doubt the value of these updates, they are essential for maintaining awareness.

Bourne wrote on 2009-08-25 18:11:

Congrats on your impressive work! This sounds more and more promising.

Freakazo wrote on 2009-08-28 15:15:

Updates like this are extremely interesting to read, and it gives me a months worth of new terms and technology to learn :D

Can't wait to use Pypy!

PyPy numeric experiments

Because PyPy will be presenting at the upcoming euroscipy conference, I have been playing recently with the idea of NumPy and PyPy integration. My idea is to integrate PyPy's JIT with NumPy or at least a very basic subset of it. Time constraints make it impossible to hand write a JIT compiler that understands NumPy. But given PyPy's architecture we actually have a JIT generator, so we don't need to write one :-)

Our JIT has shown that it can speed up small arithmetic examples significantly. What happens with something like NumPy?

I wrote a very minimal subset of NumPy in RPython, called micronumpy (only single-dimension int arrays that can only get and set items), and a benchmark against it. The point of this benchmark is to compare the performance of a builtin function (numpy.minimum) against the equivalent hand-written function, written in pure Python and compiled by our JIT.

The goal is to prove that it is possible to write algorithms in Python instead of C without loss of efficiency. Sure, we can write some functions (like minimum in the following example), but there is a whole universe of other ufuncs which would be cool to have in Python instead, assuming this could be done without a huge loss in efficiency.

Here are the results. This is comparing PyPy svn revision 66303 in the pyjitpl5 branch against python 2.6 with NumPy 1.2.1. The builtin numpy.minimum in PyPy is just a naive implementation in RPython, which is comparable to the speed of a naive implementation written in C (and thus a bit slower than the optimized version in NumPy):

NumPy (builtin function) 0.12s
PyPy's micronumpy (builtin function) 0.28s
CPython (pure Python) 11s
PyPy with JIT (pure Python) 0.91s

As we can see, PyPy's JIT is slower than the optmized NumPy's C version, but still much faster than CPython (12x).

Why is it slower? When you actually look at assembler, it's pretty obvious that it's atrocious. There's a lot of speedup to be gained out of just doing simple optimizations on resulting assembler. There are also pretty obvious limitations, like x86 backend not being able to emit opcodes for floats or x86_64 not being there. Those limitations are not fundamental in any sense and can be relatively straightforward to overcome. Therefore it seems we can get C-level speeds for pure Python implementations of numeric algorithms using NumPy arrays in PyPy. I think it's an interesting perspective that Python has the potential of becoming less of a glue language and more of a real implementation language in the scientific field.

Cheers,
fijal
Anonymous wrote on 2009-07-17 20:50:

I have the feeling your are confessing pypys secrete goal ;-).

Anonymous wrote on 2009-07-18 08:48:

a really efficient python for science: THAT would be a real milestone for dynamic languages; and start their era...

tobami wrote on 2009-07-21 10:51:

Very, very interesting.

Something I missed though was a real naive C implementation. You state it is about as fast as "PyPy's micronumpy", but it would have been nice to post the numbers. Of course, the problem is that the code would be different (C, instead of Python), but still...

Anonymous wrote on 2009-07-22 09:37:

What would it take to get this really started? Some of our group would happily help here, if there is a sort of a guideline (a TODO list?) that tells what must be done (i.e. as a friend put it, we would be codemonkeys).

Yosef wrote on 2009-07-27 07:19:

The difference in pure-python speed is what is most interesting for me, as however much NumPy you use, sometimes important parts of the software still can't be easily vectorized (or at all). If PyPy can let me run compiled NumPy (or Cython) code glued with lightning-fast Python, this leaves me with almost no performance problems. Add to that the convenience of vectorization as a means of writing short, readable code, and its a winning combination.

Zeev wrote on 2009-07-29 09:37:

Saying that implementing efficient code generation for floating point code on x86 in your jit is going to be straight forward is disingenuous.

René Dudfield wrote on 2009-07-30 04:02:

Here's a project using corepy, runtime assembler to create a faster numpy:

https://numcorepy.blogspot.com/

There's also projects like pycuda, and pygpu which generate numpy code to run on GPUs.

It gets many times than standard numpy.

pygame uses SDL blitters, and its own blitters - which are specialised array operations for images... these are many times faster than numpy in general - since they are hand optimized assembler, or very efficiently optimised C.

Remember that hand optimized assembler can be 10x faster than even C, and that not all C code is equal.

So it seems that even the pypy generated C code could even be faster.

What about applying pypy to CUDA, or OpenCL C like languages?

cu,

Maciej Fijalkowski wrote on 2009-08-02 22:17:

@ilume

I think you're completely missing the point. These experiments are performed using pure-python code that happens to operate on numpy arrays. Assembler generation happens when interpreting this code by the interpreter, so it's not really even the level of hand-written C. Corenumpy on the other hand is trying to speed up numpy operations itself (which is also a nice goal, but completely different).

Cheers,
fijal

Anonymous wrote on 2011-04-13 09:48:

Hi Maciej! Would you mind blogging an update on PyPy / C interfaces and NumPy?

I am extensively using NumPy / SciPy / NLopt (apart from apart from the stuff I import from there, my stuff is mostly pure Python algorithms, which interpreter spends most time working on).

The latest improvements in PyPy JIT really sound like if they could magically dramatically speed up my stuff...

I don't mind trying PyPy out in production if it will yield significant speedups and otherwise debugging why wouldn't it, but I need access to C stuff from within Python.

Maciej Fijalkowski wrote on 2011-04-13 09:53:

Stay tuned, I'll blog about it when I have more results. The progress has been slow so far, but it might accelerate

Anonymous wrote on 2011-04-13 13:37:

Hi! Thanks, can't wait for it... :-)

ECOOP 2009

Last week (from 6th to 10th of July) Anto, Armin and me (Carl Friedrich) were in the magnificent city of Genova, Italy at the ECOOP conference. In this blog post I want to give a (necessarily personal) account of what we did there.

Workshop days: ICOOOLPS

The first two days of the conference were the workshop days. On Monday we attended the ICOOOLPS workshop, (see the programme of the workshop). We had gotten two papers accepted at the workshop (one about layering PyPy's JIT on top of the CLR and one about the basic idea of PyPy's tracing JIT) and thus gave two presentations at the workshop, one was given by Anto, the other by me. Both went reasonably well, we got some positive feedback.

Nearly all the other talks were rather interesting as well. I particularly liked the one by Hans Schippers, who presented a machine model built on delegation called delMDSOC. The model is meant implement most features that a language would need that makes it possible to separate cross-cutting concerns. In the talk at ICOOOLPS he presented an extension to the model that adds concurrency support, using a combination of actors and coroutines. He then showed that the concurrency mechanisms of Java, Salsa (and extension of Java adding actors) and Io can be mapped to this model.

Furthermore there were two interesting invited talks, one by Andreas Gal (Mozilla), and one by Cliff Click (Azul Systems). Andreas explained how TraceMonkey works. This was very useful for me, because his talk was just before mine and I could thus kill most of my introduction about tracing JIT compilers and have more time for the really interesting stuff :-). Cliff talked about implementing other languages on top of the JVM and some of the pitfalls in getting them perform well.

All in all, ICOOOLPS was a very enjoyable workshop, also with many interesting discussions.

On Tuesday there were more workshops, but also the PyPy tutorial, so I only went to a few talks of the COP workshop and spent the rest of the morning preparing the tutorial (see next section).

Tutorial

On Tuesday afternoon we gave a PyPy Tutorial, as part of the ECOOP summer school. The first lesson we learned was that (as opposed to a community conference) people don't necessarily want to actually take their laptop out and try stuff. We gave a slow walk-through about the full life-cycle of development of a dynamic language interpreter using PyPy's tool-chain: Starting from writing your interpreter in RPython, testing it on top of CPython to translating it to C, .NET or Java to actually adding hints to get a JIT inserted.

There were about seven people attending the tutorial, a couple of which were very interested and were asking questions and discussing. Some of the discussions were even very technical, e.g. one about the details of our type-inference algorithm for RPython and why we cannot do a bottom-up analysis but have to use forward-propagation instead.

Jan Vitek of Purdue University told of some of the problems of the OVM project, which is (among other things) a Java implementation in Java (OVM also wants to support implementing VMs for other languages with it, if I understood correctly). He said that the project has essentially gotten too large and complicated, which means that it is very hard for new people to get into the project. While PyPy doesn't have some of the problems of a full Java implementation (e.g. right now our concurrency support is minimal) I definitely think that some of these risks apply to PyPy as well and we should find ways to improve the situation in this regard. Channeling Samuele: Somewhere inside the large lumbering blob of PyPy there is an elegant core trying to get out.

Main Conference

From Wednesday till Friday the main conference was happening. Many of the talks were not all that interesting for me, being quite Java centric. One talk that I liked a lot was "Making Sense of Large Heaps", which was presented by Nick Mitchell (IBM). He presented a tool called "Yeti" that can be used to analyze large heaps of Java programs. The tool uses some clever algorithms and heuristics to summarize the heap usage of data structures in intelligent ways to make it easier to find possible memory-wasters in a program. Nick also gave Anto and me a demo of the tool, where we tried to apply it to pypy-jvm (we found out that a fifth of the static data in there belongs to the parser/compiler :-( ).

On each of the days of the conference there was a keynote. I missed the one by Simon Peyton-Jones on Wednesday about type classes in Haskell. On Thursday, David Ungar was awarded the Dahl-Nygaard-Prize for his work on the Self programming language. Subsequently he gave a really inspiring keynote with the title "Self and Self: Whys and Wherefores" where he recollected Self's history, both on a technical as well as on a social level. Parts of the talk were snippets from the movies Self: The Movie and Alternate Reality Kit, both of which I highly recommend.

The keynote on Friday was by Cliff Click with the title "Java on 1000 Cores: Tales of Hardware/Software Co-design". He described the custom CPU architecture that Azul Systems has developed to run Java server applications on hundreds of cores. The talk mostly talked about the hardware, which I found very interesting (but some people didn't care for too much). Azul's CPU is essentially 54 in-order RISC cores in a single processor. The cores have a lot of extensions that make it easier to run Java on them, e.g. hardware read- and write-barriers, hardware-transactional-memory and hardware escape-detection (!).

In addition to the talks, there is of course always the hallway track (or coffee track) which is the track where you stand in the hallway and discuss with people. As usual, this was the most interesting part of the conference. One of those talks was Anto and me giving a PyPy demo to David Ungar. We had a very interesting discussion about VM implementation in general and the sort of debugging tools you need to write in particular. He liked PyPy a lot, which makes me very happy. He also liked the fact that I have actually read most Self papers :-).

Alexander Kellett wrote on 2009-07-16 19:09:

The link to delMDSOC should be https://www.hpi.uni-potsdam.de/hirschfeld/projects/delmdsoc/

Alex

Anonymous wrote on 2009-07-17 04:10:

Glad it went well.

I gather there wasn't any sprint at EuroPython? I was hoping for some news.

If you can get a real python implementation out there that is starting to get faster than CPython, you could get some momentum really quickly.

I hope Unladen Swallow doesn't end up stealing your potential userbase, or at least dividing it.

Donovan Preston wrote on 2009-07-18 01:28:

I <3 Self.

Terrence wrote on 2009-07-18 05:45:

Is something like your PyPy Tutorial online somewhere? I have a befunge interpreter that I've been meaning to get working on pypy but I have almost no idea where to begin. I've been reading pypy's code on and off for awhile now and it's very slow going. If there were some way to get up and running faster, I would really like to know about it.

Armin Rigo wrote on 2009-07-20 10:25:

You can find the tutorial
here but the part written down is quite limited. If you need starting points, look at pypy/translator/goal/target*.py (for example targetjsstandalone, which runs our partial JS interpreter).

Terrence wrote on 2009-07-21 08:40:

Thank you for the link. I had started with targetpypystandalone.py, which, on reflection, appears to be more towards the deep end of the pool. The javascript target is exactly what I'm looking for.

Luis wrote on 2009-07-09 19:37:

Please guys, can anyone of you tell something about Europython's vm panel news? I've been searching for comments on blogs in the last days but I couldn't find anything... There were many interesting presentations (hotpy, crosstwiner, psyco v2, etc...) but no comments so far! Is there any significant news in that field? How do these projects compare to pypy...?

Kumo wrote on 2009-07-11 00:44:

Will you publish the slides of the 2nd talk, as you did with the 1st?

I am looking forward to reading the slides as well as more comments about the talks.

Keep the good work!