In the last week, I (Armin) have been taking some time off the JIT work to improve our GCs. More precisely, our GCs now take one or two words less for every object. This further reduce the memory usage of PyPy, as we will show at the end.
Background information: RPython object model
We first need to understand the RPython object model as implemented by our GCs and our C backend. (Note that the object model of the Python interpreter is built on top of that, but is more complicated -- e.g. Python-level objects are much more flexible than RPython objects.)
Consider these two RPython classes:
class A: def __init__(self, x): self.x = x def f(self): return self.x * 42 class B(A): def __init__(self, x, y): self.x = x self.y = y def f(self): return self.x + self.y
The instances of A and B look like this in memory (all cells are one word):
|GC header||vtable ptr of A||hash||x|
|GC header||vtable ptr of B||hash||x||y|
The first word, the GC header, describes the layout. It encodes on half a word the shape of the object, including where it contains further pointers, so that the GC can trace it. The other half contains GC flags (e.g. the mark bit of a mark-and-sweep GC).
The second word is used for method dispatch. It is similar to a C++ vtable pointer. It points to static data that is mostly a table of methods (as function pointers), containing e.g. the method f of the example.
The hash field is not necessarily there; it is only present in classes whose hash is ever taken in the RPython program (which includes being keys in a dictionary). It is an "identity hash": it works like object.__hash__() in Python, but it cannot just be the address of the object in case of a GC that moves objects around.
Finally, the x and y fields are, obviously, used to store the value of the fields. Note that instances of B can be used in places that expect a pointer to an instance of A.
Unifying the vtable ptr with the GC header
The first idea of saving a word in every object is the observation that both the vtable ptr and the GC header store information about the class of the object. Therefore it is natural to try to only have one of them. The problem is that we still need bits for the GC flags, so the field that we have to remove is the vtable pointer.
This means that method dispatch needs to be more clever: it cannot directly read the vtable ptr, but needs to compute it from the half-word of the GC header. Fortunately, this can be done with no extra instruction on the assembler level. Here is how things will look like in the end, assuming a 32-bit x86 machine (but note that as usual we just generate portable C).
The trick for achieving efficiency is that we store all vtables together in memory, and make sure that they don't take more than 256 KB in total (16 bits, plus 2 bits of alignment). Here is how the assembler code (produced by the normal C compiler, e.g. gcc) for calling a method looks like. Before the change:
MOV EDX, [EAX + 4] # load the vtable ptr from object EAX MOV EDX, [EDX + method_offset] # load the function pointer from the vtable CALL EDX
Instead, we now have:
MOVZX EDX, [EAX] # load the 16-bit part of the GC header from EAX MOV EDX, [vtable_start + 4*EDX + method_offset] CALL EDX
Note that the complex addressing scheme done by the second MOV is still just one instruction: the vtable_start and method_offset are constants, so they are combined. And as the vtables are anyway aligned at a word boundary, we can use 4*EDX to address them, giving us 256 KB instead of just 64 KB of vtables.
Optimizing the hash field
In PyPy's Python interpreter, all application-level objects are represented as an instance of some subclass of W_Root. Since all of these objects could potentially be stored in a dictionary by the application Python program, all these objects need a hash field. Of course, in practice, only a fraction of all objects in a Python program end up having their hash ever taken. Thus this field of W_Root is wasted memory most of the time.
(Up to now, we had a hack in place to save the hash field on a few classes like W_IntegerObject, but that meant that the Python expression ``object.__hash__(42)'' would raise a TypeError in PyPy.)
The solution we implemented now (done by some Java GCs, among others) is to add a hash field to an object when the (identity) hash of that object is actually taken. This means that we had to enhance our GCs to support this. When objects are allocated, we don't reserve any space for the hash:object at 0x74B028
When the hash of an object is taken, we use its current memory address, and set a flag in the GC header saying that this particular object needs a hash:object at 0x74B028
If the GC needs to move the object to another memory location, it will make the new version of the object bigger, i.e. it will also allocate space for the hash field:object at 0x825F60
This hash field is immediately initialized with the old memory address, which is the hash value that we gave so far for the object. To not disturb the layout of the object, we always put the extra hash field at the end. Of course, once set, the hash value does not change even if the object needs to move again.
Running the following program on PyPy's Python interpreter with n=4000000:
def make_linked_list(n): a = None i = 0 while i < n: b = X() b.next = a a = b i += 1
the two optimizations together save 32 MB of RAM (i.e. 8 bytes per object). The version of PyPy we measured this with was built as follows:
./translate.py --gcremovetypeptr targetpypystandalone --objspace-std-withsharingdict
The total amount of RAM used on a 32-bit Linux is 247 MB, completing in 10.3 seconds. On CPython, it consumes 684 MB and takes 89 seconds to complete... This nicely shows that our GCs are much faster at allocating objects, and that our objects can be much smaller than CPython's.
Armin Rigo & Carl Friedrich Bolz
As the readers of this blog already know, I've been working on porting the JIT to CLI/.NET for the last months. Now that it's finally possible to get a working pypy-cli-jit, it's time to do some benchmarks.
Warning: as usual, all of this has to be considered to be a alpha version: don't be surprised if you get a crash when trying to run pypy-cli-jit. Of course, things are improving very quickly so it should become more and more stable as days pass.
For this time, I decided to run four benchmarks. Note that for all of them we run the main function once in advance, to let the JIT recoginizing the hot loops and emitting the corresponding code. Thus, the results reported do not include the time spent by the JIT compiler itself, but give a good measure of how good is the code generated by the JIT. At this point in time, I know that the CLI JIT backend spends way too much time compiling stuff, but this issue will be fixed soon.
- f1.py: this is the classic PyPy JIT benchmark. It is just a function that does some computational intensive work with integers.
- floatdemo.py: this is the same benchmark involving floating point numbers that have already been described in a previous blog post.
- oodemo.py: this is just a microbenchmark doing object oriented stuff such as method calls and attribute access.
- richards2.py: a modified version of the classic richards.py, with a warmup call before starting the real benchmark.
The benchmarks were run on a Windows machine with an Intel Pentium Dual Core E5200 2.5GHz and 2GB RAM, both with .NET (CLR 2.0) and Mono 188.8.131.52.
Because of a known mono bug, if you use a version older than 2.1 you need to pass the option -O=-branch to mono when running pypy-cli-jit, else it will just loop forever.
For comparison, we also run the same benchmarks with IronPython 2.0.1 and IronPython 2.6rc1. Note that IronPython 2.6rc1 does not work with mono.
So, here are the results (expressed in seconds) with Microsoft CLR:
Benchmark pypy-cli-jit ipy 2.0.1 ipy 2.6 ipy2.01/ pypy ipy2.6/ pypy f1 0.028 0.145 0.136 5.18x 4.85x floatdemo 0.671 0.765 0.812 1.14x 1.21x oodemo 1.25 4.278 3.816 3.42x 3.05x richards2 1228 442 670 0.36x 0.54x
And with Mono:
Benchmark pypy-cli-jit ipy 2.0.1 ipy2.01/ pypy f1 0.042 0.695 16.54x floatdemo 0.781 1.218 1.55x oodemo 1.703 9.501 5.31x richards2 720 862 1.20x
These results are very interesting: under the CLR, we are between 5x faster and 3x slower than IronPython 2.0.1, and between 4.8x faster and 1.8x slower than IronPython 2.6. On the other hand, on mono we are consistently faster than IronPython, up to 16x. Also, it is also interesting to note that pypy-cli runs faster on CLR than mono for all benchmarks except richards2.
I've not investigated yet, but I think that the culprit is the terrible behaviour of tail calls on CLR: as I already wrote in another blog post, tail calls are ~10x slower than normal calls on CLR, while being only ~2x slower than normal calls on mono. richads2 is probably the benchmark that makes most use of tail calls, thus explaining why we have a much better result on mono than CLR.
The next step is probably to find an alternative implementation that does not use tail calls: this probably will also improve the time spent by the JIT compiler itself, which is not reported in the numbers above but that so far it is surely too high to be acceptable. Stay tuned.
We've just merged branch which adds float support to x86 backend. This means that floating point operations are now super fast in PyPy's JIT. Let's have a look at example, provided by Alex Gaynor and stolen from Factor blog.
The original version of the benchmark, was definitely tuned for the performance needs of CPython.
For running this on PyPy, I changed to a bit simpler version of the program, and I'll explain a few changes that I did, which the reflect current limitations of PyPy's JIT. They're not very deep and they might be already gone while you're reading it:
- Usage of __slots__. This is a bit ridiculous, but we spend quite a bit of time to speed up normal instances of new-style classes which are very fast, yet ones with __slots__ are slower. To be fixed soon.
- Usage of reduce. This one is even more obscure, but reduce is not perceived as a thing producing loops in a program. Moving to a pure-Python version of reduce fixes the problem.
- Using x ** 2 vs x * x. In PyPy, reading a local variable is a no-op when JITted (the same as reading local variable in C). However multiplication is simpler operation that power operation.
I also included the original Java benchmark. Please note that original java version is similar to my modified one (not the one specifically tuned for CPython)The performance figures below (for n = 1 000 000), average of 10 runs:
- CPython 2.6: 7.56s
- CPython & psyco 2.6: 4.44s
- PyPy: 1.63s
- Java (JVM 1.6, client mode): 0.77s
and while JVM is much faster, it's very good that we can even compare :-)Cheers
Just a quick note to tell you that we are progressing on the JIT front. Here are the running times of the richards benchmark on my laptop:
- 8.18 seconds with CPython 2.5.2;
- 2.61 seconds with
pypy-c-jit(3x faster than CPython);
- 1.04 seconds if you ignore the time spent making assembler (8x faster than CPython);
- 1.59 seconds on Psyco, for reference (5x faster that CPython).
Yes, as this table shows, we are spending 1.57 seconds in the JIT support code. That's too much -- even ridiculously so -- for anything but a long-running process. We are working on that :-)
If you want to build your own
pypy-c-jit (for x86-32 only for now):
- you need a Subversion checkout of trunk;
- as usual, wait a long time (and be sure you have more than 1GB of RAM).
pypy-c-jit spews a lot of debugging output and
there are a few known
examples where it crashes. As we like to repeat, however, it's a complete JIT:
apart from the crashes (the bugs are probably in the JIT support code), it supports the whole Python language from the start -- in the sense of doing correct things. Future work include
Python-specific improvements by e.g. tweaking the data structures used to store Python objects so that they are more JIT-friendly.
EDIT: Oh yes, fijal reminds me that CPython 2.6 is 30% faster than CPython 2.5 on this benchmark (which is mostly my "fault", as I extracted a small part of PyPy and submitted it as a patch to CPython that works particularly well for examples like richards). It does not fundamentally change the fact that we are way faster though.
The next PyPy sprint will be held in the Computer Science department of Heinrich-Heine Universität Düsseldorf from the 6th to the 13th of November 2009. This is a fully public sprint, everyone is welcome to join us.
Topics and goals
At the sprint we intend to work on the JIT generator in PyPy and on applying it to PyPy Python interpreter.
The precise work that will be done is not fixed, as we don't know in which state the JIT will be in November. However, possible areas of work might include:
- tweaking the interpreter/objspace to be more JIT-friendly, e.g. instance implementation code, call code
- if there is interest starting non x86-32 JIT backends
- trying out existing software to find features where the optimizations of the JIT could be improved
- improving our benchmarking infrastructure
We will give special priority to topics that "non-core" people find interesting (as long as they are somehow JIT-related).
For an introduction of how our JIT-generation process works, please refer to our blog:
There is also a more dense academic paper about the subject:
The sprint will take place in a seminar room of the computer science department. It is in the building 25.12 of the university campus. For travel instructions see
If you'd like to come, please subscribe to the pypy-sprint mailing list and drop a note about your interests and post any questions. More organisational information will be send to that list. We'll keep a list of people which we'll update (which you can do so yourself if you have codespeak commit rights).
Today, I merged the parser-compiler branch, which I have been working on over the summer. It contained a total rewrite of both PyPy's Python parser and AST compiler. PyPy's old parser was (in)famous internally for being complicated and slow (with many algorithmic complexities greater than O(n)). The new parser is a simple as I could make it LL(1) parser like CPython (though it doesn't share the hacks of CPython's parser).
The new compiler is based on the Abstract Syntax Trees (AST) that CPython 2.5 introduced instead of PyPy's old AST based on the compiler package's. This means that Python code running on PyPy will be able to use the same _ast interface as CPython. PyPy's _ast implementation supports AST features that CPython 2.6 added, including compiling modified AST to bytecode and executing it. In this rewrite, some more obscure compiler features were added, too. For example, jumps in bytecode can now be greater than 65535 bytes! (That's like an if statement with 7000 lines of code in the body.)
While the PyPy translation toolchain still has many obscure details and hacks, this merge completes the process of making the actual Python interpreter very clean. Hopefully, this will make adding new features much easier and make PyPy less frustrating to maintain as well as providing application level code with an improved AST interface!
Finally, we managed to squeeze in some time to write a report about what has been going on the mysterious JIT sprint in Gothenburg, Sweden. The main goals of the sprint were to lay down the groundwork for getting more JIT work going in the next months and get more of PyPy developers up to speed with the current state of the JIT. One of the elements was to get better stability of the JIT, moving it slowly from being a prototype to actually work nicely on larger programs.
The secret goal of the sprint was to seek more speed, which Anto and Carl Friedrich did even during the break day:
We spent the first two days improving test coverage of the x86 backend and the optimizer. Now we have 100% coverage with unittests (modulo figleaf bugs), which does not mean anything, but it's better than before.
Then we spent quite some time improving the optimizer passes, so now we generate far less code than before the sprint, because a lot of it is optimized away. On the interpreter side, we marked more objects (like code objects) as immutable, so that reading fields from them can be constant-folded.
Another important optimization that we did is to remove consecutive reading of the same fields from the same structure, if no code in between can change it.
Our JIT is a hybrid environment, where only hot loops of code are jitted and the rest stays being interpreted. We found out that the performance of the non-jitted part was suboptimal, because all accesses to python frames went through an extra layer of indirection. We removed this layer of indirection, in the case where the jit and the interpreter cannot access the same frame (which is the common case).
We also spent some time improving the performance of our x86 backend, by making it use more registers and by doing more advanced variable renaming at the end of loops. It seems that using more registerd is not as much of a win as we hoped, because modern day processors are much smarter than we thought.
The most mind bending part was finding why we loose performance by making the JIT see more of the interpreter. It took us two very frustrating days and 36 gray hairs to find out that from the JIT we call a different malloc function in the Boehm GC, which is by far slower than the version that we use from the interpreter. This meant that the more we jitted, the slower our code got, purely because of the mallocs.
Now that this is fixed, the world makes much more sense again.
A lot of the sprint's work is not directly measurable in the performance figures, but we did a lot of work that is necessary for performance to improve in the next weeks. After we have done a bit more work, we should be able to provide some performance figures for programs that are more realistic than just loops that count to ten millions (which are very fast already :).
Now we're going to enjoy a couple of days off to recover from the sprint.
Carl Friedrich, fijal
Because PyPy will be presenting at the upcoming euroscipy conference, I have been playing recently with the idea of NumPy and PyPy integration. My idea is to integrate PyPy's JIT with NumPy or at least a very basic subset of it. Time constraints make it impossible to hand write a JIT compiler that understands NumPy. But given PyPy's architecture we actually have a JIT generator, so we don't need to write one :-)
Our JIT has shown that it can speed up small arithmetic examples significantly. What happens with something like NumPy?
I wrote a very minimal subset of NumPy in RPython, called micronumpy (only single-dimension int arrays that can only get and set items), and a benchmark against it. The point of this benchmark is to compare the performance of a builtin function (numpy.minimum) against the equivalent hand-written function, written in pure Python and compiled by our JIT.
The goal is to prove that it is possible to write algorithms in Python instead of C without loss of efficiency. Sure, we can write some functions (like minimum in the following example), but there is a whole universe of other ufuncs which would be cool to have in Python instead, assuming this could be done without a huge loss in efficiency.
Here are the results. This is comparing PyPy svn revision 66303 in the pyjitpl5 branch against python 2.6 with NumPy 1.2.1. The builtin numpy.minimum in PyPy is just a naive implementation in RPython, which is comparable to the speed of a naive implementation written in C (and thus a bit slower than the optimized version in NumPy):
|NumPy (builtin function)||0.12s|
|PyPy's micronumpy (builtin function)||0.28s|
|CPython (pure Python)||11s|
|PyPy with JIT (pure Python)||0.91s|
As we can see, PyPy's JIT is slower than the optmized NumPy's C version, but still much faster than CPython (12x).
Why is it slower? When you actually look at assembler, it's pretty obvious that it's atrocious. There's a lot of speedup to be gained out of just doing simple optimizations on resulting assembler. There are also pretty obvious limitations, like x86 backend not being able to emit opcodes for floats or x86_64 not being there. Those limitations are not fundamental in any sense and can be relatively straightforward to overcome. Therefore it seems we can get C-level speeds for pure Python implementations of numeric algorithms using NumPy arrays in PyPy. I think it's an interesting perspective that Python has the potential of becoming less of a glue language and more of a real implementation language in the scientific field.Cheers,
Last week (from 6th to 10th of July) Anto, Armin and me (Carl Friedrich) were in the magnificent city of Genova, Italy at the ECOOP conference. In this blog post I want to give a (necessarily personal) account of what we did there.
Workshop days: ICOOOLPS
The first two days of the conference were the workshop days. On Monday we attended the ICOOOLPS workshop, (see the programme of the workshop). We had gotten two papers accepted at the workshop (one about layering PyPy's JIT on top of the CLR and one about the basic idea of PyPy's tracing JIT) and thus gave two presentations at the workshop, one was given by Anto, the other by me. Both went reasonably well, we got some positive feedback.
Nearly all the other talks were rather interesting as well. I particularly liked the one by Hans Schippers, who presented a machine model built on delegation called delMDSOC. The model is meant implement most features that a language would need that makes it possible to separate cross-cutting concerns. In the talk at ICOOOLPS he presented an extension to the model that adds concurrency support, using a combination of actors and coroutines. He then showed that the concurrency mechanisms of Java, Salsa (and extension of Java adding actors) and Io can be mapped to this model.
Furthermore there were two interesting invited talks, one by Andreas Gal (Mozilla), and one by Cliff Click (Azul Systems). Andreas explained how TraceMonkey works. This was very useful for me, because his talk was just before mine and I could thus kill most of my introduction about tracing JIT compilers and have more time for the really interesting stuff :-). Cliff talked about implementing other languages on top of the JVM and some of the pitfalls in getting them perform well.
All in all, ICOOOLPS was a very enjoyable workshop, also with many interesting discussions.
On Tuesday there were more workshops, but also the PyPy tutorial, so I only went to a few talks of the COP workshop and spent the rest of the morning preparing the tutorial (see next section).
On Tuesday afternoon we gave a PyPy Tutorial, as part of the ECOOP summer school. The first lesson we learned was that (as opposed to a community conference) people don't necessarily want to actually take their laptop out and try stuff. We gave a slow walk-through about the full life-cycle of development of a dynamic language interpreter using PyPy's tool-chain: Starting from writing your interpreter in RPython, testing it on top of CPython to translating it to C, .NET or Java to actually adding hints to get a JIT inserted.
There were about seven people attending the tutorial, a couple of which were very interested and were asking questions and discussing. Some of the discussions were even very technical, e.g. one about the details of our type-inference algorithm for RPython and why we cannot do a bottom-up analysis but have to use forward-propagation instead.
Jan Vitek of Purdue University told of some of the problems of the OVM project, which is (among other things) a Java implementation in Java (OVM also wants to support implementing VMs for other languages with it, if I understood correctly). He said that the project has essentially gotten too large and complicated, which means that it is very hard for new people to get into the project. While PyPy doesn't have some of the problems of a full Java implementation (e.g. right now our concurrency support is minimal) I definitely think that some of these risks apply to PyPy as well and we should find ways to improve the situation in this regard. Channeling Samuele: Somewhere inside the large lumbering blob of PyPy there is an elegant core trying to get out.
From Wednesday till Friday the main conference was happening. Many of the talks were not all that interesting for me, being quite Java centric. One talk that I liked a lot was "Making Sense of Large Heaps", which was presented by Nick Mitchell (IBM). He presented a tool called "Yeti" that can be used to analyze large heaps of Java programs. The tool uses some clever algorithms and heuristics to summarize the heap usage of data structures in intelligent ways to make it easier to find possible memory-wasters in a program. Nick also gave Anto and me a demo of the tool, where we tried to apply it to pypy-jvm (we found out that a fifth of the static data in there belongs to the parser/compiler :-( ).
On each of the days of the conference there was a keynote. I missed the one by Simon Peyton-Jones on Wednesday about type classes in Haskell. On Thursday, David Ungar was awarded the Dahl-Nygaard-Prize for his work on the Self programming language. Subsequently he gave a really inspiring keynote with the title "Self and Self: Whys and Wherefores" where he recollected Self's history, both on a technical as well as on a social level. Parts of the talk were snippets from the movies Self: The Movie and Alternate Reality Kit, both of which I highly recommend.
The keynote on Friday was by Cliff Click with the title "Java on 1000 Cores: Tales of Hardware/Software Co-design". He described the custom CPU architecture that Azul Systems has developed to run Java server applications on hundreds of cores. The talk mostly talked about the hardware, which I found very interesting (but some people didn't care for too much). Azul's CPU is essentially 54 in-order RISC cores in a single processor. The cores have a lot of extensions that make it easier to run Java on them, e.g. hardware read- and write-barriers, hardware-transactional-memory and hardware escape-detection (!).
In addition to the talks, there is of course always the hallway track (or coffee track) which is the track where you stand in the hallway and discuss with people. As usual, this was the most interesting part of the conference. One of those talks was Anto and me giving a PyPy demo to David Ungar. We had a very interesting discussion about VM implementation in general and the sort of debugging tools you need to write in particular. He liked PyPy a lot, which makes me very happy. He also liked the fact that I have actually read most Self papers :-).