Finally, we managed to squeeze in some time to write a report about what has been going on the mysterious JIT sprint in Gothenburg, Sweden. The main goals of the sprint were to lay down the groundwork for getting more JIT work going in the next months and get more of PyPy developers up to speed with the current state of the JIT. One of the elements was to get better stability of the JIT, moving it slowly from being a prototype to actually work nicely on larger programs.
The secret goal of the sprint was to seek more speed, which Anto and Carl Friedrich did even during the break day:
We spent the first two days improving test coverage of the x86 backend and the optimizer. Now we have 100% coverage with unittests (modulo figleaf bugs), which does not mean anything, but it's better than before.
Then we spent quite some time improving the optimizer passes, so now we generate far less code than before the sprint, because a lot of it is optimized away. On the interpreter side, we marked more objects (like code objects) as immutable, so that reading fields from them can be constant-folded.
Another important optimization that we did is to remove consecutive reading of the same fields from the same structure, if no code in between can change it.
Our JIT is a hybrid environment, where only hot loops of code are jitted and the rest stays being interpreted. We found out that the performance of the non-jitted part was suboptimal, because all accesses to python frames went through an extra layer of indirection. We removed this layer of indirection, in the case where the jit and the interpreter cannot access the same frame (which is the common case).
We also spent some time improving the performance of our x86 backend, by making it use more registers and by doing more advanced variable renaming at the end of loops. It seems that using more registerd is not as much of a win as we hoped, because modern day processors are much smarter than we thought.
The most mind bending part was finding why we loose performance by making the JIT see more of the interpreter. It took us two very frustrating days and 36 gray hairs to find out that from the JIT we call a different malloc function in the Boehm GC, which is by far slower than the version that we use from the interpreter. This meant that the more we jitted, the slower our code got, purely because of the mallocs.
Now that this is fixed, the world makes much more sense again.
A lot of the sprint's work is not directly measurable in the performance figures, but we did a lot of work that is necessary for performance to improve in the next weeks. After we have done a bit more work, we should be able to provide some performance figures for programs that are more realistic than just loops that count to ten millions (which are very fast already :).
Now we're going to enjoy a couple of days off to recover from the sprint.
Carl Friedrich, fijal