The question I was asked most often during my recent US trip was how exactly the hints work that interpreter authors can use to improve the execution speed of the programs running on their interpreters. Since those hints are not really documented all that well, I decided to write blog posts about them. This is the first one.
First, let's recap some basics: PyPy's approach to implementing dynamic languages is to write an interpreter for the language in RPython. This interpreter can be translated to C and then further to machine code. The interpreter consists of code in the form of a large number of generated C functions and some data. Similarly, the user program consists of functions in the language the interpreter executes.
As was explained in a blog post and a paper two years ago, PyPy's JIT is a meta-tracer. Since we want to re-use our tracer for a variety of languages, we don't trace the execution of the user program, but instead trace the execution of the interpreter that is running the program. This means that the traces don't contain the bytecodes of the language in question, but RPython-level operations that the interpreter did to execute the program.
On the other hand, the loops that are traced by the tracer are the loops in the user program. This means that the tracer stops tracing after one iteration of the loop in the user function that is being considered. At this point, it can have traced many iterations of the interpreter main loop.
Here's a diagram of this process:
On the left you see the levels of execution. The CPU executes the binary of PyPy's Python interpreter, which consists of RPython functions that have been compiled first to C, then to machine code. Some of these functions contain loops, others don't. The interpreter runs a Python program written by a programmer (the user). If the tracer is used, it traces operations on the level of the interpreter. However, the extent of the trace is determined by the loops in the user program.
How Far Should Tracing Go
When the tracer encounters a function call at the interpreter level, e.g. the interpreter main loop calling a helper function, it can do one of two things:
- it can trace into the helper function, effectively inlining it into the trace.
- it can not trace into the function and instead record a call to that function as an operation in the trace. Such a call operation in the trace is sometimes called residual call.
As a default, the tracer will try to trace into the helper because that will give more information to the optimizer, allowing it to do a better job. This is particularly important for the allocation removal optimization, because if a freshly allocated object is passed as an argument to a residual call, its allocation cannot be optimized away.
There is a problem however if the helper function itself contains a loop. The tracer records the linear sequence of operations that are being executed. Thus when it encounters a loop on the interpreter level it records all the operations of every iteration of the loop itself, with the net effect of unrolling it. The only places where the tracer stops and tries to close the trace is in the main loop of the interpreter. When the tracer encounters the main loop, it also checks whether the original user loop has been closed, and thus whether it can stop tracing.
For most helper functions in the interpreter that contain loops, fully unrolling does not make sense. If a loop is unrolled, the trace is specific to the number of iteration that was seen during tracing. If the trace is later executed with a different number of iterations, the trace will be left via a guard failure, which is inefficient. Therefore the default behaviour of the tracer is to never trace into a function on the interpreter level that contains a loop, but to trace into all non-looping helper functions.
This default behaviour is essentially a heuristic, but one that usually makes sense. We want to produce just enough traces to make the resulting code efficient, but not more. Therefore we trace as much as possible (everything by default) except the functions which loops where tracing would produce code that is less general than it could be.
As an example for a helper with a loop, take string concatenation. It loops over the characters of both arguments and copies them over into the result string. It does not make sense to unroll the loops in this function. If we do that, the resulting trace can only be used for strings of the length that was seen during tracing. In practise, the string lengths are usually different each run, meaning that the trace with unrolling is not run to completion in most cases.
Influencing the Default Behaviour
Sometimes the default behaviour is not actually what is wanted. This is something the interpreter author has to decide, usually by looking at the traces that are produced and deciding that they should be improved. There are two ways in which the default is wrong:
- false negatives: if a helper function that does contain a loop should be traced into, unrolling the loop.
- false positives: if a helper function that does not contain a loop is inlined into the trace, but the interpreter author decides that this is not helpful.
If the interpreter author finds false negatives or false positives, she can fix that by applying a hint to the tracer. These hints take the form of function decorators (which both live in the pypy.rlib.jit module). In the next two subsections I will describe these two function decorators and their use.
Unrolling Functions With Loops
The first decorator, used to fix false negatives, is the unroll_safe decorator. It is used to tell the tracer to always trace into a function that has a loop, effectively unrolling the loop. This decorator should be used only if the loop in the helper function is expected to always run for the same number of iterations. This sounds like a strong restriction, in practise this is less severe: The number of iterations needs to only be the same in the context where the helper functions is traced from.
It is easiest to understand this condition via an example. Let's look at the BUILD_TUPLE bytecode in Python. It takes one argument, the length n of the tuple being built. The bytecode pops n arguments from the stack, turns them into a tuple and pushes that tuple on the stack. Thus the function that implements BUILD_TUPLE in PyPy's Python interpreter calls a helper popvalues which pops n values from the stack and returns them in a list. This helper is implemented with a loop and would thus not be traced into by default. The loop in the helper can run for very different numbers of iterations, because it is used in a variety of places. However, for every concrete BUILD_TUPLE bytecode, the argument will be constant. Therefore it is safe (and even necessary) to annotate popvalues with the unroll_safe decorator.
A different example is the implementation of the isinstance builtin. It is used to check whether an object a is an instance of a class B like this: isinstance(a, B). The second argument of the function can also be a tuple of classes to check whether an object is an instance of one of a number of classes: isinstance(a, (A, B, C, D)). To implement this second case, the implementation of isinstance contains a loop iterating over the elements of the tuple. The number of loop iterations can vary, but is usually fixed for each individual call site which typically just lists a few classes in the source code. Therefore it is also safe to annotate the implementation of isinstance with the unroll_safe decorator.
Preventing the Tracing of Functions
The second decorator dont_look_inside is used to fix false positives. It tells the JIT to never trace into the decorated function and just always produce a residual call instead. This decorator is in many ways less important than the unrolling one (except for a special situation that I will describe in a follow-up post). It is used if tracing into a function is not expected to yield any speed benefits, because the optimizer will not be able to improve it much. This is often the case if the called helper function does not contain any "dynamic" behaviour. In such a situation it is better to just leave the function call in the trace, because that produces less code.
An example would be the import mechanism in Python. It's very unlikely that any performance improvement can be had by turning part of it into assembler. Therefore we hide it from the tracer by annotating them with dont_look_inside.
In this post we discussed two hints that can be used to control precisely which parts of the interpreter should be meta-traced. If these hints are used carefully, this can go a long way to making the interpreter produce traces that contain exactly the interesting part of the execution, and will contain calls to the functions that can not be optimized by tracing techniques.
In the next part of this series I will discuss a different set of hints that can be used to strongly optimize traces.