Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.
What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.
For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:
$ python pypy/bin/dotviewer.py tokenizer.dotThe following is a screenshot of the graphviewer:
For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.
Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.
The next PyPy sprint will be held in Leysin, Switzerland, for the fifth time. The overall idea of the sprint is to continue working on making PyPy ready for general use.
The proposed topics are: ctypes, JIT, testing, LLVM. This is a fully public sprint, so newcomers and other topics are welcome. And like previous winters, the main side goal is to have fun in winter sports :-) See the sprint announcement for details.
A few days ago, Armin discovered Gnuplot. He wrote a script that turns the results of the nightly benchmark runs into plots (lower is always better, all the numbers of the microbenchmarks are "times slower than CPython"). The corresponding microbenchmarks can be found in the repository. Staring at the plots revealed a strange performance regression around the revision 45000. After some investigation Armin found that an mostly unrelated change had disabled our method cache, which caused the regression. This was fixed. In addition, Armin did a few other small tweaks in the interpreter main loop, making sure that small bytecodes are inlined into the main loop. This gave another few percent of performance increase. Together with the GC improvements two weeks ago this leads to the fastest non-JIT PyPy ever. Unfortunately "fastest" is not really very fast yet in absolute terms, with realistic apps being around 3-4 times slower than CPython. Especially calls (in all its variants) are quite slow, which is something we should look into.
Old-style classes have so far been a bit neglected by PyPy's Python interpreter. By default, PyPy makes all classes new-style and you have to use a command-line switch (--oldstyle) at startup or at translation time to change that default. Then you would get an pure-Python implementation of classic classes. This implementation was extremely slow (around 20 times slower than classic classes in CPython). In the past we had hoped that we could get away with mostly only supporting new-style classes, however it seems that real-world software seems to rely on them quite a bit, so we decided to offer a better migration path. A while ago I therefore started a re-implementation of classic classes in RPython to speed them up. This work is now finished, the branch I worked on got merged today. Speed for the old-style class benchmarks was improved greatly and I found quite a number of bugs in the old implementation too. New-style classes are still a bit faster than old-style in PyPy though, and this is unlikely to change.
Recently I've been doing a lot of profiling on the PyPy executables to find speed bottlenecks. Valgrind (the original page seems to be down) is an extremely nice tool for doing this. It has several built-in tools that give you different types of profiles. The callgrind mode provides you with a lot of information including relative call costs. The cachegrind tool gives you less information, but what it gives you (e.g. cache misses) is much more accurate. The obvious choice would be to have a way to combine the results of two profiling runs to have both. In the last days I wrote a script that does this. It's available at my user's svn and has a pretty intuitive command line interface. The combining calculation are not perfect yet, total costs of functions can still be a bit bogus (they can sum up to whatever) but at least the relative figures are good. This means that we can stop looking at two different types of graphs now. An awesome tool for analyzing the profile data is kcachegrind. Which also proves that my 12'' display is to small at least for some things :-). Update: pygrind is available under the MIT license.
In the latest bunch of tasks that Titus released on Friday for the Google Highly Open Participation Contest there are several that are related to PyPy. Some of them are about presenting PyPy to a technical audience: Task 187, Task 188, Task 189, Task 190. Then there are some three about Ropes, which are all rather challenging:
- Solving the first three section of the last ICFP contest with PyPy's ropes implementation:Task 248.
- Implementing nice wrapper classes around PyPy's ropes algorithms to make their use convenient: Task 239.
- Implementing the Ropes algorithms in C as a CPython extension module: Task 218 (already taken).
It seems that we can do better! Armin fixed a bug in our generational garbage collector, which caused variable sized objects (e.g. arrays) to be allocated outside of the nursery. This resulted in 50% speedup on synthetic benchmarks and about 10-20% on real world ones. Doing some preliminary measures, it seems that we spend roughly 10% of the time in garbage collection, which is good (and there is still some room for improvements!)