Skip to main content

Visualizing a Python tokenizer

Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.

What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.

For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:

$ python pypy/bin/

The following is a screenshot of the graphviewer:

For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.

Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.

Anonymous wrote on 2008-01-08 16:11:
Carl Friedrich Bolz-Tereick wrote on 2008-01-08 17:52:

Hi Robin,

Yes, this is sort of cool already, but it doesn't really give you as much information as the grammar itself. It's more of an overview-like thing.


Carl Friedrich

Anonymous wrote on 2008-01-08 18:05:

Check out ANTLRWorks or ANTLRStudio for superb visualization of grammars.

Carl Friedrich Bolz-Tereick wrote on 2008-01-09 13:57:

Yeah, ANTLRWorks seems pretty nice in that respect. Thanks for the hint.

Unknown wrote on 2008-01-18 13:52:


I was considering to use

for a toy language project for uni (Computer Engineering, IT) because I guess ANTLR(works) and its mailing list could help me a lot in understanding the very basics of 'grammar design'...

Or at least I hope so! :-)

However, I've also been lurking the PyPy ml for quite a while now and was considering the possibility to implement the whole (*toy*) interpreter in RPython, so to understand a bit more of PyPy's design by actually coding something simple which makes some use of it. :-)

So, would you consider trying to port the current ANTLR's Python runtime to RPython a good way for me to start doing something with PyPy?

Would you consider the thing interesting? I know this possibility had been discussed on IRC some times ago and it wasn't thought to be that useful at last, but maybe you discussed the thing some more since then and changed idea, I don't know...

How would you rate the difficulty of such a task for a PyPy and ANTLR newbie, also? :-)

I wouldn't try doing that right now, anyway, but maybe in March I should manage to get some spare time for it. In the meantime, I'd try to read as many docs and sources as possible...

Tell me your opinion, please! :-)


PS: I'm writing here because you were discussing PyPy's Python lexer and somebody wrote about ANTLRworks, but if you think I'd better send this message to the list please just tell me and I'll do so! :-)

Carl Friedrich Bolz-Tereick wrote on 2008-01-18 16:32:

Hi Matteo,

I would post to the pypy-dev mailing list instead, it is better to discuss such longish things there. Thanks for you interest!


Carl Friedrich

Unknown wrote on 2008-05-19 14:45:

Does it use Graphviz's dot utility for rendering?

Konrad wrote on 2008-06-04 18:36:


Yes, it does.

Unknown wrote on 2008-06-04 18:47:

That's bad it cann't be used as a standalone product without the need to install Graphvis.

Anonymous wrote on 2009-04-22 02:22:
wow power leveling wrote on 2009-04-22 02:27:

PyPy Winter Sports Sprint from 12-19th of January in Leysin, Switzerland

The next PyPy sprint will be held in Leysin, Switzerland, for the fifth time. The overall idea of the sprint is to continue working on making PyPy ready for general use.

The proposed topics are: ctypes, JIT, testing, LLVM. This is a fully public sprint, so newcomers and other topics are welcome. And like previous winters, the main side goal is to have fun in winter sports :-) See the sprint announcement for details.


Various Performance Improvements

A few days ago, Armin discovered Gnuplot. He wrote a script that turns the results of the nightly benchmark runs into plots (lower is always better, all the numbers of the microbenchmarks are "times slower than CPython"). The corresponding microbenchmarks can be found in the repository. Staring at the plots revealed a strange performance regression around the revision 45000. After some investigation Armin found that an mostly unrelated change had disabled our method cache, which caused the regression. This was fixed. In addition, Armin did a few other small tweaks in the interpreter main loop, making sure that small bytecodes are inlined into the main loop. This gave another few percent of performance increase. Together with the GC improvements two weeks ago this leads to the fastest non-JIT PyPy ever. Unfortunately "fastest" is not really very fast yet in absolute terms, with realistic apps being around 3-4 times slower than CPython. Especially calls (in all its variants) are quite slow, which is something we should look into.

Anonymous wrote on 2008-01-06 06:05:

I'm amazed by the progress you guys have made. 3 - 4 times slower than CPython is actually really good considering what it does!

PyPy is one of the most interesting computer language projects on the net.

Faster implementation of classic classes merged

Old-style classes have so far been a bit neglected by PyPy's Python interpreter. By default, PyPy makes all classes new-style and you have to use a command-line switch (--oldstyle) at startup or at translation time to change that default. Then you would get an pure-Python implementation of classic classes. This implementation was extremely slow (around 20 times slower than classic classes in CPython). In the past we had hoped that we could get away with mostly only supporting new-style classes, however it seems that real-world software seems to rely on them quite a bit, so we decided to offer a better migration path. A while ago I therefore started a re-implementation of classic classes in RPython to speed them up. This work is now finished, the branch I worked on got merged today. Speed for the old-style class benchmarks was improved greatly and I found quite a number of bugs in the old implementation too. New-style classes are still a bit faster than old-style in PyPy though, and this is unlikely to change.

Michael Foord wrote on 2007-12-14 18:29:

Hey guys - its great to hear so much about PyPy progress. Keep up the good work (coding *and* blogging of course).


Carl Friedrich Bolz-Tereick wrote on 2007-12-14 18:38:

Hi Michael!

It seems we are slowly getting into this blogging thing. Good to hear that somebody is actually reading that stuff too :-).


Carl Friedrich

Anonymous wrote on 2007-12-15 01:05:

Of course it is being read, more please :)!

Anonymous wrote on 2007-12-20 14:10:

Dear Carl and other PyPy developers,
Thank you for all of your hard work in getting PyPy to its present impressive state.
I really enjoy reading about your activities and accomplishments on this blog and on the PyPy irc logs.


Carl Friedrich Bolz-Tereick wrote on 2007-12-20 14:23:

Hi gyro!

Really impressive that you chew through all the IRC-logs: Even I find that a lot of work sometimes :-)

Profiling for fun with valgrind

Recently I've been doing a lot of profiling on the PyPy executables to find speed bottlenecks. Valgrind (the original page seems to be down) is an extremely nice tool for doing this. It has several built-in tools that give you different types of profiles. The callgrind mode provides you with a lot of information including relative call costs. The cachegrind tool gives you less information, but what it gives you (e.g. cache misses) is much more accurate. The obvious choice would be to have a way to combine the results of two profiling runs to have both. In the last days I wrote a script that does this. It's available at my user's svn and has a pretty intuitive command line interface. The combining calculation are not perfect yet, total costs of functions can still be a bit bogus (they can sum up to whatever) but at least the relative figures are good. This means that we can stop looking at two different types of graphs now. An awesome tool for analyzing the profile data is kcachegrind. Which also proves that my 12'' display is to small at least for some things :-). Update: pygrind is available under the MIT license.

José Fonseca wrote on 2007-12-14 13:11:


In what license are you releasing pygrind? I would like to integrate pygrind's code into Gprof2Dot (LGPL) to be able to generate nice-looking graphs from cachegrind output.

Maciej Fijalkowski wrote on 2007-12-14 13:18:

Indeed, good question, thanks. I've updated blog post (I think it satisfies you, if not you can have it under LGPL if you like).

José Fonseca wrote on 2007-12-14 14:43:

Excellent. Thanks!

Aaron Bentley wrote on 2007-12-14 17:57:

If you like kcachegrind, you might like using it for profiling python:

PyPy tasks in GHOP

In the latest bunch of tasks that Titus released on Friday for the Google Highly Open Participation Contest there are several that are related to PyPy. Some of them are about presenting PyPy to a technical audience: Task 187, Task 188, Task 189, Task 190. Then there are some three about Ropes, which are all rather challenging:

In addition there is a task to use PyPy's sandboxing features to provide an interactive Python tutorial on a web page: Task 220. We're really looking forward to working together with some bright students!
teki wrote on 2007-12-06 01:28:

"faster than c" means you can write code in other languages which look like do the same as the c one, but the c one is slower

Anonymous wrote on 2007-12-06 09:20:

Took me some time to get the joke. =)

Fredrik Johansson wrote on 2007-12-06 16:48:

Dream on; only imaginary things can be faster than c.

(Relativistic mass formula with square root of a negative number for those who don't get it.)

Carl Friedrich Bolz-Tereick wrote on 2007-12-06 17:09:

Sorry, sorry. I think I spent too much time around physicists. Further hint: notice the lower-case c.

Eduardo O. Padoan wrote on 2007-12-18 16:06:

Benchmarking is relative :)

Good news from the garbage collection front

It seems that we can do better! Armin fixed a bug in our generational garbage collector, which caused variable sized objects (e.g. arrays) to be allocated outside of the nursery. This resulted in 50% speedup on synthetic benchmarks and about 10-20% on real world ones. Doing some preliminary measures, it seems that we spend roughly 10% of the time in garbage collection, which is good (and there is still some room for improvements!)