Armin and me have been working on PyPy's parser and bytecode compiler for the Python language in the last days. Armin implemented several bytecode optimizations that CPython has since a while whereas I tried to refactor our tokenizer and parser (because our existing parser is rather slow and also not very nice code). Armin is mostly done whereas the new parser is not very far yet.
What is done, however, is the Python tokenizer. It is implemented in the usual way, by using a set of regular expressions to generate a deterministic finite automaton (DFA). This automaton is then turned into a big function which does the actual tokenization. Of course the picture is not quite as simple for Python, because it is not possible to tokenize Python using only regular expressions. To generate the proper "indent" and "dedent" tokens it would be necessary to keep state (the previous indentation levels) which a DFA cannot do. This is solved by postprocessing the tokens that the tokenizer produces to turn whitespace tokens into the proper indent and dedent tokens.
For debugging purposes I implemented a visualization tool for DFAs using PyPy's pygame-based graph viewer. The graph viewer is able to visualize interactively any graph given in the graph-description language of Graphviz. Looking at the tokenizing DFA for Python is rather instructive, both for understanding how tokenizing works and (maybe) for understanding the Python language. To try it, download the dot file of the DFA and run from a pypy checkout:
$ python pypy/bin/dotviewer.py tokenizer.dotThe following is a screenshot of the graphviewer:
For people who don't want do checkout PyPy I generated a (rather big) png for the DFA.
Next thing I would like to do (apart from actually finishing the parser, of course :-) ) is visualize the Python grammar itself using syntax diagrams or something similar. So far I couldn't really find a program to do that, though.