A quick update. The first version of pypy-stm based on regular
threads is ready. Still having no JIT and a 4-or-5-times performance
hit, it is not particularly fast, but I am happy that it turns out not
to be much slower than the previous thread-less attempts. It is at
least fast enough to run faster (in real time) than an equivalent no-STM
PyPy, if fed with an eight-threaded program on an eight-core machine
(provided, of course, you don't mind it eating all 8 cores' CPU power
instead of just one :-).
You can download and play around with this binary for Linux 64. It
was made from the stm-thread branch of the PyPy repository (translate.py --stm -O2 targetpypystandalone.py). (Be sure
to put it where it can find its stdlib, e.g. by putting it inside the
directory from the official 1.9 release.)
This binary supports the thread module and runs without the GIL.
So, despite the factor-of-4 slow-down issue, it should be the fourth
complete Python interpreter in which we can reasonably claim to have
resolved the problem of the GIL. (The first one was Greg Stein's Python
1.4, re-explored here; the second one is Jython; the third one is
IronPython.) Unlike the previous three, it is also the first one to
offer full GIL semantics to the programmer, and additionally
thread.atomic (see below). I should also add that we're likely to
see in the next year a 5th such interpreter, too, based on Hardware
Transactional Memory (same approach as with STM, but using e.g.
The binary I linked to above supports all built-in modules from PyPy,
apart from signal, still being worked on (which can be a bit
annoying because standard library modules like subprocess depend on
it). The sys.get/setcheckinterval() functions can be used to tweak
the frequency of the automatic commits. Additionally, it offers
thread.atomic, described in the previous blog post as a way to
create longer atomic sections (with the observable effect of preventing
the "GIL" to be released during that time). A complete
transaction.py module based on it is available from the sources.
The main missing features are:
- the signal module;
- the Garbage Collector, which does not do major collections so far, only
- and finally, the JIT, which needs some amount of integration to generate
the correctly-tweaked assembler.
We're pleased to announce the 1.9 release of PyPy. This release brings mostly
bugfixes, performance improvements, other small improvements and overall
progress on the numpypy effort.
It also brings an improved situation on Windows and OS X.
You can download the PyPy 1.9 release here:
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7. It's fast (pypy 1.9 and cpython 2.7.2 performance comparison)
due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or
Windows 32. Windows 64 work is still stalling, we would welcome a volunteer
to handle that.
Thanks to our donors
But first of all, we would like to say thank you to all people who
donated some money to one of our four calls:
Thank you all for proving that it is indeed possible for a small team of
programmers to get funded like that, at least for some
time. We want to include this thank you in the present release
announcement even though most of the work is not finished yet. More
precisely, neither Py3k nor STM are ready to make it in an official release
yet: people interested in them need to grab and (attempt to) translate
PyPy from the corresponding branches (respectively py3k and
- This release still implements Python 2.7.2.
- Many bugs were corrected for Windows 32 bit. This includes new
functionality to test the validity of file descriptors; and
correct handling of the calling convensions for ctypes. (Still not
much progress on Win64.) A lot of work on this has been done by Matti Picus
and Amaury Forgeot d'Arc.
- Improvements in cpyext, our emulator for CPython C extension modules.
For example PyOpenSSL should now work. We thank various people for help.
- Sets now have strategies just like dictionaries. This means for example
that a set containing only ints will be more compact (and faster).
- A lot of progress on various aspects of numpypy. See the numpy-status
page for the automatic report.
- It is now possible to create and manipulate C-like structures using the
PyPy-only _ffi module. The advantage over using e.g. ctypes is that
_ffi is very JIT-friendly, and getting/setting of fields is translated
to few assembler instructions by the JIT. However, this is mostly intended
as a low-level backend to be used by more user-friendly FFI packages, and
the API might change in the future. Use it at your own risk.
- The non-x86 backends for the JIT are progressing but are still not
merged (ARMv7 and PPC64).
- JIT hooks for inspecting the created assembler code have been improved.
See JIT hooks documentation for details.
- select.kqueue has been added (BSD).
- Handling of keyword arguments has been drastically improved in the best-case
scenario: proxy functions which simply forwards *args and **kwargs
to another function now performs much better with the JIT.
- List comprehension has been improved.
There will be a corresponding 1.9 release of JitViewer which is guaranteed to work
with PyPy 1.9. See the JitViewer docs for details.
The PyPy Team
For various reasons, less work than usual has been done since the last status
update. However, some interesting things happened anyway.
As readers know, so far we spent most of the effort in fixing all PyPy's own
tests which started to fail for various py2/py3 differences. Most of them
failed for shallow reasons, e.g. syntactic changes or the int/long
unifications. Others failed for subtle differences and needed a bit more care,
for example the fact that unbound methods are gone in Py3k.
The good news is that finally we are seeing the light at the end of the
tunnel. Most of them have been fixed. For sine other tests, we introduced the
concept of "py3k-skipping": some optimizations and modules are indeed failing,
but right now we are concentrating on completing the core language and so we
are not interested in those. When the core language will be done, we will be
able to easily find and work on the py3k-skipped tests. In particular, for
now we disabled the Int and String dict strategies, which are broken
because of the usual int/long unification and str vs bytes. As for modules,
for now _continuation (needed for stackless) and _multiprocessing do
not work yet.
Another non-trivial feature we implemented is the proper cleaning of exception
variables when we exit except blocks. This is a feature which touches
lots of levels of PyPy, starting from astcompiler, down to the bytecode
interpreter. It tooks two days of headache, but at the end we made it :-).
Additionally, Amaury did a lot of improvements to cpyext, which had been
broken since forever on this branch.
As for the next plans, now that things are starting to work and PyPy's own
tests mostly pass, we can finally start to run the compiled PyPy against
CPython's test suite. It is very likely that we will have tons of failures at
the beginning, but once we start to fix them one by one, a Py3k-compatible
PyPy will be closer and closer.
Here is another update on the status of Software Transactional Memory on PyPy.
Those of you who have been closely following this blog since last year know that, from the very first post about STM, I explored various design ideas about the API that we should get when programming in Python.
I went a full circle, and now I am back to where I started (with, important difference, a very roughly working implementation of pypy-stm).
What I realized is that the "thread" module is not that bad after all --- I mean, yes, it is a horribly low-level interface, but it is general enough to build various interesting things on top of it. What the "stm-thread" branch of PyPy contains is, basically, the regular "thread" module in which the GIL was replaced with STM. It gives multicore capabilities to any program based on multiple threads. (This is so far exactly the idea same than the one being investigated for Hardware Transactional Memory. It is roughly also what you would get if you managed to convince GCC 4.7 to compile CPython using STM.)
Now while this might already be quite interesting to some people, here is how it relates to all I said previously: namely, threads are bad, and some new "transaction" module would be a better idea.
There is one new core functionality in the "stm-thread" branch: it is "thread.atomic", a context manager that can be used in a "with" statement (exact name subject to change). In terms of the GIL, it prevents the GIL from being released in the "with" block. In terms of STM, it prevents a "transaction break", which means that the whole "with" statement runs in one single transaction. (From the Python programmer's point of view, the net effect is the same.)
So far, no ground-breaking news. But what I missed previously is that this is enough to give multicore capabilities even to a program that is not using threads so far. It is possible to rewrite an equivalent of the old transaction module in a few pages of pure Python, using "thread.atomic". Something along the following lines: start N threads that each reads from a Queue.Queue() the next job to do, and does it in a "with thread.atomic" block. The STM version of PyPy is then able to run these atomic blocks concurrently. The key point is that the slightly delicate handling of threads should be nicely hidden inside the new "transaction" module, and from outside the observed behavior would be exactly as if the transactions that we schedule are run serially.
The point I kept missing was that, yes, this sounds like nonsense, because it seems that we create N threads just to serialize their work again in "thread.atomic" sections. In fact this would be nonsense in any model that would "just" remove the GIL to let multiple threads run concurrently without crashing. Indeed, you have multiple threads, but their atomic blocks would be again a sort of GIL: only one of them would run at a time. And this is indeed the simple model of execution that you get even with STM --- but not the model of performance. The performance with STM scales with the number of cores, as long as there is enough non-conflicting work to do.
So in summary the complete circle back to the starting point is that threads might be a good low-level model. It mends itself naturally to, say, a kind of program in which the main thread polls file descriptors using select() or the Linux epoll(), and the work received is split along N other threads --- which is the kind of program you would naturally write in other languages that don't have a GIL, say Java. The other threads can then use "thread.atomic" blocks to protect sections of their work. The traditional Transactional Memory point of view is that you use such blocks to guard the short sections of code that communicate with other threads or modify global state, but nothing prevents you from using much larger sections: you should be able to scale them up to the size of a native "unit of work", so that every unit is naturally atomic. And then it's only a matter of design: you can tweak an existing module that does the thread pooling to add one "with thread.atomic"; or do it yourself from scratch; or (if the design is compatible enough) just plug in the proposed pure-Python "transaction" module. Or if you feel like it you can even use threads directly (but keep in mind that using threads too explicitly is not a composable abstraction, whereas higher-level designs typically are).
At the end of the day, you can write or reuse programs whose global structure you are already familiar with, for example with a thread pool (that can be hidden in a library if you prefer), or any other structure with or without explicit threads. But you can do so without all the mess that comes with threads like locks and deadlocks. From that angle it is really similar to Garbage Collection: e.g. the Boehm GC (now used by GCC itself) lets you write C code like you are used to, but forgeting all you had to learn about careful explicit memory management.
A short update on the Software Transactional Memory (STM) side. Let me remind you that the work is to add STM internally into PyPy, with the goal of letting the user's programs run on multiple cores after a minor adaptation. (The goal is not to expose STM to the user's program.) I will soon write some official documentation that explains in more details exactly what you get. For now you can read the previous blog posts, and you can also find technical details in the call for donation itself; or directly look at how I adapted the examples linked to later in this post.
I have now reached the point where the basics seem to work. There is no integration with the JIT so far; moreover the integration with the Garbage Collection subsystem is not finished right now, but at least it is "not crashing in my simple tests and not leaking memory too quickly". (It means that it is never calling __del__ so far, although it releases memory; and when entering transactional mode or when going to the next transaction, all live objects become immortal. This should still let most not-too-long-running programs work.)
If you want to play with it, you can download this binary (you need to put it in a place with the paths lib-python and lib_pypy, for example inside the main directory from a regular nightly tarball or from a full checkout). This version was compiled for Linux x86 32-bit from the stm-gc branch on the 25th of April. It runs e.g. the modified version of richards. This branch could also be translated for Linux x86-64, but not for other OSes nor other CPUs for now.
The resulting pypy-stm exposes the same interface as the pure Python transaction module, which is an emulator (running on CPython or any version of PyPy) which can be used to play around and prepare your programs. See the comments in there. A difference is that the real pypy-stm doesn't support epoll right now, so it cannot be used yet to play with a branch of Twisted that was already adapted (thanks Jean-Paul Calderone); but that's coming soon. For now you can use it to get multi-core usage on purely computational programs.
I did for example adapt PyPy's own translate.py: see the tweak in rpython/rtyper.py. Lines 273-281 are all that I needed to add, and they are mostly a "simplification and parallelization" of the lines above. There are a few more places in the whole translate.py that could be similarly modified, but overall it is just that: a few places. I did not measure performance, but I checked that it is capable of using multiple cores in the RTyping step of translation, with --- as expected --- some still-reasonable number of conflicts, particularly at the beginning when shared data structures are still being built.
On a few smaller, more regular examples like richards, I did measure the performance. It is not great, even taking into account that it has no JIT so far. Running pypy-stm with one thread is roughly 5 times slower than running a regular PyPy with no JIT (it used to be better in previous versions, but they didn't have any GC; nevertheless, I need to investigate). However, it does seem to scale. At least, it scales roughly as expected on my 2-real-cores, 4-hyperthreaded-cores laptop (i.e. for N between 1 and 4, the N-threaded pypy-stm performs similarly to N independent pypy-stm's running one thread each).
...a big thank you to everyone who contributed some money to support this! As you see on the PyPy site, we got more than 6700$ so far in only 5 or 6 weeks. Thanks to that, my contract started last Monday, and I am now paid a small salary via the Software Freedom Conservancy (thanks Bradley M. Kuhn for organizational support from the SFC). Again, thank you everybody!
UPDATE: The performance regression was due to disabling an optimization, the method cache, which caused non-deterministic results --- the performance could vary from simple to double. Today, as a workaround, I made the method cache transaction-local for now; it is only effective for transactions that run for long enough (maybe 0.1ms or 1ms), but at least it is there in this situation. In the version of richards presented above, the transactions are too short to make a difference (around 0.015ms).
A lot of things happened in March, like pycon. I was also busy doing other things (pictured), so apologies for the late numpy status update.
However, a lot of things have happened and numpy continues to be one of the main points of entry for hacking on PyPy. Apologies to all the people whose patches I don't review in timely manner, but seriously, you do a lot of work.
This list of changes is definitely not exhaustive, and I might be forgetting important contributions. In a loose order:
Matti Picus made out parameter work for a lot of (but not all) functions.
We merged record dtypes support. The only missing dtypes left are complex (important), datetime (less important) and object (which will probably never be implemented because it makes very little sense and is a mess with moving GCs).
Taavi Burns and others implemented lots of details, including lots of ufuncs. On the completely unscientific measure of "implemented functions" on numpypy status page, we're close to 50% of numpy working. In reality it might be more or less, but after complex dtypes we're getting very close to running real programs.
Bool indexing of arrays of the same size should work, leaving only arrays-of-ints indexing as the last missing element of fancy indexing.
I did some very early experiments on SSE. This work is seriously preliminary - in fact the only implemented operation is addition of float single-dimension numpy arrays. However, results are encouraging, given that our assembler generator is far from ideal:
The benchmark repo is available. GCC was run with -O3, no further options specified. PyPy was run with default options, the SSE branch is under backend-vector-ops, but it's not working completely yet.
One might argue that C and Python is not the same code - indeed it is not. It just shows some possible approach to writing numeric code.
Next step would be to just continue implementing missing features such as
- specialised arrays i.e. masked arrays and matrixes
- core modules such as fft, linalg, random.
- numpy's testing framework
The future is hard to predict, but we're not far off!
UPDATE:Indeed, string and unicode dtypes are not supported yet. They're as important as complex dtype
So, PyCon happened. This was the biggest PyCon ever and probably the biggest gathering of Python hackers ever.
From the PyPy perspective, a lot at PyCon was about PyPy. Listing things:
- David Beazley presented an excellent keynote describing his experience diving head-first into PyPy and at least partly failing. He, however, did not fail to explain bits and pieces about PyPy's architecture. Video is available.
- We gave tons of talks, including the tutorial, why pypy by example and pypy's JIT architecture
- We had a giant influx of new commiters, easily doubling the amount of pull requests ever created for PyPy. The main topics for newcomers were numpy and py3k, disproving what David said about PyPy being too hard to dive into ;)
- Guido argued in his keynote that Python is not too slow. In the meantime, we're trying to prove him correct :-)
We would like to thank everyone who talked to us, shared ideas and especially those who participated in sprints - we're always happy to welcome newcomers!
I'm sure there are tons of things I forgot, but thank you all!
A lot of work has been done during the last month: as usual, the list of changes is too big to be reported in a detalied way, so this is just a summary of what happened.
One of the most active areas was killing old and deprecated features. In particular, we killed support for the __cmp__ special method and its counsins, the cmp builtin function and keyword argument for list.sort() and sorted(). Killing is easy, but then you have to fix all the places which breaks because of this, including all the types which relied on __cmp__ to be comparable,, fixing all the tests which tried to order objects which are no longer ordeable now, or implementing new behavior like forbidding calling hash() on objects which implement __eq__ but not __hash__.
Among the other features, we killed lots of now-gone functions in the operator module, the builtins apply(), reduce() and buffer, and the os.* functions to deal with temporary files, which has been deprecated in favour of the new tempfile module.
The other topic which can't miss in a py3k status update is, as usual, string-vs-unicode. At this round, we fixed bugs in string formatting (in particular to teach format() to always use unicode strings) and various corner cases about when calling the (possibly overridden) __str__ method on subclasses of str. Believe me, you don't want to know the precise rules :-).
Other features which we worked on and fixed tests include, but are not limited to, marshal, hashlib, zipimport, _socket and itertools, plus the habitual endless lists of tests which fail for shallow reasons such as the syntactic differences, int vs long, range() vs list(range()) etc. As a result, the number of failing tests dropped from 650 to 235: we are beginning to see the light at the end of the tunnel :-)
Benjamin finished implementing Python 3 syntax. Most of it was small cleanups and tweaks to be compatible with CPython such as making True and False keywords and preventing . . . (note spaces between dots) from being parsed as Ellipsis. Larger syntax additions included keyword only arguments and function annotations.
Finally, we did some RPython fixes, so that it is possible again to translate PyPy in the py3k branch. However, the resuling binary is a strange beast which mixes python 2 and python 3 semantics, so it is unusable for anything but showing friends how cool it is.
I would like to underline that I was not alone in doing all this work. In particular, a lot of people joined the PyPy sprint at Pycon and worked on the branch, as you can clearly see in this activity graph. I would like to thank all who helped!
Antonio and Benjamin
The next PyPy sprint will be held --- for the first time in a while --- in a place where we haven't been so far: Leipzig, Germany, at the Python Academy's Teaching Center. It will take place from the 22nd to the 27th of June 2012, before EuroPython. Thanks to Mike Müller for organizing it!
This is a fully public sprint, everyone is welcome to join us. All days are full sprint days, so it is recommended to arrive the 21st and leave the 28th.
Topics and goals
Open. Here are some goals:
- numpy: progress towards completing the numpypy module; try to use it in real code
- stm: progress on Transactional Memory; try out the transaction module on real code.
- jit optimizations: there are a number of optimizations we can still try out or refactor.
- work on various, more efficient data structures for Python language. A good example would be lazy string slicing/concatenation or more efficient objects.
- any other PyPy-related topic is fine too.
For students, we have the possibility to support some costs via PyPy funds. Additionally, we can support you applying for grants from the PSF and other sources.
If you'd like to come, please sign up either by announcing yourself on pypy-dev, or by directly adding yourself to the list of people. (We need to have a head count for the organization.) If you are new to the project please drop a note about your interests and post any questions.
For more information, please see the sprint announcement.
The Software Transactional Memory call for donations is up. From the proposal:
Previous attempts on Hardware Transactional Memory focused on parallelizing existing programs written using the
Note that, yes, this gives you both sides of the coin: you keep using your non-thread-based program (without worrying about locks and their drawbacks like deadlocks, races, and friends), and your programs benefit from all your cores.
In more details, a low-level built-in module will provide the basics to start transactions in parallel; but this module will be only used internally in a tweaked version of, say, a Twisted reactor. Using this reactor will be enough for your existing Twisted-based programs to actually run on multiple cores. You, as a developer of the Twisted-based program, have only to care about improving the parallelizability of your program (e.g. by splitting time-consuming transactions into several parts; the exact rules will be published in detail once they are known).
The point is that your program is always correct, and can be tweaked to improve performance. This is the opposite from what explicit threads and locks give you, which is a performant program which you need to tweak to remove bugs. Arguably, this approach is the reason for why you use Python in the first place :-)