Image processing is notoriously a CPU intensive task. To do it in realtime,
you need to implement your algorithm in a fast language, hence trying to do it
in Python is foolish: Python is clearly not fast enough for this task. Is it?
Actually, it turns out that the PyPy JIT compiler produces code which is fast enough to do realtime video processing using two simple algorithms implemented by Håkan Ardö.
sobel.py implements a classical way of locating edges in images, the Sobel operator. It is an approximation of the magnitude of the image gradient. The processing time is spend on two convolutions between the image and 3x3-kernels.
magnify.py implements a pixel coordinate transformation that rearranges the pixels in the image to form a magnifying effect in the center. It consists of a single loop over the pixels in the output image copying pixels from the input image.
You can try by yourself by downloading the appropriate demo:
To run the demo, you need to have mplayer installed on your system. The demo has been tested only on linux, it might (or not) work also on other systems:
$ pypy pypy-image-demo/sobel.py $ pypy pypy-image-demo/magnify.pyBy default, the two demos uses an example AVI file. To have more fun, you can use your webcam by passing the appropriate mplayer parameters to the scripts, e.g:
$ pypy demo/sobel.py tv://By default magnify.py uses nearest-neighbor interpolation. By adding the option -b, bilinear interpolation will be used instead, which gives smoother result:
$ pypy demo/magnify.py -bThere is only a single implementation of the algorithm in magnify.py. The two different interpolation methods are implemented by subclassing the class used to represent images and embed the interpolation within the pixel access method. PyPy is able to achieve good performance with this kind of abstractions because it can inline the pixel access method and specialize the implementation of the algorithm. In C++ that kind of pixel access method would be virtual and you'll need to use templates to get the same effect without incurring in runtime overhead.
- PyPy: ~47.23 fps
- CPython: ~0.08 fps
This means that on sobel.py, PyPy is 590 times faster. On magnify.py the difference is much less evident and the speedup is "only" 15x.
- PyPy: ~26.92 fps
- CPython: ~1.78 fps
It must be noted that this is an extreme example of what PyPy can do. In particular, you cannot expect (yet :-)) PyPy to be fast enough to run an arbitrary video processing algorithm in real time, but the demo still proves that PyPy has the potential to get there.
People that listened to my (Armin Rigo) lightning talk at EuroPython know that suddenly, we have a plan to remove the Global Interpreter Lock --- the infamous GIL, the thing in CPython that prevents multiple threads from actually running in your Python code in parallel.
That's not actually new, because Jython has been doing it all along. Jython works by very carefully adding locks to all the mutable built-in types, and by relying on the underlying Java platform to be efficient about them (so that the result is faster than, say, very carefully adding similar locks in CPython). By "very carefully", I mean really really carefully; for example, 'dict1.update(dict2)' needs to lock both dict1 and dict2, but if you do it naively, then a parallel 'dict2.update(dict1)' might cause a deadlock.
All of PyPy, CPython and IronPython have a GIL. But for PyPy we are considering a quite different approach than Jython's, based on Software Transactional Memory. This is a recent development in computer science, and it gives a nicer solution than locking. Here is a short introduction to it.
Say you want to atomically pop an item from 'list1' and append it to 'list2':
def f(list1, list2): x = list1.pop() list2.append(x)
This is not safe in multithreaded cases (even with the GIL). Say that you call f(l1, l2) in thread 1 and f(l2, l1) in thread 2. What you want is that it has no effect at all (x is moved from one list to the other, then back). But what can occur is that instead the top of the two lists are swapped, depending on timing issues.
One way to fix it is with a global lock:
def f(list1, list2): global_lock.acquire() x = list1.pop() list2.append(x) global_lock.release()
A finer way to fix it is with locks that come with the lists:
def f(list1, list2): acquire_all_locks(list1.lock, list2.lock) x = list1.pop() list2.append(x) release_all_locks(list1.lock, list2.lock)
The second solution is a model for Jython's, while the first is a model for CPython's. Indeed, in CPython's interpreter, we acquire the GIL, then we do one bytecode (or actually a number of them, like 100), then we release the GIL; and then we proceed to the next bunch of 100.
Software Transactional Memory (STM) gives a third solution:
def f(list1, list2): while True: t = transaction() x = list1.pop(t) list2.append(t, x) if t.commit(): break
In this solution, we make a transaction object and use it in all reads and writes we do to the lists. There are actually several different models, but let's focus on one of them. During a transaction, we don't actually change the global memory at all. Instead, we use the thread-local transaction object. We store in it which objects we read from, which objects we write to, and what values we write. It is only when the transaction reaches its end that we attempt to "commit" it. Committing might fail if other commits have occurred in between, creating inconsistencies; in that case, the transaction aborts and must restart from the beginning.
In the same way as the previous two solutions are models for CPython and Jython, the STM solution looks like it could be a model for PyPy in the future. In such a PyPy, the interpreter would start a transaction, do one or several bytecodes, and then end the transaction; and repeat. This is very similar to what is going on in CPython with the GIL. In particular, it means that it gives programmers all the same guarantees as the GIL does. The only difference is that it can actually run multiple threads in parallel, as long as their code does not interfere with each other. (In particular, if you need not just the GIL but actual locks in your existing multi-threaded program, then this will not magically remove the need for them. You might get an additional built-in module that exposes STM to your Python programs, if you prefer it over locks, but that's another question.)
Why not apply that idea to CPython? Because we would need to change everything everywhere. In the example above, you may have noted that I no longer call 'list1.pop()', but 'list1.pop(t)'; this is a way to tell that the implementation of all the methods needs to be changed, in order to do their work "transactionally". This means that instead of really changing the global memory in which the list is stored, it must instead record the change in the transation object. If our interpreter is written in C, as CPython is, then we need to write it explicitly everywhere. If it is written instead in a higher-level language, as PyPy is, then we can add this behavior as as set of translation rules, and apply them automatically wherever it is necessary. Moreover, it can be a translation-time option: you can either get the current "pypy" with a GIL, or a version with STM, which would be slower due to the extra bookkeeping. (How much slower? I have no clue, but as a wild guess, maybe between 2 and 5 times slower. That is fine if you have enough cores, as long as it scales nicely :-)
A final note: as STM research is very recent (it started around 2003), there are a number of variants around, and it's not clear yet which one is better in which cases. As far as I can tell, the approach described in "A Comprehensive Strategy for Contention Management in Software Transactional Memory" seems to be one possible state-of-the-art; it also seems to be "good enough for all cases".
So, when will it be done? I cannot say yet. It is still at the idea stage, but I think that it can work. How long would it take us to write it? Again no clue, but we are looking at many months rather than many days. This is the sort of thing that I would like to be able to work on full time after the Eurostars funding runs out on September 1. We are currently looking at ways to use crowdfunding to raise money so that I can do exactly that. Expect a blog post about that very soon. But this looks like a perfect candidate for crowdfunding -- there are at least thousands of you who would be willing to pay 10s of Euros to Kill the GIL. Now we only have to make this happen.
I'm here to report back the results of our survey. First, we're very pleased to report that a number of you guys are happilly running PyPy in production! Most (97%) of the respondants using PyPy are using it because it's faster, but a further 26% (respondants could choose multiple answers) are using it because of lower memory usage. Of users who aren't using PyPy, the most common reason was C extensions, followed by "Other".
From reading the extra comments section there are a few things we've learned:
- Google docs needs a better UI for this stuff
- A huge number of people want NumPy and SciPy, it was easily the most requested C extension (25% of respondants said somthing about NumPy). We've already blogged on the topic of our plans for NumPy.
- Having packages in the various OS's repositories would be a big help in getting users up and running.
A huge thanks to everyone who responded! Finally, if you're using PyPy in production we'd love to get a testimonial from you, if you're willing to spare a few minutes to give us a quote or two please get in contact with us via our mailing list.
The next PyPy sprint will be in Genova-Pegli, Italy, the week after EuroPython
(which is in Florence, about 3h away by train). This is a fully public sprint:
newcomers and topics other than those proposed below are welcome.
Goals and topics of the sprint
Now that we have released 1.5, the sprint itself is going to be mainly working on fixing issues reported by various users. Possible topics include, but are not limited to:
- fixing issues in the bug tracker
- improve cpyext, the C-API compatibility layer, to support more extension modules
- finish/improve/merge jitypes2, the branch which makes ctypes JIT friendly
- general JIT improvements
- improve our tools, like the jitviewer or the buildbot infrastructure
- make your favorite module/application working on PyPy, if it doesn't yet
Of course this does not prevent people from showing up with a more precise interest in mind If there are newcomers, we will gladly give introduction talks.
Since we are almost on the beach, we can take one day off for summer relaxation and/or tourist visits nearby :-).
Exact timesThe work days should be 27 June - 2 July 2011. People may arrive on the 26th already and/or leave on the 3rd.
Location & AccomodationBoth the sprint venue and the lodging will be at Albergo Puppo in Genova-Pegli, Italy. Pegli is a nice and peaceful little quarter of Genova, and the hotel is directly on the beach, making it a perfect place for those who want to enjoy the sea in the middle of the Italian summer, as a quick search on Google Images shows :-)
The place has a good ADSL Internet connexion with wireless installed. You can of course arrange your own lodging anywhere but I definitely recommend lodging there too.
Please confirm that you are coming so that we can adjust the reservations as appropriate. The prices are as follows, and they include breakfast and a parking place for the car, in case you need it:
Please register by hg:
- single room: 70 €
- double room: 95 €
- triple room: 105 €
https://foss.heptapod.net/pypy/extradoc/-/blob/branch/default/extradoc/sprintinfo/genova-pegli-2011/people.txtor on the pypy-dev mailing list if you do not yet have check-in rights:
We've been working on PyPy for a long time. But readers of this blog will know that in the past year something has changed: we think PyPy is production ready. And it's not just us, this week LWN.net wrote an article about how PyPy sped up one of their scripts by a factor of three, noting that, "plans are to run gitdm under PyPy from here on out". All in all we think PyPy is pretty great, but not everyone is using it yet, and we want to know why. We want your feedback on why PyPy isn't ready to be your only Python yet, and how we can improve it to make that happen.
Therefore, we've put together a quick survey, whether you're using PyPy or not if you could take a few minutes to fill it out and let us know how we're doing we'd really appreciate it. You can find the form here.
Thanks, The PyPy team
We are in the process of migrating the hosting machine for PyPy, moving away from codespeak.net and towards a mixture of custom servers (e.g. for buildbot.pypy.org) and wide-scale services (e.g. for the docs, now at readthedocs.org).
When this is done, a proper announce will be posted here. In the meantime, we have already moved the mailing lists, now hosted on python.org. The subscribers' list have been copied, so if you didn't notice anything special for the past week, then everything works fine :-) This concerns pypy-dev, pypy-issue and pypy-commit. Two notes:
- Some settings have not been copied, notably if you used to disable mail delivery. Sorry about that; you have to re-enter such settings.
- Following the move, about 50 addresses have been dropped for being invalid. I'm unsure why they were not dropped earlier, but in case sending mail to you from python.org instead of codespeak.net fails, then you have been dropped from the mailing lists, and you need to subscribe again.
Fancy hi-level interfaces often come with a high runtime overhead making them slow. Here is an experiment with building such an interface using constructions that PyPy should be good at optimizing. The idea is to allow the JIT in PyPy to remove the overhead introduced by using a fancy high-level python interface on top of a low-level C interface. The application considered is Linear programming. It is a tool used to solve linear optimization problems. It can for example be used to find the nonnegative values x, y and z that gives the maximum value of
lp = LinearProgram() x, y, z = lp.IntVar(), lp.IntVar(), lp.IntVar() lp.objective = 10*x + 6*y + 4*z lp.add_constraint( x + y + z <= 100 ) lp.add_constraint( 10*x + 4*y + 5*z <= 600 ) lp.add_constraint( 2*x + 2*y + 6*z <= 300 ) lp.add_constraint( x >= 0 ) lp.add_constraint( y >= 0 ) lp.add_constraint( z >= 0 ) maxval = lp.maximize() print maxval print x.value, y.value, z.valueTo benchmark the API I used it to solve a minimum-cost flow problem with 154072 nodes and 390334 arcs. The C library needs 9.43 s to solve this and the pplp interface adds another 5.89 s under PyPy and 28.17 s under CPython. A large amount of time is still spend setting up the problem, but it's a significant improvement over the 20 minutes required on CPython by cvxopt. It is probably not designed to be fast on this kind of benchmark. I have not been able to get cvxopt to work under PyPy. The benchmark used is available here
Hi everyone. Since yesterday's blog post we got a ton of feedback, so we want to clarify a few things, as well as share some of the progress we've made, in only the 24 hours since the post.
Reusing the original NumPy
First, a lot of people have asked why we cannot just reuse the original NumPy through cpyext, our CPython C-API compatibility layer. We believe this is not the best approach, for a few reasons:
- cpyext is slow, and always will be slow. It has to emulate far too many details of the CPython object model that don't exist on PyPy (e.g., reference counting). Since people are using NumPy primarily for speed this would mean that even if we could have a working NumPy, no one would want to use it. Also, as soon as the execution crosses the cpyext boundary, it becomes invisible to the JIT, which means the JIT has to assume the worst and deoptimize stuff away.
- NumPy uses many obscure documented and undocumented details of the CPython C-API. Emulating these is often difficult or impossible (e.g. we can't fix accessing a struct field, as there's no function call for us to intercept).
- It's not much fun. Frankly, working on cpyext, debugging the crashes, and everything else that goes with it is not terribly fun, especially when you know that the end result will be slow. We've demonstrated we can build a much faster NumPy, in a way that's more fun, and given that the people working on this are volunteers, it's important to keep us motivated.
Finally, we are not proposing to rewrite the entirety of NumPy or, god forbid, BLAST, or any of the low level stuff that operates on C-level arrays, only the parts that interface with Python code directly.
C bindings vs. CPython C-API
There are two issues on C code, one has a very nice story, and the other not so much. First is the case of arbitrary C-code that isn't Python related, things like libsqlite, libbz2, or any random C shared library on your system. PyPy will quite happily call into these, and bindings can be developed either at the RPython level (using rffi) or in pure Python, using ctypes. Writing bindings with ctypes has the advantage that they can run on every alternative Python implementation, such as Jython and IronPython. Moreover, once we merge the jittypes2 branch ctypes calls will even be smoking fast.
On the other hand there is the CPython C-extension API. This is a very specific API which CPython exposes, and PyPy tries to emulate. It will never be fast, because there is far too much overhead in all the emulation that needs to be done.
One of the reasons people write C extensions is for speed. Often, with PyPy you can just forget about C, write everything in pure python and let the JIT to do its magic.
In case the PyPy JIT alone isn't fast enough, or you just want to use existing C code then it might make sense to split your C-extension into 2 parts, one which doesn't touch the CPython C-API and thus can be loaded with ctypes and called from PyPy, and another which does the interfacing with Python for CPython (where it will be faster).
There are also libraries written in C to interface with existing C codebases, but for whom performance is not the largest goal, for these the right solution is to try using CPyExt, and if it works that's great, but if it fails the solution will be to rewrite using ctypes, where it will work on all Python VMs, not just CPython.
And finally there are rare cases where rewriting in RPython makes more sense, NumPy is one of the few examples of these because we need to be able to give the JIT hints on how to appropriately vectorize all of the operations on an array. In general writing in RPython is not necessary for almost any libraries, NumPy is something of a special case because it is so ubiquitous that every ounce of speed is valuable, and makes the way people use it leads to code structure where the JIT benefits enormously from extra hints and the ability to manipulate memory directly, which is not possible from Python.
On a more positive note, after we published the last post, several new people came and contributed improvements to the numpy-exp branch. We would like to thank all of them:
- nightless_night contributed: An implementation of __len__, fixed bounds checks on __getitem__ and __setitem__.
- brentp contributed: Subtraction and division on NumPy arrays.
- MostAwesomeDude contributed: Multiplication on NumPy arrays.
- hodgestar contributed: Binary operations between floats and NumPy arrays.
Those last two were technically an outstanding branch we finally merged, but hopefully you get the picture. In addition there was some exciting work done by regular PyPy contributors. I hope it's clear that there's a place to jump in for people with any level of PyPy familiarity. If you're interested in contributing please stop by #pypy on irc.freenode.net, the pypy-dev mailing list, or send us pull requests on bitbucket.
NumPy integration is one of the single most requested features for PyPy. This post tries to describe where we are, what we plan (or what we don't plan), and how you can help.
Short version for the impatient: we are doing experiments, which show that PyPy+numpy can be faster and better than CPython+numpy. We have a plan on how to move forward, but at the moment there is lack of dedicated people or money to tackle it.
The slightly longer version
Integrating numpy in PyPy has been my pet project on an on-and-off (mostly off) basis over the past two years. There were some experiments, then a long pause, and then some more experiments which are documented below.
The general idea is not to use the existing CPython module, but to reimplement numpy in RPython (i.e. the language PyPy is implemented in), thus letting our JIT achieve extra speedups. The really cool thing about this part is that numpy will automatically benefit of any general JIT improvements, without any need of extra tweaking.
|CPython 2.6.5 with numpy 1.3.0||0.260s (1x)||4.2 (1x)|
|PyPy numpy-exp @ 3a9d77b789e1||0.120s (2.2x)||0.087 (48x)|
The add benchmark spends most of the time inside the + operator on arrays (doing a + a + a + a + a), , which in CPython is implemented in C. As you can see from the table above, the PyPy version is already ~2 times faster. (Although numexpr is still faster than PyPy, but we're working on it).
The exact way array addition is implemented is worth another blog post, but in short it lazily evaluates the expression and computes it at the end, avoiding intermediate results. This approach scales much better than numexpr and can lead to speeding up all the operations that you can perform on matrices.
The next obvious step to get even more speedups would be to extend the JIT to use SSE operations on x86 CPUs, which should speed it up by about additional 2x, as well as using multiple threads to do operations.
iterate is also interesting, but for entirely different reasons. On CPython it spends most of the time inside a Python loop; the PyPy version is ~48 times faster, because the JIT can optimize across the python/numpy boundary, showing the potential of this approach, users are not grossly penalized for writing their loops in Python.
The drawback of this approach is that we need to reimplement numpy in RPython, which takes time. A very rough estimate is that it would be possible to implement an useful subset of it (for some definition of useful) in a period of time comprised between one and three man-months.
It also seems that the result will be faster for most cases and the same speed as original numpy for other cases. The only problem is finding the dedicated persons willing to spend quite some time on this and however, I am willing to both mentor such a person and encourage him or her.
The good starting point for helping would be to look at what's already implemented in micronumpy modules and try extending it. Adding a - operator or adding integers would be an interesting start. Drop by on #pypy on irc.freenode.net or get in contact with developers via some other channel (such as the pypy-dev mailing list) if you want to help.
Another option would be to sponsor NumPy development. In case you're interested, please get in touch with us or leave your email in comments.
We're pleased to announce the 1.5 release of PyPy. This release updates PyPy with the features of CPython 2.7.1, including the standard library. Thus all the features of CPython 2.6 and CPython 2.7 are now supported. It also contains additional performance improvements. You can download it here:
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.1. It's fast (pypy 1.5 and cpython 2.6.2 performance comparison) due to its integrated tracing JIT compiler.
This release includes the features of CPython 2.6 and 2.7. It also includes a large number of small improvements to the tracing JIT compiler. It supports Intel machines running Linux 32/64 or Mac OS X. Windows is beta (it roughly works but a lot of small issues have not been fixed so far). Windows 64 is not yet supported.
Numerous speed achievements are described on our blog. Normalized speed charts comparing pypy 1.5 and pypy 1.4 as well as pypy 1.5 and cpython 2.6.2 are available on our benchmark website. The speed improvement over 1.4 seems to be around 25% on average.
- The largest change in PyPy's tracing JIT is adding support for loop invariant code motion, which was mostly done by Håkan Ardö. This feature improves the performance of tight loops doing numerical calculations.
- The CPython extension module API has been improved and now supports many more extensions. For information on which one are supported, please refer to our compatibility wiki.
- These changes make it possible to support Tkinter and IDLE.
- The cProfile profiler is now working with the JIT. However, it skews the performance in unstudied ways. Therefore it is not yet usable to analyze subtle performance problems (the same is true for CPython of course).
- There is an external fork which includes an RPython version of the postgresql. However, there are no prebuilt binaries for this.
- Our developer documentation was moved to Sphinx and cleaned up.
- and many small things :-)
Carl Friedrich Bolz, Laura Creighton, Antonio Cuni, Maciej Fijalkowski, Amaury Forgeot d'Arc, Alex Gaynor, Armin Rigo and the PyPy team