Software Transactional Memory lisp experiments
As covered in the previous blog post, the STM subproject of PyPy has been back on the drawing board. The result of this experiment is an STM-aware garbage collector written in C. This is finished by now, thanks to Armin's and Remi's work, we have a fully functional garbage collector and a STM system that can be used from any C program with enough effort. Using it is more than a little mundane, since you have to inserts write and read barriers by hand everywhere in your code that reads or writes to garbage collector controlled memory. In the PyPy integration, this manual work is done automatically by the STM transformation in the interpreter.
However, to experiment some more, we created a minimal lisp-like/scheme-like interpreter (called Duhton), that follows closely CPython's implementation strategy. For anyone familiar with CPython's source code, it should be pretty readable. This interpreter works like a normal and very basic lisp variant, however it comes with a transaction builtin, that lets you spawn transactions using the STM system. We implemented a few demos that let you play with the transaction system. All the demos are running without conflicts, which means there are no conflicting writes to global memory and hence the demos are very amenable to parallelization. They exercise:
- arithmetics - demo/many_sqare_roots.duh
- read-only access to globals - demo/trees.duh
- read-write access to local objects - demo/trees2.duh
With the latter ones being very similar to the classic gcbench. STM-aware Duhton can be found in the stmgc repo, while the STM-less Duhton, that uses refcounting, can be found in the duhton repo under the base branch.
Below are some benchmarks. Note that this is a little comparing apples to oranges since the single-threaded duhton uses refcounting GC vs generational GC for STM version. Future pypy benchmarks will compare more apples to apples. Moreover none of the benchmarks has any conflicts. Time is the total time that the benchmark took (not the CPU time) and there was very little variation in the consecutive runs (definitely below 5%).
|benchmark||1 thread (refcount)||1 thread (stm)||2 threads||4 threads|
As you can see, the slowdown for STM vs single thread is significant (1.8x, 1.7x, 1.6x respectively), but still lower than 2x. However the speedup from running on multiple threads parallelizes the problem almost perfectly.
While a significant milestone, we hope the next blog post will cover STM-enabled pypy that's fully working with JIT work ongoing.
fijal on behalf of Remi Meier and Armin Rigo
I hacked a bit; inserted likely hint on early exit on spinlock acquisition, Haswell xacquire/xrelease hints on spinlock acquisition and release, and compiled with Haswell optimized flags.
Resulting scaling from 1 to 4 threads for tests were 1.92, 1.87 and 1.88. I think that's already quite close to 2.
I think this is OK, but not extraordinary.
Just to clarify my above comment: those were average factors of scaling per doubling of threads. So, 4-thread version ran actually 3.67, 3.50 and 3.54 times faster than single-threaded version.
Cool that you hacked on it! Note however that spinlock acquisition is not a blocker in these examples --- we implement STM mostly without locks, and locks are acquired rarely. Running independent code without getting STM conflicts means that each thread will in practice only acquire its own lock. And a single global lock is used for major GC --- but there, the large amount of work done means that using the Haswell xacquire/xrelease hints is just counterproductive.
"Resulting scaling from 1 to 4 threads" doesn't mean anything, as in some examples it scales perfectly, and in other examples it doesn't scale at all (as expected).
All your arguments are valid, and I didn't really expect much from hinting, just decided to try. It would seem that Haswell is still inching towards higher multicore scalability - probably thanks to improved atomic and fence ops in general. It's a benefit for those workloads that should conceptually scale well...
You really need to go above 4 threads: 8,16,32, and 64 at least. Then plot out the overhead of the STM related to this level of threading. If your benchmark is too small, alter it so that it makes sense to try and solve it with 64 threads.
@glen: we're focusing right now on the machines we have, which are standard Intels with 4, 8, or at most 12 cores. I believe it is interesting too, and it's what people have right now in their own desktop or laptop computers. Obviously the scalability to larger numbers of cores is important as well, but we can't simply disregard any result involving less than 64 cores.
This is a really great news.
Wish you all the best with further work!