Repeating a Matrix Multiplication Benchmark
I watched the Hennessy & Patterson's Turing award lecture recently:
In it, there's a slide comparing the performance of various matrix multiplication implementations, using Python (presumably CPython) as a baseline and comparing that against various C implementations (I couldn't find the linked paper yet):
I expected the baseline speedup of switching from CPython to C to be higher and I also wanted to know what performance PyPy gets, so I did my own benchmarks. This is a problem that Python is completely unsuited for, so it should give very exaggerated results.
The usual disclaimers apply: All benchmarks are lies, benchmarking of synthetic workloads even more so. My implementation is really naive (though I did optimize it a little bit to help CPython), don't use any of this code for anything real. The benchmarks ran on my rather old Intel i5-3230M laptop under Ubuntu 17.10.
With that said, my results were as follows:
|Implementation||time||speedup over CPython||speedup over PyPy|
|CPython||512.588 ± 2.362 s||1 ×|
|PyPy||8.167 ± 0.007 s||62.761 ± 0.295 ×||1 ×|
|'naive' C||2.164 ± 0.025 s||236.817 ± 2.918 ×||3.773 ± 0.044 ×|
|NumPy||0.171 ± 0.002 s||2992.286 ± 42.308 ×||47.678 ± 0.634 ×|
This is running 1500x1500 matrix multiplications with (the same) random matrices. Every implementation is run 50 times in a fresh process. The results are averaged, the errors are bootstrapped 99% confidence intervals.
So indeed the speedup that I got of switching from CPython to C is quite a bit higher than 47x! PyPy is much better than CPython, but of course can't really compete against GCC. And then the real professionals (numpy/OpenBLAS) are in a whole 'nother league. The speedup of the AVX numbers in the slide above is even higher than my NumPy numbers, which I assume is the result of my old CPU with two cores, vs. the 18 core CPU with AVX support. Lesson confirmed: leave matrix multiplication to people who actually know what they are doing.