Skip to main content

CFFI release 0.3

Hi everybody,

We released CFFI 0.3. This is the first release that supports more than CPython 2.x :-)

  • CPython 2.6, 2.7, and 3.x are supported (3.3 definitely, but maybe 3.2 or earlier too)
  • PyPy trunk is supported.

In more details, the main news are:

  • support for PyPy. You need to get a trunk version of PyPy, which comes with the built-in module _cffi_backend to use with the CFFI release. For testing, you can download the Linux 32/64 versions of PyPy trunk. The OS/X and Windows versions of _cffi_backend are not tested at all so far, so probably don't work yet.
  • support for Python 3. It is unknown which exact version is required; probably 3.2 or even earlier, but we need 3.3 to run the tests. The 3.x version is not a separate source; it runs out of the same sources. Thanks Amaury for starting this port.
  • the main change in the API is that you need to use ffi.string(cdata) instead of str(cdata) or unicode(cdata). The motivation for this change was the Python 3 compatibility. If your Python 2 code used to contain str(<cdata 'char *'>), it would interpret the memory content as a null-terminated string; but on Python 3 it would just return a different string, namely "<cdata 'char *'>", and proceed without even a crash, which is bad. So ffi.string() solves it by always returning the memory content as an 8-bit string (which is a str in Python 2 and a bytes in Python 3).
  • other minor API changes are documented at https://cffi.readthedocs.org/ (grep for version 0.3).

Upcoming work, to be done before release 1.0:

  • expose to the user the module cffi.model in a possibly refactored way, for people that don't like (or for some reason can't easily use) strings containing snippets of C declarations. We are thinking about refactoring it in such a way that it has a ctypes-compatible interface, to ease porting existing code from ctypes to cffi. Note that this would concern only the C type and function declarations, not all the rest of ctypes.
  • CFFI 1.0 will also have a corresponding PyPy release. We are thinking about calling it PyPy 2.0 and including the whole of CFFI (instead of just the _cffi_backend module like now). In other words it will support CFFI out of the box --- we want to push forward usage of CFFI in PyPy :-)

Cheers,

Armin Rigo and Maciej Fijałkowski

C++ objects in cppyy, part 1: Data Members

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Documentation and setup instructions are available here. Recent work has focused on STL, low-level buffers, and code quality, but also a lot on pythonizations for the CINT backend, which is mostly for High Energy Physics (HEP) use only. A previous posting walked through the high-level structure and organization of the module, where it was argued why it is necessary to write cppyy in RPython and generate bindings at run-time for the best performance. This posting details how access to C++ data structures is provided and is part of a series of 3 postings on C++ object representation in Python: the second posting will be about method dispatching, the third will tie up several odds and ends by showing how the choices presented here and in part 2 work together to make features such as auto-casting possible.

Wrapping Choices

Say we have a plain old data type (POD), which is the simplest possible data structure in C++. Like for example:

    struct A {
        int    m_i;
        double m_d;
    };

What should such a POD look like when represented in Python? Let's start by looking at a Python data structure that is functionally similar, in that it also carries two public data members of the desired types. Something like this:

    class A(object):
        def __init__(self):
            self.m_i = 0
            self.m_d = 0.

Alright, now how to go about connecting this Python class with the former C++ POD? Or rather, how to connect instances of either. The exact memory layout of a Python A instance is up to Python, and likewise the layout of a C++ A instance is up to C++. Both layouts are implementation details of the underlying language, language implementation, language version, and the platform used. It should be no surprise then, that for example an int in C++ looks nothing like a PyIntObject, even though it is perfectly possible, in both cases, to point out in memory where the integer value is. The two representations can thus not make use of the same block of memory internally. However, the requirement is that the access to C++ from Python looks and feels natural in its use, not that the mapping is exact. Another requirement is that we want access to the actual object from both Python and C++. In practice, it is easier to provide natural access to C++ from Python than the other way around, because the choices of memory layout in C++ are far more restrictive: the memory layout defines the access, as the actual class definition is gone at run-time. The best choice then, is that the Python object will act as a proxy to the C++ object, with the actual data always being in C++.

From here it follows that if the m_i data member lives in C++, then Python needs some kind of helper to access it. Conveniently, since version 2.2, Python has a property construct that can take a getter and setter function that are called when the property is used in Python code, and present it to the programmer as if it were a data member. So we arrive at this (note how the property instance is a variable at the class level):

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = property(get_m_i, set_m_i)
        m_d = property(get_m_d, set_m_d)

The construct_new_A helper is not very interesting (the reflection layer can provide for it directly), and methods are a subject for part 2 of this posting, so focus on get_m_i and set_m_i. In order for the getter to work, the method needs to have access to the C++ instance for which the Python object is a proxy. On access, Python will call the getter function with the proxy instance for which it is called. The proxy has a _cppthis data member from which the C++ instance can be accessed (think of it as a pointer) and all is good, at least for m_i. The second data member m_d, however, requires some more work: it is located at some offset into _cppthis. This offset can be obtained from the reflection information, which lets the C++ compiler calculate it, so details such as byte padding are fully accounted for. Since the setter also needs the offset, and since both share some more details such as the containing class and type information of the data member, it is natural to create a custom property class. The getter and setter methods then become bound methods of an instance of that custom property, CPPDataMember, and there is one such instance per data member. Think of something along these lines:

    def make_datamember(cppclass, name):
        cppdm = cppyy.CPPDataMember(cppclass, name)
        return property(cppdm.get, cppdm.set)
where the make_datamember function replaces the call to property in the class definition above.

Now hold on a minute! Before it was argued that Python and C++ can not share the same underlying memory structure, because of choices internal to the language. But if on the Python side choices are being made by the developer of the language bindings, that is no longer a limitation. In other words, why not go through e.g. the Python extension API, and do this:

    struct A_pyproxy {
        PyObject_HEAD
        int    m_i;
        double m_d;
    };

Doing so would save on malloc overhead and remove a pointer indirection. There are some technical issues specific to PyPy for such a choice: there is no such thing as PyPyObject_HEAD and the layout of objects is not a given as that is decided only at translation time. But assume that those issues can be solved, and also accept that there is no problem in creating structure definitions like this at run-time, since the reflection layer can provide both the required size and access to the placement new operator (compare e.g. CPython's struct module). There is then still a more fundamental problem: it must be possible to take over ownership in Python from instances created in C++ and vice-versa. With a proxy scheme, that is trivial: just pass the pointer and do the necessary bookkeeping. With an embedded object, however, not every use case can be implemented: e.g. if an object is created in Python, passed to C++, and deleted in C++, it must have been allocated independently. The proxy approach is therefore still the best choice, although embedding objects may provide for optimizations in some use cases.

Inheritance

The next step, is to take a more complicated C++ class, one with inheritance (I'm leaving out details such as constructors etc., for brevity):

    class A {
    public:
        virtual ~A() {}
        int    m_i;
        double m_d;
    };

    class B : public A {
    public:
        virtual ~B() {}
        int    m_j;
    };

From the previous discussion, it should already be clear what this will look like in Python:

    class A(object):
        def __init__(self):
            self._cppthis = construct_new_A()
        m_i = make_datamember('A', 'm_i')
        m_d = make_datamember('A', 'm_d')

    class B(A):
        def __init__(self):
            self._cppthis = construct_new_B()
        m_j = make_datamember('B', 'm_j')

There are some minor adjustments needed, however. For one, the offset of the m_i data member may be no longer zero: it is possible that a virtual function dispatch table (vtable) pointer is added at the beginning of A (an alternative is to have the vtable pointer at the end of the object). But if m_i is handled the same way as m_d, with the offset provided by the compiler, then the compiler will add the bits, if any, for the vtable pointer and all is still fine. A real problem could come in however, with a call of the m_i property on an instance of B: in that case, the _cppthis points to a B instance, whereas the getter/setter pair expect an A instance. In practice, this is usually not a problem: compilers will align A and B and calculate an offset for m_j from the start of A. Still, that is an implementation detail (even though it is one that can be determined at run-time and thus taken advantage of by the JIT), so it can not be relied upon. The m_i getter thus needs to take into account that it can be called with a derived type, and so it needs to add an additional offset. With that modification, the code looks something like this (as you would have guessed, this is getting more and more into pseudo-code territory, although it is conceptually close to the actual implementation in cppyy):

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + offset(self.__class__, A))

Which is a shame, really, because the offset between B and A is going to be zero most of the time in practice, and the JIT can not completely elide the offset calculation (as we will see later; it is easy enough to elide if self.__class__ is A, though). One possible solution is to repeat the properties for each derived class, i.e. to have a get_B_m_i etc., but that looks ugly on the Python side and anyway does not work in all cases: e.g. with multiple inheritance where there are data members with the same name in both bases, or if B itself has a public data member called m_i that shadows the one from A. The optimization then, is achieved by making B in charge of the offset calculations, by making offset a method of B, like so:

    def get_m_i(self):
        return int(self._cppthis + offset(A, m_i) + self.offset(A))

The insight is that by scanning the inheritance hierarchy of a derived class like B, you can know statically whether it may sometimes need offsets, or whether the offsets are always going to be zero. Hence, if the offsets are always zero, the method offset on B will simply return the literal 0 as its implementation, with the JIT taking care of the rest through inlining and constant folding. If the offset could be non-zero, then the method will perform an actual calculation, and it will let the JIT elide the call only if possible.

Multiple Virtual Inheritance

Next up would be multiple inheritance, but that is not very interesting: we already have the offset calculation between the actual and base class, which is all that is needed to resolve any multiple inheritance hierarchy. So, skip that and move on to multiple virtual inheritance. That that is going to be a tad more complicated will be clear if you show the following code snippet to any old C++ hand and see how they respond. Most likely you will be told: "Don't ever do that." But if code can be written, it will be written, and so for the sake of the argument, what would this look like in Python:

    class A {
    public:
        virtual ~A() {}
        int m_a;
    };

    class B : public virtual A {
    public:
        virtual ~B() {}
        int m_b;
    };

    class C : public virtual A {
    public:
        virtual ~C() {}
        int m_c;
    };

    class D : public virtual B, public virtual C {
    public:
        virtual ~D() {}
        int m_d;
    };

Actually, nothing changes from what we have seen so far: the scheme as laid out above is fully sufficient. For example, D would simply look like:

    class D(B, C):
        def __init__(self):
            self._cppthis = construct_new_D()
        m_d = make_datamember('D', 'm_d')

Point being, the only complication added by the multiple virtual inheritance, is that navigation of the C++ instance happens with pointers internal to the instance rather than with offsets. However, it is still a fixed offset from any location to any other location within the instance as its parts are laid out consecutively in memory (this is not a requirement, but it is the most efficient, so it is what is used in practice). But what you can not do, is determine the offset statically: you need a live (i.e. constructed) object for any offset calculations. In Python, everything is always done dynamically, so that is of itself not a limitation. Furthermore, self is already passed to the offset calculation (remember that this was done to put the calculation in the derived class, to optimize the common case of zero offset), thus a live C++ instance is there precisely when it is needed. The call to the offset calculation is hard to elide, since the instance will be passed to a C++ helper and so the most the JIT can do is guard on the instance's memory address, which is likely to change between traces. Instead, explicit caching is needed on the base and derived types, allowing the JIT to elide the lookup in the explicit cache.

Static Data Members and Global Variables

That, so far, covers all access to instance data members. Next up are static data members and global variables. A complication here is that a Python property needs to live on the class in order to work its magic. Otherwise, if you get the property, it will simply return the getter function, and if you set it, it will dissappear. The logical conclusion then, is that a property representing a static or global variable, needs to live on the class of the class, or the metaclass. If done directly though, that would mean that every static data member is available from every class, since all Python classes have the same metaclass, which is class type (and which is its own metaclass). To prevent that from happening and because type is actually immutable, each proxy class needs to have its own custom metaclass. Furthermore, since static data can also be accessed on the instance, the class, too, gets a property object for each static data member. Expressed in code, for a basic C++ class, this looks as follows:

    class A {
    public:
        static int s_i;
    };

Paired with some Python code such as this, needed to expose the static variable both on the class and the instance level:

    meta_A = type(CppClassMeta, 'meta_A', [CPPMetaBase], {})
    meta_A.s_i = make_datamember('A', 's_i')

    class A(object):
        __metaclass__ = meta_A
        s_i = make_datamember('A', 's_i')

Inheritance adds no complications for the access of static data per se, but there is the issue that the metaclasses must follow the same hierarchy as the proxy classes, for the Python method resolution order (MRO) to work. In other words, there are two complete, parallel class hierarchies that map one-to-one: a hierarchy for the proxy classes and one for their metaclasses.

A parallel class hierarchy is used also in other highly dynamic, object-oriented environments, such as for example Smalltalk. In Smalltalk as well, class-level constructs, such as class methods and data members, are defined for the class in the metaclass. A metaclass hierarchy has further uses, such as lazy loading of nested classes and member templates (this would be coded up in the base class of all metaclasses: CPPMetaBase), and makes it possible to distribute these over different reflection libraries. With this in place, you can write Python codes like so:

    >>>> from cppyy.gbl import A
    >>>> a = A()
    >>>> a.s_i = 42
    >>>> print A.s_i == a.s_i
    True
    >>>> # etc.

The implementation of the getter for s_i is a lot easier than for instance data: the static data lives at a fixed, global, address, so no offset calculations are needed. The same is done for global data or global data living in namespaces: namespaces are represented as Python classes, and global data are implemented as properties on them. The need for a metaclass is one of the reasons why it is easier for namespaces to be classes: module objects are too restrictive. And even though namespaces are not modules, you still can, with some limitations, import from them anyway.

It is common that global objects themselves are pointers, and therefore it is allowed that the stored _cppthis is not a pointer to a C++ object, but rather a pointer to a pointer to a C++ object. A double pointer, as it were. This way, if the C++ code updates the global pointer, it will automatically reflect on the Python side in the proxy. Likewise, if on the Python side the pointer gets set to a different variable, it is the pointer that gets updated, and this will be visible on the C++ side. In general, however, the same caveat as for normal Python code applies: in order to set a global object, it needs to be set within the scope of that global object. As an example, consider the following code for a C++ namespace NS with global variable g_a, which behaves the same as Python code for what concerns the visibility of changes to the global variable:

    >>>> from cppyy.gbl import NS, A
    >>>> from NS import g_a
    >>>> g_a = A(42)                     # does NOT update C++ side
    >>>> print NS.g_a.m_i
    13                                   # the old value happens to be 13
    >>>> NS.g_a = A(42)                  # does update C++ side
    >>>> print NS.g_a.m_i
    42
    >>>> # etc.

Conclusion

That covers all there is to know about data member access of C++ classes in Python through a reflection layer! A few final notes: RPython does not support metaclasses, and so the construction of proxy classes (code like make_datamember above) happens in Python code instead. There is an overhead penalty of about 2x over pure RPython code associated with that, due to extra guards that get inserted by the JIT. A factor of 2 sounds like a lot, but the overhead is tiny to begin with, and 2x of tiny is still tiny and it's not easy to measure. The class definition of the custom property, CPPDataMember, is in RPython code, to be transparent to the JIT. The actual offset calculations are in the reflection layer. Having the proxy class creation in Python, with structural code in RPython, complicates matters if proxy classes need to be constructed on-demand. For example, if an instance of an as-of-yet unseen type is returned by a method. Explaining how that is solved is a topic of part 2, method calls, so stay tuned.

This posting laid out the reasoning behind the object representation of C++ objects in Python by cppyy for the purpose of data member access. It explained how the chosen representation of offsets gives rise to a very pythonic representation, which allows Python introspection tools to work as expected. It also explained some of the optimizations done for the benefit of the JIT. Next up are method calls, which will be described in part 2.

Sindwiller wrote on 2012-09-12 13:50:

On a related note, do you know when Reflex will discard gccxml? I'm using Boost.Python with Ogre3D (among other things) right now and I'm looking into the pypy option. Gccxml, however, complains about some C++11 related stuff (which is somewhat odd, to the least, as I don't expose any Ogre-internal class or anything like that).

Wim Lavrijsen wrote on 2013-02-27 23:28:

Reflex itself will be discarded in favor of clang from llvm. That is, however, still experimental, but we're getting there.

heemanshu bhalla wrote on 2013-10-03 14:18:

Complete explanation of static data members with classes and program go to link :-

https://geeksprogrammings.blogspot.in/2013/09/static-data-members.html

Multicore Programming in PyPy and CPython

Hi all,

This is a short "position paper" kind of post about my view (Armin Rigo's) on the future of multicore programming in high-level languages. It is a summary of the keynote presentation at EuroPython. As I learned by talking with people afterwards, I am not a good enough speaker to manage to convey a deeper message in a 20-minutes talk. I will try instead to convey it in a 250-lines post...

This is about three points:

  1. We often hear about people wanting a version of Python running without the Global Interpreter Lock (GIL): a "GIL-less Python". But what we programmers really need is not just a GIL-less Python --- we need a higher-level way to write multithreaded programs than using directly threads and locks. One way is Automatic Mutual Exclusion (AME), which would give us an "AME Python".
  2. A good enough Software Transactional Memory (STM) system can be used as an internal tool to do that. This is what we are building into an "AME PyPy".
  3. The picture is darker for CPython, though there is a way too. The problem is that when we say STM, we think about either GCC 4.7's STM support, or Hardware Transactional Memory (HTM). However, both solutions are enough for a "GIL-less CPython", but not for "AME CPython", due to capacity limitations. For the latter, we need somehow to add some large-scale STM into the compiler.

Let me explain these points in more details.

GIL-less versus AME

The first point is in favor of the so-called Automatic Mutual Exclusion approach. The issue with using threads (in any language with or without a GIL) is that threads are fundamentally non-deterministic. In other words, the programs' behaviors are not reproductible at all, and worse, we cannot even reason about it --- it becomes quickly messy. We would have to consider all possible combinations of code paths and timings, and we cannot hope to write tests that cover all combinations. This fact is often documented as one of the main blockers towards writing successful multithreaded applications.

We need to solve this issue with a higher-level solution. Such solutions exist theoretically, and Automatic Mutual Exclusion (AME) is one of them. The idea of AME is that we divide the execution of each thread into a number of "atomic blocks". Each block is well-delimited and typically large. Each block runs atomically, as if it acquired a GIL for its whole duration. The trick is that internally we use Transactional Memory, which is a technique that lets the system run the atomic blocks from each thread in parallel, while giving the programmer the illusion that the blocks have been run in some global serialized order.

This doesn't magically solve all possible issues, but it helps a lot: it is far easier to reason in terms of a random ordering of large atomic blocks than in terms of a random ordering of lines of code --- not to mention the mess that multithreaded C is, where even a random ordering of instructions is not a sufficient model any more.

How do such atomic blocks look like? For example, a program might contain a loop over all keys of a dictionary, performing some "mostly-independent" work on each value. This is a typical example: each atomic block is one iteration through the loop. By using the technique described here, we can run the iterations in parallel (e.g. using a thread pool) but using AME to ensure that they appear to run serially.

In Python, we don't care about the order in which the loop iterations are done, because we are anyway iterating over the keys of a dictionary. So we get exactly the same effect as before: the iterations still run in some random order, but --- and that's the important point --- they appear to run in a global serialized order. In other words, we introduced parallelism, but only under the hood: from the programmer's point of view, his program still appears to run completely serially. Parallelisation as a theoretically invisible optimization... more about the "theoretically" in the next paragraph.

Note that randomness of order is not fundamental: they are techniques building on top of AME that can be used to force the order of the atomic blocks, if needed.

PyPy and STM/AME

Talking more precisely about PyPy: the current prototype pypy-stm is doing precisely this. In pypy-stm, the length of the atomic blocks is selected in one of two ways: either explicitly or automatically.

The automatic selection gives blocks corresponding to some small number of bytecodes, in which case we have merely a GIL-less Python: multiple threads will appear to run serially, with the execution randomly switching from one thread to another at bytecode boundaries, just like in CPython.

The explicit selection is closer to what was described in the previous section: someone --- the programmer or the author of some library that the programmer uses --- will explicitly put with thread.atomic: in the source, which delimitates an atomic block. For example, we can use it to build a library that can be used to iterate over the keys of a dictionary: instead of iterating over the dictionary directly, we would use some custom utility which gives the elements "in parallel". It would give them by using internally a pool of threads, but enclosing every handling of an element into such a with thread.atomic block.

This gives the nice illusion of a global serialized order, and thus gives us a well-behaving model of the program's behavior.

Restating this differently, the only semantical difference between pypy-stm and a regular PyPy or CPython is that it has thread.atomic, which is a context manager that gives the illusion of forcing the GIL to not be released during the execution of the corresponding block of code. Apart from this addition, they are apparently identical.

Of course they are only semantically identical if we ignore performance: pypy-stm uses multiple threads and can potentially benefit from that on multicore machines. The drawback is: when does it benefit, and how much? The answer to this question is not immediate. The programmer will usually have to detect and locate places that cause too many "conflicts" in the Transactional Memory sense. A conflict occurs when two atomic blocks write to the same location, or when A reads it, B writes it, but B finishes first and commits. A conflict causes the execution of one atomic block to be aborted and restarted, due to another block committing. Although the process is transparent, if it occurs more than occasionally, then it has a negative impact on performance.

There is no out-of-the-box perfect solution for solving all conflicts. What we will need is more tools to detect them and deal with them, data structures that are made aware of the risks of "internal" conflicts when externally there shouldn't be one, and so on. There is some work ahead.

The point here is that from the point of view of the final programmer, we gets conflicts that we should resolve --- but at any point, our program is correct, even if it may not be yet as efficient as it could be. This is the opposite of regular multithreading, where programs are efficient but not as correct as they could be. In other words, as we all know, we only have resources to do the easy 80% of the work and not the remaining hard 20%. So in this model we get a program that has 80% of the theoretical maximum of performance and it's fine. In the regular multithreading model we would instead only manage to remove 80% of the bugs, and we are left with obscure rare crashes.

CPython and HTM

Couldn't we do the same for CPython? The problem here is that pypy-stm is implemented as a transformation step during translation, which is not directly possible in CPython. Here are our options:

  • We could review and change the C code everywhere in CPython.
  • We use GCC 4.7, which supports some form of STM.
  • We wait until Intel's next generation of CPUs comes out ("Haswell") and use HTM.
  • We write our own C code transformation within a compiler (e.g. LLVM).

I will personally file the first solution in the "thanks but no thanks" category. If anything, it will give us another fork of CPython that will painfully struggle to keep not more than 3-4 versions behind, and then eventually die. It is very unlikely to be ever merged into the CPython trunk, because it would need changes everywhere. Not to mention that these changes would be very experimental: tomorrow we might figure out that different changes would have been better, and have to start from scratch again.

Let us turn instead to the next two solutions. Both of these solutions are geared toward small-scale transactions, but not long-running ones. For example, I have no clue how to give GCC rules about performing I/O in a transaction --- this seems not supported at all; and moreover looking at the STM library that is available so far to be linked with the compiled program, it assumes short transactions only. By contrast, when I say "long transaction" I mean transactions that can run for 0.1 seconds or more. To give you an idea, in 0.1 seconds a PyPy program allocates and frees on the order of ~50MB of memory.

Intel's Hardware Transactional Memory solution is both more flexible and comes with a stricter limit. In one word, the transaction boundaries are given by a pair of special CPU instructions that make the CPU enter or leave "transactional" mode. If the transaction aborts, the CPU cancels any change, rolls back to the "enter" instruction and causes this instruction to return an error code instead of re-entering transactional mode (a bit like a fork()). The software then detects the error code. Typically, if transactions are rarely cancelled, it is fine to fall back to a GIL-like solution just to redo these cancelled transactions.

About the implementation: this is done by recording all the changes that a transaction wants to do to the main memory, and keeping them invisible to other CPUs. This is "easily" achieved by keeping them inside this CPU's local cache; rolling back is then just a matter of discarding a part of this cache without committing it to memory. From this point of view, there is a lot to bet that we are actually talking about the regular per-core Level 1 and Level 2 caches --- so any transaction that cannot fully store its read and written data in the 64+256KB of the L1+L2 caches will abort.

So what does it mean? A Python interpreter overflows the L1 cache of the CPU very quickly: just creating new Python function frames takes a lot of memory (on the order of magnitude of 1/100 of the whole L1 cache). Adding a 256KB L2 cache into the picture helps, particularly because it is highly associative and thus avoids a lot of fake conflicts. However, as long as the HTM support is limited to L1+L2 caches, it is not going to be enough to run an "AME Python" with any sort of medium-to-long transaction. It can run a "GIL-less Python", though: just running a few hundred or even thousand bytecodes at a time should fit in the L1+L2 caches, for most bytecodes.

I would vaguely guess that it will take on the order of 10 years until CPU cache sizes grow enough for a CPU in HTM mode to actually be able to run 0.1-second transactions. (Of course in 10 years' time a lot of other things may occur too, including the whole Transactional Memory model being displaced by something else.)

Write your own STM for C

Let's discuss now the last option: if neither GCC 4.7 nor HTM are sufficient for an "AME CPython", then we might want to write our own C compiler patch (as either extra work on GCC 4.7, or an extra pass to LLVM, for example).

We would have to deal with the fact that we get low-level information, and somehow need to preserve interesting high-level bits through the compiler up to the point at which our pass runs: for example, whether the field we read is immutable or not. (This is important because some common objects are immutable, e.g. PyIntObject. Immutable reads don't need to be recorded, whereas reads of mutable data must be protected against other threads modifying them.) We can also have custom code to handle the reference counters: e.g. not consider it a conflict if multiple transactions have changed the same reference counter, but just resolve it automatically at commit time. We are also free to handle I/O in the way we want.

More generally, the advantage of this approach over both the current GCC 4.7 and over HTM is that we control the whole process. While this still looks like a lot of work, it looks doable. It would be possible to come up with a minimal patch of CPython that can be accepted into core without too much troubles (e.g. to mark immutable fields and tweak the refcounting macros), and keep all the cleverness inside the compiler extension.

Conclusion

I would assume that a programming model specific to PyPy and not applicable to CPython has little chances to catch on, as long as PyPy is not the main Python interpreter (which looks unlikely to change anytime soon). Thus as long as only PyPy has AME, it looks like it will not become the main model of multicore usage in Python. However, I can conclude with a more positive note than during the EuroPython conference: it is a lot of work, but there is a more-or-less reasonable way forward to have an AME version of CPython too.

In the meantime, pypy-stm is around the corner, and together with tools developed on top of it, it might become really useful and used. I hope that in the next few years this work will trigger enough motivation for CPython to follow the ideas.

JohnLenton wrote on 2012-08-09 12:29:

A question: does a “donate towards STM/AME in pypy” also count as a donation towards the CPython work? Getting the hooks in CPython to allow exploration and implementation of this seems at least as important as the pypy work. In fact, I think it’s quite a bit more important.

Armin Rigo wrote on 2012-08-09 12:55:

@John: I didn't foresee this development at the start of the year, so I don't know. It's a topic that would need to be discussed internally, likely with feedback from past donators.

Right now of course I'm finishing the basics of pypy-stm (working on the JIT now), and from there on there is a lot that can be done as pure Python, like libraries of better-suited data structures --- and generally gaining experience that would anyway be needed for CPython's work.

Anonymous wrote on 2012-08-09 15:53:

With HTM you don't have to have a one-to-one mapping between your application transactions and the hardware interface. You can also have an STM, that is implemented using HTM. So you may do all the book-keeping yourself in software, but then at commit time use HTM.

Nat Tuck wrote on 2012-08-09 16:37:

No. We really do want a GIL-free Python. Even if that means we sometimes need to deal with locks.

Right now a high end server can have 64 cores. That means that parallel python code could run faster than serial C code.

STM and other high level abstractions are neat, but they're no substitute for just killing the damn GIL.

Anonymous wrote on 2012-08-09 17:32:

What does 'just killing the damn GIL' mean without something like STM? Do you consider it acceptable for Python primitives not to be threadsafe?

If you intend to run 64 cores, then what is the exact reason you need threading and can't use multiprocessing?

Anonymous wrote on 2012-08-09 19:54:

Jesus Christ why don't we all just spend 5 min fiddling with the multiprocessing module and learn how to partition execution and queues like we partition sequences of statements into functions? So sick of GIL articles and the obsession with not learning how to divide up the work and communicate. In some ways the need to recognize narrow channels where relatively small amounts of data are being channeled through relatively intense blocks of execution and create readable, explicit structure around those blocks might actually improve the comprehensibility of some code I've seen. Getting a little tired of seeing so much effort by excellent, essential, dedicated Python devs getting sucked up by users who won't get it.

I think users are driving this speed-for-free obsession way to far. If anything bugs in a magical system are harder to find than understanding explicit structure and explicit structure that's elegant is neither crufty nor slow. Eventually, no interpreter will save a bad programmer. Are we next going to enable the novice "Pythonista" to forego any knowledge of algorithms?

We -need- JIT on production systems to get response times down for template processing without micro-caching out the wazoo. These types of services are already parallel by nature of the servers and usually I/O bound except for the few slow parts. Cython already serves such an excellent roll for both C/C++ API's AND speed AND optimizing existing python code with minimal changes. JIT PyPy playing well with Cython would make Python very generally uber. Users who actually get multiprocessing and can divide up the workflow won't want a slower implementation of any other kind. Getting a somewhat good solution for 'free' is not nearly as appealing as the additional headroom afforded by an incremental user cost (adding some strong typing or patching a function to work with pypy/py3k).

Unknown wrote on 2012-08-09 19:59:

template processing. lol.

Maciej Fijalkowski wrote on 2012-08-09 21:27:

@Anonymous.

I welcome you to work out how to make pypy translation process parallel using any techniques you described.

Benjamin wrote on 2012-08-10 07:27:

I get the overall goals and desires and I think they are fabulous. However, one notion that seems counterintuitive to me is the desire for large atomic operations.

Aside from the nomenclature (atomic generally means smallest possible), my intuition is that STM would generally operate more efficiently by having fewer roll-backs with small atomic operations and frequent commits. This leads me to assume there is some sort of significant overhead involved with the setup or teardown of the STM 'wrapper'.

From a broader perspective, I get that understanding interlacing is much easier with larger pieces, but larger pieces of code don't lend themselves to wide distribution across many cores like small pieces do.

It seems, to me, that you're focusing heavily on the idea of linearly written code magically functioning in parallel and neglecting the idea of simple, low-cost concurrency, which might have a much bigger short-term impact; and which, through use, may shed light on better frameworks for reducing the complexity inherent in concurrency.

Armin Rigo wrote on 2012-08-10 08:57:

@Anonymous: "So you may do all the book-keeping yourself in software, but then at commit time use HTM.": I don't see how (or the point), can you be more explicit or post a link?

@Anonymous: I'm not saying that STM is the final solution to all problems. Some classes of problems have other solutions that work well so far and I'm not proposing to change them. Big servers can naturally handle big loads just by having enough processes. What I'm describing instead is a pure language feature that may or may not help in particular cases --- and there are other cases than the one you describe where the situation is very different and multiprocessing doesn't help at all. Also, you have to realise that any argument "we will never need feature X because we can work around it using hack Y" is bound to lose eventually: at least some people in some cases will need the clean feature X because the hack Y is too complicated to learn or use correctly.

@Benjamin: "atomic" actually means "not decomposable", not necessarily "as small as possible". This focus on smallness of transaction IMO is an artefact of last decade's research focus. In my posts I tend to focus on large transaction as a counterpoint: in the use cases I have in mind there is no guarantee that all transactions will be small. Some of them may be, but others not, and this is a restriction. In things like "one iteration through this loop = one transaction", some of these iterations go away and do a lot of stuff.

Unknown wrote on 2012-08-10 18:15:

Transactional programming is neat. So are Goroutines and functional-style parallelism. On the other hand, I think that C and C++ (or at least C1x and C++11) get one thing completely right: they don't try to enforce any particular threading model. For some problems (like reference counts, as you mention), you really do want a different model. As long as other languages force me to choose a single model, my big projects will stay in C/C++.

Benjamin wrote on 2012-08-10 21:17:

@Armin I'd love to hear your thoughts (benefits, costs, entrenched ideas, etc.) on large vs small transactions at some point. Though I suspect that would be a post unto itself.

Armin Rigo wrote on 2012-08-10 22:04:

@Benjamin: a user program might be optimized to reduce its memory usage, for example by carefully reusing objects instead of throwing them away, finding more memory-efficient constructs, and so on. But in many cases in Python you don't care too much. Similarly, I expect that it's possible to reduce the size of transactions by splitting them up carefully, hoping to get some extras in performance. But most importantly I'd like a system where the programmer didn't have to care overmuch about that. It should still work reasonably well for *any* size, just like a reasonable GC should work for any heap size.

If I had to describe the main issue I have against HTM, it is that beyond some transaction size we loose all parallelism because it has to fall back on the GIL.

Well, now that I think about it, it's the same in memory usage: if you grow past the RAM size, the program is suddenly swapping, and performance becomes terrible. But RAM sizes are so far much more generous than maximum hardware transaction sizes.

Unknown wrote on 2012-08-12 08:26:

There are two key concurrency patterns to keep in mind when considering Armin's STM work:

1. Event-loop based applications that spend a lot of time idling waiting for events.

2. Map-reduce style applications where only the reduce step is particularly prone to resource contention, but the map step is read-heavy (and thus hard to split amongst multiple processes)

For both of those use cases, splitting out multiple processes often won't pay off due to either the serialisation overhead or the additional complexity needed to make serialisation possible at all.

Coarse-grained STM, however, should pay off handsomely in both of those scenarios: if the CPU bound parts of the application are touching different data structures, or are only *reading* any shared data, with any writes being batched for later application, then the STM interaction can be built in to the event loop or parallel execution framework.

Will STM help with threading use cases where multiple threads are simultaneously reading and writing the same data structure? No, it won't. However, such applications don't exploit multiple cores effectively even with free threading, because their *lock* contention will also be high.

As far as "just kill the GIL" goes, I've already written extensively on that topic: https://python-notes.boredomandlaziness.org/en/latest/python3/questions_and_answers.html#but-but-surely-fixing-the-gil-is-more-important-than-fixing-unicode

klaussfreire wrote on 2012-08-13 23:35:

Option 5, implement STM on the operating system. Linux already has COW for processes, imagine COW-MERGE for threads.

When you start transactional mode, all pages are marked read-only, thread-private and COW. When you commit, dirty pages are merged with the processes' page maps, unless conflicts arise (the process already has dirty pages).

A simple versioning system and version checks would take care of conflict detection.

I just wonder how difficult it would be designing applications that can run on this model (conflicts at page level vs object level).

Thread-private allocation arenas are entirely possible to avoid new objects from creating conflicts all the time, so it would be a matter of making read-only use of objects really read-only, something I've done incrementally in patches already. Reference counts have to be externalized (taken out of PyObject), for instance.

Armin Rigo wrote on 2012-08-14 09:12:

@klaussfreire: that approach is a cool hack but unlikely to work in practice in a language like Python, because the user doesn't control at all what objects are together with what other objects on the same pages. Even with the reference counts moved out of the way I guess you'd have far too many spurious conflicts.

klaussfreire wrote on 2012-08-14 15:43:

@Armin, well, Python itself does know.

In my half-formed idea in my head, python would use thread-local versions of the integer pool and the various free lists, and allocation of new objects would be served from an also thread-local arena (while in a transaction).

Read-write access to shared objects, yes, would be a little bit unpredictable. That's why I was wondering how good (if at all) it would work for Python.

Wim Lavrijsen wrote on 2012-08-14 20:18:

@klaussfreire

is this perhaps what you are looking for: https://plasma.cs.umass.edu/emery/grace

Cheers,
Wim

klaussfreire wrote on 2012-08-14 21:50:

Damn. And I thought I was being original. I can already spot a few key places where kernel-based support would be superior (not only raw performance, but also transparency), but in general, that's exactly what I was talking about, sans transaction retrials.

Mark D. wrote on 2012-08-16 04:23:

0.1 second transactions? With hardware transactional memory the general idea is transactions about ten thousand times smaller. A dozen memory modifications maybe.

It would be prohibitively expensive, hardware wise, to implement conflict detection for transactions much larger than that, to say nothing of the occurrence of conflicts requiring rollback and re-execution if such enormously large transactions were executed optimistically.

Armin Rigo wrote on 2012-08-19 11:58:

@Mark D.: I don't know if "a dozen memory modification" comes from real work in the field or is just a guess. My own guess would be that Intel Haswell supports easily hunderds of modifications, possibly thousands. Moreover the built-in cache coherency mechanisms should be used here too, in a way that scales with the cache size; this means they should not be "prohibitively expensive".
Of course I know that in 0.1 seconds we do far more than thousands writes, but I think that nothing strictly limits the progression of future processors in that respect.

The occurrence of conflicts in large transactions depends on two factors. First, "true conflicts", which is the hard problem, but which I think should be relatively deterministic and debuggable with new tools. Second, "false conflicts", which is the HTM/STM mechanism detecting a conflict when there is none. To handle large transactions this should occur with a probability very, very close to 0% for each memory access. In pypy-stm it is 0%, but indeed, with HTM it depends on how close to 0% they can get. I have no data on that.

Ole Laursen wrote on 2012-09-06 15:04:

I'm a little late, but regarding the simple let's-do-the-loop-concurrently example, if pypy-stm ends up working out as hoped, would it be relatively easy for pypy to do it automatically without having to use parallel loop thing explicitly?

I have a hunch the answer would be yes, but that the hard part is figuring out when it makes sense and how to do the split (each thread needs a good chunk to work on).

On the other hand, GCC has OpenMP which does seem really convenient and also looks like it has (or rather an implementation of that would have to have) solved part of this problem.

Many years ago, I read about research in auto-parallellising compilers and it stroke me as a really hard problem. But if you can just do some magic with the loops, perhaps it's an attainable goal?

Unknown wrote on 2012-09-06 21:02:

I really believe that concurrency - like memory allocation, GC and safe arrays - should be done without the user thinking about it...

Languages like Erlang, ABCL and Concurrent Object Oriented C solves this quite elegant.

Just make every Object a "process" (thread/greenlet) and every return value a Future and your are done :-)

Anonymous wrote on 2015-09-22 07:53:

Ammm... Jython 2.7.0 !

All pure Python syntax using threading instantly go MULTI-CORE! All you need to do is replace the 'p' with a 'j' in your command and voila!

;)

NumPyPy non-progress report

Hello everyone.

Not much has happened in the past few months with numpypy development. A part of the reason was doing other stuff for me, a part of the reason was various unexpected visa-related admin, a part of the reason was EuroPython and a part was long-awaited holiday.

The thing that's maybe worth mentioning is that it does not mean the donations disappeared in the mist. PyPy developers are being paid to work on NumPyPy on an hourly basis - that means if I decide to take holidays or work on something else, the money is simply staying in the account until later.

Thanks again for all the donations, I hope to get back to this topic soon!

Cheers,
fijal


Stephen Weber wrote on 2012-08-09 00:37:

Thanks for the non-update, I trust you that all is well. Rest helps us work better!

Unknown wrote on 2012-08-13 13:25:

Please don’t worry too much about the money lost/not-lost. The important part is that you enjoy the programming. For you, because that’s more fun and for us because more fun for the programmer means better code.

CFFI release 0.2.1

Hi everybody,

We released CFFI 0.2.1 (expected to be 1.0 soon). CFFI is a way to call C from Python.

EDIT: Win32 was broken in 0.2. Fixed.

This release is only for CPython 2.6 or 2.7. PyPy support is coming in
the ffi-backend branch, but not finished yet. CPython 3.x would be
easy but requires the help of someone.

The package is available on bitbucket as well as documented. You
can also install it straight from the python package index: pip install cffi

  • Contains numerous small changes and support for more C-isms.
  • The biggest news is the support for installing packages that use
    ffi.verify() on machines without a C compiler. Arguably, this
    lifts the last serious restriction for people to use CFFI.
  • Partial list of smaller changes:
    • mappings between 'wchar_t' and Python unicodes
    • the introduction of ffi.NULL
    • a possibly clearer API for ffi.new(): e.g. to allocate a single int and obtain a pointer to it, use ffi.new("int *") instead of the old
      ffi.new("int")
    • and of course a plethora of smaller bug fixes
  • CFFI uses pkg-config to install itself if available. This helps
    locate libffi on modern Linuxes. Mac OS/X support is available too
    (see the detailed installation instructions). Win32 should work out
    of the box. Win64 has not been really tested yet.

Cheers,
Armin Rigo and Maciej Fijałkowski

Prototype PHP interpreter using the PyPy toolchain - Hippy VM

Hello everyone.

I'm proud to release the result of a Facebook-sponsored study on the feasibility of using the RPython toolchain to produce a PHP interpreter. The rules were simple: two months; one person; get as close to PHP as possible, implementing enough warts and corner cases to be reasonably sure that it answers hard problems in the PHP language. The outcome is called Hippy VM and implements most of the PHP 1.0 language (functions, arrays, ints, floats and strings). This should be considered an alpha release.

The resulting interpreter is obviously incomplete – it does not support all modern PHP constructs (classes are completely unimplemented), builtin functions, grammar productions, web server integration, builtin libraries etc., etc.. It's just complete enough for me to reasonably be able to say that – given some engineering effort – it's possible to provide a rock-solid and fast PHP VM using PyPy technologies.

The result is available in a Bitbucket repo and is released under the MIT license.

Performance

The table below shows a few benchmarks comparing Hippy VM to Zend (a standard PHP interpreter available in Linux distributions) and HipHop VM (a PHP-to-C++ optimizing compiler developed by Facebook). The versions used were Zend 5.3.2 (Zend Engine v2.3.0) and HipHop VM heads/vm-0-ga4fbb08028493df0f5e44f2bf7c042e859e245ab (note that you need to check out the vm branch to get the newest version).

The run was performed on 64-bit Linux running on a Xeon W3580 with 8M of L2 cache, which was otherwise unoccupied.

Unfortunately, I was not able to run it on the JITted version of HHVM, the new effort by Facebook, but people involved with the project told me it's usually slower or comparable with the compiled HipHop. Their JITted VM is still alpha software, so I'll update it as soon as I have the info.

benchmark Zend HipHop VM Hippy VM Hippy / Zend Hippy / HipHop
arr 2.771 0.508+-0% 0.274+-0% 10.1x 1.8x
fannkuch 21.239 7.248+-0% 1.377+-0% 15.4x 5.3x
heapsort 1.739 0.507+-0% 0.192+-0% 9.1x 2.6x
binary_trees 3.223 0.641+-0% 0.460+-0% 7.0x 1.4x
cache_get_scb 3.350 0.614+-0% 0.267+-2% 12.6x 2.3x
fib 2.357 0.497+-0% 0.021+-0% 111.6x 23.5x
fasta 1.499 0.233+-4% 0.177+-0% 8.5x 1.3x

The PyPy compiler toolchain provides a way to implement a dynamic language interpreter in a high-level language called RPython. This is a language which is lower-level than Python, but still higher-level than C or C++: for example, RPython is a garbage-collected language. The killer feature is that the toolchain will generate a JIT for your interpreter which will be able to leverage most of the work that has been done on speeding up Python in the PyPy project. The resulting JIT is generated for your interpreter, and is not Python-specific. This was one of the toolchain's original design decisions – in contrast to e.g. the JVM, which was initially only used to interpret Java and later adjusted to serve as a platform for dynamic languages.

Another important difference is that there is no common bytecode to which you compile both your language and Python, so you don't inherit problems presented when implementing language X on top of, say, Parrot VM or the JVM. The PyPy toolchain does not impose constraints on the semantics of your language, whereas the benefits of the JVM only apply to languages that map well onto Java concepts.

To read more about creating your own interpreters using the PyPy toolchain, read more blog posts or an excellent article by Laurence Tratt.

PHP deviations

The project's biggest deviation from the PHP specification is probably that GC is no longer reference counting. That means that the object finalizer, when implemented, will not be called directly at the moment of object death, but at some later point. There are possible future developments to alleviate that problem, by providing "refcounted" objects when leaving the current scope. Research has to be done in order to achieve that.

Assessment

The RPython toolchain seems to be a cost-effective choice for writing dynamic language VMs. It both provides a fast JIT and gives you access to low-level primitives when you need them. A good example is in the directory hippy/rpython which contains the implementation of an ordered dictionary. An ordered dictionary is not a primitive that RPython provides – it's not necessary for the goal of implementing Python. Now, implementing it on top of a normal dictionary is possible, but inefficient. RPython provides a way to work directly at a lower level, if you desire to do so.

Things that require improvements in RPython:

  • Lack of mutable strings on the RPython level ended up being a problem. I ended up using lists of characters; which are efficient, but inconvenient, since they don't support any string methods.
  • Frame handling is too conservative and too Python-specific, especially around the calls. It's possible to implement less general, but simpler and faster frame handling implementation in RPython.

Status of the implementation

Don't use it! It's a research prototype intended to assess the feasibility of using RPython to create dynamic language VMs. The most notable feature that's missing is reasonable error reporting. That said, I'm confident it implements enough of the PHP language to prove that the full implementation will present the same performance characteristics.

Benchmarks

The benchmarks are a selection of computer language shootout benchmarks, as well as cache_get_scb, which is a part of old Facebook code. All benchmarks other than this one (which is not open source, but definitely the most interesting :( ) are available in the bench directory. The Python program to run them is called runner.py and is in the same directory. It runs them 10 times, cutting off the first 3 runs (to ignore the JIT warm-up time) and averaging the rest. As you can see the standard deviation is fairly minimal for all interpreters and runs; if it's omitted it means it's below 0.5%.

The benchmarks were not selected for their ease of optimization – the optimizations in the interpreter were written specifically for this set of benchmarks. No special JIT optimizations were added, and barring what's mentioned below a vanilla PyPy 1.9 checkout was used for compilation.

So, how fast will my website run if this is completed?

The truth is that I lack the benchmarks to be able to answer that right now. The core of the PHP language is implemented up to the point where I'm confident that the performance will not change as we get more of the PHP going.

How do I run it?

Get a PyPy checkout, apply the diff if you want to squeeze out the last bits of performance and run pypy-checkout/pypy/bin/rpython targethippy.py to get an executable that resembles a PHP interpreter. You can also directly run python targethippy.py file.php, but this will be about 2000x slower.

RPython modifications

There was a modification that I did to the PyPy source code; the diff is available. It's trivial, and should simply be made optional in the RPython JIT generator, but it was easier just to do it, given the very constrained time frame.

  • gen_store_back_in_virtualizable was disabled. This feature is necessary for Python frames but not for PHP frames. PHP frames do not have to be kept alive after we exit a function.

Future

Hippy is a cool prototype that presents a very interesting path towards a fast PHP VM. However, at the moment I have too many other open source commitments to take on the task of completing it in my spare time. I do think that this project has a lot of potential, but I will not commit to any further development at this time. If you send pull requests I'll try to review them. I'm also open to having further development on this project funded, so if you're interested in this project and the potential of a fast PHP interpreter, please get in touch.

Cheers,
fijal

EDIT: Fixed the path to the rpython binary

Anonymous wrote on 2012-07-13 23:26:

it's cool. Next on the list Javascript to Python/PyPy converter...

Maciej Fijalkowski wrote on 2012-07-13 23:34:

please read the blog post first. It's *not* PHP to Python converter. There is also a started JS implementation on in https://bitbucket.org/pypy/lang-js, but JS is kind of useless without a browser.

Anonymous wrote on 2012-07-14 00:30:

JS to pypy would be useful when time comes to running all those node based apps in prod ;)

Also, Java to PyPy would be a cool experiment too - jvm's way too bloated...

Christian Heimes wrote on 2012-07-14 01:42:

Do I read the numbers correctly? The fibonacci test runs more than 110 times faster in your experimental, 2 months old VM than in the default Zend VM? That's amazing!

It took me a while to figure out the meaning of the numbers. Please add units and explain that small is faster.

Christian

Unknown wrote on 2012-07-14 02:27:

Nice, Python surprising when

Konstantine Rybnikov wrote on 2012-07-14 07:25:

Cool. When will your pypy converter convert my c++ programs to python? Can't wait until that happens! Anyway, nice work!

p.s.: sarcasm

Benedikt Morbach wrote on 2012-07-14 10:22:

Hey there, nice work.

Do you have any numbers or estimates how memory consumption compares?

Ole Laursen wrote on 2012-07-14 11:56:

I hope you get funding for researching the refcount thing. Being able to predict when something gets whacked is just really convenient and something PyPy Python can benefit from too.

While GC may be more efficient, the unpredictable nature of it do become a problem in production in some cases.

For instance, for a webapp written with Django and CPython, when a request is over I know that the stuff that was allocated is now gone unless I put something in a global data structure. I suspect many applications have similar patterns where you perform a big operation after which it's natural to have a clean up.

Inactive Account wrote on 2012-07-15 00:21:

Wow, this is wonderful.
You rock.

I surely hope you get funding.

If I didn't live in Brazil, and our currency wasn't so weak, and my income wasn't so low, I would definitely donate some dozens of dollars.

Keep the good work

Tom wrote on 2012-07-15 19:02:

I would like to see how this compares to the Phalanger project. Which runs PHP in the .NET runtime.

Maciej Fijalkowski wrote on 2012-07-15 19:05:

About phalanger: the short answer is that I don't have windows and comparisons on mono would be a bit ingenuine. The longer answer is that I don't expect phalanger to particularly excel compared to Zend.

For example compare the performance of IronPython and CPython. The same reasons apply as they do towards JVM or Parrot - this is IMO nto the right way for dynamic lanaguages.

Anonymous wrote on 2012-07-15 20:16:

Does the Zend test include APC as well? That's the current standard way to run php scripts...

Maciej Fijalkowski wrote on 2012-07-15 20:29:

Yes, although APC does not change anything in *this* set of benchmarks, precisely because you run everything in-process (within the same interpreter instance even).

Reini Urban wrote on 2012-07-16 16:25:

Love this effort and esp. the benchmarks! Great work

Referring to your mentioning of JVM and parrot:

You consider as disadvantage to be tied to an existing set of VM opcodes to implement many languages. You were talking about .NET (which had to add Iron-style dynamic reflection later) or the JVM.

parrot already has all the functionality the JVM or .NET was missing and even more (e.g. dynamic types loadable as plugins) and considers it as advantage to share opcodes and bytecode libraries across different languages.

But parrot cannot compete with your speed yet.

SM wrote on 2012-07-16 17:36:

Very interesting project. It would be nice if you used a recent version of PHP for comparisons - 5.3.2 is over 2 years old and one version behind. Try something like 5.4.4.

Reinis I. wrote on 2012-07-18 20:59:

> JS is kind of useless without a browser

This would have been more true before Node.js, but now it's false.

Arne Babenhauserheide wrote on 2012-07-18 22:18:

Wow, 1.5x to 20x faster than a PHP-compiler and 7x to 100x faster than PHP itself… congrats!

Anonymous wrote on 2012-07-24 11:01:

Offtopic: not trying to sound offensive or pushy, but what happened to numpypy development? I'm regularly checking https://buildbot.pypy.org/numpy-status/latest.html, and it looks like its development is stale for several months.

Maciej Fijalkowski wrote on 2012-07-24 11:06:

@Anonymous not much. I'll write a non-progress blog post some time soon.

Anonymous wrote on 2012-07-24 11:46:

@Fijal
Thank you!

Dima Tisnek wrote on 2012-08-08 09:33:

Awesome proof of concept!

Can you post memory footprint comparison, please?

And perhaps a quick overview what these test cases cover, arithmetic, function call overhead, dynamic language features?

Thanks for your hard work, without likes of you OSS would never exist!

Anonymous wrote on 2013-02-03 15:15:

Just in case anyone *is* interested in implementing PHP on the Parrot Virtual Machine, you don't have to tie yourself to the PVM bytecodes.

You can write your PHP compiler entirely in NQP (Not Quite Perl) which in turn produces parrot bytecode for you.

This is important for two reasons:

First, NQP is a mid level language, and is relatively easy to write in, and doesn't require you to know anything at all about the PVM.

Second, although NQP *presently* only targets PVM, there's an in-progress backend which targets the Java Virtual Machine! Early benchmarks suggest that it is already faster than perl5, and there are many optimizations and speedups to come.

Thus, if you were to write a PHP compiler in NQP, you could target either the Parrot Virtual machine, or (in the future) the Java virtual machine.

Unknown wrote on 2013-02-03 15:16:

Just in case anyone *is* interested in implementing PHP on the Parrot Virtual Machine, you don't have to tie yourself to the PVM bytecodes.

You can write your PHP compiler entirely in NQP (Not Quite Perl) which in turn produces parrot bytecode for you.

This is important for two reasons:

First, NQP is a mid level language, and is relatively easy to write in, and doesn't require you to know anything at all about the PVM.

Second, although NQP *presently* only targets PVM, there's an in-progress backend which targets the Java Virtual Machine! Early benchmarks suggest that it is already faster than perl5, and there are many optimizations and speedups to come.

Thus, if you were to write a PHP compiler in NQP, you could target either the Parrot Virtual machine, or (in the future) the Java virtual machine.

Py3k status update #5

This is the fifth status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.

Apart from the usual "fix shallow py3k-related bugs" part, most of my work in
this iteration has been to fix the bootstrap logic of the interpreter, in
particular to setup the initial sys.path.

Until few weeks ago, the logic to determine sys.path was written entirely
at app-level in pypy/translator/goal/app_main.py, which is automatically
included inside the executable during translation. The algorithm is more or
less like this:

  1. find the absolute path of the executable by looking at sys.argv[0]
    and cycling through all the directories in PATH
  2. starting from there, go up in the directory hierarchy until we find a
    directory which contains lib-python and lib_pypy

This works fine for Python 2 where the paths and filenames are represented as
8-bit strings, but it is a problem for Python 3 where we want to use unicode
instead. In particular, whenever we try to encode a 8-bit string into an
unicode, PyPy asks the _codecs built-in module to find the suitable
codec. Then, _codecs tries to import the encodings package, to list
all the available encodings. encodings is a package of the standard
library written in pure Python, so it is located inside
lib-python/3.2. But at this point in time we yet have to add
lib-python/3.2 to sys.path, so the import fails. Bootstrap problem!

The hard part was to find the problem: since it is an error which happens so
early, the interpreter is not even able to display a traceback, because it
cannot yet import traceback.py. The only way to debug it was through some
carefully placed print statement and the help of gdb. Once found the
problem, the solution was as easy as moving part of the logic to RPython,
where we don't have bootstrap problems.

Once the problem was fixed, I was able to finally run all the CPython test
against the compiled PyPy. As expected there are lots of failures, and fixing
them will be the topic of my next months.

Anonymous wrote on 2012-07-10 17:10:

Would be nice to have a PyPy distribution embeded in OpenOffice 3.4.2

haypo wrote on 2012-07-11 10:18:

I solved a similar issue in Python 3.2. Python 3 did use the wrong encoding to encode/decode filenames. When I tried to use the filesystem encoding instead, I had an ugly bootstrap issue with encodings implemented in Python (whereas ASCII, latin1 and utf-8 are implemented in C with a fast-path).

The solution is to use C function to encode to/decode from the locale encoding, because the filesystem encoding is the locale encoding. mbstowcs() and wcstombs() are used until the Python codec machinery is ready.

Anonymous wrote on 2012-07-13 15:58:

Did you try to compare PyPy to Pythran? According to his author, Pythran is on some benchmarks 30x faster than PyPy: https://linuxfr.org/users/serge_ss_paille/journaux/pythran-python-c#comment-1366988

see also the manual here: https://github.com/serge-sans-paille/pythran/blob/master/MANUAL

What do you think of this approach of translating Python to C++ ?

Maciej Fijalkowski wrote on 2012-07-13 17:54:

@Anonymous - there is extremely little point in comparing python with whatever-looks-like-python-but-is-not. It's beyond the scope of this blog for sure.

Anonymous wrote on 2012-07-13 21:11:

To be fair to @Anonymous, the pypy developers commonly compare pypy to C in benchmarks so it's not so unreasonable. The point is that only that one should understand that they are different languages, not that all comparisons between languages are pointless.

Maciej Fijalkowski wrote on 2012-07-13 21:19:

Oh yes sure. It's as producting to compare pypy to shedskin as it is to compare pypy with g77. It still *is* or might be a valuable comparison, but it is important to keep in mind that those languages are different.

Unknown wrote on 2012-08-13 17:30:

Any news on the py3k side?

That’s actually what’s most interesting to me on a practical level and it would be nice to know how long it will still take till I can test it :)

Antonio Cuni wrote on 2012-08-14 10:06:

@arne due to EuroPython and some personal issues not much has happened on the py3k side in the past month.

It is hard to give estimates about when things will be ready, because it depends a lot on how much time I'll be able to dedicate on it. At this point, most of the major features are implemented and I am fixing all the smaller ones which are highlighted by failing CPython tests. However, sometimes a small feature might take much more time to fix than a big one

EuroPython sprint

Hi all,

EuroPython is next week. We will actually be giving a presentation on Monday, in one of the plenary talks: PyPy: current status and GIL-less future. This is the first international PyPy keynote we give, as far as I know, but not the first keynote about PyPy [David Beazley's video] :-)

The other talks are PyPy JIT under the hood and to some extent Performance analysis tools for JITted VMs. This year we are also trying out a help desk. Finally, we will have the usual sprint after EuroPython on Saturday and Sunday.

See you soon!

Armin.

holger krekel wrote on 2012-06-28 10:35:

Don't you consider the David Beazley keynote at Pycon 2012 as a talk about PyPy? (even if not from a core dev)

Armin Rigo wrote on 2012-06-28 10:38:

That's what the link "the first keynote about PyPy" is about. It's a link to the pypy blog where we talk about David's keynote. I did not find a direct page at us.pycon.org...

Armin Rigo wrote on 2012-06-28 10:39:

That's what the link "the first keynote about PyPy" is about. It's a link to the pypy blog where we talk about David's keynote. I did not find a direct page at us.pycon.org...

Architecture of Cppyy

The cppyy module makes it possible to call into C++ from PyPy through the Reflex package. Work started about two years ago, with a follow-up sprint a year later. The module has now reached an acceptable level of maturity and initial documentation with setup instructions, as well as a list of the currently supported language features, are now available here. There is a sizable (non-PyPy) set of unit and application tests that is still being worked through, not all of them of general applicability, so development continues its current somewhat random walk towards full language coverage. However, if you find that cppyy by and large works for you except for certain specific features, feel free to ask for them to be given higher priority.

Cppyy handles bindings differently than what is typically found in other tools with a similar objective, so this update walks through some of these differences, and explains why choices were made as they are.

The most visible difference, is from the viewpoint of the Python programmer interacting with the module. The two canonical ways of making Python part of a larger environment, are to either embed or extend it. The latter is done with so-called extension modules, which are explicitly constructed to be very similar in their presentation to the Python programmer as normal Python modules. In cppyy, however, the external C++ world is presented from a single entrance point, the global C++ namespace (in the form of the variable cppyy.gbl). Thus, instead of importing a package that contains your C++ classes, usage looks like this (assuming class MyClass in the global namespace):

>>>> import cppyy
>>>> m = cppyy.gbl.MyClass()
>>>> # etc.

This is more natural than it appears at first: C++ classes and functions are, once compiled, represented by unique linker symbols, so it makes sense to give them their own unique place on the Python side as well. This organization allows pythonizations of C++ classes to propagate from one code to another, ensures that all normal Python introspection (such as issubclass and isinstance) works as expected in all cases, and that it is possible to represent C++ constructs such as typedefs simply by Python references. Achieving this unified presentation would clearly require a lot of internal administration to track all C++ entities if they each lived in their own, pre-built extension modules. So instead, cppyy generates the C++ bindings at run-time, which brings us to the next difference.

Then again, that is not really a difference: when writing or generating a Python extension module, the result is some C code that consists of calls into Python, which then gets compiled. However, it is not the bindings themselves that are compiled; it is the code that creates the bindings that gets compiled. In other words, any generated or hand-written extension module does exactly what cppyy does, except that they are much more specific in that the bound code is hard-wired with e.g. fixed strings and external function calls. The upshot is that in Python, where all objects are first-class and run-time constructs, there is no difference whatsoever between bindings generated at run-time, and bindings generated at ... well, run-time really. There is a difference in organization, though, which goes back to the first point of structuring the C++ class proxies in Python: given that a class will settle in a unique place once bound, instead of inside a module that has no meaning in the C++ world, it follows that it can also be uniquely located in the first place. In other words, cppyy can, and does, make use of a class loader to auto-load classes on-demand.

If at this point, this all reminds you of a bit ctypes, just with some extra bells and whistles, you would be quite right. In fact, internally cppyy makes heavy use of the RPython modules that form the guts of ctypes. The difficult part of ctypes, however, is the requirement to annotate functions and structures. That is not very pleasant in C, but in C++ there is a whole other level of complexity in that the C++ standard specifies many low-level details, that are required for dispatching calls and understanding object layout, as "implementation defined." Of course, in the case of Open Source compilers, getting at those details is doable, but having to reverse engineer closed-source compilers gets old rather quickly in more ways than one. More generally, these implementation defined details prevent a clean interface, i.e. without a further dependency on the compiler, into C++ like the one that the CFFI module provides for C. Still, once internal pointers have been followed, offsets have been calculated, this objects have been provided, etc., etc., the final dispatch into binary C++ is no different than that into C, and cppyy will therefore be able to make use of CFFI internally, like it does with ctypes today. This is especially relevant in the CLang/LLVM world, where stub functions are done away with. To get the required low-level details then, cppyy relies on a back-end, rather than getting it from the programmer, and this is where Reflex (together with the relevant C++ compiler) comes in, largely automating this tedious process.

There is nothing special about Reflex per se, other than that it is relatively lightweight, available, and has proven to be able to handle huge code bases. It was a known quantity when work on cppyy started, and given the number of moving parts in learning PyPy, that was a welcome relief. Reflex is based on gccxml, and can therefore handle pretty much any C or C++ code that you care to throw at it. It is also technically speaking obsolete as it will not support C++11, since gccxml won't, but its expected replacement, based on CLang/LLVM, is not quite there yet (we are looking at Q3 of this year). In cppyy, access to Reflex, or any back-end for that matter, is through a thin C API (see the schematic below): cppyy asks high level questions to the back-end, and receives low-level results, some of which are in the form of opaque handles. This ensures that cppyy is not tied to any specific back-end. In fact, currently it already supports another, CINT, but that back-end is of little interest outside of High Energy Physics (HEP). The Python side is always the same, however, so any Python code based on cppyy does not have to change if the back-end changes. To use the system, a back-end specific tool (genreflex for Reflex) is first run on a set of header files with a selection file for choosing the required classes. This produces a C++ file that must be compiled into a shared library, and a corresponding map file for the class loader. These shared libraries, with their map files alongside, can be put anywhere as long as they can be located through the standard paths for the dynamic loader. With that in place, the setup is ready, and the C++ classes are available to be used from cppyy.

So far, nothing that has been described is specific to PyPy. In fact, most of the technologies described have been used for a long time on CPython already, so why the need for a new, PyPy-specific, module? To get to that, it is important to first understand how a call is mediated between Python and C++. In Python, there is the concept of a PyObject, which has a reference count, a pointer to a type object, and some payload. There are APIs to extract the low-level information from the payload for use in the C++ call, and to repackage any results from the call. This marshalling is where the bulk of the time is spent when dispatching. To be absolutely precise, most C++ extension module generators produce slow dispatches because they don't handle overloads efficiently, but even in there, they still spend most of their time in the marshalling code, albeit in calls that fail before trying the next overload. In PyPy, speed is gained by having the JIT unbox objects into the payload only, allowing it to become part of compiled traces. If the same marshalling APIs were used, the JIT is forced to rebox the payload, hand it over through the API, only to have it unboxed again by the binding. Doing so is dreadfully inefficient. The objective of cppyy, then, is to keep all code transparent to the JIT until the absolute last possible moment, i.e. the call into C++ itself, therefore allowing it to (more or less) directly pass the payload it already has, with an absolute minimal amount of extra work. In the extreme case when the binding is not to a call, but to a data member of an object (or to a global variable), the memory address is delivered to the JIT and this results in direct access with no overhead. Note the interplay: cppyy in PyPy does not work like a binding in the CPython sense that is a back-and-forth between the interpreter and the extension. Instead, it does its work by being transparent to the JIT, allowing the JIT to dissolve the binding. And with that, we have made a full circle: if to work well with the JIT, and in so doing achieve the best performance, you can not have marshalling or do any other API-based driving, then the concept of compiled extension modules is out, and the better solution is in run-time generated bindings.

That leaves one final point. What if you do want to present an extension module-like interface to programmers that use your code? But of course, this is Python: everything consists of first-class objects, whose behavior can be changed on the fly. In CPython, you might hesitate to make such changes, as every overlay or indirection results in quite a bit of overhead. With PyPy, however, these layers are all optimized out of existences, making that a non-issue.

This posting laid out the reasoning behind the organization of cppyy. A follow-up is planned, to explain how C++ objects are handled and represented internally.

Wim Lavrijsen

Fernando Perez wrote on 2012-06-25 21:00:

Thanks for this excellent post; any chance you'll make it to Scipy'2012 in Austin? I still remember your talk at one of the very old Scipys at Caltech as one of the best we've had; it would be great to catch up on the implications of your continued work on this front since. With the recent progress on cython and numpy/numba, fresh ideas on the C++ front are a great complement.

Sebastien Binet wrote on 2012-06-26 09:28:

Wim,

I know you are quite attached to details so I was surprised by:

"""
Reflex is based on gccxml, and can therefore handle pretty much any C or C++ code that you care to throw at it
"""

but that's not true: gccxml being an interesting and useful hack of the C++ frontend of GCC, it can only correctly parse the subset of C which is valid C++.

here are a few links:
https://stackoverflow.com/questions/1201593/c-subset-of-c-where-not-examples

https://en.wikipedia.org/wiki/Compatibility_of_C_and_C%2B%2B

I discovered it the hard way...

Anonymous wrote on 2012-06-26 09:45:

@Sebastien, GCC-XML must be able to parse the entirety of C, since it has to support "extern C" blocks, mustn't it?

Sebastien Binet wrote on 2012-06-26 12:30:

"extern C" is "just" modifying the symbol mangling mechanism of the identifiers inside the extern-C block.

just try this example from the link I posted earlier:
https://stackoverflow.com/questions/1201593/c-subset-of-c-where-not-examples

"""
struct A { struct B { int a; } b; int c; };
struct B b; // ill-formed: b has incomplete type (*not* A::B)
"""

even if you create a foo.h like so:

"""
#ifdef __cplusplus
extern "C" {
#endif

struct A { struct B { int a; } b; int c; };
struct B b;
#ifdef __cplusplus
}
#endif
"""

and compile some main.c/cxx (which just includes that header) with gcc/g++, you'll get:

"""
$ gcc main.c
$ echo $?
0

$ g++ main.cxx
In file included from main.cxx:2:0:
foo.h:7:12: error: aggregate ‘B b’ has incomplete type and cannot be defined
zsh: exit 1 g++ main.cxx
"""

gccxml is using the C++ parser, thus my first remark :}

Sebastien Binet wrote on 2012-06-26 12:54:

Also, as we are in the nitpicking and parsing department, any C++ keyword which isn't a C one, can be correctly used in a C file, making that file landing in the valid-C-which-isnt-in-the-C++-subset-of-C
(e.g.: class,new,this to name a few of the most popular types or identifiers one can find in C codebases)

Wim Lavrijsen wrote on 2012-06-26 17:59:

@Fernando: no, no travel for me anytime soon. If Py4Science is still going, though, I can always walk down the hill, of course. :)

I've seen Numba (Stefan brought it up on the pypy-dev list), but it appears to be focused on C. With LLVM, we are using the AST directly. I don't think you can drive C++ through llvm-py.

@Sebastien: the "details" that you are missing are in that "pretty much any" is not the same as "all." Worse, Reflex has a whole toolchain of gccxml, genreflex, C++ compiler, and finally the Reflex API. You lose information at every step along the way. It's one more reason for CLang/LLVM, but as said, that's for Q3/2012.

Note though that there are two kinds of C headers that one may encounter. Those that are in a pure C environment, and those for mixed C/C++ use (e.g. Python.h and the system headers). In the former case, no-one would drag in the dependency on a C++ compiler, just to use Reflex. Using e.g. CFFI is a much better option. In the other case, there is no problem either way.

Cheers,
Wim

Anonymous wrote on 2012-06-27 11:46:

On a similar note, what's the state of embedding PyPy into C++ (or does cppyy make that case fully obsolete?)?

Wim Lavrijsen wrote on 2012-06-27 18:16:

@anonymous: there was a recent thread on pypy-dev, showing a successful embedding: https://mail.python.org/pipermail/pypy-dev/2012-March/009661.html

If done through C++, you can use the Python C-API (through cpyext), but AFAIK, that doesn't play nicely with threads yet.

Cheers,
Wim

Matthias wrote on 2012-06-28 16:50:

From my past experience wrapping a C++ library to python is a whole lot more than just being able to call functions and having objects.

For example using a binding generator like SWIG you need to annotate your source, because the source alone does not have sufficient information to generate proper bindings (at least no bindings that feel python-like).

So I am wondering how Cppyy behaves in this area.

E.g. how does this play with templates? I will probably still need to define up-front which instantiations I need to be available in python?

How does it deal with object ownership? E.g. what happens if the C++ code decides to delete an object that python still points to? Or how are shared pointers dealt with?

How is type mapping handled? E.g. you might want to call functions taking MyString with "standard" python strings instead of having to construct MyString() objects first and then passing those.

Wim Lavrijsen wrote on 2012-06-28 18:36:

@Matthias: there are several follow-up posts planned to explain everything in detail, so just a few quick answers now.

Pythonizations are handled automatically based on signature, otherwise by allowing user defined pythonization functions.

Template instantiations are still needed in the Reflex world, but with CLang/LLVM, those can be generated by the backend (CINT can perform the instantiations automatically as well).

Object ownership can be handled heuristically if the C++ side behaves (this is e.g. the case for most of ROOT). If that's not the case, extra annotations per function or per object are needed. In addition, communication with the memory regulator (a tracker of all proxies on the python side) through a callback on both sides is possible.

Type mappings happen through custom converters that are to be coded up in either Python or C++. Standard mappings (e.g. the use of std::string in the way that you describe for MyString) have been added by default. Type mappings can also be done based on signature in some cases.

Not everything of the above is implemented in cppyy yet, but all have been solved before in PyROOT on CPython. It's just a matter of time to implement things for cppyy. The important point, however, is that none of this needs a separate language: most of it can be handled automatically, with a little work of the programmer in python proper or, worst case, with a C++ helper.

Cheers,
Wim

Anonymous wrote on 2013-09-20 06:58:

Hmm is anyone else experiencing problems with the pictures on this blog loading?
I'm trying to find out if its a problem on my end or if it's the
blog. Any feed-back would be greatly appreciated.

my site ... Splendyr REview - https://livingwaychristianfriendshipgroup.com/members/starcormi/activity/932712/ -

Release 0.1 of CFFI

Hi.

We're pleased to announce the first public release, 0.1 of CFFI, a way to call C from Python.
(This release does not support PyPy yet --- but we announce it here as it is planned for the
next release :-)

The package is available on bitbucket as well as documented. You can also install it
straight from the python package index (pip).

The aim of this project is to provide a convenient and reliable way of calling C code from Python.
The interface is based on LuaJIT's FFI and follows a few principles:

  • The goal is to call C code from Python. You should be able to do so
    without learning a 3rd language: every alternative requires you to learn
    their own language (Cython, SWIG) or API (ctypes). So we tried to
    assume that you know Python and C and minimize the extra bits of API that
    you need to learn.
  • Keep all the Python-related logic in Python so that you don't need to
    write much C code (unlike CPython native C extensions).
  • Work either at the level of the ABI (Application Binary Interface)
    or the API (Application Programming Interface). Usually, C
    libraries have a specified C API but often not an ABI (e.g. they may
    document a "struct" as having at least these fields, but maybe more).
    (ctypes works at the ABI level, whereas Cython or native C extensions
    work at the API level.)
  • We try to be complete. For now some C99 constructs are not supported,
    but all C89 should be, including macros (and including macro "abuses",
    which you can manually wrap in saner-looking C functions).
  • We attempt to support both PyPy and CPython (although PyPy support is not
    complete yet) with a reasonable path for other Python implementations like
    IronPython and Jython.
  • Note that this project is not about embedding executable C code in
    Python, unlike Weave. This is about calling existing C libraries
    from Python.

Status of the project

Consider this as a beta release. Creating CPython extensions is fully supported and the API should
be relatively stable; however, minor adjustements of the API are possible.

PyPy support is not yet done and this is a goal for the next release. There are vague plans to make this the
preferred way to call C from Python that can reliably work between PyPy and CPython.

Right now CFFI's verify() requires a C compiler and header files to be available at run-time.
This limitation will be lifted in the near future and it'll contain a way to cache the resulting binary.

Cheers,

Armin Rigo and Maciej Fijałkowski

intgr wrote on 2012-06-19 00:28:

Will the CFFI be any JIT-friendlier than PyPy's ctypes?

Anonymous wrote on 2012-06-19 16:46:

What's the difference between CFFI and CPyExt?

RonnyPfannschmidt wrote on 2012-06-19 18:04:

@intgr yes

@anon cffi is a FFI, cpyext is a api emulation they are completely different things