Natural Language Processing for Icelandic with PyPy: A Case Study
Icelandic is one of the smallest languages of the world, with about 370.000 speakers. It is a language in the Germanic family, most similar to Norwegian, Danish and Swedish, but closer to the original Old Norse spoken throughout Scandinavia until about the 14th century CE.
As with other small languages, there are worries that the language may not survive in a digital world, where all kinds of fancy applications are developed first - and perhaps only - for the major languages. Voice assistants, chatbots, spelling and grammar checking utilities, machine translation, etc., are increasingly becoming staples of our personal and professional lives, but if they don’t exist for Icelandic, Icelanders will gravitate towards English or other languages where such tools are readily available.
Iceland is a technology-savvy country, with world-leading adoption rates of the Internet, PCs and smart devices, and a thriving software industry. So the government figured that it would be worthwhile to fund a 5-year plan to build natural language processing (NLP) resources and other infrastructure for the Icelandic language. The project focuses on collecting data and developing open source software for a range of core applications, such as tokenization, vocabulary lookup, n-gram statistics, part-of-speech tagging, named entity recognition, spelling and grammar checking, neural language models and speech processing.
My name is Vilhjálmur Þorsteinsson, and I’m the founder and CEO of a software startup Miðeind in Reykjavík, Iceland, that employs 10 software engineers and linguists and focuses on NLP and AI for the Icelandic language. The company participates in the government’s language technology program, and has contributed significantly to the program’s core tools (e.g., a tokenizer and a parser), spelling and grammar checking modules, and a neural machine translation stack.
When it came to a choice of programming languages and development tools for the government program, the requirements were for a major, well supported, vendor-and-OS-agnostic FOSS platform with a large and diverse community, including in the NLP space. The decision to select Python as a foundational language for the project was a relatively easy one. That said, there was a bit of trepidation around the well known fact that CPython can be slow for inner-core tasks, such as tokenization and parsing, that can see heavy workloads in production.
I first became aware of PyPy in early 2016 when I was developing a crossword game Netskrafl in Python 2.7 for Google App Engine. I had a utility program that compressed a dictionary into a Directed Acyclic Word Graph and was taking 160 seconds to run on CPython 2.7, so I tried PyPy and to my amazement saw a 4x speedup (down to 38 seconds), with literally no effort besides downloading the PyPy runtime.
This led me to select PyPy as the default Python interpreter for my company’s Python development efforts as well as for our production websites and API servers, a role in which it remains to this day. We have followed PyPy’s upgrades along the way, being just about to migrate our minimally required language version from 3.6 to 3.7.
In NLP, speed and memory requirements can be quite important for software usability. On the other hand, NLP logic and algorithms are often complex and challenging to program, so programmer productivity and code clarity are also critical success factors. A pragmatic approach balances these factors, avoids premature optimization and seeks a careful compromise between maximal run-time efficiency and minimal programming and maintenance effort.
Turning to our use cases, our Icelandic text tokenizer "Tokenizer" is fairly light, runs tight loops and performs a large number of small, repetitive operations. It runs very well on PyPy’s JIT and has not required further optimization.
Our Icelandic parser Greynir (known on PyPI as reynir) is, if I may say so myself, a piece of work. It parses natural language text according to a hand-written context-free grammar, using an Earley-type algorithm as enhanced by Scott and Johnstone. The CFG contains almost 7,000 nonterminals and 6,000 terminals, and the parser handles ambiguity as well as left, right and middle recursion. It returns a packed parse forest for each input sentence, which is then pruned by a scoring heuristic down to a single best result tree.
This parser was originally coded in pure Python and turned out to be unusably slow when run on CPython - but usable on PyPy, where it was 3-4x faster. However, when we started applying it to heavier production workloads, it became apparent that it needed to be faster still. We then proceeded to convert the innermost Earley parsing loop from Python to tight C++ and to call it from PyPy via CFFI, with callbacks for token-terminal matching functions (“business logic”) that remained on the Python side. This made the parser much faster (on the order of 100x faster than the original on CPython) and quick enough for our production use cases. Even after moving much of the heavy processing to C++ and using CFFI, PyPy still gives a significant speed boost over CPython.
Connecting C++ code with PyPy proved to be quite painless using CFFI, although we had to figure out a few magic incantations in our build module to make it compile smoothly during setup from source on Windows and MacOS in addition to Linux. Of course, we build binary PyPy and CPython wheels for the most common targets so most users don’t have to worry about setup requirements.
With the positive experience from the parser project, we proceeded to take a similar approach for two other core NLP packages: our compressed vocabulary package BinPackage (known on PyPI as islenska) and our trigrams database package Icegrams. These packages both take large text input (3.1 million word forms with inflection data in the vocabulary case; 100 million tokens in the trigrams case) and compress it into packed binary structures. These structures are then memory-mapped at run-time using mmap and queried via Python functions with a lookup time in the microseconds range. The low-level data structure navigation is done in C++, called from Python via CFFI. The ex-ante preparation, packing, bit-fiddling and data structure generation is fast enough with PyPy, so we haven’t seen a need to optimize that part further.
To showcase our tools, we host public (and open source) websites such as greynir.is for our parsing, named entity recognition and query stack and yfirlestur.is for our spell and grammar checking stack. The server code on these sites is all Python running on PyPy using Flask, wrapped in gunicorn and hosted on nginx. The underlying database is PostgreSQL accessed via SQLAlchemy and psycopg2cffi. This setup has served us well for 6 years and counting, being fast, reliable and having helpful and supporting communities.
As can be inferred from the above, we are avid fans of PyPy and commensurately thankful for the great work by the PyPy team over the years. PyPy has enabled us to use Python for a larger part of our toolset than CPython alone would have supported, and its smooth integration with C/C++ through CFFI has helped us attain a better tradeoff between performance and programmer productivity in our projects. We wish for PyPy a great and bright future and also look forward to exciting related developments on the horizon, such as HPy.