A dive into packaging native python extensions
The complete guide to building you own native wheel from scratch
There are cases where you want to extend python with native code, e.g. for scientific computing (numpy, scipy), database connectors (mysqlclient, psycopg2) or UI (pygobject, pyqt). For cpython this is traditionally done in C/C++, but you can also use the C api from D (pyd), go (cffi) or rust (cffi or pyo3).
Distributing those extensions is a big problem. Until recently, the only viable option was to write special plugins for setuptools, e.g. milksnake for cffi or setuptools-rust for pyo3. Inspired by the new pyproject.toml, I wanted to get rid of the flaws and resulting pain of setuptools. So I went to write pyo3-pack, which aims at making packaging and publishing native python modules in rust as easy as wasm-pack makes it for javascript.
It turns out that writing such a tool is relatively easy (less than a thousand lines of rust to get from source to wheel). The hard part is to find out what you need to do in the first place. The documentation is scattered across different and partially outdated tutorials, PEPs, stack overflow answers, references, examples and source code; I sometimes even had to resort to reverse engineering. So I decided to write down everything I learned about native wheels, which eventually became this blog post.
The good parts
The official tutorial on native modules, Extending Python with C or C++, is a good introduction to the core concepts of native modules: The header files, the calling conventions, GC, the object protocol and error handling. It only shows building for C/C++ with distutils (the predecessor to setuptools) though, omits the officially blessed manylinux, and lacks an explanation of the abi and linking options (more on those later).
For the daily work, the Python/C API Reference Manual is often much better. It also has some explanation for the ABI.
For the rest of the post I’ll assume that you have built your native python module as shared library (e.g PyInit_<modname>
function for python 3) with your technology of choice.
Metadata
Each python package, whether it is an egg or a wheel or a source archive, is described by structured metadata, which contains fields required for pip to work and informational fields used e.g. for pypi.
There are five versions for the metadata of python packages: 1.0 (PEP 241), 1.1 (PEP 314), 1.2 (PEP 345), 2.0 (PEP 426) and 2.1 (PEP 566).
2.0 was an attempt to replace the key-value structure of the metadata with a json like structure. This could have been a big improvement, but was withdrawn (and is not accepted by pypi or pip) since it would have been a to big breakage. This is why the current version is called 2.1, even though it is backwards compatible to 1.0.
The current specification can be found at PyPA’s Core metadata specifications page, which is pretty self-explaining and worth reading.
N.B.: https://www.pypa.io/en/latest/roadmap/ is completely outdated as it still features as Metadata 2.0 as part of the roadmap. https://packaging.python.org/specifications/core-metadata/#description is misleading since you must not use the RFC 822 in the metadata for the pypi upload (see the section on uploading) and for the METADATA file inside the wheel you can just put the description in the body, i.e. after all the keys.
Tags and naming
Native modules need to specify with which platforms and python interpreters they are compatible. Python has two coexisting standards with slightly different syntax: PEP 425 for packages and PEP 3149 for shared libraries. Both are based on abi tags, so let’s discuss them first.
The cpython ABI
cpython abi is composed of the major and minor version of cpython and a set of abiflags, which are determined by compiler flags. According PEP 3149 there are three such compile time options we need to consider (at least for linux and mac):
d
:--with-pydebug
m
:--with-pymalloc
u
:--with-wide-unicode
For practical purposes, d
is irrelevant, m
is always set and u
may or may not be set - more on u
below. The tag for this abi is cp{major}{minor}{abiflags}
or cpython-{major}{minor}{abiflags}
. My python 3.6 installation is for example cp36m
and cpython-36m
.
The u
or wide-unicode flag is about the representation of unicode characters (introductory article). Initially, python unicode characters were fixed to two bytes (UCS-2), meaning that any 3 or 4 byte characters were not representable. This changed with PEP 261, which added optional support for wide unicode characters (UCS-4) to python2. The choice between UCS-2 and UCS-4 was made a compile time option, creating the abi without “u” for UCS-2 and one with “u” for UCS-4 (ignoring the option to completly disable unicode). In python 3.3 this was replaced by a system that determines the representation at runtime described in PEP 393, removing the “u” flag from the abi. This means that the wide-unicode option is only relevant for backwards compatibility with python 2.
The stable abi
There are obviously some big drawbacks from having tons of different abis which you all need to support and build and test, so PEP 384 introduced the “stable abi” in python 3.2. This abi with the tag abi3
contains a subset of the full abi and that is forward compatible with all future 3.x releases of cpython. In the header files, everything that is not part of the stable abi is gated with #if !defined(Py_LIMITED_API)
.
The stable abi is extended from time to time, meaning that you can require the stable abi and a minimum version. In the header files this is done by setting Py_LIMITED_API
to the minimum support python in the PY_VERSION_HEX
format as described in the documentation. In the headers this is checked e.g. with #if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03030000
for a function that was added to the stable abi in python 3.3.
Sysconfig
In the initial version of this post, I wrote about getting the required information about the interpreter through sysconfig. But it turned out that sysconfig behaves inconsistently across python versions and operating systems. E.g. the VERSION
field on linux is in the format {major}.{minor}
, while it is {major}{minor}
on windows (both with python 3.7). There’s also EXT_SUFFIX
, which tells you the complete extension of the library filename on linux (e.g. ".cpython-35m-x86_64-linux-gnu.so"
), but on windows it’s just .pyd
. I’ve collected a few samples in a folder in the pyo3-pack repo. You’ll find more of those weird cases in there.
I’m currently using the following snippet with python -c
and do the logic and sanity checks in rust.
import sysconfig
import sys
import json
print(json.dumps({
"major": sys.version_info.major,
"minor": sys.version_info.minor,
"abiflags": sysconfig.get_config_var("ABIFLAGS"),
"m": sysconfig.get_config_var("WITH_PYMALLOC") == 1,
"u": sysconfig.get_config_var("Py_UNICODE_SIZE") == 4,
"d": sysconfig.get_config_var("Py_DEBUG") == 1,
# This one isn't technically necessary, but still very useful for sanity checks
"platform": sys.platform,
}))
This is than deserialized into the equivalent of the following python 3.7 code:
@dataclass
class Interpreter:
major: int
minor: int
abiflags: Optional[str]
If you still want to use sysconfig, the easiest way is through python -m sysconfig
. As seen above, you can use WITH_PYMALLOC
(1 means m
), Py_UNICODE_SIZE
(4 means u
) and Py_DEBUG
(1 would mean d
) for the python 2 abiflags.
To get the flags in machine readable as Dict[str, Union[str, int]]
, use:
python -c "import json, sysconfig; print(json.dumps(sysconfig.get_config_vars()))"
For a Dict[str, str]
, use:
python -c "import json, sysconfig; print(json.dumps({k:str(v) for k, v in sysconfig.get_config_vars().items()}))"
Naming shared libraries
PEP 3149 defines that shared libraries will get a tag between the file name and the extension, separated by dot. It tells you that this tag needs to include at least the implementation (i.e. cpython) with its major and minor version. It also shows .cpython-32mu.so
as an example for such file extension, from which we can derive .cpython-{major}{minor}{abiflags}.so
as template.
This sounds nice, but is extremely misleading if not plainly wrong in reality.
From picking apart other native libraries and trial and error with filenames I figured the following:
- Python 2.7 - 3.2 doesn’t have any abitags.
- Python 3.2 - 3.4 actually use the scheme
.cpython-{major}{minor}{abiflags}.so
for POSIX (i.e. linux and mac), but accepts files without tag. Windows still doesn’t use tags. - Python 3.5+ uses the a new scheme with the platform included, which is now also used for windows. 3.5.+ also accepts files without any tag, but not those with a 3.2 - 3.4 style tag.
The only place the new, 3.5+ schema has ever been announced were the python 3.5 release notes. But rejoice, even those are wrong. (I tried googling both the wrong and the correct version, but it really seems to be only in those release notes)
For 3.5+, I found that the following is what’s actually working (and also what setuptools produce):
Linux
Template: .cpython-{major}{minor}{abiflags}-{architecture}-{os}.so
architecture
is either i386
or x86_64
, and os
is linux-gnu
. The release notes state that the file extension is .pyd
, which is wrong and doesn’t work in practice. Also note that os has an internal minus, breaking the general rule of separating parts of the tag with a minus.
Example: steinlaus.cpython-35m-x86_64-linux-gnu.so
Mac OS
Template: .cpython-{major}{minor}{abiflags}-darwin.so
Example: steinlaus.cpython-35m-darwin.so
Windows
Template: {name}.cp{major}{minor}-{platform}.pyd
The platform is either win_amd64
or win32
. .pyd files are just renamed .dll files, which is confirmed in the official windows FAQ (which is otherwise extremely outdated)
Example: steinlaus.cp35-win_amd64.pyd
Naming wheels
The documentation for defining wheels is much better than the one for naming so files with most parts being specified in PEP 425.
The official schema from that PEP is {distribution}-{version}(-{build tag})?-{python tag}-{abi tag}-{platform tag}.whl
, which is used for all python versions. The distribution is you package’s name escaped with re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)
, we can ignore and skip the build tag, the python tag for our case is cp{major}{minor}{abiflags}
, the abi tag is either the python tag, abi3
or none
.
For the platform tag it states that “The platform tag is simply distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _.”. This is unfortunate, since the output of distutils.util.get_platform() isn’t specified, so we need to reverse engineer. Looking only at 32-bit and 64-bit x86, we have either win_amd64
or win32
for windows. For linux, we have linux_i686
or linux_x86_64
, even though in practice we must use either manylinux1_i686
or manylinux1_x86_64
as desribed in the manylinux paragraph below. For mac the tag used by setuptools is macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64
for whatever reason.
Examles:
steinlaus-1.0.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
steinlaus-1.0.0-cp36-cp36m-manylinux1_x86_64.whl
steinlaus-1.0.0-cp36-cp36m-win_amd64.whl
Manylinux
Libraries and binaries on Linux are traditionally (for the better or worse) dynamically linked to libraries in $LD_LIBRARY_PATH
, which are installed through the systems package manager. Native modules could require arbitrary versions of arbitrary libraries, but can’t guarantee they are installed on the target machine, leading to linker errors when importing. To avoid such incompatibilities, PEP 513 specifies a target manylinux1
which contains only a set of old versions of libraries that can be found on basically every Linux. (This is an extremely short summary of the rational in the PEP)
Wheels for the manylinux1
target must be in the manylinux1 docker container. This container is based on CentOS 5, i.e. some very old Linux. Using this docker image is the only officially blessed way to build for the linux target in general. Pypi only accept wheels with the manylinux1
tag and rejects those with a linux
tag. A slightly more modern target, manylinux2010
is currently being working on as a successor for manylinux1 (PEP 571 - The manylinux2010 Platform Tag, tracking issue).
manylinux is accompanied by a tool called auditwheel that checks the library and then “awards” the manylinux1 tag. Afaik this is not checked by pypi, so it’s possible to lie about that check.
By default rust only links very few system libraries, which are a subset of the manylinux1 target. This means that pyo3-pack only needs to check that the constraints are met and we can otherwise totally skip the whole ancient-docker-mess.
The internals of a (binary) wheel
While there alternative ways to install python packages, using wheels with pip is (for good reasons) the officially blessed one, so for pyo3-pack I’ve only looked into into building wheels. They are specified in PEP 427.
Wheels are generally just zip files with a .whl
extension. They come in two flavors: sdist and bdist. bdist (“built distribution”) wheels are pre-built packages. Their installation is mostly just unpacking the archive. They specify the compatible python version(s), an abi and a platform. sdist (“source distribution”) wheels contain all the sources including your setup.py
or pyproject.toml
, so for installing them, they need to be built first.
Every wheel contains a {distribution}-{version}.dist-info
folder with the following files inside it, where {distribution}
is again the name with the underscore-escapes.
-
WHEEL
:Wheel-Version: 1.0 Generator: pyo3-pack ({version}) Root-Is-Purelib: false Tag: {python tag}-{abi tag}-{platform tag}
-
METADATA
: This file contains the metadata as described above. Since metadata 2.1, you can (and want to) put the description in the body of the file, separated from the key value pairs by a newline. The only required keys areMetadata-Version
,Name
andVersion
.Metadata-Version: 2.1 Name: {name} Version: {version} Summary: {summary or UNKNOWN} {description / content of readme.md}
-
RECORD
: This file contains checksums and sizes of all files. Each line contains a file, a hash and the size of the file in bytes separated by commas like the following:path/to/file,sha256=HASH-AS-URLSAFE-BASE64-NOPAD,1234
The only exception is the record file itself, for which hash and size are left blank:
{name}-{version}.dist-info/RECORD,,
The exact format is described in PEP 376, while PEP 427 adds that the hasing algorithm must be “sha256 or better”.
-
entry_points.txt
: This file isn’t specified in any PEP, but in the Entry points specification. It contains sections with key-value pairs in the ini format. While there’s more it can do, the interesting part is a section calledconsole_scripts
. This section lists function which should be exposed as shell commands. The keys are the commands, while the value specifies which function to call. Pip will create the scripts which are small wrappers around the functions when installing the package. The functions have the structuresome.module.path:object.attr
. E.g. poetry defines[console_scripts] poetry=poetry.console:main
which pip translates to
#!/usr/bin/python3 # -*- coding: utf-8 -*- import re import sys from poetry.console import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0]) sys.exit(main())
-
top_level.txt
: Setuptools also add this file which contains only the name of your package. This is part of the (PEP-less) egg format, the predecessor of wheels, as described in The Internal Structure of Python Eggs. This file is not documented and not needed for wheels and therefore not added by other packagers such as poetry. (Interestingly enough, the wheel repository, which adds thebdist_whl
command to setuptools, does not even contain the stringtop_level.txt
.)
For actual package you have two options:
If you only need to package one shared library, you put it at the top level of the zip. The shared library must be named according to the rules describe above, while basename must be the name of the module.
Example:
.
├── get_fourtytwo-1.6.8.dist-info
│ ├── METADATA
│ ├── RECORD
│ └── WHEEL
└── get_fourtytwo.cpython-36m-x86_64-linux-gnu.so
For any wheels containing python files, whether they have native components or not, the top level module is a python module. This means a directory at the top level with the name of the module and a __init__.py
inside that direct ry. Inside this directory the same rules as for any other python project apply. Native modules work the same way as pure python single file module, only that the filenames end with .so
or .pyd
instead of .py
. Take a look at numpy’s wheels for a complex, real world scenario.
Example:
.
├── get_fourtytwo
│ ├── __init__.py
│ ├── native_fourtytwo.cpython-36m-x86_64-linux-gnu.so
│ └── python_fourtytwo.py
└── get_fourtytwo-1.6.8.dist-info
├── METADATA
├── RECORD
└── WHEEL
… where __init__.py
contains
from .native_fourtytwo import native_class
from .python_fourtytwo import some_class
Besides the presented wheel 1.0 format, PEP 491 defining a “Wheel 1.9” format also exists. It is officially in draft status, but it seems completely abandoned, with no mention neither on the mailing list nor in the relevant github repos. The PEP doesn’t explain why version 1.9 should follow version 1.0.
Note that you can lie to pypi about the metadata. E.g. I actually ran into a case, where a .tar.gz was uploaded as 3.0.{date}, while the installed package identified itself as 3.0.dev0, which didn’t exist on pypi. This effectively broke pip freeze.
Source distributions
Source distribution, sdist for short, are special source archives that can be build and installed with pip. They are used e.g. when there are no wheels for the current platform/abi and as base for building debian or fedora packages. While they existed for a longer time, they are formally specified in PEP 517. This PEP differentiates between a source tree, which would be the git repository, and a source distribution, which this paragraph is about.
A source distribution is a .tar.gz archive. It is explicitly stated that zip archives are not allowed anymore, even though it mentions lxml-3.4.4.zip
as an example in the beginning. The filename is {name}-{version}.tar.gz
. The archive contains one folder, which is named {name}-{version}
. This folder contains the required source, a setup.py and/or a pyproject.toml, and a file called PKG-INFO
which identical to the METADATA
file in wheels.
Example:
foobar-0.11.2/
├── foobar
│ ├── __init__.py
│ └── main.py
├── LICENSE
├── PKG-INFO
├── pyproject.toml
└── setup.py
If the archive contains a pyproject.toml with a [build-system]
section that specifies a list of packages required for building as requires
and the path to a build backend object in build-backend
, this backend should be called by pip to build the source distribution into a wheel. pip 10.0.1, which is the latest version as of this writing, refuses to install such wheels stating “This version of pip does not implement PEP 517 so it cannot build a wheel without ‘setuptools’ and ‘wheel’.”. We can therefore skip any further details about the build backend because we can’t use it yet anyway.
Without a pyproject.toml with those entries, pip executes the setup.py
in the directory, meaning that currently the way to support source distributions is to use setuptools, which is exactely what I wanted to avoid. This means no sdist in custom packagers for now.
As a side note, both flit and poetry already implement the PEP 517 interface (buildapy.py in flit and api.py in poetry) and add a pyproject.toml to the archive. But as they omit the [build-system]
, pip instead uses the setup.py they also create.
Finding python interpreters
It’s convenient for building and essential for testing to find the installed python versions. For linux and mac, you can check which python binaries are in PATH
and then use the snippet from above to get the version and abiflags. (Or you can use a fixed list of 2.7 and 3.5+ and just try each because there’s no good library to work with PATH
yet).
For windows, every python version is just called python.exe
. Fortunately, there is a launcher called py
. With -0
(but not --list
, even if the help says otherwise) it will list all known versions, which you can launch with py -{version}
. It’s then easy to get the path of the actual interpreter with py -{version} -c "import sys; print(sys.executable)"
.
Contemporary legacy uploading
Now that we’ve got our wheel built, we also want to publish it, i.e. upload it to pypi (which is now powered by a software called warehouse). It turns out that the api to upload packages is called the “legacy api”, even though there’s no new api for uploads (there is a json api, but it only supports reading package metadata). The upload part of the legacy api had no documentation other than “use twine”, so I read through the source of poetry uploader, warehouse’s endpoint and warehouse’s tests to figure out how to use that api. Eventually I wrote a pull request to warehouse documenting that api.
Errors
As mentioned in the preface, all the information presented here is assembled from many different sources of varying qualitity and up-to-dateness, with some parts being reverse-engineered. So if you find any errors or missing parts, please ping me (konstin@mailbox.org, konstin on github, @konstinx on twitter).