Principle of operation

This section attempts to confer a basic sense of how greenback works.

Async/await basics and limitations

Python’s async/await syntax goes to some lengths to look like normal straight-line Python code: the async version of some logic looks pretty similar to the threaded or non-concurrent version, just with extra async and await keywords. Under the hood, though, an async callstack is represented as something very much like a generator. When some async function in the framework you’re using needs to suspend the current callstack and run other code for a bit, it effectively yields an object that tells the event loop about its intentions. The exact nature of these “traps” is a private implementation detail of the particular framework you’re using. For example, asyncio yields Futures, curio yields specially formatted tuples, and Trio (currently) yields internal objects representing the two primitive operations cancel_shielded_checkpoint() and wait_task_rescheduled(). For much more detail on how async/await work (more approachably explained, too!), see the excellent writeup by Brett Cannon: How the heck does async/await work in Python 3.5?

Common to both async functions and generators is the limitation that one can only directly yield out of the function containing the yield statement. If you want the yielded value to propagate multiple levels up the callstack – all the way to the event loop, in the async case – you need cooperation at each level, in the form of an await (in an async function) or yield from (in a generator) statement. This means every point where execution might be suspended is marked in the source code, which is a useful property. It also means that adding some I/O (which might block, thus needs to be able to suspend execution) at the bottom of a long chain of formerly-synchronous functions requires adding async and await keywords at every level of the chain. That property is sometimes problematic. To be sure, doing that work can reveal important bugs (read Unyielding if you don’t believe it); but sometimes it’s just an infeasible amount of work to be doing all at once in the first place, especially if you need to interoperate with a large project and/or external dependencies that weren’t written to support async/await.

“Reentering the event loop” (letting a regular synchronous function call an async function using the asynchronous context that exists somewhere further up the callstack) has therefore been a popular Python feature request, but unfortunately it’s somewhat fundamentally at odds with how generators and async functions are implemented internally. The CPython interpreter uses a few levels of the C call stack to implement each level of the running Python call stack, as is natural. The Python frame objects and everything they reference are allocated on the heap, but the C stack is still used to track the control flow of which functions called which. Since C doesn’t have any support for suspending and resuming a callstack, Python yield necessarily turns into C return, and the later resumption of the generator or async function is accomplished by a fresh C-level call to the frame-evaluation function (using the same frame object as before). Yielding out of a 10-level callstack requires 10 times as many levels of C return, and resuming it requires 10 times as many new nested calls to the frame evaluation function. This strategy requires the Python interpreter to be aware of each place where execution of a generator or async function might be suspended, and performance and comprehensibility both argue against allowing such suspensions to occur in every operation.

Sounds like we’re out of luck, then. Or are we?

Greenlets: a different approach

Before Python had async/await or yield from, before it even had context managers or generators with send() and throw() methods, a third-party package called greenlet provided support for a very different way of suspending a callstack. This one required no interpreter support or special keyword because it worked at the C level, copying parts of the C stack to and from the heap in order to implement suspension and resumption. The approach required deep architecture-specific magic and elicited a number of subtle bugs, but those have generally been worked out in the years since the first release in 2006, such that the package is now considered pretty stable.

The greenlet package has spawned its own fair share of concurrency frameworks such as gevent and eventlet. If those meet your needs, you’re welcome to use them and never give async/await another glance. For our purposes, though, we’re interested in using just the greenlet primitive: the ability to suspend a callstack of ordinary synchronous functions that haven’t been marked with any special syntax.

Using greenlets to bridge the async/sync divide

From the perspective of someone writing an async function, your code is the only thing running in your thread until you yield or yield from or await, at which point you will be suspended until your top-level caller (such as the async event loop) sees fit to resume you. From the perspective of someone writing an async event loop, the perspective is reversed: each “step” of the async function (represented by a send() call on the coroutine object) cedes control to an async task until the task feels like yielding control back to the event loop. Our goal is to allow something other than a yield statement to make this send() call return.

greenback achieves this by introducing a “shim” coroutine in between the async event loop and your task’s “real” coroutine. Whenever the event loop wants to run the next step of your task, it runs a step of this “shim” coroutine, which creates a greenlet for running the next step of the underlying “real” coroutine. This greenlet terminates when the “real” send() call does (corresponding to a yield statement inside your async framework), but because it’s a greenlet, we can also suspend it at a time of our choosing even in the absence of any yield statements. greenback.await_() makes use of that capability by repeatedly calling send() on its argument coroutine, using the greenlet-switching machinery to pass the yielded traps up to the event loop and get their responses sent back down.

Once you understand the approach, most of the remaining trickery is in the answer to the question: “how do we install this shim coroutine?” In Trio you can directly replace trio.lowlevel.Task.coro with a wrapper of your choosing, but in asyncio the ability to modify the analogous field is not exposed publicly, and on CPython it’s not even exposed to Python at all. It’s necessary to use ctypes to edit the coroutine pointer in the C task object, and fix up the reference counts accordingly. This works well once the details are ironed out. There’s some additional glue to deal with exception propagation, non-coroutine awaitables, and so forth, but the core of the implementation is not very much changed from the gist sketch that originally inspired it.

What can I do with this, anyway?

You really can switch greenlets almost anywhere, which means you can call greenback.await_() almost anywhere too. For example, you can perform async operations in magic methods: operator implementations, property getters and setters, object initializers, you name it. You can use combinators that were written for synchronous code (map, filter, sum, everything in itertools, and so forth) with asynchronous operations (though they’ll still execute serially within that call – the only concurrency you obtain is with other async tasks). You can use libraries that support a synchronous callback, and actually run async code inside the callback. And when this async code blocks waiting for something, it will play nice and allow all your other async tasks to run.

You may find greenback.autoawait useful in some of these situations: it’s a decorator that turns an async function into a synchronous one. There are also greenback.async_context and greenback.async_iter for sync-ifying async context managers and async iterators, respectively.

If you’re feeling reckless, you can even use greenback to run async code in places you might think impossible, such as finalizers (__del__ methods), weakref callbacks, or perhaps even signal handlers. This is not recommended (your async library will not be happy, to put it mildly, if the signal arrives or GC occurs in the middle of its delicate task bookkeeping) but it seems that you can get away with it some reasonable fraction of the time. Don’t try these in production, though!

All of these are fun to play with, but in most situations the ergonomic benefit is not going to be worth the “spooky action at a distance” penalty. The real benefits probably come mostly when working with large established non-async projects. For example, you could write a pytest plugin that surrounds the entire run in a call to trio.run(), with greenback.await_() used at your leisure to escape back into a shared async context. Perhaps this could allow running multiple async tests in parallel in the same thread. At this point such things are only vague ideas, which may well fail to work out. The author’s hope is that greenback gives you the tool to pursue whichever ones seem worthwhile to you.

What’s the performance impact?

Running anything with a greenback portal available incurs some slowdown, and actually using await_() incurs some more. The slowdown is not extreme.

The slowdown due to greenback is mostly proportional to the number of times you yield to the event loop with a portal active, as well as the number of portal creations and await_() calls you perform. You can run the microbenchmark.py script from the Git repository to see the numbers on your machine. On a 2023 MacBook Pro (ARM64), with CPython 3.12, greenlet 3.0.3, and Trio 0.24.0, I get:

  • Baseline: The simplest possible async operation is what Trio calls a checkpoint: yield to the event loop and ask to immediately be rescheduled again. This takes about 13.6 microseconds on Trio and 12.9 microseconds on asyncio. (asyncio is able to take advantage of some C acceleration here.)

  • Adding the greenback portal, without making any await_() calls yet, adds about 1 microsecond per checkpoint.

  • Executing each of those checkpoints through a separate await_() adds about another 2 microseconds per await_(). (Surrounding the entire checkpoint loop in a single await_(), by contrast, has negligible impact.)

  • Creating a new portal for each of those await_(checkpoint()) invocations adds another 16 microseconds or so per portal creation. If you use with_portal_run_sync(), portal creation gets about 10 microseconds faster (so the portal is only adding about 6 microseconds of overhead).

Keep in mind that these are microbenchmarks: your actual program is probably not executing checkpoints in a tight loop! The more work you’re doing each time you’re scheduled, the less overhead greenback will entail.