Numba: Accelerating Python Code with Simple Decorators

Python has two well-known characteristics: easy to write and slow to run. Developers love the clean syntax and rich ecosystem, but grimace at the overhead from dynamic compilation, interpreted execution, and the GIL’s limits on multi-threading. Solutions like Cython try to address this by compiling certain functions ahead of time, but most such tools have non-Pythonic syntax and are far from plug-and-play. This post introduces numba’s decorator-based approach to accelerating Python functions — a simple, flexible way to speed up Python code without changing how it looks.

Environment setup

To keep things simple, the development environment here uses conda in a virtual environment. The required packages are numba and numpy:

conda create -n numba-python python=3.7
conda activate numba-python
conda install numba numpy

Basic concepts

Compiled language, interpreted language, and JIT

High-level programs can be executed directly by an interpreter, or compiled to machine code first. Languages that use direct interpretation are called interpreted languages; their execution flow is roughly: source code → runtime → bytecode → machine code. Common examples include Perl, Python, MATLAB, and Ruby. Languages that require ahead-of-time (AoT) compilation are called compiled languages; their flow is: source code → machine code → runtime. Common examples include C and C++. Interpreted languages offer more flexibility since they can run directly, but they’re generally slower — especially for code that runs repeatedly, which gets compiled on every execution. JIT (Just-In-Time) compilation combines compiler speed with interpreter flexibility and allows adaptive optimization.

Python is a typical interpreted language. At runtime, the interpreter stores bytecode in .pyc files, then sends it to the Python virtual machine for further interpretation. The bytecode doesn’t need to regenerate if the program hasn’t changed, but the VM interpretation step still runs every time. With numba.jit, the first call compiles the function to machine code; subsequent calls skip the interpretation step entirely, reaching speeds comparable to compiled languages.

Decorators

A decorator is a common design pattern that adds new behavior to an existing object without changing its structure. In Python, decorators can be defined as functions or classes and applied with @decorator. The code below shows several decorator patterns:

Code: decorators

Define a decorator using a function

import time
import numpy as np
import functools

def timef0(func):
    """simplist decorator, with wrong func name"""
    def wrapper(*args, **kw):
        startt = time.time()
        result = func(*args, **kw)
        print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        return result
    return wrapper


def timef1(func):
    """decorator with correct func name"""
    @functools.wraps(func)
    def wrapper(*args, **kw):
        startt = time.time()
        result = func(*args, **kw)
        print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        return result
    return wrapper

def timef2(num_runs):
    """decorator with param"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kw):
            startt = time.time()
            for _ in range(num_runs):
                result = func(*args, **kw)
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
            return result
        return wrapper
    return decorator

def timef3(*args, **kw):
    """decorator with and without param"""
    if len(args) == 1 and len(kw)==0:
        func = args[0]
        @functools.wraps(func)
        def wrapper(*args, **kw):
            startt = time.time()
            result = func(*args, **kw)
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
            return result
        return wrapper
    elif len(args) == 0 and len(kw)!=0:
        num_runs = kw["num_runs"] if "num_runs" in kw else 1
        warmup   = kw["warmup"] if "warmup" in kw else 0
        def decorator(func):
            @functools.wraps(func)
            def wrapper(*args, **kw):
                for _ in range(warmup):
                    result = func(*args, **kw)
                startt = time.time()
                for _ in range(num_runs):
                    result = func(*args, **kw)
                print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
                return result
            return wrapper
        return decorator
    else:
        raise ValueError("Params for decorator are not expected!")

def timef4(func=None, num_runs=1, warmup=0):
    """decorator with and without param, a more flatten way"""
    if not func:
        return functools.partial(timef4, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef0
def add_val_f0(_list, val):
    return [l+val for l in _list]

@timef1
def add_val_f1(_list, val):
    return [l+val for l in _list]

@timef2(num_runs=2)
def add_val_f2(_list, val):
    return [l+val for l in _list]

@timef3
def add_val_f3_woparam(_list, val):
    return [l+val for l in _list]

@timef3(num_runs=2, warmup=1)
def add_val_f3_wparam(_list, val):
    return [l+val for l in _list]

@timef4
def add_val_f4_woparam(_list, val):
    return [l+val for l in _list]

@timef4(num_runs=2, warmup=1)
def add_val_f4_wparam(_list, val):
    return [l+val for l in _list]

print(f"add_val_f0 func name: {add_val_f0.__name__}")
add_val_f0(np.ones(1000000), 10)
print(f"add_val_f1 func name: {add_val_f1.__name__}")
add_val_f1(np.ones(1000000), 10)
print(f"add_val_f2 func name: {add_val_f2.__name__}")
add_val_f2(np.ones(1000000), 10)
print(f"add_val_f3_woparam func name: {add_val_f3_woparam.__name__}")
add_val_f3_woparam(np.ones(1000000), 10)
print(f"add_val_f3_wparam func name: {add_val_f3_wparam.__name__}")
add_val_f3_wparam(np.ones(1000000), 10)
print(f"add_val_f4_woparam func name: {add_val_f4_woparam.__name__}")
add_val_f4_woparam(np.ones(1000000), 10)
print(f"add_val_f4_wparam func name: {add_val_f4_wparam.__name__}")
add_val_f4_wparam(np.ones(1000000), 10)

Output:

add_val_f0 func name: wrapper
add_val_f0 time cost: 0.257897s
add_val_f1 func name: add_val_f1
add_val_f1 time cost: 0.263955s
add_val_f2 func name: add_val_f2
add_val_f2 time cost: 0.269486s (Avg over 2 runs)
add_val_f3_woparam func name: add_val_f3_woparam
add_val_f3_woparam time cost: 0.257728s
add_val_f3_wparam func name: add_val_f3_wparam
add_val_f3_wparam time cost: 0.283820s (Avg over 2 runs)
add_val_f4_woparam func name: add_val_f4_woparam
add_val_f4_woparam time cost: 0.254313s
add_val_f4_wparam func name: add_val_f4_wparam
add_val_f4_wparam time cost: 0.282968s (Avg over 2 runs)

Define a decorator using a class

The essence of a decorator is adding functionality without altering the function’s structure — timing in this example, or registering a function in a module. The decorated function therefore still returns a callable: either the original function, or a wrapper that calls it. Note that returning a plain wrapper breaks attribute consistency. As shown in timef0, the decorated function’s name becomes the wrapper’s name. To preserve function attributes through decoration, add @functools.wraps(func) to the wrapper.

When a decorator is applied, decorator(func) is what actually executes. Applying @timef1 to add_val_f1 at declaration is equivalent to timef1(add_val_f1). This nesting makes it straightforward to pass parameters to decorators. For example, @timef2(num_runs=2) on add_val_f2 is equivalent to timef2(num_runs=2)(add_val_f2). Since timef2(num_runs=2) returns the decorator function, this is ultimately decorator(add_val_f2). In this pattern, timef2 only handles parameters while decorator does the actual wrapping. I personally prefer the recursive style shown in timef3 for more flexible definitions.

Decorators can also be nested — examples appear in later sections.

Accelerating Python for loops with numba.jit

In Python’s for loops, the same bytecode is interpreted repeatedly, even for identical code — an inefficient execution model. numba.jit can accelerate this with a single decorator:

Code: accelerating for loops with numba.jit

import time
import functools
import numpy as np
import numba


def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef
def add_val_f0(_list, val):
    return [l+val for l in _list]

@timef
def add_val_f1(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit
def add_val_f2(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit
def add_val_f2_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True)
def add_val_f3(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def add_val_f3_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True, cache=True)
def add_val_f4(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, cache=True)
def add_val_f4_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

numba.config.NUMBA_DEFAULT_NUM_THREADS=4

@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f5(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f5_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f6(ls, val):
    a = np.zeros(len(ls))
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f6_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

@timef
def add_val_numpy(ls, val):
    return ls + val

size=10000000
add_val_f0(np.ones(size), 10)
add_val_f1(np.ones(size), 10)
add_val_f2(np.ones(size), 10)
add_val_f2_warmedup(np.ones(size), 10)
add_val_f3(np.ones(size), 10)
add_val_f3_warmedup(np.ones(size), 10)
add_val_f4(np.ones(size), 10)
add_val_f4_warmedup(np.ones(size), 10)
add_val_f5(np.ones(size), 10)
add_val_f5_warmedup(np.ones(size), 10)
add_val_f6(np.ones(size), 10)
add_val_f6_warmedup(np.ones(size), 10)
add_val_numpy(np.ones(size), 10)

Output:

add_val_f0 time cost: 2.478387s
add_val_f1 time cost: 4.142665s
add_val_f2 time cost: 0.228293s
add_val_f2_warmedup time cost: 0.037352s (Avg over 2 runs)
add_val_f3 time cost: 0.129881s
add_val_f3_warmedup time cost: 0.036910s (Avg over 2 runs)
add_val_f4 time cost: 0.038818s
add_val_f4_warmedup time cost: 0.036762s (Avg over 2 runs)
add_val_f5 time cost: 0.214643s
add_val_f5_warmedup time cost: 0.017964s (Avg over 2 runs)
add_val_f6 time cost: 0.273015s
add_val_f6_warmedup time cost: 0.010958s (Avg over 2 runs)
add_val_numpy time cost: 0.012424s

The corresponding cache files generated in __pycache__:

.
├── numba_jit.py
└── __pycache__
    ├── numba_jit.add_val_f4-68.py37m.1.nbc
    ├── numba_jit.add_val_f4-68.py37m.nbi
    ├── numba_jit.add_val_f4_warmedup-76.py37m.1.nbc
    └── numba_jit.add_val_f4_warmedup-76.py37m.nbi

The benchmark shows that @numba.jit significantly speeds up Python code. The large gap between add_val_fn (first call) and add_val_fn_warmedup (warmed-up) shows that JIT still pays a compilation cost on the first invocation. After that, execution speed approaches or exceeds NumPy’s optimized performance.

Three numba.jit parameters are worth knowing:

nopython: controls compilation mode. numba has two modes: nopython mode and object mode. Setting nopython=True forces nopython mode, which produces the fastest code. The tradeoff is that nopython code can’t access the Python C API, so native Python classes as input types aren’t supported — this can be worked around with numba.jitclass or NumPy structured arrays.
cache: enables caching compiled functions to disk, avoiding recompilation on every program start. As shown above, add_val_f4 with cache=True reaches the same speed as the pre-warmed version. Note that the first call still generates the cache file and runs slowly.
parallel: enables parallel optimization. With parallel=True, numba parallelizes operations within the function automatically. Use numba.prange for explicit parallel loops. Thread count is controlled via numba.config.NUMBA_DEFAULT_NUM_THREADS.

numba.jit handles most loop cases well. But when you want to operate along a specific axis of an array, jit isn’t the right tool. The next section covers numba’s vectorize and guvectorize for building efficient NumPy universal functions that handle multi-dimensional arrays.

Creating efficient NumPy universal functions with numba.vectorize and guvectorize

A NumPy universal function (ufunc) operates on each element of an array. Basic operations like add, dot, max, and any are implemented in C and run fast. NumPy lets you define custom ufuncs, but custom Python ufuncs offer no speed advantage over plain Python. NumPy custom ufuncs also only support elementwise operations — generalized ufuncs like matrix multiplication aren’t well supported. numba’s vectorize and guvectorize are compatible with numpy.ufunc semantics, run efficiently, and support generalized ufuncs.

The examples below demonstrate both for elementwise multiplication and matrix multiplication.

Code: numba.vectorize for efficient numpy ufunc — elementwise multiplication

import time
import functools
import numpy as np
import numba

numba.config.NUMBA_NUM_THREADS=8

def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef(num_runs=2, warmup=1)
def dotmul_numpy(matA, matB):
    return matA*matB

@timef(num_runs=2, warmup=1)
def dotmul_python(matA, matB):
    matC = np.empty_like(matA)
    for i in range(matA.shape[0]):
        for j in range(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def dotmul_numba_jit(matA, matB):
    matC = np.empty_like(matA)
    for i in range(matA.shape[0]):
        for j in range(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def dotmul_numba_jit_parallel(matA, matB):
    matC = np.empty_like(matA)
    for i in numba.prange(matA.shape[0]):
        for j in numba.prange(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@np.vectorize
def dotmul_numpy_vectorize(a, b):
    return a*b
dotmul_numpy_vectorize.__name__ = "dotmul_numpy_vectorize"
dotmul_numpy_vectorize = timef(num_runs=2, warmup=1)(dotmul_numpy_vectorize)

@timef(num_runs=2, warmup=1)
@numba.vectorize(nopython=True)
def dotmul_numba_vectorize(a, b):
    return a*b

@timef(num_runs=2, warmup=1)
@numba.vectorize('float64(float64, float64)', target='parallel', nopython=True)
def dotmul_numba_vectorize_parallel(a, b):
    return a*b

@numba.vectorize('float64(float64, float64)', target='cuda')
def dotmul_numba_vectorize_cuda(a, b):
    return a*b
dotmul_numba_vectorize_cuda.__name__ = "dotmul_numba_vectorize_cuda"
dotmul_numba_vectorize_cuda = timef(num_runs=2, warmup=1)(dotmul_numba_vectorize_cuda)

size=1000
dotmul_python(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_cuda(np.ones((size,size)), np.ones((size,size)))

Output:

dotmul_python time cost: 0.390452s (Avg over 2 runs)
dotmul_numpy time cost: 0.002007s (Avg over 2 runs)
dotmul_numba_jit time cost: 0.002113s (Avg over 2 runs)
dotmul_numba_jit_parallel time cost: 0.001313s (Avg over 2 runs)
dotmul_numpy_vectorize time cost: 0.143894s (Avg over 2 runs)
dotmul_numba_vectorize time cost: 0.001535s (Avg over 2 runs)
dotmul_numba_vectorize_parallel time cost: 0.001899s (Avg over 2 runs)
dotmul_numba_vectorize_cuda time cost: 0.004523s (Avg over 2 runs)

Code: numba.guvectorize for efficient numpy generalized ufunc — matrix multiplication

import time
import functools
import numpy as np
import numba

numba.config.NUMBA_NUM_THREADS=8

def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef(num_runs=2, warmup=1)
def matmul_numpy(matA, matB):
    return np.matmul(matA, matB)

@timef(num_runs=2, warmup=1)
def matmul_python(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def matmul_numba_jit(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def matmul_numba_jit_parallel(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", nopython=True)
def matmul_numba_guvectorize(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]

@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='parallel', nopython=True)
def matmul_numba_guvectorize_parallel(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]

@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='cuda')
def matmul_numba_guvectorize_cuda(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
matmul_numba_guvectorize_cuda.__name__ = "matmul_numba_guvectorize_cuda"
matmul_numba_guvectorize_cuda = timef(num_runs=2, warmup=1)(matmul_numba_guvectorize_cuda)

size=1000
matmul_python(np.ones((size,size)), np.ones((size,size)))
matmul_numpy(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
matmul_numba_guvectorize(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))
matmul_numba_guvectorize_parallel(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))

size=100
matmul_numba_guvectorize_cuda(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))

Output:

matmul_python time cost: 578.654421s (Avg over 2 runs)
matmul_numpy time cost: 0.011648s (Avg over 2 runs)
matmul_numba_jit time cost: 1.189993s (Avg over 2 runs)
matmul_numba_jit_parallel time cost: 1.129338s (Avg over 2 runs)
matmul_numba_guvectorize time cost: 1.124118s (Avg over 2 runs)
matmul_numba_guvectorize_parallel time cost: 1.129258s (Avg over 2 runs)
matmul_numba_guvectorize_cuda time cost: 0.179321s (Avg over 2 runs)

numba.vectorize produces ufuncs that outperform numpy.vectorize and often match or beat NumPy’s C implementations. numba.guvectorize is flexible and far faster than pure Python, but still falls short of NumPy’s optimized matrix multiplication.

Like numba.jit, both vectorize and guvectorize accept nopython and cache parameters. Instead of parallel=True, they use a target parameter: target="cpu" for single-threaded, target="parallel" for multi-threaded. There’s also target="cuda" for NVIDIA GPU acceleration, though for serious CUDA work I’d point you to my other post, Numba: Learning CUDA Programming Quickly with Python.

Why use ufuncs over jit? jit is more intuitive, more flexible, and often faster. The main reason to reach for ufuncs is broadcasting: numba-generated numpy ufuncs can broadcast over specific axes and support reductions and accumulations. This makes input dimensionality more flexible. Outside that use case, I generally prefer numba.jit.

Ahead-of-time compilation with numba.pycc

While numba’s primary mode is JIT, it also supports ahead-of-time (AoT) compilation, similar to Cython. This section rewrites the numba.jit example using pycc:

Code: ahead-of-time compilation with numba.pycc

# numba_pycc_module.py

import numba
from numba.pycc import CC
import numpy as np

cc = CC('numba_pycc_test_module')

@cc.export('add_valf', 'f8[:](f8[:], f8)')
@cc.export('add_vali', 'i4[:](i4[:], i4)')
def add_val(ls, val):
    a = np.empty_like(ls)
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

if __name__ == "__main__":
    cc.compile()

Run to generate the compiled .so binary:

.
├── numba_pycc_module.py
├── numba_pycc_test_module.cpython-37m-x86_64-linux-gnu.so
└── numba_pycc_test.py

# numba_pycc_test.py

import time
import functools
import numpy as np

import numba_pycc_test_module


def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

size=10000000
timef(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)

Output:

add_valf time cost: 0.027234s
add_valf time cost: 0.029468s (Avg over 2 runs)
add_vali time cost: 0.019918s
add_vali time cost: 0.018991s (Avg over 2 runs)

numba.pycc compiles Python functions to machine code ahead of time. The compiled module has zero compilation overhead at runtime and doesn’t require numba to be installed. pycc can also integrate into setuptools build scripts.

The tradeoff: removing the numba runtime dependency also removes numba-specific features like parallel computation. pycc also only supports regular jit functions — ufuncs aren’t supported.

That covers the main ways numba can accelerate Python code. I’ve found numba genuinely satisfying to use: there are constraints, but the flexible syntax, solid NumPy array support, and large speedups without restructuring code make it worthwhile. This post didn’t cover everything — notably cfunc, nogil in jit, and experimental features like jit_module. numba is under active development and worth following. I hope this gives you enough to start using it in your own projects. Check the official documentation when you need more detail.

References

Writing this post drew from official documentation and other blogs — recommended reading: