跳转到内容
ForeverYoung
返回

Numba: 简单装饰器加速python代码

简单易用与运行效率低是贴在python身上的两大标签。开发人员一方面对其简单的语法和丰富的库爱不释手,一方面又对其由于动态编译和解释执行带来的较低的运行效率和GIL带来的多线程难扩展的情况深恶痛绝。为了解决这些python中的固有问题,一些解释器如cython尝试对一些函数提前编译进而提高执行效率。但绝大多数的解释器或库函数的语法非常不pythonic,而且也不能够做到即插即用。本文简单介绍了如何通过numba库,为python函数装饰器的方式来对python函数进行加速。为读者提供一中,简单易用,灵活编译方式去解决python的固有问题,提高python代码的执行效率。

环境配置

为了简化环境配置,本文中的编程环境将全部由conda配置,并在conda的虚拟环境中测试。所涉及的依赖库分别是numbanumpy。相关环境可以通过以下语令进行配置。

conda create -n numba-python python=3.7
conda activate numba-python
conda install numba numpy

基本概念

编译型语言(Compiled language),解释型语言(Interpreted language)和即时编译(Just-In-Time)

用高级语言编写的程序一般可通过由解释器(Interpreter)直接执行,或由编译器(Compiler)编译成机械码再执行。由解释器直接执行的高级语言称为解释型语言(Interpreted language),其执行流程一般为:(解释型语言->程序运行->字节码->机械码),常见的解释型语言有Perl,Python,MATLAB和Ruby。需要通过编译器进行提前编译(AoT: ahead of time)再执行的高级语言称为编译型语言(Compiled language),其执行流程一般为:(编译型语言->机械码->程序运行),常见的编译型语言有c和c++。由于解释型语言可直接运行,其代码的灵活性相对更高。但在一般情况下,解释型语言需要边编译边运行,所以其执行效率相对于编译型语言较低。尤其对于部分反复执行的代码,解释型语言常需要对其进行反复编译。即时编译(Just-In-Time)结合了编译器的速度和解释器的灵活性,并允许自适应优化。

python是一种典型的解释型语言。在执行时,解释器首先将程序的字节码存储到.pyc,在将字节码发送到python虚拟机上进一步的解释执行字节码。如果程序未发生变化,python的字节码并不需要反复生成,但在python虚拟机上的解释步骤是需要反复执行的。在python numba中,可将numba.jit的函数在第一次执行时生成的机械码,进而在该函数可以直接调用生成的机械码而省去反复解释的过程,进而达到与编译型语言相似的速度。

装饰器(Decorator)

装饰器是一种常见的设计模式,其允许向一个现有的对象添加新的功能,同时又不改变其结构。在python中,装饰器可以通过函数或类的形式定义,并通过@decorator的方式调用,一些常见的装饰器结构定义和调用方式可见如下代码。

代码:装饰器

通过函数定义装饰器

import time
import numpy as np
import functools

def timef0(func):
    """simplist decorator, with wrong func name"""
    def wrapper(*args, **kw):
        startt = time.time()
        result = func(*args, **kw)
        print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        return result
    return wrapper


def timef1(func):
    """decorator with correct func name"""
    @functools.wraps(func)
    def wrapper(*args, **kw):
        startt = time.time()
        result = func(*args, **kw)
        print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        return result
    return wrapper

def timef2(num_runs):
    """decorator with param"""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kw):
            startt = time.time()
            for _ in range(num_runs):
                result = func(*args, **kw)
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
            return result
        return wrapper
    return decorator

def timef3(*args, **kw):
    """decorator with and without param"""
    if len(args) == 1 and len(kw)==0:
        func = args[0]
        @functools.wraps(func)
        def wrapper(*args, **kw):
            startt = time.time()
            result = func(*args, **kw)
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
            return result
        return wrapper
    elif len(args) == 0 and len(kw)!=0:
        num_runs = kw["num_runs"] if "num_runs" in kw else 1
        warmup   = kw["warmup"] if "warmup" in kw else 0
        def decorator(func):
            @functools.wraps(func)
            def wrapper(*args, **kw):
                for _ in range(warmup):
                    result = func(*args, **kw)
                startt = time.time()
                for _ in range(num_runs):
                    result = func(*args, **kw)
                print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
                return result
            return wrapper
        return decorator
    else:
        raise ValueError("Params for decorator are not expected!")

def timef4(func=None, num_runs=1, warmup=0):
    """decorator with and without param, a more flatten way"""
    if not func:
        return functools.partial(timef4, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef0
def add_val_f0(_list, val):
    return [l+val for l in _list]

@timef1
def add_val_f1(_list, val):
    return [l+val for l in _list]

@timef2(num_runs=2)
def add_val_f2(_list, val):
    return [l+val for l in _list]

@timef3
def add_val_f3_woparam(_list, val):
    return [l+val for l in _list]

@timef3(num_runs=2, warmup=1)
def add_val_f3_wparam(_list, val):
    return [l+val for l in _list]

@timef4
def add_val_f4_woparam(_list, val):
    return [l+val for l in _list]

@timef4(num_runs=2, warmup=1)
def add_val_f4_wparam(_list, val):
    return [l+val for l in _list]

print(f"add_val_f0 func name: {add_val_f0.__name__}")
add_val_f0(np.ones(1000000), 10)
print(f"add_val_f1 func name: {add_val_f1.__name__}")
add_val_f1(np.ones(1000000), 10)
print(f"add_val_f2 func name: {add_val_f2.__name__}")
add_val_f2(np.ones(1000000), 10)
print(f"add_val_f3_woparam func name: {add_val_f3_woparam.__name__}")
add_val_f3_woparam(np.ones(1000000), 10)
print(f"add_val_f3_wparam func name: {add_val_f3_wparam.__name__}")
add_val_f3_wparam(np.ones(1000000), 10)
print(f"add_val_f4_woparam func name: {add_val_f4_woparam.__name__}")
add_val_f4_woparam(np.ones(1000000), 10)
print(f"add_val_f4_wparam func name: {add_val_f4_wparam.__name__}")
add_val_f4_wparam(np.ones(1000000), 10)

输出:

add_val_f0 func name: wrapper
add_val_f0 time cost: 0.257897s
add_val_f1 func name: add_val_f1
add_val_f1 time cost: 0.263955s
add_val_f2 func name: add_val_f2
add_val_f2 time cost: 0.269486s (Avg over 2 runs)
add_val_f3_woparam func name: add_val_f3_woparam
add_val_f3_woparam time cost: 0.257728s
add_val_f3_wparam func name: add_val_f3_wparam
add_val_f3_wparam time cost: 0.283820s (Avg over 2 runs)
add_val_f4_woparam func name: add_val_f4_woparam
add_val_f4_woparam time cost: 0.254313s
add_val_f4_wparam func name: add_val_f4_wparam
add_val_f4_wparam time cost: 0.282968s (Avg over 2 runs)

通过类定义装饰器

装饰器的本质是为函数添加新功能但不更改函数结构,比如在本例中计算时间代价,或是将函数注册到模块中。因此,为了保证函数调用的一致性,装饰器返回的仍是函数对象。被返回的函数一般有两种类型,原始输入函数,或是调用原始输入函数的wrapper函数。需要注意的是,当返回函数为wrapper函数时,由于其是一个新的函数对象,函数相关的属性也就不同。如上例timef0所示,简单返回wrapper函数会导致装饰函数的名字变为新函数的名字,进而破坏了装饰函数在装饰前和装饰后的一致性。为了使得装饰前与装饰后函数属性一致,需要在使用时对wrapper函数加入@functools.wraps(func)装饰器。

在装饰器被调用时,其真正执行的是decorator(func)。如对add_val_f1添加@timef1装饰器时,其在声明时是等同于timef1(add_val_f1)。因此也就可以通过函数嵌套的形式,为装饰器传参。比如上例中的timef2,在声明时相当于timef2(num_runs=2)(add_val_f2);由于timef2(num_runs=2)的返回值为decorator函数,所以该声明也就等价于decorator(add_val_f2)。在这种函数情况下,timef2只负责传参,而decorator函数才完成类似于timef1的功能。当然笔者更喜欢使用timef2所示的递归的方式,来进行更加灵活的定义和传参。

此外,装饰器是可以嵌套的,具体例子会在下面小节展示。

通过numba.jit加速python for循环

上小节提到,在python的for中,即使是相同的代码也需要对字节码进行反复解释,这种执行编译的方式是低效的。在numba中,可以通过添加简单@numba.jit装饰器进行加速,比如可以通过如下代码对上述例子进行加速。

代码:numba.jit加速for循环
import time
import functools
import numpy as np
import numba


def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef
def add_val_f0(_list, val):
    return [l+val for l in _list]

@timef
def add_val_f1(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit
def add_val_f2(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit
def add_val_f2_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True)
def add_val_f3(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def add_val_f3_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True, cache=True)
def add_val_f4(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, cache=True)
def add_val_f4_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

numba.config.NUMBA_DEFAULT_NUM_THREADS=4

@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f5(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f5_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index,_ in enumerate(ls):
        a[index] = ls[index]+val
    return a

@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f6(ls, val):
    a = np.zeros(len(ls))
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f6_warmedup(ls, val):
    a = np.zeros(len(ls))
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

@timef
def add_val_numpy(ls, val):
    return ls + val

size=10000000
add_val_f0(np.ones(size), 10)
add_val_f1(np.ones(size), 10)
add_val_f2(np.ones(size), 10)
add_val_f2_warmedup(np.ones(size), 10)
add_val_f3(np.ones(size), 10)
add_val_f3_warmedup(np.ones(size), 10)
add_val_f4(np.ones(size), 10)
add_val_f4_warmedup(np.ones(size), 10)
add_val_f5(np.ones(size), 10)
add_val_f5_warmedup(np.ones(size), 10)
add_val_f6(np.ones(size), 10)
add_val_f6_warmedup(np.ones(size), 10)
add_val_numpy(np.ones(size), 10)

输出:

add_val_f0 time cost: 2.478387s
add_val_f1 time cost: 4.142665s
add_val_f2 time cost: 0.228293s
add_val_f2_warmedup time cost: 0.037352s (Avg over 2 runs)
add_val_f3 time cost: 0.129881s
add_val_f3_warmedup time cost: 0.036910s (Avg over 2 runs)
add_val_f4 time cost: 0.038818s
add_val_f4_warmedup time cost: 0.036762s (Avg over 2 runs)
add_val_f5 time cost: 0.214643s
add_val_f5_warmedup time cost: 0.017964s (Avg over 2 runs)
add_val_f6 time cost: 0.273015s
add_val_f6_warmedup time cost: 0.010958s (Avg over 2 runs)
add_val_numpy time cost: 0.012424s

其中__pychache__中生成对应的缓存文件:

.
├── numba_jit.py
└── __pycache__
    ├── numba_jit.add_val_f4-68.py37m.1.nbc
    ├── numba_jit.add_val_f4-68.py37m.nbi
    ├── numba_jit.add_val_f4_warmedup-76.py37m.1.nbc
    └── numba_jit.add_val_f4_warmedup-76.py37m.nbi

在上诉例子中,可以看到通过简单添加@numba.jit装饰器来对python代码进行加速。通过add_val_fnadd_val_fn_warmedup极大的速度差距可以看到,在@numba.jit中,对于第一次执行还是存在解释编译,因此执行效率相对缓慢。而在第一次解释编译之后,其执行速度几乎达到甚至超过numpy经过优化之后的速度。

在numba.jit中常用到的参数有三个,分别为:nopython,cache和parallel。

numba.jit足以应付绝大多数的循环,但当一个操作只想针对数组的某一维度进行操作,numba.jit就显得力不从心了。下一小节将讨论如何通过numba中的vectroize和guvectorize生成numpy的通用函数去解决多维数组的扩展问题。

通过numba中的vectroize和guvectorize创建高效的numpy通用函数

在numpy中通用函数ufunc(universal function),是一种能对数组的每个元素进行操作的函数,一些基本操作如add,dot,max,any等在numpy中都通过c来实现进而达到较好的执行效率。

对于相对较复杂的操作,numpy允许自定义ufunc。但相较于通过c实现的ufunc,通过python自定义的ufunc与原始python函数速度差异不大。而且numpy自定义的ufunc只支持按位操作(elementwise operation),对于相对较复杂的操作如矩阵乘法这种广义通用函数,numpy的支持并不友好。而numba的vectroize和guvectorize在兼容numpy.ufunc的特性的同时,执行效率高并且支持广义通用函数。

本节示例展示了通过numba中的vectroize和guvectorize分别来加速numpy自定义的ufunc。

代码:numba.vectroize创建高效numpy通用函数 - 按位乘法
import time
import functools
import numpy as np
import numba

numba.config.NUMBA_NUM_THREADS=8

def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef(num_runs=2, warmup=1)
def dotmul_numpy(matA, matB):
    return matA*matB

@timef(num_runs=2, warmup=1)
def dotmul_python(matA, matB):
    matC = np.empty_like(matA)
    for i in range(matA.shape[0]):
        for j in range(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def dotmul_numba_jit(matA, matB):
    matC = np.empty_like(matA)
    for i in range(matA.shape[0]):
        for j in range(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def dotmul_numba_jit_parallel(matA, matB):
    matC = np.empty_like(matA)
    for i in numba.prange(matA.shape[0]):
        for j in numba.prange(matA.shape[1]):
            matC[i,j] = matA[i,j] * matB[i,j]
    return matC

@np.vectorize
def dotmul_numpy_vectorize(a, b):
    return a*b
dotmul_numpy_vectorize.__name__ = "dotmul_numpy_vectorize"
dotmul_numpy_vectorize = timef(num_runs=2, warmup=1)(dotmul_numpy_vectorize)

@timef(num_runs=2, warmup=1)
@numba.vectorize(nopython=True)
def dotmul_numba_vectorize(a, b):
    return a*b

@timef(num_runs=2, warmup=1)
@numba.vectorize('float64(float64, float64)', target='parallel', nopython=True)
def dotmul_numba_vectorize_parallel(a, b):
    return a*b

@numba.vectorize('float64(float64, float64)', target='cuda')
def dotmul_numba_vectorize_cuda(a, b):
    return a*b
dotmul_numba_vectorize_cuda.__name__ = "dotmul_numba_vectorize_cuda"
dotmul_numba_vectorize_cuda = timef(num_runs=2, warmup=1)(dotmul_numba_vectorize_cuda)

size=1000
dotmul_python(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_cuda(np.ones((size,size)), np.ones((size,size)))

输出:

dotmul_python time cost: 0.390452s (Avg over 2 runs)
dotmul_numpy time cost: 0.002007s (Avg over 2 runs)
dotmul_numba_jit time cost: 0.002113s (Avg over 2 runs)
dotmul_numba_jit_parallel time cost: 0.001313s (Avg over 2 runs)
dotmul_numpy_vectorize time cost: 0.143894s (Avg over 2 runs)
dotmul_numba_vectorize time cost: 0.001535s (Avg over 2 runs)
dotmul_numba_vectorize_parallel time cost: 0.001899s (Avg over 2 runs)
dotmul_numba_vectorize_cuda time cost: 0.004523s (Avg over 2 runs)
代码:numba.guvectroize创建高效numpy广义通用函数 - 矩阵乘法
import time
import functools
import numpy as np
import numba

numba.config.NUMBA_NUM_THREADS=8

def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

@timef(num_runs=2, warmup=1)
def matmul_numpy(matA, matB):
    return np.matmul(matA, matB)

@timef(num_runs=2, warmup=1)
def matmul_python(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def matmul_numba_jit(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def matmul_numba_jit_parallel(matA, matB):
    m, n = matA.shape
    n, p = matB.shape
    matC = np.zeros((m,p))
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
    return matC

@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", nopython=True)
def matmul_numba_guvectorize(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]

@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='parallel', nopython=True)
def matmul_numba_guvectorize_parallel(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]

@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='cuda')
def matmul_numba_guvectorize_cuda(matA, matB, matC):
    m, n = matA.shape
    n, p = matB.shape
    for i in range(m):
        for j in range(p):
            matC[i, j] = 0
            for k in range(n):
                matC[i, j] += matA[i, k] * matB[k, j]
matmul_numba_guvectorize_cuda.__name__ = "matmul_numba_guvectorize_cuda"
matmul_numba_guvectorize_cuda = timef(num_runs=2, warmup=1)(matmul_numba_guvectorize_cuda)

size=1000
matmul_python(np.ones((size,size)), np.ones((size,size)))
matmul_numpy(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
matmul_numba_guvectorize(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))
matmul_numba_guvectorize_parallel(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))

size=100
matmul_numba_guvectorize_cuda(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))

输出:

matmul_python time cost: 578.654421s (Avg over 2 runs)
matmul_numpy time cost: 0.011648s (Avg over 2 runs)
matmul_numba_jit time cost: 1.189993s (Avg over 2 runs)
matmul_numba_jit_parallel time cost: 1.129338s (Avg over 2 runs)
matmul_numba_guvectorize time cost: 1.124118s (Avg over 2 runs)
matmul_numba_guvectorize_parallel time cost: 1.129258s (Avg over 2 runs)
matmul_numba_guvectorize_cuda time cost: 0.179321s (Avg over 2 runs)

通过上例可以看到,相较于numpy.vectorize所生成的ufunc,numba.vectorize执行效率更高,甚至超过numpy通过c实现的通用函数的执行速度。而对于numba.guvectorize虽然可以灵活自定义,并且远远超过原始python实现,但相较于numpy优化过的代码仍有差距。

类似于numba.jit,numba.vectorize和numba.guvectorize同样可以通过nopython和cache参数来设置编译模式和缓存文件。与之不同的是,vectorize和guvectorize通过target参数来控制并行计算。当target=“cpu”时numba使用单线程,当target=“parallel”时numba使用多线程并行。值得一提的是target还可以允许使用nvidia cuda作为numpy通用函数的计算单元,虽然笔者并不建议。如果读者希望通过cuda加速python代码,可以看笔者的另一篇博客Numba: 通过python快速学习cuda编程

为什么要用ufunc而不是jit?jit更直观,更灵活,甚至更快。主要原因是因为numba生成的numpy.ufunc可以对某些轴进行广播(broadcast),在一些情况下可以会某一些轴进行缩减和累积。这样输入数据的维度就可以更自由。但除此以外,笔者更推荐numba.jit。

numba.pycc提前编译python函数

虽然numba主要的编译方式都是jit,但numba也如cython一样提供提前编译(AoT)的编译方式。本小节对之前numba.jit的例子进行改写,达到了提前编译的效果。

代码:numba.pycc提前编译python函数
# numba_pycc_module.py

import numba
from numba.pycc import CC
import numpy as np

cc = CC('numba_pycc_test_module')

@cc.export('add_valf', 'f8[:](f8[:], f8)')
@cc.export('add_vali', 'i4[:](i4[:], i4)')
def add_val(ls, val):
    a = np.empty_like(ls)
    for index in numba.prange(len(ls)):
        a[index] = ls[index]+val
    return a

if __name__ == "__main__":
    cc.compile()

运行生成.so的二进制文件

.
├── numba_pycc_module.py
├── numba_pycc_test_module.cpython-37m-x86_64-linux-gnu.so
└── numba_pycc_test.py
# numba_pycc_test.py

import time
import functools
import numpy as np

import numba_pycc_test_module


def timef(func=None, num_runs=1, warmup=0):
    """calculate the time cost for the function"""
    if not func:
        return functools.partial(timef, num_runs=num_runs, warmup=warmup)
    @functools.wraps(func)
    def wrapper(*args, **kw):
        for _ in range(warmup):
            result = func(*args, **kw)
        startt = time.time()
        for _ in range(num_runs):
            result = func(*args, **kw)
        if num_runs == 1:
            print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
        else:
            print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
        return result
    return wrapper

size=10000000
timef(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)

输出:

add_valf time cost: 0.027234s
add_valf time cost: 0.029468s (Avg over 2 runs)
add_vali time cost: 0.019918s
add_vali time cost: 0.018991s (Avg over 2 runs)

通过上例可以看到,通过numba.pycc可以提前编译生成的python函数为机械码。被编译的模块在运行时没有编译开销,也不依赖于Numba库。numba.pycc还可以将编译步骤集成到setuptools等脚本中。虽然numba.pycc使用起来灵活方便,但也有其局限性。如摆脱对numba依赖的同时也无法使用numba的一下特性,比如并行计算。而且numba.pycc编译仅允许使用常规的numba.jit函数,不能使用ufuncs。

至此,笔者对于通过numba加速python代码的介绍就结束了。在笔者使用numba时,虽然有些许的局限性,但其灵活简单的语法,对numpy数组较为完备的支持,保持python原有函数结构上对其极大的提速令笔者非常满意。此外,本文对一些其他笔者不太常用的功能并没有提及,如numba中的cfunc,jit中的nogil等。而且numba在仍在高效的迭代开发,一些实验性的功能如jit_module等都非常有趣。希望读者通过本文对numba有初步的了解,进而运用到自己的项目中。如有需要,多多查看官方手册。祝好,共勉。

参考

在撰写本文时,大量的参考了官方文档和其他博客,收益良多。观点表述如有雷同,可视为出自原作者。相关参考链接如下:


分享这篇文章:

上一篇
pybind: 为cpp/cuda代码提供python接口
下一篇
通过NPP加速TensorRT部署时图片数据预处理