简单易用与运行效率低是贴在python身上的两大标签。开发人员一方面对其简单的语法和丰富的库爱不释手,一方面又对其由于动态编译和解释执行带来的较低的运行效率和GIL带来的多线程难扩展的情况深恶痛绝。为了解决这些python中的固有问题,一些解释器如cython尝试对一些函数提前编译进而提高执行效率。但绝大多数的解释器或库函数的语法非常不pythonic,而且也不能够做到即插即用。本文简单介绍了如何通过numba库,为python函数装饰器的方式来对python函数进行加速。为读者提供一中,简单易用,灵活编译方式去解决python的固有问题,提高python代码的执行效率。
环境配置
为了简化环境配置,本文中的编程环境将全部由conda配置,并在conda的虚拟环境中测试。所涉及的依赖库分别是numba和numpy。相关环境可以通过以下语令进行配置。
conda create -n numba-python python=3.7
conda activate numba-python
conda install numba numpy
基本概念
编译型语言(Compiled language),解释型语言(Interpreted language)和即时编译(Just-In-Time)
用高级语言编写的程序一般可通过由解释器(Interpreter)直接执行,或由编译器(Compiler)编译成机械码再执行。由解释器直接执行的高级语言称为解释型语言(Interpreted language),其执行流程一般为:(解释型语言->程序运行->字节码->机械码),常见的解释型语言有Perl,Python,MATLAB和Ruby。需要通过编译器进行提前编译(AoT: ahead of time)再执行的高级语言称为编译型语言(Compiled language),其执行流程一般为:(编译型语言->机械码->程序运行),常见的编译型语言有c和c++。由于解释型语言可直接运行,其代码的灵活性相对更高。但在一般情况下,解释型语言需要边编译边运行,所以其执行效率相对于编译型语言较低。尤其对于部分反复执行的代码,解释型语言常需要对其进行反复编译。即时编译(Just-In-Time)结合了编译器的速度和解释器的灵活性,并允许自适应优化。
python是一种典型的解释型语言。在执行时,解释器首先将程序的字节码存储到.pyc,在将字节码发送到python虚拟机上进一步的解释执行字节码。如果程序未发生变化,python的字节码并不需要反复生成,但在python虚拟机上的解释步骤是需要反复执行的。在python numba中,可将numba.jit的函数在第一次执行时生成的机械码,进而在该函数可以直接调用生成的机械码而省去反复解释的过程,进而达到与编译型语言相似的速度。
装饰器(Decorator)
装饰器是一种常见的设计模式,其允许向一个现有的对象添加新的功能,同时又不改变其结构。在python中,装饰器可以通过函数或类的形式定义,并通过@decorator的方式调用,一些常见的装饰器结构定义和调用方式可见如下代码。
代码:装饰器
通过函数定义装饰器
import time
import numpy as np
import functools
def timef0(func):
"""simplist decorator, with wrong func name"""
def wrapper(*args, **kw):
startt = time.time()
result = func(*args, **kw)
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
return result
return wrapper
def timef1(func):
"""decorator with correct func name"""
@functools.wraps(func)
def wrapper(*args, **kw):
startt = time.time()
result = func(*args, **kw)
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
return result
return wrapper
def timef2(num_runs):
"""decorator with param"""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kw):
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
return decorator
def timef3(*args, **kw):
"""decorator with and without param"""
if len(args) == 1 and len(kw)==0:
func = args[0]
@functools.wraps(func)
def wrapper(*args, **kw):
startt = time.time()
result = func(*args, **kw)
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
return result
return wrapper
elif len(args) == 0 and len(kw)!=0:
num_runs = kw["num_runs"] if "num_runs" in kw else 1
warmup = kw["warmup"] if "warmup" in kw else 0
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
return decorator
else:
raise ValueError("Params for decorator are not expected!")
def timef4(func=None, num_runs=1, warmup=0):
"""decorator with and without param, a more flatten way"""
if not func:
return functools.partial(timef4, num_runs=num_runs, warmup=warmup)
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
if num_runs == 1:
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
else:
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
@timef0
def add_val_f0(_list, val):
return [l+val for l in _list]
@timef1
def add_val_f1(_list, val):
return [l+val for l in _list]
@timef2(num_runs=2)
def add_val_f2(_list, val):
return [l+val for l in _list]
@timef3
def add_val_f3_woparam(_list, val):
return [l+val for l in _list]
@timef3(num_runs=2, warmup=1)
def add_val_f3_wparam(_list, val):
return [l+val for l in _list]
@timef4
def add_val_f4_woparam(_list, val):
return [l+val for l in _list]
@timef4(num_runs=2, warmup=1)
def add_val_f4_wparam(_list, val):
return [l+val for l in _list]
print(f"add_val_f0 func name: {add_val_f0.__name__}")
add_val_f0(np.ones(1000000), 10)
print(f"add_val_f1 func name: {add_val_f1.__name__}")
add_val_f1(np.ones(1000000), 10)
print(f"add_val_f2 func name: {add_val_f2.__name__}")
add_val_f2(np.ones(1000000), 10)
print(f"add_val_f3_woparam func name: {add_val_f3_woparam.__name__}")
add_val_f3_woparam(np.ones(1000000), 10)
print(f"add_val_f3_wparam func name: {add_val_f3_wparam.__name__}")
add_val_f3_wparam(np.ones(1000000), 10)
print(f"add_val_f4_woparam func name: {add_val_f4_woparam.__name__}")
add_val_f4_woparam(np.ones(1000000), 10)
print(f"add_val_f4_wparam func name: {add_val_f4_wparam.__name__}")
add_val_f4_wparam(np.ones(1000000), 10)
输出:
add_val_f0 func name: wrapper
add_val_f0 time cost: 0.257897s
add_val_f1 func name: add_val_f1
add_val_f1 time cost: 0.263955s
add_val_f2 func name: add_val_f2
add_val_f2 time cost: 0.269486s (Avg over 2 runs)
add_val_f3_woparam func name: add_val_f3_woparam
add_val_f3_woparam time cost: 0.257728s
add_val_f3_wparam func name: add_val_f3_wparam
add_val_f3_wparam time cost: 0.283820s (Avg over 2 runs)
add_val_f4_woparam func name: add_val_f4_woparam
add_val_f4_woparam time cost: 0.254313s
add_val_f4_wparam func name: add_val_f4_wparam
add_val_f4_wparam time cost: 0.282968s (Avg over 2 runs)
通过类定义装饰器
装饰器的本质是为函数添加新功能但不更改函数结构,比如在本例中计算时间代价,或是将函数注册到模块中。因此,为了保证函数调用的一致性,装饰器返回的仍是函数对象。被返回的函数一般有两种类型,原始输入函数,或是调用原始输入函数的wrapper函数。需要注意的是,当返回函数为wrapper函数时,由于其是一个新的函数对象,函数相关的属性也就不同。如上例timef0所示,简单返回wrapper函数会导致装饰函数的名字变为新函数的名字,进而破坏了装饰函数在装饰前和装饰后的一致性。为了使得装饰前与装饰后函数属性一致,需要在使用时对wrapper函数加入@functools.wraps(func)装饰器。
在装饰器被调用时,其真正执行的是decorator(func)。如对add_val_f1添加@timef1装饰器时,其在声明时是等同于timef1(add_val_f1)。因此也就可以通过函数嵌套的形式,为装饰器传参。比如上例中的timef2,在声明时相当于timef2(num_runs=2)(add_val_f2);由于timef2(num_runs=2)的返回值为decorator函数,所以该声明也就等价于decorator(add_val_f2)。在这种函数情况下,timef2只负责传参,而decorator函数才完成类似于timef1的功能。当然笔者更喜欢使用timef2所示的递归的方式,来进行更加灵活的定义和传参。
此外,装饰器是可以嵌套的,具体例子会在下面小节展示。
通过numba.jit加速python for循环
上小节提到,在python的for中,即使是相同的代码也需要对字节码进行反复解释,这种执行编译的方式是低效的。在numba中,可以通过添加简单@numba.jit装饰器进行加速,比如可以通过如下代码对上述例子进行加速。
代码:numba.jit加速for循环
import time
import functools
import numpy as np
import numba
def timef(func=None, num_runs=1, warmup=0):
"""calculate the time cost for the function"""
if not func:
return functools.partial(timef, num_runs=num_runs, warmup=warmup)
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
if num_runs == 1:
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
else:
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
@timef
def add_val_f0(_list, val):
return [l+val for l in _list]
@timef
def add_val_f1(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef
@numba.jit
def add_val_f2(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef(num_runs=2, warmup=1)
@numba.jit
def add_val_f2_warmedup(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef
@numba.jit(nopython=True)
def add_val_f3(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def add_val_f3_warmedup(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef
@numba.jit(nopython=True, cache=True)
def add_val_f4(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, cache=True)
def add_val_f4_warmedup(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
numba.config.NUMBA_DEFAULT_NUM_THREADS=4
@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f5(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f5_warmedup(ls, val):
a = np.zeros(len(ls))
for index,_ in enumerate(ls):
a[index] = ls[index]+val
return a
@timef
@numba.jit(nopython=True, parallel=True)
def add_val_f6(ls, val):
a = np.zeros(len(ls))
for index in numba.prange(len(ls)):
a[index] = ls[index]+val
return a
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def add_val_f6_warmedup(ls, val):
a = np.zeros(len(ls))
for index in numba.prange(len(ls)):
a[index] = ls[index]+val
return a
@timef
def add_val_numpy(ls, val):
return ls + val
size=10000000
add_val_f0(np.ones(size), 10)
add_val_f1(np.ones(size), 10)
add_val_f2(np.ones(size), 10)
add_val_f2_warmedup(np.ones(size), 10)
add_val_f3(np.ones(size), 10)
add_val_f3_warmedup(np.ones(size), 10)
add_val_f4(np.ones(size), 10)
add_val_f4_warmedup(np.ones(size), 10)
add_val_f5(np.ones(size), 10)
add_val_f5_warmedup(np.ones(size), 10)
add_val_f6(np.ones(size), 10)
add_val_f6_warmedup(np.ones(size), 10)
add_val_numpy(np.ones(size), 10)
输出:
add_val_f0 time cost: 2.478387s
add_val_f1 time cost: 4.142665s
add_val_f2 time cost: 0.228293s
add_val_f2_warmedup time cost: 0.037352s (Avg over 2 runs)
add_val_f3 time cost: 0.129881s
add_val_f3_warmedup time cost: 0.036910s (Avg over 2 runs)
add_val_f4 time cost: 0.038818s
add_val_f4_warmedup time cost: 0.036762s (Avg over 2 runs)
add_val_f5 time cost: 0.214643s
add_val_f5_warmedup time cost: 0.017964s (Avg over 2 runs)
add_val_f6 time cost: 0.273015s
add_val_f6_warmedup time cost: 0.010958s (Avg over 2 runs)
add_val_numpy time cost: 0.012424s
其中__pychache__中生成对应的缓存文件:
.
├── numba_jit.py
└── __pycache__
├── numba_jit.add_val_f4-68.py37m.1.nbc
├── numba_jit.add_val_f4-68.py37m.nbi
├── numba_jit.add_val_f4_warmedup-76.py37m.1.nbc
└── numba_jit.add_val_f4_warmedup-76.py37m.nbi
在上诉例子中,可以看到通过简单添加@numba.jit装饰器来对python代码进行加速。通过add_val_fn和add_val_fn_warmedup极大的速度差距可以看到,在@numba.jit中,对于第一次执行还是存在解释编译,因此执行效率相对缓慢。而在第一次解释编译之后,其执行速度几乎达到甚至超过numpy经过优化之后的速度。
在numba.jit中常用到的参数有三个,分别为:nopython,cache和parallel。
- nopython参数用于控制numba的编译模式,numba有两种编译模式,分别为非python模式和对象模式。当nopython=True时,numba编译模式为非python模式。这种编译模式会产生最高性能的代码,但由于其生成的代码无法访问python c api,对于原生python代码的兼容型一般。比如非python模式无法兼任原生的python类作为输入类型,但其可以通过numba.jitclass或是numpy的结构体来解决。
- cache参数用于控制函数的缓存至磁盘文件,可以通过传递cache=True避免每次调用Python程序时都要进行编译,进而提升第一次执行相对缓慢相对缓慢的情况。如在上例中,add_val_f4通过cache=True缓存下编译过的文件,进而达到与
add_val_f4_warmedup相似的速度。需要注意的是当函数缓存文件不存在时,第一次执行该函数将生成缓存文件,而本次执行的速度同样相对缓慢。 - parallel参数用于控制函数的并行。当parallel=True时,numba对函数内的操作进行并行优化。也可以通过numba.prange进行显示并行循环进而进一步提高执行效率。此外,numba通过
numba.config.NUMBA_DEFAULT_NUM_THREADS来指定并行的线程数。
numba.jit足以应付绝大多数的循环,但当一个操作只想针对数组的某一维度进行操作,numba.jit就显得力不从心了。下一小节将讨论如何通过numba中的vectroize和guvectorize生成numpy的通用函数去解决多维数组的扩展问题。
通过numba中的vectroize和guvectorize创建高效的numpy通用函数
在numpy中通用函数ufunc(universal function),是一种能对数组的每个元素进行操作的函数,一些基本操作如add,dot,max,any等在numpy中都通过c来实现进而达到较好的执行效率。
对于相对较复杂的操作,numpy允许自定义ufunc。但相较于通过c实现的ufunc,通过python自定义的ufunc与原始python函数速度差异不大。而且numpy自定义的ufunc只支持按位操作(elementwise operation),对于相对较复杂的操作如矩阵乘法这种广义通用函数,numpy的支持并不友好。而numba的vectroize和guvectorize在兼容numpy.ufunc的特性的同时,执行效率高并且支持广义通用函数。
本节示例展示了通过numba中的vectroize和guvectorize分别来加速numpy自定义的ufunc。
代码:numba.vectroize创建高效numpy通用函数 - 按位乘法
import time
import functools
import numpy as np
import numba
numba.config.NUMBA_NUM_THREADS=8
def timef(func=None, num_runs=1, warmup=0):
"""calculate the time cost for the function"""
if not func:
return functools.partial(timef, num_runs=num_runs, warmup=warmup)
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
if num_runs == 1:
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
else:
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
@timef(num_runs=2, warmup=1)
def dotmul_numpy(matA, matB):
return matA*matB
@timef(num_runs=2, warmup=1)
def dotmul_python(matA, matB):
matC = np.empty_like(matA)
for i in range(matA.shape[0]):
for j in range(matA.shape[1]):
matC[i,j] = matA[i,j] * matB[i,j]
return matC
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def dotmul_numba_jit(matA, matB):
matC = np.empty_like(matA)
for i in range(matA.shape[0]):
for j in range(matA.shape[1]):
matC[i,j] = matA[i,j] * matB[i,j]
return matC
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def dotmul_numba_jit_parallel(matA, matB):
matC = np.empty_like(matA)
for i in numba.prange(matA.shape[0]):
for j in numba.prange(matA.shape[1]):
matC[i,j] = matA[i,j] * matB[i,j]
return matC
@np.vectorize
def dotmul_numpy_vectorize(a, b):
return a*b
dotmul_numpy_vectorize.__name__ = "dotmul_numpy_vectorize"
dotmul_numpy_vectorize = timef(num_runs=2, warmup=1)(dotmul_numpy_vectorize)
@timef(num_runs=2, warmup=1)
@numba.vectorize(nopython=True)
def dotmul_numba_vectorize(a, b):
return a*b
@timef(num_runs=2, warmup=1)
@numba.vectorize('float64(float64, float64)', target='parallel', nopython=True)
def dotmul_numba_vectorize_parallel(a, b):
return a*b
@numba.vectorize('float64(float64, float64)', target='cuda')
def dotmul_numba_vectorize_cuda(a, b):
return a*b
dotmul_numba_vectorize_cuda.__name__ = "dotmul_numba_vectorize_cuda"
dotmul_numba_vectorize_cuda = timef(num_runs=2, warmup=1)(dotmul_numba_vectorize_cuda)
size=1000
dotmul_python(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numpy_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_parallel(np.ones((size,size)), np.ones((size,size)))
dotmul_numba_vectorize_cuda(np.ones((size,size)), np.ones((size,size)))
输出:
dotmul_python time cost: 0.390452s (Avg over 2 runs)
dotmul_numpy time cost: 0.002007s (Avg over 2 runs)
dotmul_numba_jit time cost: 0.002113s (Avg over 2 runs)
dotmul_numba_jit_parallel time cost: 0.001313s (Avg over 2 runs)
dotmul_numpy_vectorize time cost: 0.143894s (Avg over 2 runs)
dotmul_numba_vectorize time cost: 0.001535s (Avg over 2 runs)
dotmul_numba_vectorize_parallel time cost: 0.001899s (Avg over 2 runs)
dotmul_numba_vectorize_cuda time cost: 0.004523s (Avg over 2 runs)
代码:numba.guvectroize创建高效numpy广义通用函数 - 矩阵乘法
import time
import functools
import numpy as np
import numba
numba.config.NUMBA_NUM_THREADS=8
def timef(func=None, num_runs=1, warmup=0):
"""calculate the time cost for the function"""
if not func:
return functools.partial(timef, num_runs=num_runs, warmup=warmup)
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
if num_runs == 1:
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
else:
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
@timef(num_runs=2, warmup=1)
def matmul_numpy(matA, matB):
return np.matmul(matA, matB)
@timef(num_runs=2, warmup=1)
def matmul_python(matA, matB):
m, n = matA.shape
n, p = matB.shape
matC = np.zeros((m,p))
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
return matC
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True)
def matmul_numba_jit(matA, matB):
m, n = matA.shape
n, p = matB.shape
matC = np.zeros((m,p))
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
return matC
@timef(num_runs=2, warmup=1)
@numba.jit(nopython=True, parallel=True)
def matmul_numba_jit_parallel(matA, matB):
m, n = matA.shape
n, p = matB.shape
matC = np.zeros((m,p))
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
return matC
@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", nopython=True)
def matmul_numba_guvectorize(matA, matB, matC):
m, n = matA.shape
n, p = matB.shape
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
@timef(num_runs=2, warmup=1)
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='parallel', nopython=True)
def matmul_numba_guvectorize_parallel(matA, matB, matC):
m, n = matA.shape
n, p = matB.shape
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
@numba.guvectorize("float64[:,:], float64[:,:], float64[:,:]", "(m,n),(n,p)->(m,p)", target='cuda')
def matmul_numba_guvectorize_cuda(matA, matB, matC):
m, n = matA.shape
n, p = matB.shape
for i in range(m):
for j in range(p):
matC[i, j] = 0
for k in range(n):
matC[i, j] += matA[i, k] * matB[k, j]
matmul_numba_guvectorize_cuda.__name__ = "matmul_numba_guvectorize_cuda"
matmul_numba_guvectorize_cuda = timef(num_runs=2, warmup=1)(matmul_numba_guvectorize_cuda)
size=1000
matmul_python(np.ones((size,size)), np.ones((size,size)))
matmul_numpy(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit(np.ones((size,size)), np.ones((size,size)))
matmul_numba_jit_parallel(np.ones((size,size)), np.ones((size,size)))
matmul_numba_guvectorize(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))
matmul_numba_guvectorize_parallel(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))
size=100
matmul_numba_guvectorize_cuda(np.ones((size,size)), np.ones((size,size)), np.zeros((size,size)))
输出:
matmul_python time cost: 578.654421s (Avg over 2 runs)
matmul_numpy time cost: 0.011648s (Avg over 2 runs)
matmul_numba_jit time cost: 1.189993s (Avg over 2 runs)
matmul_numba_jit_parallel time cost: 1.129338s (Avg over 2 runs)
matmul_numba_guvectorize time cost: 1.124118s (Avg over 2 runs)
matmul_numba_guvectorize_parallel time cost: 1.129258s (Avg over 2 runs)
matmul_numba_guvectorize_cuda time cost: 0.179321s (Avg over 2 runs)
通过上例可以看到,相较于numpy.vectorize所生成的ufunc,numba.vectorize执行效率更高,甚至超过numpy通过c实现的通用函数的执行速度。而对于numba.guvectorize虽然可以灵活自定义,并且远远超过原始python实现,但相较于numpy优化过的代码仍有差距。
类似于numba.jit,numba.vectorize和numba.guvectorize同样可以通过nopython和cache参数来设置编译模式和缓存文件。与之不同的是,vectorize和guvectorize通过target参数来控制并行计算。当target=“cpu”时numba使用单线程,当target=“parallel”时numba使用多线程并行。值得一提的是target还可以允许使用nvidia cuda作为numpy通用函数的计算单元,虽然笔者并不建议。如果读者希望通过cuda加速python代码,可以看笔者的另一篇博客Numba: 通过python快速学习cuda编程。
为什么要用ufunc而不是jit?jit更直观,更灵活,甚至更快。主要原因是因为numba生成的numpy.ufunc可以对某些轴进行广播(broadcast),在一些情况下可以会某一些轴进行缩减和累积。这样输入数据的维度就可以更自由。但除此以外,笔者更推荐numba.jit。
numba.pycc提前编译python函数
虽然numba主要的编译方式都是jit,但numba也如cython一样提供提前编译(AoT)的编译方式。本小节对之前numba.jit的例子进行改写,达到了提前编译的效果。
代码:numba.pycc提前编译python函数
# numba_pycc_module.py
import numba
from numba.pycc import CC
import numpy as np
cc = CC('numba_pycc_test_module')
@cc.export('add_valf', 'f8[:](f8[:], f8)')
@cc.export('add_vali', 'i4[:](i4[:], i4)')
def add_val(ls, val):
a = np.empty_like(ls)
for index in numba.prange(len(ls)):
a[index] = ls[index]+val
return a
if __name__ == "__main__":
cc.compile()
运行生成.so的二进制文件
.
├── numba_pycc_module.py
├── numba_pycc_test_module.cpython-37m-x86_64-linux-gnu.so
└── numba_pycc_test.py
# numba_pycc_test.py
import time
import functools
import numpy as np
import numba_pycc_test_module
def timef(func=None, num_runs=1, warmup=0):
"""calculate the time cost for the function"""
if not func:
return functools.partial(timef, num_runs=num_runs, warmup=warmup)
@functools.wraps(func)
def wrapper(*args, **kw):
for _ in range(warmup):
result = func(*args, **kw)
startt = time.time()
for _ in range(num_runs):
result = func(*args, **kw)
if num_runs == 1:
print(f"{func.__name__} time cost: {time.time()-startt:0.6f}s")
else:
print(f"{func.__name__} time cost: {(time.time()-startt)/num_runs:0.6f}s (Avg over {num_runs} runs)")
return result
return wrapper
size=10000000
timef(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_valf)(np.ones(size), 10)
timef(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)
timef(num_runs=2, warmup=1)(numba_pycc_test_module.add_vali)(np.ones(size, dtype=np.int32), 10)
输出:
add_valf time cost: 0.027234s
add_valf time cost: 0.029468s (Avg over 2 runs)
add_vali time cost: 0.019918s
add_vali time cost: 0.018991s (Avg over 2 runs)
通过上例可以看到,通过numba.pycc可以提前编译生成的python函数为机械码。被编译的模块在运行时没有编译开销,也不依赖于Numba库。numba.pycc还可以将编译步骤集成到setuptools等脚本中。虽然numba.pycc使用起来灵活方便,但也有其局限性。如摆脱对numba依赖的同时也无法使用numba的一下特性,比如并行计算。而且numba.pycc编译仅允许使用常规的numba.jit函数,不能使用ufuncs。
至此,笔者对于通过numba加速python代码的介绍就结束了。在笔者使用numba时,虽然有些许的局限性,但其灵活简单的语法,对numpy数组较为完备的支持,保持python原有函数结构上对其极大的提速令笔者非常满意。此外,本文对一些其他笔者不太常用的功能并没有提及,如numba中的cfunc,jit中的nogil等。而且numba在仍在高效的迭代开发,一些实验性的功能如jit_module等都非常有趣。希望读者通过本文对numba有初步的了解,进而运用到自己的项目中。如有需要,多多查看官方手册。祝好,共勉。
参考
在撰写本文时,大量的参考了官方文档和其他博客,收益良多。观点表述如有雷同,可视为出自原作者。相关参考链接如下: