Numba: Learning CUDA Programming Quickly with Python

Complex build environments, pointer arithmetic, and debugging make CUDA programming notoriously daunting for beginners. This post uses Python’s numba library as a faster path to understanding how CUDA’s highly concurrent, multithreaded programming model works.

Environment setup

To keep things simple, everything here is configured via conda in a virtual environment. The required packages are cudatoolkit, numba, and numpy:

conda create -n numba-cuda python=3.7
conda activate numba-cuda
conda install cudatoolkit numba numpy

After installation, use nvidia-smi to check GPU status. The table shows which programs are running on each GPU, their VRAM usage, and overall memory utilization:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   36C    P8   112W / 250W |   6338MiB / 11175MiB |     38%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     32597      G   /home/yang/anaconda3/envs/ssds/bin/python   6338MiB |
+-----------------------------------------------------------------------------+

You can also verify numba can see the GPU with:

python -c "from numba import cuda; print(cuda.gpus)"

If it returns <Managed Device 0>..., the GPU is accessible from Python.

If you don’t have a GPU, numba includes a GPU simulator. Enable it with export NUMBA_ENABLE_CUDASIM=1. The simulator runs on CPU, so its compute throughput is far lower than real GPU hardware — expect performance slightly slower than equivalent CPU multithreading. Code that works in the simulator won’t necessarily run on real hardware either. Use it for learning only.

Once the environment is ready, let’s start writing CUDA.

”Hello, World!”: First look at Grid, Block, and Thread

With the environment working, this section runs a “Hello, World!” program on the GPU to introduce the basic concepts and the process of launching a GPU function.

Code: "Hello, World!"

from numba import cuda

def cpu_print():
    print('cpu: "hello world!"')

@cuda.jit
def gpu_print():
    print(cuda.blockIdx.x, cuda.threadIdx.x, 'gpu: "hello world!"')

if __name__ == "__main__":
    gpu_print[2, 4]()
    cuda.synchronize()
    cpu_print()

Output:

0 0 gpu: "hello world!"
0 1 gpu: "hello world!"
0 2 gpu: "hello world!"
0 3 gpu: "hello world!"
1 0 gpu: "hello world!"
1 1 gpu: "hello world!"
1 2 gpu: "hello world!"
1 3 gpu: "hello world!"
cpu: "hello world!"

In numba, CUDA functions live in numba.cuda, imported via from numba import cuda. To run code on the GPU, wrap it in a function and add the @cuda.jit decorator. Functions decorated with @cuda.jit are called kernels — they’re called from the CPU but execute only on the GPU.

More precisely: a kernel runs on the GPU’s smallest compute units (CUDA cores). Each invocation runs in a single thread; multiple threads form a block; multiple blocks form a grid. All threads executing a given kernel invocation share one grid. The grid controls block count and lifetime; each block controls the count and lifetime of its threads. The compute unit that runs a thread is a Streaming Processor (SP); a block runs on a Streaming Multiprocessor (SM); the grid runs on a whole GPU card.

When the CPU calls a kernel, it specifies the number of blocks per grid and threads per block: kernel[num_blocks_per_grid, num_threads_per_block](). In this example, gpu_print[2, 4]() launches 2 blocks with 4 threads each — 8 threads total, hence 8 lines of “hello world!”.

Kernel launches are asynchronous: after launching a GPU kernel, the CPU doesn’t wait for it to finish before moving on. When synchronization is needed, call cuda.synchronize() to block the CPU until the GPU is done. Without it, cpu: "hello world!" prints first — the CPU reaches cpu_print before the GPU finishes, because launching a kernel has latency and execution itself is deferred.

Each thread knows its block index, thread index, block count, and thread count. Since these indices map directly to loop indices, CUDA parallelism can replace sequential loops in a fairly straightforward way. Thread (and block) indices support up to three dimensions, accessed via x, y, z. In this example, cuda.blockIdx.x and cuda.threadIdx.x give each thread’s block and thread index.

The next section uses matrix addition to illustrate basic CUDA parallel computation.

Matrix addition: first steps with parallel computation

The “Hello, World!” example covered the kernel execution model and the Grid/Block/Thread hierarchy. This section uses 2D matrix addition to go deeper into thread configuration and data transfer.

Code: matrix addition

import time
import random
import math
import numpy as np
from numba import cuda, jit


def matrix_add_cpu(matrix_A, matrix_B, n_rows, n_cols):
    return [[matrix_A[ridx][cidx] + matrix_B[ridx][cidx] for cidx in range(n_cols)] for ridx in range(n_rows)]

def matrix_add_numpy(matrix_A, matrix_B, n_rows, n_cols):
    return matrix_A + matrix_B

def matrix_add_cpu_v2(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    for ridx in range(n_rows):
        for cidx in range(n_cols):
            matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]
    return matrix_C

@jit
def matrix_add_numba_jit(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    for ridx in range(n_rows):
        for cidx in range(n_cols):
            matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]
    return matrix_C

@cuda.jit
def matrix_add_numba_cuda(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    ridx = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y
    cidx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]

@cuda.jit
def matrix_add_numba_cuda_copy(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    ridx = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y
    cidx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]

@cuda.jit
def matrix_add_numba_cuda_v2(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    ridx = idx // n_cols
    cidx = idx %  n_cols

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]


if __name__ == "__main__":
    n_row, n_col = 1080, 1920

    arr_A, arr_B = [[[random.random() for _ in range(n_col)] for _ in range(n_row)] for _ in range(2)]
    elapsed_time = time.time()
    arr_cpu = matrix_add_cpu(arr_A, arr_B, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    print("process time on cpu : {:.03f}ms".format(elapsed_time * 1000))

    arr_A, arr_B = np.array(arr_A), np.array(arr_B)
    elapsed_time = time.time()
    arr_numpy = matrix_add_numpy(arr_A, arr_B, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    if (arr_numpy - np.array(arr_cpu)).max() < 1e-8:
        print("process time on numpy : {:.03f}ms".format(elapsed_time * 1000))

    arr_C = arr_B.copy()
    elapsed_time = time.time()
    arr_cpu_v2 = matrix_add_cpu_v2(arr_A, arr_B, arr_C, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    if (arr_numpy - arr_cpu_v2).max() < 1e-8:
        print("process time on cpu v2 : {:.03f}ms".format(elapsed_time * 1000))

    elapsed_time = time.time()
    arr_n_jit = matrix_add_numba_jit(arr_A, arr_B, arr_C, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    if (arr_numpy - arr_n_jit).max() < 1e-8:
        print("process time on numba jit : {:.03f}ms".format(elapsed_time * 1000))

    threadsperblock = (16, 16)
    blockspergrid_x = math.ceil(n_col / threadsperblock[0])
    blockspergrid_y = math.ceil(n_row / threadsperblock[1])
    blockspergrid = (blockspergrid_x, blockspergrid_y)
    elapsed_time = time.time()
    matrix_add_numba_cuda[blockspergrid, threadsperblock](arr_A, arr_B, arr_C, n_row, n_col)
    cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    if (arr_numpy - arr_C).max() < 1e-8:
        print("process time on numba cuda : {:.03f}ms".format(elapsed_time * 1000))

    d_arr_A, d_arr_B, d_arr_C = cuda.to_device(arr_A), cuda.to_device(arr_B), cuda.to_device(arr_C)
    cuda.synchronize()
    elapsed_time = time.time()
    matrix_add_numba_cuda_copy[blockspergrid, threadsperblock](d_arr_A, d_arr_B, d_arr_C, n_row, n_col)
    cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    arr_C = d_arr_C.copy_to_host()
    cuda.synchronize()
    if (arr_numpy - arr_C).max() < 1e-8:
        print("process time on numba cuda with copied data : {:.03f}ms".format(elapsed_time * 1000))

    threadsperblock = 256
    blockspergrid   = math.ceil((n_col*n_row)/threadsperblock)
    elapsed_time = time.time()
    matrix_add_numba_cuda_v2[blockspergrid, threadsperblock](d_arr_A, d_arr_B, d_arr_C, n_row, n_col)
    cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    arr_C = d_arr_C.copy_to_host()
    cuda.synchronize()
    if (arr_numpy - arr_C).max() < 1e-8:
        print("process time on numba cuda with copied data : {:.03f}ms".format(elapsed_time * 1000))

Output 1:

process time on cpu : 203.750ms
process time on numpy : 2.630ms
process time on cpu v2 : 1334.258ms
process time on numba jit : 235.476ms
process time on numba cuda : 217.982ms
process time on numba cuda with copied data : 76.042ms
process time on numba cuda v2 with copied data : 75.180ms

Output 2:

process time on cpu : 245.238ms
process time on numpy : 2.766ms
process time on cpu v2 : 1410.334ms
process time on numba jit (1st): 234.586ms
process time on numba jit (2nd): 3.086ms
process time on numba cuda (1st): 234.172ms
process time on numba cuda (2nd): 15.431ms
process time on numba cuda with copied data (1st): 80.449ms
process time on numba cuda with copied data (2nd): 0.642ms
process time on numba cuda v2 with copied data (1st): 78.265ms
process time on numba cuda v2 with copied data (2nd): 0.538ms

When each computation is independent, a CUDA kernel can replace a loop and deliver a large speedup. In pure Python, adding two matrices requires nested loops over every element. In a CUDA kernel, each thread handles one element — the loop indices map to thread indices via ridx = cuda.threadIdx.y + cuda.blockDim.y * cuda.blockIdx.y and cidx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x. Compared to pure Python, CUDA gives a 3x to 18x speedup. Notably, the second invocation drops to around 0.8ms — a 264x to 1654x improvement — which suggests the first run includes numba’s JIT compilation overhead.

Each SM has a fixed number of SPs. For most current GPUs, an SM contains 1024 or 512 SPs, which sets the maximum threads per block. In this example, threads per block is fixed at (16, 16) = 256. Hardware schedules threads in groups called warps of 32. All 32 threads in a warp execute the same instruction simultaneously; if fewer are active, the rest are masked and idle. To maximize warp utilization, thread count should be a multiple of 32. Here, 256 = 32 × 8.

Since threads per block is typically a constant, block count is usually the ceiling division of total elements by threads per block. For this 1920×1080 matrix: blockspergrid_x = math.ceil(n_col / threadsperblock[0]) and blockspergrid_y = math.ceil(n_row / threadsperblock[1]) = (120, 68) = 8160 blocks total. Because of ceiling division, total threads exceed the matrix size, so the kernel needs a bounds check: if ridx < n_rows and cidx < n_cols:. Multi-dimensional thread/block counts are intuitive but generate more wasted threads from ceiling rounding — which is why CUDA code usually prefers 1D thread/block counts.

GPU kernels operate on GPU VRAM, not CPU RAM. Before running a kernel, input data must be copied from CPU RAM to GPU VRAM; results must be copied back afterward. numba.cuda handles this automatically when it detects CPU-side data — but that means unnecessary round-trips: matrix_A and matrix_B don’t need to come back to CPU. It’s better to use cuda.to_device() to copy inputs to GPU explicitly, and arr.copy_to_host() to retrieve results. To avoid the cost of copying an output array from CPU to GPU, allocate it directly on the GPU with cuda.device_array(). As the benchmarks show, manual memory management significantly outperforms automatic.

One more thing: multiple small transfers are slower than a single equivalent large transfer. Copy data in one batch when possible. When data is too large to process at once, batching transfers with computation can improve throughput further.

The next section applies batching to matrix addition using CUDA streams.

Batched matrix processing: streams

When data is too large to process in one shot, the GPU queues work serially. CUDA streams pipeline this: split a large task into batches, each going through copy → compute → writeback, with these stages overlapping across batches for a throughput gain.

Code: matrix addition with multiple streams

import time
import random
import math
import numpy as np
from numba import cuda

@cuda.jit
def matrix_add_numba_cuda(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    ridx = idx // n_cols
    cidx = idx %  n_cols

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]

@cuda.jit
def matrix_add_numba_cuda_queue(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    ridx = idx // n_cols
    cidx = idx %  n_cols

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]

@cuda.jit
def matrix_add_numba_cuda_stream(matrix_A, matrix_B, matrix_C, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    ridx = idx // n_cols
    cidx = idx %  n_cols

    if ridx < n_rows and cidx < n_cols:
        matrix_C[ridx][cidx] = matrix_A[ridx][cidx] + matrix_B[ridx][cidx]


if __name__ == "__main__":
    n_row, n_col = 10800, 19200

    arr_A, arr_B, arr_C = np.random.random((n_row, n_col)), np.random.random((n_row, n_col)), np.random.random((n_row, n_col))

    # basic one time
    threadsperblock = 256
    blockspergrid   = math.ceil((n_col*n_row)/threadsperblock)

    elapsed_time = time.time()
    d_arr_A, d_arr_B, d_arr_C = cuda.to_device(arr_A), cuda.to_device(arr_B), cuda.to_device(arr_C)
    matrix_add_numba_cuda[blockspergrid, threadsperblock](d_arr_A, d_arr_B, d_arr_C, n_row, n_col)
    cuda.synchronize()
    arr_result = d_arr_C.copy_to_host()
    cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    print("process time on numba cuda : {:.03f}ms".format(elapsed_time * 1000))

    # queue
    n_streams     = 4
    seg_row       = n_row // n_streams
    blockspergrid = math.ceil((seg_row* n_col) / threadsperblock)

    elapsed_time = time.time()
    for i in range(0, n_streams):
        d_arr_A = cuda.to_device(arr_A[i * seg_row : (i + 1) * seg_row])
        d_arr_B = cuda.to_device(arr_B[i * seg_row : (i + 1) * seg_row])
        cuda.synchronize()
        matrix_add_numba_cuda_queue[blockspergrid, threadsperblock](
                d_arr_A,
                d_arr_B,
                d_arr_C[i * seg_row : (i + 1) * seg_row],
                seg_row,
                n_col)
        cuda.synchronize()
        arr_C[i * seg_row : (i + 1) * seg_row] = d_arr_C[i * seg_row : (i + 1) * seg_row].copy_to_host()
        cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    if (arr_result - arr_C).max() < 1e-8:
        print("process time on numba cuda queue: {:.03f}ms".format(elapsed_time * 1000))

    # stream
    stream_list   = list()
    for i in range (0, n_streams):
        stream = cuda.stream()
        stream_list.append(stream)

    elapsed_time = time.time()
    for i in range(0, n_streams):
        d_arr_A = cuda.to_device(arr_A[i * seg_row : (i + 1) * seg_row], stream=stream_list[i])
        d_arr_B = cuda.to_device(arr_B[i * seg_row : (i + 1) * seg_row], stream=stream_list[i])
        matrix_add_numba_cuda_stream[blockspergrid, threadsperblock, stream_list[i]](
                d_arr_A,
                d_arr_B,
                d_arr_C[i * seg_row : (i + 1) * seg_row],
                seg_row,
                n_col)
        arr_C[i * seg_row : (i + 1) * seg_row] = d_arr_C[i * seg_row : (i + 1) * seg_row].copy_to_host(stream=stream_list[i])
    cuda.synchronize()
    elapsed_time = time.time() - elapsed_time
    if (arr_result - arr_C).max() < 1e-8:
        print("process time on numba cuda stream: {:.03f}ms".format(elapsed_time * 1000))

Output 1:

process time on numba cuda : 1180.036ms
process time on numba cuda queue: 1019.919ms
process time on numba cuda stream: 973.997ms

Output 2:

process time on numba cuda (1st): 1187.603ms
process time on numba cuda (2nd): 883.670ms
process time on numba cuda queue (1st): 1064.817ms
process time on numba cuda queue (2nd): 932.134ms
process time on numba cuda stream (1st): 959.764ms
process time on numba cuda stream (2nd): 887.833ms

gantt
	title Default vs Queue vs Stream
	dateFormat  ss
    axisFormat  %S
	section Default
	Host2Device      : a1, 0s, 3s
	Kernal           : a2, after a1 , 3s
    Device2Host      : after a2 , 3s

	section Queue
	Host2Device1     : 0s, 1s
	Kernal1          : 1s
    Device2Host1     : 1s
	Host2Device2     : 1s
	Kernal2          : 1s
    Device2Host2     : 1s
	Host2Device3     : 1s
	Kernal3          : 1s
    Device2Host3     : 1s

	section Stream
	Host2Device1     : b1, 0s, 1s
	Kernal1          : b2, after b1 , 1s
    Device2Host1     : after b2 , 1s
	Host2Device2     : after b1,  1s
	Kernal2          : b3, after b2, 1s
    Device2Host2     : after b3 , 1s
	Host2Device3     : after b2,  1s
	Kernal3          : b4, after b3, 1s
    Device2Host3     : after b4 , 1s

Many CUDA operations are independent at the hardware level — kernel execution, host-to-device copies, device-to-host copies can all overlap. CUDA streams exploit this: split a large task into batches, assign each batch to a stream, and let copy and compute stages from different batches overlap. In this example, stream processing is slightly faster than simple serial queuing.

To use streams, create stream objects and associate them with each operation. In numba: create a stream with cuda.stream(); launch a kernel with kernel[num_blocks, num_threads, stream](); transfer data with cuda.to_device(stream=stream) and arr.copy_to_host(stream=stream). Operations without an explicit stream run on the default stream.

Each stream executes its operations in order, but non-default streams are asynchronous with each other. The default stream acts as a barrier: it waits for all non-default streams to finish before executing, and non-default streams wait for the default stream before running.

Don’t create too many streams. If per-batch kernel compute time is shorter than a single transfer, the pipeline advantage disappears — each stream ends up waiting for other streams’ transfers anyway. In practice, the optimal stream count depends on hardware transfer overhead and kernel compute time and may need some tuning.

That covers matrix addition. The next section introduces matrix summation and the parallel reduction algorithm, along with warp divergence and coalesced memory access.

Matrix summation: parallel reduction

In the matrix addition examples, each element was independent — threads didn’t need to coordinate. Many real problems aren’t like this: each step depends on the result of the previous one. This section uses matrix summation to introduce parallel reduction for binary operations.

Code: matrix summation

import time
import random
import math
import numpy as np
from numba import cuda, jit


def matrix_sum_numpy(matrix, n_rows, n_cols):
    return matrix.sum()

def matrix_sum_cpu(matrix, n_rows, n_cols):
    res = 0
    for ridx in range(n_rows):
        for cidx in range(n_cols):
            res += matrix[ridx][cidx]
    return res

@jit
def matrix_sum_numba_jit(matrix, n_rows, n_cols):
    res = 0
    for ridx in range(n_rows):
        for cidx in range(n_cols):
            res += matrix[ridx][cidx]
    return res

@cuda.jit
def matrix_sum_reduce_numba_cuda(matrix, res, max_stride_iter, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    volume = n_rows * n_cols
    
    if idx < volume:
        ridx = idx // n_cols
        cidx = idx %  n_cols

        for s in range(max_stride_iter):
            stride = 2**s
            if (cuda.threadIdx.x % (2 * stride)) == 0:
                idx_s = idx + stride
                if idx_s < volume:
                    ridx_s = idx_s // n_cols
                    cidx_s = idx_s %  n_cols
                    matrix[ridx][cidx] += matrix[ridx_s][cidx_s]
            cuda.syncthreads()

    if cuda.threadIdx.x == 0:
        res[cuda.blockIdx.x] = matrix[ridx][cidx]

@cuda.jit
def matrix_sum_reduce_numba_cuda_v2(matrix, res, max_stride_iter, n_rows, n_cols):
    blockIdx = cuda.blockDim.x * cuda.blockIdx.x
    idx = cuda.threadIdx.x + blockIdx
    volume = n_rows * n_cols
    
    if idx < volume:
        for s in range(max_stride_iter):
            stride = 2**s
            idx =  2 * stride * cuda.threadIdx.x
            if idx < cuda.blockDim.x:
                idx = idx + blockIdx
                idx_s = idx + stride
                if idx < volume and idx_s < volume:
                    ridx = idx // n_cols
                    cidx = idx %  n_cols
                    ridx_s = idx_s // n_cols
                    cidx_s = idx_s %  n_cols
                    matrix[ridx][cidx] += matrix[ridx_s][cidx_s]
            cuda.syncthreads()

    if cuda.threadIdx.x == 0:
        res[cuda.blockIdx.x] = matrix[ridx][cidx]

@cuda.jit
def matrix_sum_reduce_numba_cuda_v3(matrix, res, max_stride_iter, n_rows, n_cols):
    idx = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x
    volume = n_rows * n_cols
    
    if idx < volume:
        ridx = idx // n_cols
        cidx = idx %  n_cols
        for _ in range(max_stride_iter):
            stride = cuda.blockDim.x // 2
            if cuda.threadIdx.x < stride:
                idx_s = idx + stride
                if idx_s < volume:
                    ridx_s = idx_s // n_cols
                    cidx_s = idx_s %  n_cols
                    matrix[ridx][cidx] += matrix[ridx_s][cidx_s]
            cuda.syncthreads()

    if cuda.threadIdx.x == 0:
        res[cuda.blockIdx.x] = matrix[ridx][cidx]

@cuda.reduce
def matrix_sum_reduce_numba_cuda_reduce(a, b):
    return a + b


if __name__ == "__main__":
    n_row, n_col = 1080, 1920
    arr = np.random.random((n_row, n_col))

    elapsed_time = time.time()
    res_numpy = matrix_sum_numpy(arr, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    print("process time on numpy : {:.03f}ms".format(elapsed_time * 1000))

    elapsed_time = time.time()
    res_cpu = matrix_sum_cpu(arr, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_cpu) < 1e-6:
        print("process time on cpu : {:.03f}ms".format(elapsed_time * 1000))

    elapsed_time = time.time()
    res_n_jit = matrix_sum_numba_jit(arr, n_row, n_col)
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_n_jit) < 1e-6:
        print("process time on numba jit : {:.03f}ms".format(elapsed_time * 1000))

    # numba cuda v1, v2, v3 and reduce version
    threadsperblock = 1024
    blockspergrid   = math.ceil((n_col*n_row)/threadsperblock)
    max_stride_iter = math.ceil(math.log2(threadsperblock))
    block_res_n_cuda = np.zeros(blockspergrid)
    d_arr = cuda.to_device(arr)
    d_block_res_n_cuda = cuda.to_device(block_res_n_cuda)
    cuda.synchronize()

    elapsed_time = time.time()
    matrix_sum_reduce_numba_cuda[blockspergrid, threadsperblock](d_arr, d_block_res_n_cuda, max_stride_iter, n_row, n_col)
    cuda.synchronize()
    block_res_n_cuda = d_block_res_n_cuda.copy_to_host()
    cuda.synchronize()
    res_n_cuda = 0
    for res in block_res_n_cuda:
        res_n_cuda += res
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_n_cuda) < 1e-6:
        print("process time on numba cuda: {:.03f}ms".format(elapsed_time * 1000))

    elapsed_time = time.time()
    matrix_sum_reduce_numba_cuda_v2[blockspergrid, threadsperblock](d_arr, d_block_res_n_cuda, max_stride_iter, n_row, n_col)
    cuda.synchronize()
    block_res_n_cuda = d_block_res_n_cuda.copy_to_host()
    cuda.synchronize()
    res_n_cuda = 0
    for res in block_res_n_cuda:
        res_n_cuda += res
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_n_cuda) < 1e-6:
        print("process time on numba cuda v2: {:.03f}ms".format(elapsed_time * 1000))

    elapsed_time = time.time()
    matrix_sum_reduce_numba_cuda_v3[blockspergrid, threadsperblock](d_arr, d_block_res_n_cuda, max_stride_iter, n_row, n_col)
    cuda.synchronize()
    block_res_n_cuda = d_block_res_n_cuda.copy_to_host()
    cuda.synchronize()
    res_n_cuda = 0
    for res in block_res_n_cuda:
        res_n_cuda += res
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_n_cuda) < 1e-6:
        print("process time on numba cuda v3: {:.03f}ms".format(elapsed_time * 1000))

    arr_1d = arr.reshape(-1)
    d_arr_1d = cuda.to_device(arr_1d)
    elapsed_time = time.time()
    res_n_cuda_reduce = matrix_sum_reduce_numba_cuda_reduce(d_arr_1d)
    elapsed_time = time.time() - elapsed_time
    if (res_numpy - res_n_cuda_reduce) < 1e-6:
        print("process time on numba reduce: {:.03f}ms".format(elapsed_time * 1000))

Output 1:

process time on numpy : 0.820ms
process time on cpu : 566.166ms
process time on numba jit : 187.324ms
process time on numba cuda: 208.259ms
process time on numba cuda v2: 137.212ms
process time on numba cuda v3: 120.333ms
process time on numba reduce: 376.588ms

Output 2:

process time on numpy : 0.871ms
process time on cpu : 567.084ms
process time on numba jit (1st): 236.218ms
process time on numba jit (2nd): 2.202ms
process time on numba cuda (1st): 252.695ms
process time on numba cuda (2nd): 1.662ms
process time on numba cuda v2 (1st): 158.547ms
process time on numba cuda v2 (2nd): 1.468ms
process time on numba cuda v3 (1st): 130.770ms
process time on numba cuda v3 (2nd): 1.101ms
process time on numba reduce (1st): 434.946ms
process time on numba reduce (2nd): 0.686ms

Parallel reduction is a fundamental parallel algorithm for problems of the form: given N inputs and an associative binary operator, produce a single result. The operator can be sum, max, min, square, logical AND/OR, etc. Here we use addition. The naive approach is serial traversal (matrix_sum_cpu). For parallel computation, the associativity and commutativity of addition mean elements can be summed in any order. The approach: partition the input array into smaller chunks, use one thread to compute a partial sum per chunk, then sum the partial sums. This is parallel reduction. Time complexity drops from O(N) to O(log N). numba.cuda provides a built-in 1D parallel reduction via @cuda.reduce, as used in matrix_sum_reduce_numba_cuda_reduce.

The explicit implementation is in matrix_sum_reduce_numba_cuda. The kernel loops over stride values, doubling each iteration up to the block’s thread count. When a thread’s index is a multiple of the current stride’s interval, it adds the element at distance stride into its own element. Thread 0 then saves the block’s partial sum. The kernel only produces per-block partial sums — global synchronization across blocks isn’t possible in CUDA. Blocks may be scheduled in multiple waves: a kernel requiring 10 blocks with only 5 available at a time runs in two waves, making it impossible to synchronize across all 10 simultaneously. Within-block synchronization uses cuda.syncthreads().

In matrix_sum_reduce_numba_cuda, only threads whose index is divisible by the current stride interval are active — the rest sit idle within their warp. This is warp divergence: the hardware schedules at warp granularity, so idle threads in a warp still consume scheduling resources. Warp divergence can be reduced by reorganizing thread assignments. In matrix_sum_reduce_numba_cuda_v2, threads in the first half of the block are active while threads in the second half are idle — keeping entire warps either fully active or fully idle. This simple index reorganization yields a 1.5x speedup over v1.

Memory coalescing also matters. In v2, the memory access stride per loop iteration is 2 * stride * threadIdx.x — non-contiguous. Reorganizing to interleaved pairing (as in v3) concentrates reads and writes to contiguous addresses each iteration, improving memory access efficiency. v3 is about 1.14x faster than v2.

(To be continued)

References

Writing this post drew heavily from official documentation and other blogs. Recommended reading:

Reading about something a hundred times is no substitute for doing it once. After going through this post, try to reproduce the CUDA code yourself.