A Deep Dive into Sparse Matrices in PyTorch 2.12: COO, CSR, CSC, BSR, and BSC

In many models and systems now, matrix shapes are large enough to feel a little absurd. Embedding tables, graph adjacency matrices, feature matrices in recommender systems, attention masks: they can easily contain millions or tens of millions of elements. But if you actually inspect them, many positions contain nothing useful. Their value is just 0.

The sparse matrix idea is simple: if there are too many zeros, do not store the zeros. Store only the nonzero values. Memory should go down, and later matrix multiplication may skip some wasted work.

The catch is immediate. Once you throw away the zeros, you must remember one more thing: where did the nonzero values come from? The way you encode those positions mostly determines what the sparse matrix looks like, and it also affects which path torch.sparse.mm can take.

PyTorch 2.12 commonly exposes COO, CSR, CSC, BSR, and BSC sparse layouts. They all do the same basic job: store nonzero values and their positions. The difference is in the position encoding. Some layouts store raw coordinates. Some compress row pointers. Some compress column pointers. Some group elements into blocks.

I want to pull that apart first: which tensors does each PyTorch sparse layout actually store? When does sparse storage really save memory? When you call torch.sparse.mm, which layouts run, and which layouts can merely be constructed?

Near the end, I include one CPU/GPU benchmark. It only describes what happened for these machines, these PyTorch wheels, and these matrix shapes. Sparse performance depends heavily on hardware and kernels. One table should not be read as a general law.

Start with “do not store zeros”

Assume we have a matrix A:

Its shape is 5 x 6, and it has nnz = 5 nonzero elements. If we store it as dense float32, it needs 5 x 6 x 4 = 120 bytes, not counting tensor metadata.

Dense storage only cares about the value at each position. Sparse storage has to think one step further: besides the nonzero values themselves, usually called values, it also has to store where those values lived in the original matrix, usually through some form of indices. So the difference between sparse layouts is not mainly the API name. It is the index structure.

Sparse layouts remove zeros, but indices are not free. In PyTorch, indices are usually int64, so each index is 8 bytes. If the matrix is small, or not sparse enough, sparse storage can end up using more memory.

The difference between sparse layouts mostly comes from how coordinates are encoded.

COO: the direct coordinate table

COO means coordinate list. It is basically a table of coordinates. It is a convenient entry format because if you know “which row, which column, and what value”, you can build the sparse matrix. PyTorch’s torch.sparse_coo_tensor(indices, values, size) takes two main pieces of data:

import torch

indices = torch.tensor([
    [0, 1, 1, 3, 3, 4],  # row
    [2, 0, 4, 1, 3, 5],  # col
])
values = torch.tensor([3, 4, 5, 7, 8, 9], dtype=torch.float32)

a = torch.sparse_coo_tensor(indices, values, size=(5, 6))

The nice thing about COO is that it is direct. Many sparse datasets do not start life as a matrix anyway. They arrive as events, edges, or samples, one record at a time. Accumulating them as COO is often the least painful path.

The index overhead is also direct. In a 2D matrix, each nonzero element needs two coordinates. If the value is float32 and the index is int64, each nonzero element is roughly 20 bytes:

Here s_i is index bytes, and s_v is value bytes. This still excludes tensor metadata. If the matrix is not sparse enough, COO indices quickly eat back the space saved by dropping zeros.

PyTorch COO has one detail that is easy to miss: it can be uncoalesced. The same coordinate can appear multiple times, and semantically those repeated values are summed. For example, position (1, 4) can appear twice; after coalesce(), it becomes a single coordinate. This is useful for incremental construction. For computation, you usually want to call coalesce() first, otherwise repeated entries at the same position add traversal work.

unmerged = torch.sparse_coo_tensor(
    torch.tensor([[1, 1], [4, 4]]),
    torch.tensor([2.0, 3.0]),
    size=(5, 6),
)

merged = unmerged.coalesce()
# merged.indices() -> [[1], [4]]
# merged.values()  -> [5.]

The COO intuition is: easy to build, flexible to represent, and visibly redundant in its indices.

CSR: compress row coordinates into pointers

CSR means compressed sparse row. With CSR, the layout starts compressing repeated coordinates. Instead of storing a row index for every nonzero element, it uses a row pointer array to describe the data range for each row:

crow_indices = torch.tensor([0, 1, 3, 3, 5, 6])
col_indices = torch.tensor([2, 0, 4, 1, 3, 5])
values = torch.tensor([3, 4, 5, 7, 8, 9], dtype=torch.float32)

a = torch.sparse_csr_tensor(crow_indices, col_indices, values, size=(5, 6))

crow_indices has length n_rows + 1. The nonzero elements in row i live between crow_indices[i] and crow_indices[i + 1], a half-open slice. For example, row 1 has range [1, 3), so it reads col_indices[1:3] = [0, 4] and values[1:3] = [4, 5].

CSR turns repeated row coordinates into row pointers. Row traversal is natural; column coordinates are still stored per nonzero element.

CSR storage is roughly:

Compared with COO, CSR removes the row index for every nonzero element, but adds a pointer array of length n_rows + 1. The wider the matrix, and the more nonzero elements per row, the more likely CSR is to save index storage relative to COO.

The PyTorch docs also mention that CSR sparse matrix multiplication is usually better suited than COO for compressed row format. That sentence should not be taken alone as a conclusion. Actual speed still depends on the sparsity pattern, matrix size, right-hand-side dense width, CPU/GPU backend, and PyTorch build.

CSC: compress column coordinates into pointers

CSC means compressed sparse column. You can think of it as the transpose view of CSR: it compresses columns and stores row indices per nonzero element. If CSR turns “row-wise access” into contiguous slices, CSC does the same for “column-wise access”.

ccol_indices = torch.tensor([0, 1, 2, 3, 4, 5, 6])
row_indices = torch.tensor([1, 3, 0, 3, 1, 4])
values = torch.tensor([4, 7, 3, 8, 5, 9], dtype=torch.float32)

a = torch.sparse_csc_tensor(ccol_indices, row_indices, values, size=(5, 6))

ccol_indices[j] to ccol_indices[j + 1] describes the nonzero elements in column j. In other words, CSR is friendly to “find data by row”, while CSC is friendly to “find data by column”.

CSR and CSC both compress coordinates in their value arrays, but they make traversal contiguous along different directions.

For A @ X, where the left-hand side is sparse and the right-hand side is dense, CSR’s row slices fit the output-row computation more naturally. CSC is not useless. It is better for column-oriented access, some transposed forms, or particular backend paths. In the local experiment here, CSC can run torch.sparse.mm on CPU, but it is much slower than COO/CSR. That only describes this environment and this workload; it does not mean CSC is generally slower.

BSR and BSC: compress coordinates at block level

BSR means block sparse row, and BSC means block sparse column. They push coordinate compression one level higher, from element coordinates to block coordinates. Instead of treating every nonzero element as an independent coordinate, the matrix is split into fixed-size dense blocks. If a block contains any element that needs to be preserved, the whole dense payload of that block is stored.

For example, use a 2 x 2 block. Even if a block contains only one nonzero value, BSR still stores 4 values. The cost is that zeros inside the block may come back. The benefit is that there are far fewer block coordinates, and computation inside a block looks more like a regular small dense matrix.

dense = torch.tensor([
    [1, 2, 0, 0],
    [3, 4, 0, 0],
    [0, 0, 5, 6],
    [0, 0, 7, 8],
], dtype=torch.float32)

bsr = dense.to_sparse_bsr(blocksize=(2, 2))
bsc = dense.to_sparse_bsc(blocksize=(2, 2))

The practical question for BSR/BSC is whether nonzeros naturally appear in blocks.

BSR/BSC storage is approximately:

Here n_b is the number of nonzero blocks, n_p is the number of compressed block rows or block columns, and b_h x b_w is the block shape. For attention masks, graph structures, finite-element matrices, or some block-structured feature interactions, this structure can make sense. For randomly scattered nonzero elements, block sparse often stores many zeros again.

The random 5% sparse matrix in the local experiment is a good example: 16x16 BSR/BSC storage is already slightly higher than dense, because nearly every block is “lit up” by at least one nonzero element.

`torch.sparse.mm`: layout exists, but computation paths differ

PyTorch 2.12 provides these constructors, but torch.sparse.mm support is not “all sparse layouts behave the same”. The PyTorch 2.12 docstring says it supports COO and CSR storage formats. In the gradient support section, COO @ Dense and CSR @ Dense both support backward for both inputs, while CSC/BSR/BSC @ Dense does not support backward.

This matters in training code. You may be able to construct a CSR/CSC/BSR/BSC tensor, but that does not mean every layout has the forward kernel, backward kernel, or device support you need. In real code, treat layout as part of the API contract, not just something a tensor constructor accepts.

A minimal check is to run it directly:

def try_sparse_mm(a_sparse, x_dense):
    try:
        y = torch.sparse.mm(a_sparse, x_dense)
        return y, None
    except Exception as exc:
        return None, f"{type(exc).__name__}: {exc}"

The benchmark script in this post does the same thing. If a layout runs, it times it. If it does not, it records unsupported.

CPU/GPU experiment: storage saved, multiplication speed

After the data structures, look at one set of actual numbers. The table puts CPU and GPU results side by side. Both experiments use a 2048 x 2048 matrix, a 2048 x 64 right-hand-side dense matrix, float32, and 16 x 16 block size. CPU numbers come from a local arm64 PyTorch 2.12 wheel; GPU numbers come from the same workload on H200/CUDA.

CPU command:

uv run --python 3.12 --with 'torch==2.12.*' scripts/benchmark_pytorch_sparse.py --device cpu

GPU command:

python -m haptic_foundation.scripts.benchmark_pytorch_sparse_gpu --device cuda --rows 2048 --cols 2048 --rhs-cols 64 --block-size 16

Code: GPU sparse matrix benchmark

from __future__ import annotations

import argparse
import platform
import time
from dataclasses import dataclass
from typing import Callable

import torch


@dataclass
class Case:
    name: str
    dense: torch.Tensor
    rhs: torch.Tensor


@dataclass
class Result:
    case: str
    layout: str
    nnz: int
    storage_bytes: int | None
    storage_ratio: float | None
    time_ms: float | None
    time_ratio: float | None
    note: str


def tensor_bytes(tensor: torch.Tensor) -> int:
    return tensor.untyped_storage().nbytes()


def sparse_storage_bytes(tensor: torch.Tensor) -> int:
    layout = tensor.layout
    if layout == torch.sparse_coo:
        return (
            tensor.indices().untyped_storage().nbytes()
            + tensor.values().untyped_storage().nbytes()
        )
    if layout in {torch.sparse_csr, torch.sparse_bsr}:
        return (
            tensor.crow_indices().untyped_storage().nbytes()
            + tensor.col_indices().untyped_storage().nbytes()
            + tensor.values().untyped_storage().nbytes()
        )
    if layout in {torch.sparse_csc, torch.sparse_bsc}:
        return (
            tensor.ccol_indices().untyped_storage().nbytes()
            + tensor.row_indices().untyped_storage().nbytes()
            + tensor.values().untyped_storage().nbytes()
        )
    raise ValueError(f"unsupported layout for storage accounting: {layout}")


def cuda_median_ms(
    fn: Callable[[], torch.Tensor],
    *,
    warmup: int,
    repeats: int,
) -> float:
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    times: list[float] = []
    for _ in range(repeats):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        fn()
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))
    return sorted(times)[len(times) // 2]


def cpu_median_ms(fn: Callable[[], torch.Tensor], *, min_run_time: float) -> float:
    for _ in range(3):
        fn()
    start = time.perf_counter()
    iters = 0
    samples: list[float] = []
    while time.perf_counter() - start < min_run_time or iters < 5:
        t0 = time.perf_counter()
        fn()
        samples.append((time.perf_counter() - t0) * 1000.0)
        iters += 1
    return sorted(samples)[len(samples) // 2]


def median_ms(
    fn: Callable[[], torch.Tensor],
    *,
    device: torch.device,
    warmup: int,
    repeats: int,
    min_run_time: float,
) -> float:
    if device.type == "cuda":
        return cuda_median_ms(fn, warmup=warmup, repeats=repeats)
    return cpu_median_ms(fn, min_run_time=min_run_time)


def make_unstructured_case(
    name: str,
    rows: int,
    cols: int,
    rhs_cols: int,
    density: float,
    *,
    seed: int,
    dtype: torch.dtype,
    device: torch.device,
) -> Case:
    generator = torch.Generator(device=device).manual_seed(seed)
    nnz = max(1, int(rows * cols * density))
    flat = torch.randperm(rows * cols, generator=generator, device=device)[:nnz]
    row = torch.div(flat, cols, rounding_mode="floor")
    col = flat.remainder(cols)
    values = torch.randn(nnz, generator=generator, dtype=dtype, device=device)
    dense = torch.zeros((rows, cols), dtype=dtype, device=device)
    dense[row, col] = values
    rhs = torch.randn((cols, rhs_cols), generator=generator, dtype=dtype, device=device)
    return Case(name=name, dense=dense, rhs=rhs)


def make_block_case(
    name: str,
    rows: int,
    cols: int,
    rhs_cols: int,
    block_size: tuple[int, int],
    block_density: float,
    *,
    seed: int,
    dtype: torch.dtype,
    device: torch.device,
) -> Case:
    brow, bcol = block_size
    if rows % brow or cols % bcol:
        raise ValueError("rows and cols must be divisible by block size")
    generator = torch.Generator(device=device).manual_seed(seed)
    block_rows = rows // brow
    block_cols = cols // bcol
    n_blocks = max(1, int(block_rows * block_cols * block_density))
    chosen = torch.randperm(block_rows * block_cols, generator=generator, device=device)[:n_blocks]
    dense = torch.zeros((rows, cols), dtype=dtype, device=device)
    for block in chosen.tolist():
        br = block // block_cols
        bc = block % block_cols
        dense[br * brow : (br + 1) * brow, bc * bcol : (bc + 1) * bcol] = torch.randn(
            (brow, bcol), generator=generator, dtype=dtype, device=device
        )
    rhs = torch.randn((cols, rhs_cols), generator=generator, dtype=dtype, device=device)
    return Case(name=name, dense=dense, rhs=rhs)


def convert_layouts(dense: torch.Tensor, block_size: tuple[int, int]) -> dict[str, torch.Tensor]:
    coo = dense.to_sparse_coo().coalesce()
    return {
        "dense": dense,
        "coo": coo,
        "csr": dense.to_sparse_csr(),
        "csc": dense.to_sparse_csc(),
        "bsr": dense.to_sparse_bsr(blocksize=block_size),
        "bsc": dense.to_sparse_bsc(blocksize=block_size),
    }


def run_case(
    case: Case,
    block_size: tuple[int, int],
    *,
    device: torch.device,
    warmup: int,
    repeats: int,
    min_run_time: float,
) -> list[Result]:
    layouts = convert_layouts(case.dense, block_size)
    dense_bytes = tensor_bytes(case.dense)
    dense_time_ms = median_ms(
        lambda: torch.mm(case.dense, case.rhs),
        device=device,
        warmup=warmup,
        repeats=repeats,
        min_run_time=min_run_time,
    )
    results = [
        Result(
            case=case.name,
            layout="dense",
            nnz=int(torch.count_nonzero(case.dense).item()),
            storage_bytes=dense_bytes,
            storage_ratio=1.0,
            time_ms=dense_time_ms,
            time_ratio=1.0,
            note="baseline",
        )
    ]
    for layout_name in ["coo", "csr", "csc", "bsr", "bsc"]:
        sparse = layouts[layout_name]
        storage_bytes = sparse_storage_bytes(sparse)
        try:
            torch.sparse.mm(sparse, case.rhs)
            if device.type == "cuda":
                torch.cuda.synchronize()
            time_ms = median_ms(
                lambda sparse=sparse: torch.sparse.mm(sparse, case.rhs),
                device=device,
                warmup=warmup,
                repeats=repeats,
                min_run_time=min_run_time,
            )
            note = "ok"
        except Exception as exc:
            time_ms = None
            note = f"unsupported: {type(exc).__name__}: {str(exc).splitlines()[0]}"
        results.append(
            Result(
                case=case.name,
                layout=layout_name,
                nnz=int(sparse._nnz()),
                storage_bytes=storage_bytes,
                storage_ratio=storage_bytes / dense_bytes,
                time_ms=time_ms,
                time_ratio=None if time_ms is None else time_ms / dense_time_ms,
                note=note,
            )
        )
    return results


def fmt_bytes(value: int | None) -> str:
    if value is None:
        return "-"
    units = ["B", "KiB", "MiB", "GiB"]
    size = float(value)
    for unit in units:
        if size < 1024 or unit == units[-1]:
            return f"{size:.1f} {unit}" if unit != "B" else f"{int(size)} B"
        size /= 1024
    return f"{value} B"


def fmt_ratio(value: float | None) -> str:
    return "-" if value is None else f"{value:.3f}x"


def fmt_ms(value: float | None) -> str:
    return "-" if value is None else f"{value:.3f}"


def print_markdown(results: list[Result]) -> None:
    print(
        "| case | layout | nnz / blocks | storage | storage vs dense | "
        "median ms | time vs dense | note |"
    )
    print("| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |")
    for row in results:
        print(
            f"| {row.case} | {row.layout} | {row.nnz} | {fmt_bytes(row.storage_bytes)} | "
            f"{fmt_ratio(row.storage_ratio)} | {fmt_ms(row.time_ms)} | "
            f"{fmt_ratio(row.time_ratio)} | {row.note} |"
        )


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("--device", default="cuda", choices=["cuda", "cpu"])
    parser.add_argument("--rows", type=int, default=2048)
    parser.add_argument("--cols", type=int, default=2048)
    parser.add_argument("--rhs-cols", type=int, default=64)
    parser.add_argument("--block-size", type=int, default=16)
    parser.add_argument("--warmup", type=int, default=10)
    parser.add_argument("--repeats", type=int, default=50)
    parser.add_argument("--min-run-time", type=float, default=0.2)
    parser.add_argument("--require-torch-prefix", default="2.12.")
    return parser.parse_args()


def main() -> None:
    args = parse_args()
    if args.require_torch_prefix and not torch.__version__.startswith(args.require_torch_prefix):
        raise SystemExit(
            f"Expected torch {args.require_torch_prefix}*, got torch {torch.__version__}"
        )
    if args.device == "cuda" and not torch.cuda.is_available():
        raise SystemExit("CUDA device requested, but torch.cuda.is_available() is false")

    device = torch.device(args.device)
    dtype = torch.float32
    block_size = (args.block_size, args.block_size)
    cases = [
        make_unstructured_case("random 0.1%", args.rows, args.cols, args.rhs_cols, 0.001, seed=11, dtype=dtype, device=device),
        make_unstructured_case("random 1%", args.rows, args.cols, args.rhs_cols, 0.01, seed=12, dtype=dtype, device=device),
        make_unstructured_case("random 5%", args.rows, args.cols, args.rhs_cols, 0.05, seed=13, dtype=dtype, device=device),
        make_block_case("16x16 blocks 1%", args.rows, args.cols, args.rhs_cols, block_size, 0.01, seed=21, dtype=dtype, device=device),
    ]

    print(f"torch: {torch.__version__}")
    print(f"python: {platform.python_version()} ({platform.machine()})")
    print(f"device: {device}")
    if device.type == "cuda":
        props = torch.cuda.get_device_properties(device)
        print(f"cuda: {torch.version.cuda}")
        print(f"gpu: {props.name}")
        print(f"compute capability: {props.major}.{props.minor}")
        print(f"total memory: {fmt_bytes(props.total_memory)}")
    else:
        print(f"threads: {torch.get_num_threads()}")
    print(f"shape: {args.rows}x{args.cols}, rhs: {args.cols}x{args.rhs_cols}, dtype: float32")
    print(f"block size: {block_size[0]}x{block_size[1]}")
    print()

    all_results: list[Result] = []
    for case in cases:
        all_results.extend(
            run_case(
                case,
                block_size,
                device=device,
                warmup=args.warmup,
                repeats=args.repeats,
                min_run_time=args.min_run_time,
            )
        )
    print_markdown(all_results)


if __name__ == "__main__":
    main()

Environment and measurement method:

PyTorch: 2.12.0
Matrix shape: 2048 x 2048
Right-hand-side dense matrix: 2048 x 64
dtype: float32
block size: 16 x 16
storage accounting: actual tensor buffer bytes, including values and index tensors, excluding Python object metadata
CPU timing: torch.utils.benchmark.Timer(...).blocked_autorange
GPU timing: CUDA events + torch.cuda.synchronize()

The results use dense as 1.0. time vs dense (CPU) = 0.102x means the CPU sparse path took about 10.2% of CPU dense time. time vs dense (GPU) = 1.748x means the GPU sparse path was about 1.75x slower than dense on the same GPU.

case	layout	nnz / blocks	storage	storage vs dense	time vs dense (CPU)	time vs dense (GPU)	note
random 0.1%	dense	4194	16.0 MiB	1.000x	1.000x	1.000x	baseline
random 0.1%	coo	4194	81.9 KiB	0.005x	0.102x	1.748x	ok
random 0.1%	csr	4194	97.9 KiB	0.006x	0.137x	0.971x	ok
random 0.1%	csc	4194	97.9 KiB	0.006x	0.306x	3.925x	ok
random 0.1%	bsr	3684	3.7 MiB	0.228x	-	2.290x	CPU unsupported
random 0.1%	bsc	3684	3.7 MiB	0.228x	-	-	CPU/GPU unsupported
random 1%	dense	41943	16.0 MiB	1.000x	1.000x	1.000x	baseline
random 1%	coo	41943	819.2 KiB	0.050x	0.792x	1.486x	ok
random 1%	csr	41943	835.2 KiB	0.051x	0.920x	0.929x	ok
random 1%	csc	41943	835.2 KiB	0.051x	4.169x	4.115x	ok
random 1%	bsr	15135	15.0 MiB	0.938x	-	2.555x	CPU unsupported
random 1%	bsc	15135	15.0 MiB	0.938x	-	-	CPU/GPU unsupported
random 5%	dense	209715	16.0 MiB	1.000x	1.000x	1.000x	baseline
random 5%	coo	209715	4.0 MiB	0.250x	3.571x	2.015x	ok
random 5%	csr	209715	4.0 MiB	0.251x	4.149x	1.007x	ok
random 5%	csc	209715	4.0 MiB	0.251x	17.511x	4.172x	ok
random 5%	bsr	16384	16.3 MiB	1.016x	-	2.674x	CPU unsupported
random 5%	bsc	16384	16.3 MiB	1.016x	-	-	CPU/GPU unsupported
16x16 blocks 1%	dense	41728	16.0 MiB	1.000x	1.000x	1.000x	baseline
16x16 blocks 1%	coo	41728	815.0 KiB	0.050x	0.833x	1.481x	ok
16x16 blocks 1%	csr	41728	831.0 KiB	0.051x	0.796x	0.971x	ok
16x16 blocks 1%	csc	41728	831.0 KiB	0.051x	2.753x	4.004x	ok
16x16 blocks 1%	bsr	163	166.6 KiB	0.010x	-	2.226x	CPU unsupported
16x16 blocks 1%	bsc	163	166.6 KiB	0.010x	-	-	CPU/GPU unsupported

This table is easy to misread, so the boundary matters: these numbers only apply to this shape, these sparsity patterns, and these two runtime environments.

Start with storage. For random 0.1%, COO uses only about 0.5% of dense storage. Dense is 16 MiB; COO is 81.9 KiB. If the question is simply “how do I fit this matrix in memory?”, sparse storage helps directly.

But saving storage does not mean multiplication is always faster. On CPU, COO/CSR are faster than dense for random 0.1%; by random 5%, COO/CSR/CSC are all slower. On GPU, in this H200 result, CSR stays close to dense, while COO and CSC are slower. The reason is not mysterious: dense GEMM is highly optimized contiguous computation; sparse multiplication has to read indices, do indirect access, and often write dense output. Once there are enough nonzero elements, index overhead and irregular memory access can cancel out the skipped multiplications.

Block sparse is more picky about the data. At random 1%, 16x16 BSR/BSC storage is already close to dense storage; at random 5%, it is slightly above dense. The reason is the same old issue: if a block has even one nonzero element, the whole 16x16 payload is stored. In the artificial 16x16 blocks 1% case, BSR/BSC use only about 1% of dense storage. Block sparse looks right only when the block structure is real.

One more practical point: layout support is part of the performance result. The local CPU wheel did not run BSR/BSC through torch.sparse.mm; H200 can run BSR, but it is still slower than dense for this workload; BSC on CUDA also lacks the Strided + SparseBsc @ Strided path. This does not mean the BSR/BSC data structures are meaningless. It means current backend and kernel coverage decide whether you can use them, and whether they are fast.

How to choose a format

If you have a batch of coordinates and values and want to put them into a sparse tensor, COO is usually the easiest format. It is construction-friendly and can represent duplicate coordinates. After construction, if you want to compute with it, consider coalesce() first.

If the main operation is A @ X, where A is the left-hand sparse matrix, CSR is a natural candidate. It compresses by row, and each output row can read that row’s nonzero entries from one contiguous slice. It is not always fastest, but the data structure matches this access pattern.

If your access pattern is naturally column-oriented, or if you often work with transposed views, CSC is worth considering. Do not assume CSC and CSR have identical speed just because CSC is the “mirror image” of CSR. Actual kernels are often not symmetric.

If nonzero elements really appear in blocks, then consider BSR/BSC. The benefit of block sparse comes from “fewer block coordinates” and “more regular dense computation inside each block”. If your nonzeros are just random points, block sparse stores the zeros inside blocks too, and storage can quickly approach dense.

If this is production code, measure it. At minimum, fix the shape, density, dtype, index dtype, right-hand-side width, device, PyTorch version, and build information. Choosing a sparse matrix format is not an abstract format question. It is an engineering question about data distribution and kernel support.