In many models and systems now, matrix shapes are large enough to feel a little absurd. Embedding tables, graph adjacency matrices, feature matrices in recommender systems, attention masks: they can easily contain millions or tens of millions of elements. But if you actually inspect them, many positions contain nothing useful. Their value is just 0.
The sparse matrix idea is simple: if there are too many zeros, do not store the zeros. Store only the nonzero values. Memory should go down, and later matrix multiplication may skip some wasted work.
The catch is immediate. Once you throw away the zeros, you must remember one more thing: where did the nonzero values come from? The way you encode those positions mostly determines what the sparse matrix looks like, and it also affects which path torch.sparse.mm can take.
PyTorch 2.12 commonly exposes COO, CSR, CSC, BSR, and BSC sparse layouts. They all do the same basic job: store nonzero values and their positions. The difference is in the position encoding. Some layouts store raw coordinates. Some compress row pointers. Some compress column pointers. Some group elements into blocks.
I want to pull that apart first: which tensors does each PyTorch sparse layout actually store? When does sparse storage really save memory? When you call torch.sparse.mm, which layouts run, and which layouts can merely be constructed?
Near the end, I include one CPU/GPU benchmark. It only describes what happened for these machines, these PyTorch wheels, and these matrix shapes. Sparse performance depends heavily on hardware and kernels. One table should not be read as a general law.
Start with “do not store zeros”
Assume we have a matrix A:
Its shape is 5 x 6, and it has nnz = 5 nonzero elements. If we store it as dense float32, it needs 5 x 6 x 4 = 120 bytes, not counting tensor metadata.
Dense storage only cares about the value at each position. Sparse storage has to think one step further: besides the nonzero values themselves, usually called values, it also has to store where those values lived in the original matrix, usually through some form of indices. So the difference between sparse layouts is not mainly the API name. It is the index structure.
Sparse layouts remove zeros, but indices are not free. In PyTorch, indices are usually int64, so each index is 8 bytes. If the matrix is small, or not sparse enough, sparse storage can end up using more memory.
COO: the direct coordinate table
COO means coordinate list. It is basically a table of coordinates. It is a convenient entry format because if you know “which row, which column, and what value”, you can build the sparse matrix. PyTorch’s torch.sparse_coo_tensor(indices, values, size) takes two main pieces of data:
import torch
indices = torch.tensor([
[0, 1, 1, 3, 3, 4], # row
[2, 0, 4, 1, 3, 5], # col
])
values = torch.tensor([3, 4, 5, 7, 8, 9], dtype=torch.float32)
a = torch.sparse_coo_tensor(indices, values, size=(5, 6))
The nice thing about COO is that it is direct. Many sparse datasets do not start life as a matrix anyway. They arrive as events, edges, or samples, one record at a time. Accumulating them as COO is often the least painful path.
The index overhead is also direct. In a 2D matrix, each nonzero element needs two coordinates. If the value is float32 and the index is int64, each nonzero element is roughly 20 bytes:
Here s_i is index bytes, and s_v is value bytes. This still excludes tensor metadata. If the matrix is not sparse enough, COO indices quickly eat back the space saved by dropping zeros.
PyTorch COO has one detail that is easy to miss: it can be uncoalesced. The same coordinate can appear multiple times, and semantically those repeated values are summed. For example, position (1, 4) can appear twice; after coalesce(), it becomes a single coordinate. This is useful for incremental construction. For computation, you usually want to call coalesce() first, otherwise repeated entries at the same position add traversal work.
unmerged = torch.sparse_coo_tensor(
torch.tensor([[1, 1], [4, 4]]),
torch.tensor([2.0, 3.0]),
size=(5, 6),
)
merged = unmerged.coalesce()
# merged.indices() -> [[1], [4]]
# merged.values() -> [5.]
The COO intuition is: easy to build, flexible to represent, and visibly redundant in its indices.
CSR: compress row coordinates into pointers
CSR means compressed sparse row. With CSR, the layout starts compressing repeated coordinates. Instead of storing a row index for every nonzero element, it uses a row pointer array to describe the data range for each row:
crow_indices = torch.tensor([0, 1, 3, 3, 5, 6])
col_indices = torch.tensor([2, 0, 4, 1, 3, 5])
values = torch.tensor([3, 4, 5, 7, 8, 9], dtype=torch.float32)
a = torch.sparse_csr_tensor(crow_indices, col_indices, values, size=(5, 6))
crow_indices has length n_rows + 1. The nonzero elements in row i live between crow_indices[i] and crow_indices[i + 1], a half-open slice. For example, row 1 has range [1, 3), so it reads col_indices[1:3] = [0, 4] and values[1:3] = [4, 5].
CSR storage is roughly:
Compared with COO, CSR removes the row index for every nonzero element, but adds a pointer array of length n_rows + 1. The wider the matrix, and the more nonzero elements per row, the more likely CSR is to save index storage relative to COO.
The PyTorch docs also mention that CSR sparse matrix multiplication is usually better suited than COO for compressed row format. That sentence should not be taken alone as a conclusion. Actual speed still depends on the sparsity pattern, matrix size, right-hand-side dense width, CPU/GPU backend, and PyTorch build.
CSC: compress column coordinates into pointers
CSC means compressed sparse column. You can think of it as the transpose view of CSR: it compresses columns and stores row indices per nonzero element. If CSR turns “row-wise access” into contiguous slices, CSC does the same for “column-wise access”.
ccol_indices = torch.tensor([0, 1, 2, 3, 4, 5, 6])
row_indices = torch.tensor([1, 3, 0, 3, 1, 4])
values = torch.tensor([4, 7, 3, 8, 5, 9], dtype=torch.float32)
a = torch.sparse_csc_tensor(ccol_indices, row_indices, values, size=(5, 6))
ccol_indices[j] to ccol_indices[j + 1] describes the nonzero elements in column j. In other words, CSR is friendly to “find data by row”, while CSC is friendly to “find data by column”.
For A @ X, where the left-hand side is sparse and the right-hand side is dense, CSR’s row slices fit the output-row computation more naturally. CSC is not useless. It is better for column-oriented access, some transposed forms, or particular backend paths. In the local experiment here, CSC can run torch.sparse.mm on CPU, but it is much slower than COO/CSR. That only describes this environment and this workload; it does not mean CSC is generally slower.
BSR and BSC: compress coordinates at block level
BSR means block sparse row, and BSC means block sparse column. They push coordinate compression one level higher, from element coordinates to block coordinates. Instead of treating every nonzero element as an independent coordinate, the matrix is split into fixed-size dense blocks. If a block contains any element that needs to be preserved, the whole dense payload of that block is stored.
For example, use a 2 x 2 block. Even if a block contains only one nonzero value, BSR still stores 4 values. The cost is that zeros inside the block may come back. The benefit is that there are far fewer block coordinates, and computation inside a block looks more like a regular small dense matrix.
dense = torch.tensor([
[1, 2, 0, 0],
[3, 4, 0, 0],
[0, 0, 5, 6],
[0, 0, 7, 8],
], dtype=torch.float32)
bsr = dense.to_sparse_bsr(blocksize=(2, 2))
bsc = dense.to_sparse_bsc(blocksize=(2, 2))
BSR/BSC storage is approximately:
Here n_b is the number of nonzero blocks, n_p is the number of compressed block rows or block columns, and b_h x b_w is the block shape. For attention masks, graph structures, finite-element matrices, or some block-structured feature interactions, this structure can make sense. For randomly scattered nonzero elements, block sparse often stores many zeros again.
The random 5% sparse matrix in the local experiment is a good example: 16x16 BSR/BSC storage is already slightly higher than dense, because nearly every block is “lit up” by at least one nonzero element.
torch.sparse.mm: layout exists, but computation paths differ
PyTorch 2.12 provides these constructors, but torch.sparse.mm support is not “all sparse layouts behave the same”. The PyTorch 2.12 docstring says it supports COO and CSR storage formats. In the gradient support section, COO @ Dense and CSR @ Dense both support backward for both inputs, while CSC/BSR/BSC @ Dense does not support backward.
This matters in training code. You may be able to construct a CSR/CSC/BSR/BSC tensor, but that does not mean every layout has the forward kernel, backward kernel, or device support you need. In real code, treat layout as part of the API contract, not just something a tensor constructor accepts.
A minimal check is to run it directly:
def try_sparse_mm(a_sparse, x_dense):
try:
y = torch.sparse.mm(a_sparse, x_dense)
return y, None
except Exception as exc:
return None, f"{type(exc).__name__}: {exc}"
The benchmark script in this post does the same thing. If a layout runs, it times it. If it does not, it records unsupported.
CPU/GPU experiment: storage saved, multiplication speed
After the data structures, look at one set of actual numbers. The table puts CPU and GPU results side by side. Both experiments use a 2048 x 2048 matrix, a 2048 x 64 right-hand-side dense matrix, float32, and 16 x 16 block size. CPU numbers come from a local arm64 PyTorch 2.12 wheel; GPU numbers come from the same workload on H200/CUDA.
CPU command:
uv run --python 3.12 --with 'torch==2.12.*' scripts/benchmark_pytorch_sparse.py --device cpu
GPU command:
python -m haptic_foundation.scripts.benchmark_pytorch_sparse_gpu --device cuda --rows 2048 --cols 2048 --rhs-cols 64 --block-size 16
Code: GPU sparse matrix benchmark
from __future__ import annotations
import argparse
import platform
import time
from dataclasses import dataclass
from typing import Callable
import torch
@dataclass
class Case:
name: str
dense: torch.Tensor
rhs: torch.Tensor
@dataclass
class Result:
case: str
layout: str
nnz: int
storage_bytes: int | None
storage_ratio: float | None
time_ms: float | None
time_ratio: float | None
note: str
def tensor_bytes(tensor: torch.Tensor) -> int:
return tensor.untyped_storage().nbytes()
def sparse_storage_bytes(tensor: torch.Tensor) -> int:
layout = tensor.layout
if layout == torch.sparse_coo:
return (
tensor.indices().untyped_storage().nbytes()
+ tensor.values().untyped_storage().nbytes()
)
if layout in {torch.sparse_csr, torch.sparse_bsr}:
return (
tensor.crow_indices().untyped_storage().nbytes()
+ tensor.col_indices().untyped_storage().nbytes()
+ tensor.values().untyped_storage().nbytes()
)
if layout in {torch.sparse_csc, torch.sparse_bsc}:
return (
tensor.ccol_indices().untyped_storage().nbytes()
+ tensor.row_indices().untyped_storage().nbytes()
+ tensor.values().untyped_storage().nbytes()
)
raise ValueError(f"unsupported layout for storage accounting: {layout}")
def cuda_median_ms(
fn: Callable[[], torch.Tensor],
*,
warmup: int,
repeats: int,
) -> float:
for _ in range(warmup):
fn()
torch.cuda.synchronize()
times: list[float] = []
for _ in range(repeats):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
fn()
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
return sorted(times)[len(times) // 2]
def cpu_median_ms(fn: Callable[[], torch.Tensor], *, min_run_time: float) -> float:
for _ in range(3):
fn()
start = time.perf_counter()
iters = 0
samples: list[float] = []
while time.perf_counter() - start < min_run_time or iters < 5:
t0 = time.perf_counter()
fn()
samples.append((time.perf_counter() - t0) * 1000.0)
iters += 1
return sorted(samples)[len(samples) // 2]
def median_ms(
fn: Callable[[], torch.Tensor],
*,
device: torch.device,
warmup: int,
repeats: int,
min_run_time: float,
) -> float:
if device.type == "cuda":
return cuda_median_ms(fn, warmup=warmup, repeats=repeats)
return cpu_median_ms(fn, min_run_time=min_run_time)
def make_unstructured_case(
name: str,
rows: int,
cols: int,
rhs_cols: int,
density: float,
*,
seed: int,
dtype: torch.dtype,
device: torch.device,
) -> Case:
generator = torch.Generator(device=device).manual_seed(seed)
nnz = max(1, int(rows * cols * density))
flat = torch.randperm(rows * cols, generator=generator, device=device)[:nnz]
row = torch.div(flat, cols, rounding_mode="floor")
col = flat.remainder(cols)
values = torch.randn(nnz, generator=generator, dtype=dtype, device=device)
dense = torch.zeros((rows, cols), dtype=dtype, device=device)
dense[row, col] = values
rhs = torch.randn((cols, rhs_cols), generator=generator, dtype=dtype, device=device)
return Case(name=name, dense=dense, rhs=rhs)
def make_block_case(
name: str,
rows: int,
cols: int,
rhs_cols: int,
block_size: tuple[int, int],
block_density: float,
*,
seed: int,
dtype: torch.dtype,
device: torch.device,
) -> Case:
brow, bcol = block_size
if rows % brow or cols % bcol:
raise ValueError("rows and cols must be divisible by block size")
generator = torch.Generator(device=device).manual_seed(seed)
block_rows = rows // brow
block_cols = cols // bcol
n_blocks = max(1, int(block_rows * block_cols * block_density))
chosen = torch.randperm(block_rows * block_cols, generator=generator, device=device)[:n_blocks]
dense = torch.zeros((rows, cols), dtype=dtype, device=device)
for block in chosen.tolist():
br = block // block_cols
bc = block % block_cols
dense[br * brow : (br + 1) * brow, bc * bcol : (bc + 1) * bcol] = torch.randn(
(brow, bcol), generator=generator, dtype=dtype, device=device
)
rhs = torch.randn((cols, rhs_cols), generator=generator, dtype=dtype, device=device)
return Case(name=name, dense=dense, rhs=rhs)
def convert_layouts(dense: torch.Tensor, block_size: tuple[int, int]) -> dict[str, torch.Tensor]:
coo = dense.to_sparse_coo().coalesce()
return {
"dense": dense,
"coo": coo,
"csr": dense.to_sparse_csr(),
"csc": dense.to_sparse_csc(),
"bsr": dense.to_sparse_bsr(blocksize=block_size),
"bsc": dense.to_sparse_bsc(blocksize=block_size),
}
def run_case(
case: Case,
block_size: tuple[int, int],
*,
device: torch.device,
warmup: int,
repeats: int,
min_run_time: float,
) -> list[Result]:
layouts = convert_layouts(case.dense, block_size)
dense_bytes = tensor_bytes(case.dense)
dense_time_ms = median_ms(
lambda: torch.mm(case.dense, case.rhs),
device=device,
warmup=warmup,
repeats=repeats,
min_run_time=min_run_time,
)
results = [
Result(
case=case.name,
layout="dense",
nnz=int(torch.count_nonzero(case.dense).item()),
storage_bytes=dense_bytes,
storage_ratio=1.0,
time_ms=dense_time_ms,
time_ratio=1.0,
note="baseline",
)
]
for layout_name in ["coo", "csr", "csc", "bsr", "bsc"]:
sparse = layouts[layout_name]
storage_bytes = sparse_storage_bytes(sparse)
try:
torch.sparse.mm(sparse, case.rhs)
if device.type == "cuda":
torch.cuda.synchronize()
time_ms = median_ms(
lambda sparse=sparse: torch.sparse.mm(sparse, case.rhs),
device=device,
warmup=warmup,
repeats=repeats,
min_run_time=min_run_time,
)
note = "ok"
except Exception as exc:
time_ms = None
note = f"unsupported: {type(exc).__name__}: {str(exc).splitlines()[0]}"
results.append(
Result(
case=case.name,
layout=layout_name,
nnz=int(sparse._nnz()),
storage_bytes=storage_bytes,
storage_ratio=storage_bytes / dense_bytes,
time_ms=time_ms,
time_ratio=None if time_ms is None else time_ms / dense_time_ms,
note=note,
)
)
return results
def fmt_bytes(value: int | None) -> str:
if value is None:
return "-"
units = ["B", "KiB", "MiB", "GiB"]
size = float(value)
for unit in units:
if size < 1024 or unit == units[-1]:
return f"{size:.1f} {unit}" if unit != "B" else f"{int(size)} B"
size /= 1024
return f"{value} B"
def fmt_ratio(value: float | None) -> str:
return "-" if value is None else f"{value:.3f}x"
def fmt_ms(value: float | None) -> str:
return "-" if value is None else f"{value:.3f}"
def print_markdown(results: list[Result]) -> None:
print(
"| case | layout | nnz / blocks | storage | storage vs dense | "
"median ms | time vs dense | note |"
)
print("| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |")
for row in results:
print(
f"| {row.case} | {row.layout} | {row.nnz} | {fmt_bytes(row.storage_bytes)} | "
f"{fmt_ratio(row.storage_ratio)} | {fmt_ms(row.time_ms)} | "
f"{fmt_ratio(row.time_ratio)} | {row.note} |"
)
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("--device", default="cuda", choices=["cuda", "cpu"])
parser.add_argument("--rows", type=int, default=2048)
parser.add_argument("--cols", type=int, default=2048)
parser.add_argument("--rhs-cols", type=int, default=64)
parser.add_argument("--block-size", type=int, default=16)
parser.add_argument("--warmup", type=int, default=10)
parser.add_argument("--repeats", type=int, default=50)
parser.add_argument("--min-run-time", type=float, default=0.2)
parser.add_argument("--require-torch-prefix", default="2.12.")
return parser.parse_args()
def main() -> None:
args = parse_args()
if args.require_torch_prefix and not torch.__version__.startswith(args.require_torch_prefix):
raise SystemExit(
f"Expected torch {args.require_torch_prefix}*, got torch {torch.__version__}"
)
if args.device == "cuda" and not torch.cuda.is_available():
raise SystemExit("CUDA device requested, but torch.cuda.is_available() is false")
device = torch.device(args.device)
dtype = torch.float32
block_size = (args.block_size, args.block_size)
cases = [
make_unstructured_case("random 0.1%", args.rows, args.cols, args.rhs_cols, 0.001, seed=11, dtype=dtype, device=device),
make_unstructured_case("random 1%", args.rows, args.cols, args.rhs_cols, 0.01, seed=12, dtype=dtype, device=device),
make_unstructured_case("random 5%", args.rows, args.cols, args.rhs_cols, 0.05, seed=13, dtype=dtype, device=device),
make_block_case("16x16 blocks 1%", args.rows, args.cols, args.rhs_cols, block_size, 0.01, seed=21, dtype=dtype, device=device),
]
print(f"torch: {torch.__version__}")
print(f"python: {platform.python_version()} ({platform.machine()})")
print(f"device: {device}")
if device.type == "cuda":
props = torch.cuda.get_device_properties(device)
print(f"cuda: {torch.version.cuda}")
print(f"gpu: {props.name}")
print(f"compute capability: {props.major}.{props.minor}")
print(f"total memory: {fmt_bytes(props.total_memory)}")
else:
print(f"threads: {torch.get_num_threads()}")
print(f"shape: {args.rows}x{args.cols}, rhs: {args.cols}x{args.rhs_cols}, dtype: float32")
print(f"block size: {block_size[0]}x{block_size[1]}")
print()
all_results: list[Result] = []
for case in cases:
all_results.extend(
run_case(
case,
block_size,
device=device,
warmup=args.warmup,
repeats=args.repeats,
min_run_time=args.min_run_time,
)
)
print_markdown(all_results)
if __name__ == "__main__":
main()
Environment and measurement method:
- PyTorch:
2.12.0 - Matrix shape:
2048 x 2048 - Right-hand-side dense matrix:
2048 x 64 - dtype:
float32 - block size:
16 x 16 - storage accounting: actual tensor buffer bytes, including values and index tensors, excluding Python object metadata
- CPU timing:
torch.utils.benchmark.Timer(...).blocked_autorange - GPU timing: CUDA events +
torch.cuda.synchronize()
The results use dense as 1.0. time vs dense (CPU) = 0.102x means the CPU sparse path took about 10.2% of CPU dense time. time vs dense (GPU) = 1.748x means the GPU sparse path was about 1.75x slower than dense on the same GPU.
| case | layout | nnz / blocks | storage | storage vs dense | time vs dense (CPU) | time vs dense (GPU) | note |
|---|---|---|---|---|---|---|---|
| random 0.1% | dense | 4194 | 16.0 MiB | 1.000x | 1.000x | 1.000x | baseline |
| random 0.1% | coo | 4194 | 81.9 KiB | 0.005x | 0.102x | 1.748x | ok |
| random 0.1% | csr | 4194 | 97.9 KiB | 0.006x | 0.137x | 0.971x | ok |
| random 0.1% | csc | 4194 | 97.9 KiB | 0.006x | 0.306x | 3.925x | ok |
| random 0.1% | bsr | 3684 | 3.7 MiB | 0.228x | - | 2.290x | CPU unsupported |
| random 0.1% | bsc | 3684 | 3.7 MiB | 0.228x | - | - | CPU/GPU unsupported |
| random 1% | dense | 41943 | 16.0 MiB | 1.000x | 1.000x | 1.000x | baseline |
| random 1% | coo | 41943 | 819.2 KiB | 0.050x | 0.792x | 1.486x | ok |
| random 1% | csr | 41943 | 835.2 KiB | 0.051x | 0.920x | 0.929x | ok |
| random 1% | csc | 41943 | 835.2 KiB | 0.051x | 4.169x | 4.115x | ok |
| random 1% | bsr | 15135 | 15.0 MiB | 0.938x | - | 2.555x | CPU unsupported |
| random 1% | bsc | 15135 | 15.0 MiB | 0.938x | - | - | CPU/GPU unsupported |
| random 5% | dense | 209715 | 16.0 MiB | 1.000x | 1.000x | 1.000x | baseline |
| random 5% | coo | 209715 | 4.0 MiB | 0.250x | 3.571x | 2.015x | ok |
| random 5% | csr | 209715 | 4.0 MiB | 0.251x | 4.149x | 1.007x | ok |
| random 5% | csc | 209715 | 4.0 MiB | 0.251x | 17.511x | 4.172x | ok |
| random 5% | bsr | 16384 | 16.3 MiB | 1.016x | - | 2.674x | CPU unsupported |
| random 5% | bsc | 16384 | 16.3 MiB | 1.016x | - | - | CPU/GPU unsupported |
| 16x16 blocks 1% | dense | 41728 | 16.0 MiB | 1.000x | 1.000x | 1.000x | baseline |
| 16x16 blocks 1% | coo | 41728 | 815.0 KiB | 0.050x | 0.833x | 1.481x | ok |
| 16x16 blocks 1% | csr | 41728 | 831.0 KiB | 0.051x | 0.796x | 0.971x | ok |
| 16x16 blocks 1% | csc | 41728 | 831.0 KiB | 0.051x | 2.753x | 4.004x | ok |
| 16x16 blocks 1% | bsr | 163 | 166.6 KiB | 0.010x | - | 2.226x | CPU unsupported |
| 16x16 blocks 1% | bsc | 163 | 166.6 KiB | 0.010x | - | - | CPU/GPU unsupported |
This table is easy to misread, so the boundary matters: these numbers only apply to this shape, these sparsity patterns, and these two runtime environments.
Start with storage. For random 0.1%, COO uses only about 0.5% of dense storage. Dense is 16 MiB; COO is 81.9 KiB. If the question is simply “how do I fit this matrix in memory?”, sparse storage helps directly.
But saving storage does not mean multiplication is always faster. On CPU, COO/CSR are faster than dense for random 0.1%; by random 5%, COO/CSR/CSC are all slower. On GPU, in this H200 result, CSR stays close to dense, while COO and CSC are slower. The reason is not mysterious: dense GEMM is highly optimized contiguous computation; sparse multiplication has to read indices, do indirect access, and often write dense output. Once there are enough nonzero elements, index overhead and irregular memory access can cancel out the skipped multiplications.
Block sparse is more picky about the data. At random 1%, 16x16 BSR/BSC storage is already close to dense storage; at random 5%, it is slightly above dense. The reason is the same old issue: if a block has even one nonzero element, the whole 16x16 payload is stored. In the artificial 16x16 blocks 1% case, BSR/BSC use only about 1% of dense storage. Block sparse looks right only when the block structure is real.
One more practical point: layout support is part of the performance result. The local CPU wheel did not run BSR/BSC through torch.sparse.mm; H200 can run BSR, but it is still slower than dense for this workload; BSC on CUDA also lacks the Strided + SparseBsc @ Strided path. This does not mean the BSR/BSC data structures are meaningless. It means current backend and kernel coverage decide whether you can use them, and whether they are fast.
How to choose a format
If you have a batch of coordinates and values and want to put them into a sparse tensor, COO is usually the easiest format. It is construction-friendly and can represent duplicate coordinates. After construction, if you want to compute with it, consider coalesce() first.
If the main operation is A @ X, where A is the left-hand sparse matrix, CSR is a natural candidate. It compresses by row, and each output row can read that row’s nonzero entries from one contiguous slice. It is not always fastest, but the data structure matches this access pattern.
If your access pattern is naturally column-oriented, or if you often work with transposed views, CSC is worth considering. Do not assume CSC and CSR have identical speed just because CSC is the “mirror image” of CSR. Actual kernels are often not symmetric.
If nonzero elements really appear in blocks, then consider BSR/BSC. The benefit of block sparse comes from “fewer block coordinates” and “more regular dense computation inside each block”. If your nonzeros are just random points, block sparse stores the zeros inside blocks too, and storage can quickly approach dense.
If this is production code, measure it. At minimum, fix the shape, density, dtype, index dtype, right-hand-side width, device, PyTorch version, and build information. Choosing a sparse matrix format is not an abstract format question. It is an engineering question about data distribution and kernel support.