Skip to content

Commit e13572a

Browse files
excelle08meta-codesync[bot]
authored andcommitted
Add pytorch_gemm_gpuless benchmark to DCPerf AI suite (#542)
Summary: Pull Request resolved: #542 OSS build recipe and benchpress integration for the GPU-less torch.mm micro-benchmark. Measures host-side dispatch overhead of torch.mm without requiring a GPU — for analyzing CPU frontend bottlenecks (BTB/L1I capacity) on Neoverse V2 and AMD Zen4. Package contents: - install_pytorch_gemm_gpuless.sh: Auto-detects CUDA drivers, installs PyTorch CUDA (via conda-forge) or CPU accordingly, builds C extensions - cleanup_pytorch_gemm_gpuless.sh: Remove installed benchmarks - src/: 6 Python + 4 C/C++ source files with local imports - PytorchGemmGpulessParser: Extracts wall_time_per_call_us, host_overhead_per_call_us, simulated_tflops from benchmark stdout Three job configs: - pytorch_gemm_gpuless_stage1: Python dispatch overhead (any machine) - pytorch_gemm_gpuless_stage2_nosleep: Full host overhead via mock_cuda - pytorch_gemm_gpuless_stage2_spin: With spin delay simulating GPU latency CUDA driver handling: - Install script detects libcuda.so.1 or installs cuda-compat package - If CUDA drivers present: installs PyTorch CUDA, both stages available - If no CUDA drivers: installs PyTorch CPU, stage2 blocked at runtime with clear error message directing user to install cuda-compat - mock_cuda.cpp extended with cuInit/cuDeviceGetCount/cuDeviceGet/ cuDeviceGetAttribute/cuDeviceGetName/cuDeviceTotalMem/cuDriverGetVersion mocks for GPU-less machine support Reviewed By: pranaykoka Differential Revision: D97787215
1 parent 4f92357 commit e13572a

17 files changed

+2352
-0
lines changed

benchpress/config/benchmarks_ai.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,16 @@ adsim:
6565
- target_latency_msec
6666
- target_latency_percentile
6767

68+
pytorch_gemm_gpuless:
69+
parser: pytorch_gemm_gpuless
70+
install_script: ./packages/ai_wdl/pytorch_gemm_gpuless/install_pytorch_gemm_gpuless.sh
71+
cleanup_script: ./packages/ai_wdl/pytorch_gemm_gpuless/cleanup_pytorch_gemm_gpuless.sh
72+
path: ./benchmarks/ai_wdl/pytorch_gemm_gpuless/run.sh
73+
metrics:
74+
- wall_time_per_call_us
75+
- host_overhead_per_call_us
76+
- simulated_tflops
77+
6878
type_conversion:
6979
parser: type_conversion
7080
install_script: ./packages/ai_wdl/type_conversion/install_type_conversion.sh

benchpress/config/jobs_ai.yml

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,3 +340,75 @@
340340
description: Benchmark for common AI data type conversion
341341
args:
342342
- '--benchmark_format=json'
343+
344+
- benchmark: pytorch_gemm_gpuless
345+
name: pytorch_gemm_gpuless_stage1
346+
description: GPU-less torch.mm dispatch overhead via TorchDispatchMode (no CUDA needed).
347+
args:
348+
- stage1
349+
- '-m {m}'
350+
- '-n {n}'
351+
- '-k {k}'
352+
- '-t {dtype}'
353+
- '--steps {steps}'
354+
- '--warmups {warmups}'
355+
- '--gpu-model {gpu_model}'
356+
- '--efficiency {efficiency}'
357+
- '--no-sleep'
358+
vars:
359+
- 'm=1024'
360+
- 'n=1024'
361+
- 'k=1024'
362+
- 'dtype=bfloat16'
363+
- 'steps=1000000'
364+
- 'warmups=10000'
365+
- 'gpu_model=gb200'
366+
- 'efficiency=0.5'
367+
368+
- benchmark: pytorch_gemm_gpuless
369+
name: pytorch_gemm_gpuless_stage2_nosleep
370+
description: GPU-less torch.mm full host-side overhead via mock_cuda (requires libcuda.so.1).
371+
args:
372+
- stage2
373+
- '-m {m}'
374+
- '-n {n}'
375+
- '-k {k}'
376+
- '-t {dtype}'
377+
- '--steps {steps}'
378+
- '--warmups {warmups}'
379+
- '--gpu-model {gpu_model}'
380+
- '--efficiency {efficiency}'
381+
- '--no-sleep'
382+
vars:
383+
- 'm=1024'
384+
- 'n=1024'
385+
- 'k=1024'
386+
- 'dtype=bfloat16'
387+
- 'steps=1000000'
388+
- 'warmups=10000'
389+
- 'gpu_model=gb200'
390+
- 'efficiency=0.5'
391+
392+
- benchmark: pytorch_gemm_gpuless
393+
name: pytorch_gemm_gpuless_stage2_spin
394+
description: GPU-less torch.mm with spin delay (simulates GPU latency via clock_gettime polling).
395+
args:
396+
- stage2
397+
- '-m {m}'
398+
- '-n {n}'
399+
- '-k {k}'
400+
- '-t {dtype}'
401+
- '--steps {steps}'
402+
- '--warmups {warmups}'
403+
- '--gpu-model {gpu_model}'
404+
- '--efficiency {efficiency}'
405+
- '--delay-mode spin'
406+
vars:
407+
- 'm=1024'
408+
- 'n=1024'
409+
- 'k=1024'
410+
- 'dtype=bfloat16'
411+
- 'steps=1000000'
412+
- 'warmups=10000'
413+
- 'gpu_model=gb200'
414+
- 'efficiency=0.5'

benchpress/plugins/parsers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
from .multichase_pointer import MultichasePointerParser
4343
from .nginx_wrk_bench import NginxWrkParser
4444
from .nnpi_net4 import NNPINet4Parser
45+
from .pytorch_gemm_gpuless import PytorchGemmGpulessParser
4546
from .rebatch import RebatchParser
4647
from .returncode import ReturncodeParser
4748
from .schbench import SchbenchParser
@@ -116,6 +117,7 @@ def register_parsers(factory):
116117
factory.register("adsim", AdSimParser)
117118
factory.register("cdn_bench", CDNBenchParser)
118119
factory.register("type_conversion", TypeConversionParser)
120+
factory.register("pytorch_gemm_gpuless", PytorchGemmGpulessParser)
119121
factory.register("xsbench", XSBenchParser)
120122
if not open_source:
121123
factory.register("hackperf", HackperfParser)
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/usr/bin/env python3
2+
# Copyright (c) Meta Platforms, Inc. and affiliates.
3+
#
4+
# This source code is licensed under the MIT license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
# pyre-unsafe
8+
import re
9+
10+
from benchpress.lib.parser import Parser
11+
12+
13+
class PytorchGemmGpulessParser(Parser):
14+
"""Parser for pytorch_gemm_gpuless benchmark output.
15+
16+
Extracts metrics from both Stage 1 (TorchDispatchMode) and
17+
Stage 2 (mock_cuda) output formats:
18+
Wall time / call: 13.730 us
19+
Host overhead / call: 13.730 us
20+
Simulated TF/s: 156.440000
21+
"""
22+
23+
def parse(self, stdout, stderr, returncode):
24+
metrics = {}
25+
26+
for line in stdout:
27+
line = line.strip()
28+
29+
m = re.search(r"Wall\s+time\s*/\s*call:\s+([\d.]+)\s*us", line)
30+
if m:
31+
metrics["wall_time_per_call_us"] = float(m.group(1))
32+
continue
33+
34+
m = re.search(r"Host\s+overhead\s*/\s*call:\s+([\d.]+)\s*us", line)
35+
if m:
36+
metrics["host_overhead_per_call_us"] = float(m.group(1))
37+
continue
38+
39+
m = re.search(r"Simulated\s+TF/s:\s+([\d.]+)", line)
40+
if m:
41+
metrics["simulated_tflops"] = float(m.group(1))
42+
continue
43+
44+
m = re.search(r"Simulated\s+GPU\s*/\s*call:\s+([\d.]+)\s*us", line)
45+
if m:
46+
metrics["simulated_gpu_per_call_us"] = float(m.group(1))
47+
continue
48+
49+
return metrics
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# pytorch_gemm_gpuless
2+
3+
GPU-less `torch.mm` micro-benchmark that measures host-side dispatch overhead
4+
without requiring a GPU. Designed for analyzing CPU frontend bottlenecks
5+
(BTB/L1I capacity) on Neoverse V2 (GB200/GB300) and AMD Zen4.
6+
7+
## Stages
8+
9+
| Stage | What it Measures | Requirements |
10+
|-------|-----------------|--------------|
11+
| 1 (`TorchDispatchMode`) | Python dispatch overhead | Any machine |
12+
| 2 (`mock_cuda`) | Full host-side overhead (C++ + CUDA driver API) | CUDA drivers (libcuda.so.1) |
13+
14+
Stage 2 requires NVIDIA driver userspace libraries (`cuda-compat` package).
15+
No GPU hardware is needed — only the driver shared library for function table
16+
patching. The install script auto-detects and installs `cuda-compat` if
17+
available via package manager.
18+
19+
## Installation
20+
21+
```bash
22+
./benchpress -b ai install pytorch_gemm_gpuless_stage1
23+
./benchpress -b ai install pytorch_gemm_gpuless_stage2_nosleep
24+
```
25+
26+
The install script will:
27+
- Detect CUDA driver availability
28+
- Install PyTorch CUDA (if drivers present) or PyTorch CPU (if not)
29+
- Build C extensions (nop_delay, mock_cuda) via setuptools
30+
- Stage 2 jobs will error at runtime if CUDA drivers are missing
31+
32+
## Run
33+
34+
```bash
35+
# Stage 1 — pure host dispatch overhead (any machine)
36+
./benchpress -b ai run pytorch_gemm_gpuless_stage1
37+
38+
# Stage 2 — full C++ dispatch overhead (requires CUDA drivers)
39+
./benchpress -b ai run pytorch_gemm_gpuless_stage2_nosleep
40+
./benchpress -b ai run pytorch_gemm_gpuless_stage2_spin
41+
```
42+
43+
## Metrics
44+
45+
| Metric | Unit | Description |
46+
|--------|------|-------------|
47+
| `wall_time_per_call_us` | microseconds | Total wall time per torch.mm call |
48+
| `host_overhead_per_call_us` | microseconds | Host dispatch overhead per call |
49+
| `simulated_tflops` | TF/s | Simulated throughput |
50+
51+
## Sample Output
52+
53+
```json
54+
{
55+
"benchmark_name": "pytorch_gemm_gpuless_stage1",
56+
"metrics": {
57+
"wall_time_per_call_us": 76.196,
58+
"host_overhead_per_call_us": 76.196,
59+
"simulated_tflops": 28.183608
60+
}
61+
}
62+
```
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
# Copyright (c) Meta Platforms, Inc. and affiliates.
3+
#
4+
# This source code is licensed under the MIT license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
AI_BENCH_ROOT="$(dirname "$(readlink -f "$0")")"
8+
BENCHPRESS_ROOT="$(readlink -f "$AI_BENCH_ROOT/../../..")"
9+
BENCHMARKS_DIR="${BENCHPRESS_ROOT}/benchmarks/ai_wdl/pytorch_gemm_gpuless"
10+
11+
rm -rf "$BENCHMARKS_DIR"

0 commit comments

Comments
 (0)