Skip to main content

Micro Benchmarks

Computation Benchmarks#

kernel-launch#

Introduction#

Measure GPU kernel launch performance from multiple perspectives, including end-to-end latency, host-side dispatch overhead, steady-state launch throughput, and device-side launch time.

Metrics#

NameUnitDescription
kernel-launch/e2e_latency_ustime (us)Single-shot end-to-end latency measured in CPU time.
kernel-launch/host_dispatch_ustime (us)Host-side dispatch overhead per kernel measured in CPU time.
kernel-launch/launch_throughput_mkpsthroughput (MKPS)Steady-state kernel launch throughput.
kernel-launch/device_launch_ustime (us)Device-side average launch time per kernel measured by events.

gemm-flops#

Introduction#

Measure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS), performed by NVIDIA cutlass or AMD rocblas-bench. The benchmark supports one or more GEMM shapes in m,n,k format.

Metrics#

NameUnitDescription
gemm-flops/fp64_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float64 peak FLOPS.
gemm-flops/fp32_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float32 peak FLOPS.
gemm-flops/fp16_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS.
gemm-flops/fp64_tc_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float64 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/tf32_tc_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/fp16_tc_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/bf16_tc_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/int8_tc_m${m}_n${n}_k${k}_iopsIOPS (GIOPS)GEMM int8 peak IOPS with NVIDIA Tensor Core.
gemm-flops/int4_tc_m${m}_n${n}_k${k}_iopsIOPS (GIOPS)GEMM int4 peak IOPS with NVIDIA Tensor Core.
gemm-flops/fp32_xdlops_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM tensor-float32 peak FLOPS with AMD XDLOPS.
gemm-flops/fp16_xdlops_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS with AMD XDLOPS.
gemm-flops/bf16_xdlops_m${m}_n${n}_k${k}_flopsFLOPS (GFLOPS)GEMM bfloat16 peak FLOPS with AMD XDLOPS.
gemm-flops/int8_xdlops_m${m}_n${n}_k${k}_iopsIOPS (GIOPS)GEMM int8 peak IOPS with AMD XDLOPS.

matmul#

Introduction#

Large scale matmul operation using torch.matmul with one GPU.

Metrics#

NameUnitDescription
pytorch-matmul/nosharding_timetime (ms)Time of pure matmul operation.

cublaslt-gemm / hipblaslt-gemm#

Introduction#

Measure the GEMM performance of cublasLtMatmul or hipblasLt-bench.

Metrics#

NameUnitDescription
cublaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flopsFLOPS (TFLOPS)TFLOPS of measured GEMM kernel.
hipblaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flopsFLOPS (TFLOPS)TFLOPS of measured GEMM kernel.

cublas-function#

Introduction#

Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuBLAS are as follows:

  • cublasSgemm
  • cublasSgemmStridedBatched
  • cublasGemmStridedBatchedEx
  • cublasGemmEx
  • cublasCgemm3mStridedBatched
  • cublasCgemm

Metrics#

NameUnitDescription
cublas-function/name_${function_name}_${parameters}_timetime (us)The mean time to execute the cublas function with the parameters.
cublas-function/name_${function_name}_${parameters}_correctnessWhether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check.
cublas-function/name_${function_name}_${parameters}_errorThe error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check.

cudnn-function#

Introduction#

Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuDNN are as follows:

  • cudnnConvolutionBackwardFilter
  • cudnnConvolutionBackwardData
  • cudnnConvolutionForward

Metrics#

NameUnitDescription
cudnn-function/name_${function_name}_${parameters}_timetime (us)The mean time to execute the cudnn function with the parameters.

tensorrt-inference#

Introduction#

Inference PyTorch/ONNX models on NVIDIA GPUs with TensorRT.

Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19 lstm, bert-base, bert-large, gpt2-small

Do not support large models like gpt2-large currently because models larger than 2GB (maximum protobuf size) cannot be exported in one ONNX file.

Metrics#

NameUnitDescription
tensorrt-inference/${model}_gpu_time_meantime (ms)The mean GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_gpu_time_99time (ms)The 99th percentile GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_meantime (ms)The mean H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_99time (ms)The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_end_to_end_time_meantime (ms)The mean duration from when the H2D of a query is called to when the D2H of the same query is completed.
tensorrt-inference/${model}_end_to_end_time_99time (ms)The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.

ort-inference#

Introduction#

Inference performance of the torchvision models using ONNXRuntime. Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19

The supported percentiles are 50, 90, 95, 99, and 99.9.

Parameters#

ParameterDefaultDescription
--pytorch_modelsSee aboveTorchvision models to export to ONNX and run with ONNX Runtime.
--precisionfloat16Inference precision: float32, float16, or int8.
--graph_opt_level3ONNX Runtime graph optimization level: 0, 1, 2, or 3.
--batch_size32Batch size of the generated input tensor.
--num_warmup64Number of warmup inference iterations excluded from metrics.
--num_steps256Number of measured inference iterations.
--execution_providerautoONNX Runtime execution provider: auto, cuda, rocm, migraphx, cpu, or a full provider name.
--pretrainedfalseUse pretrained torchvision weights when exporting ONNX models.

Metrics#

NameUnitDescription
ort-inference/{precision}_{model}_timetime (ms)The mean latency to execute one batch of inference.
ort-inference/{precision}{model}_time{percentile}time (ms)The {percentile}th percentile latency to execute one batch of inference.

gpu-burn#

Introduction#

Multi-GPU CUDA stress test for GPU compute and memory utilization, performed by gpu-burn. Supports the use of double unit types and the use of tensor cores.

Metrics#

NameUnitDescription
gpu-burn/timetime (s)The runtime for gpu-burn test.
gpu-burn/gpu_[0-9]_passyes/noThe result of the gpu-burn test for each GPU (1: yes, 0: no).
gpu-burn/abortyes/noWhether or not GPU-burn test aborted before returning GPU results (1: yes, 0: no).

cpu-hpl#

Introduction#

HPL or High Performance Computing Linpack evaluates compute bandwidth by solving dense linear systems in double precision arethmetic. Performed by High-Performance Linpack Benchmark for Distributed-Memory Computers

Metrics#

NameUnitDescription
cpu-hpl/tests_passHPL completed running and correctness test has passed (1: pass, 0: fail).
cpu-hpl/throughputbandwidth (GFlops)Compute bandwidth.
cpu-hpl/timetime (s)Time elapsed during HPL run.

gpu-hpl#

Introduction#

Measure GPU HPL performance for dense linear algebra workloads. Performed by rocHPL.

Parameters#

gpu-hpl always generates an HPL input .dat file from the command-line parameters. The generated file name and output file name are derived from the same workload prefix used in metric keys.

ParameterDefaultDescription
--p1Number of rows in the MPI process grid.
--q1Number of columns in the MPI process grid.
--local-pOptional number of rows in the node-local MPI process grid.
--local-qOptional number of columns in the node-local MPI process grid.
--n45312Global matrix size.
--nb384Panel/block size.
--warmup0Number of warmup HPL runs to exclude from result aggregation.
--iterations1Number of measured HPL runs to include in result aggregation.
--reduce-opmaxReduce operator for measured runs by FLOPS: mean, median, max, min.
--pmap1Process mapping: 0 for row-major, 1 for column-major.
--bcast0Broadcast topology: 0 for 1rg, 1 for 1rM, 2 for 2rg, 3 for 2rM, 4 for Lng, 5 for LnM.
--threshold16.0Residual check threshold.
--pfact2Panel factorization: 0 for left, 1 for Crout, 2 for right.
--nbmin32Recursive stopping criterion.
--ndiv2Number of panels in recursion.
--rfact2Recursive panel factorization: 0 for left, 1 for Crout, 2 for right.
--depth1Lookahead depth.
--swap1Swapping algorithm: 0 for binary exchange, 1 for long, 2 for mix.
--swapping-threshold64Swapping threshold.
--l10L1 storage form: 0 for transposed, 1 for non-transposed.
--u0U storage form: 0 for transposed, 1 for non-transposed.
--equilibration0Equilibration: 0 for no, 1 for yes.
--memory-alignment8Memory alignment in doubles.

--warmup runs are excluded from result aggregation. --reduce-op is applied to the measured FLOPS values. The reported _time metric is reduced in the same performance direction by applying --reduce-op to 1 / time and then converting the result back to seconds.

Metrics#

rocHPL reports performance, time, and correctness metrics. The metric key includes the configured HPL variant, process grid, matrix size, and block size: ${tv}_P${P}_Q${Q}_N${N}_NB${NB}. The tv field is based on the rocHPL T/V value and includes an extended suffix for L1, U, Equilibration, and memory-alignment. For example, WC11R2R32_TTN8 uses transposed L1, transposed U, no equilibration, and memory alignment 8.

NameUnitDescription
gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flopsFLOPS (GFLOPS)Throughput for the specified rocHPL run.
gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_timetime (s)Time elapsed during the specified HPL run.
gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_passWhether residual checks passed (1: pass, 0: fail).

gpu-hpl-mxp#

Introduction#

Measure GPU HPL-MxP performance for mixed-precision dense linear algebra workloads. Performed by rocHPL-MxP.

Parameters#

gpu-hpl-mxp always generates an HPL-MxP input .dat file from the command-line parameters. The generated file name and output file name are derived from the same workload prefix used in metric keys.

ParameterDefaultDescription
--p1Number of rows in the MPI process grid.
--q1Number of columns in the MPI process grid.
--local-pOptional number of rows in the node-local MPI process grid.
--local-qOptional number of columns in the node-local MPI process grid.
--n61440Global matrix size.
--nb2560Panel/block size.
--warmup0Number of warmup HPL-MxP runs to exclude from result aggregation.
--iterations1Number of measured HPL-MxP runs to include in result aggregation.
--reduce-opmaxReduce operator for measured runs by FLOPS: mean, median, max, min.
--pmap1Process mapping: 0 for row-major, 1 for column-major.
--bcast0Broadcast topology: 0 for 1rg, 1 for 1rM, 2 for 2rg, 3 for 2rM, 4 for Lng, 5 for LnM.
--threshold16.0Residual check threshold.

--warmup runs are excluded from result aggregation. --reduce-op is applied to the measured FLOPS values. The reported _time metric is reduced in the same performance direction by applying --reduce-op to 1 / time and then converting the result back to seconds.

Metrics#

rocHPL-MxP reports performance, time, and correctness metrics. The metric key includes the configured HPL-MxP variant, process grid, matrix size, and block size: ${tv}_P${P}_Q${Q}_N${N}_NB${NB}. The tv field is based on the rocHPL-MxP T/V value, for example WC1.

NameUnitDescription
gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flopsFLOPS (GFLOPS)Throughput for the specified rocHPL-MxP run.
gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_timetime (s)Time elapsed during the specified HPL-MxP run.
gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_passWhether residual checks passed (1: pass, 0: fail).

gpu-hpcg#

Introduction#

Measure GPU HPCG performance for sparse linear algebra and multigrid-style workloads. Performed by rocHPCG.

Parameters#

ParameterDefaultDescription
--npx1Number of MPI processes in the x dimension.
--npy1Number of MPI processes in the y dimension.
--npz1Number of MPI processes in the z dimension.
--nx560Local problem size in the x dimension.
--ny280Local problem size in the y dimension.
--nz280Local problem size in the z dimension.
--rt60Benchmark runtime in seconds.
--tol1.0Verification control: 0 runs reference verification; non-zero skips it.
--pz0Partition boundary in the z process dimension.
--zl--nzLocal nz value for processes with z rank lower than --pz.
--zu--nzLocal nz value for processes with z rank greater than or equal to --pz.

Metrics#

rocHPCG reports performance and time metrics. Performance metrics are reported for final, ddot, waxpby, spmv, mg, and total. The metric key includes the configured process domain and local problem size: p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}.

NameUnitDescription
gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flopsFLOPS (GFLOPS)Throughput for the specified rocHPCG operation.
gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidthbandwidth (GB/s)Bandwidth for the specified rocHPCG operation.
gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops_per_processFLOPS (GFLOPS)Per-process throughput for the specified operation.
gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth_per_processbandwidth (GB/s)Per-process bandwidth for the specified operation.
gpu-hpcg/setup_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}time (s)Setup phase duration.
gpu-hpcg/optimization_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}time (s)Optimization phase duration.
gpu-hpcg/total_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}time (s)Total runtime.

cpu-stream#

Introduction#

Measure of memory bandwidth and computation rate for simple vector kernels. performed by University of Virginia STREAM benchmark.

Metrics#

NameUnitDescription
cpu-stream/threadsNumber of threads used for the test. Determined by core count.
cpu-stream/['copy', 'scale', 'add', 'triad']_throughputbandwidth (MB/s)Memory throughput of designated kerel operation.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_avgtime (s)Average elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_mintime (s)Minimum elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_maxtime (s)Maximum elapsed times over all iterations.

Communication Benchmarks#

cpu-memory-bw-latency#

Introduction#

Measure the memory copy bandwidth and latency across different CPU NUMA nodes. performed by Intel MLC Tool.

Metrics#

NameUnitDescription
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bwbandwidth (MB/s)Former NUMA to latter NUMA memory bandwidth.
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lattime (ns)Former NUMA to latter NUMA memory latency.
cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, full read.
cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 3 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 2 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 1 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, with stream-triad like pattern.

mem-bw#

Introduction#

Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs, performed by NVIDIA or AMD bandwidth test tool. The bandwidthTest sample was out-of-date and has been removed as of the CUDA Samples 12.9 release. For up-to-date bandwidth measurements, refer instead to the nvbandwidth benchmark.

Metrics#

NameUnitDescription
mem-bw/h2d_bwbandwidth (GB/s)Host to device copy bandwidth.
mem-bw/d2h_bwbandwidth (GB/s)Device to host copy bandwidth.
mem-bw/d2d_bwbandwidth (GB/s)Device to device copy bandwidth.

gpu-copy-bw#

Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device. For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in hipDeviceMallocUncached (previous hipDeviceMallocFinegrained) mode to maximize performance.

Metrics#

NameUnitDescription
cpu_to_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_cpu_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_by_(sm|dma)_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_(read|write)_by_(sm|dma)_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
cpu_and_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_and_cpu_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)Same as above, but generated by --dtoh --bidirectional.
gpu[0-9]+_and_gpu[0-9]+_by_(sm|dma)_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_and_gpu[0-9]+_(read|write)_by_(sm|dma)_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
gpu[0-9]+_to_gpu_all_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU writing all peer GPUs' memory using GPU SM with peer communication enabled.
gpu_all_to_gpu[0-9]+_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.
gpu_all_to_gpu_all_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.

gpu-stream#

Introduction#

Measure the memory bandwidth of GPU using BabelStream (hip-stream) backend. The benchmark executes copy, scale, add, triad, and dot operations. The array_size parameter represents the number of elements. Each benchmark run measures the GPU visible to the current process.

Metrics#

Metric NameUnitDescription
STREAM_INIT_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Initialization phase bandwidth for the current benchmark run and one array size.
STREAM_INIT_[float|double]_array_[0-9]+_timetime (s)Initialization phase runtime for the current benchmark run and one array size.
STREAM_READ_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Read phase bandwidth for the current benchmark run and one array size.
STREAM_READ_[float|double]_array_[0-9]+_timetime (s)Read phase runtime for the current benchmark run and one array size.
STREAM_COPY_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Maximum copy bandwidth for the current benchmark run and one array size.
STREAM_COPY_[float|double]_array_[0-9]+_time_mintime (s)Minimum copy runtime for the current benchmark run and one array size.
STREAM_COPY_[float|double]_array_[0-9]+_time_maxtime (s)Maximum copy runtime for the current benchmark run and one array size.
STREAM_COPY_[float|double]_array_[0-9]+_time_avgtime (s)Average copy runtime for the current benchmark run and one array size.
STREAM_MUL_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Maximum mul bandwidth for the current benchmark run and one array size.
STREAM_MUL_[float|double]_array_[0-9]+_time_mintime (s)Minimum mul runtime for the current benchmark run and one array size.
STREAM_MUL_[float|double]_array_[0-9]+_time_maxtime (s)Maximum mul runtime for the current benchmark run and one array size.
STREAM_MUL_[float|double]_array_[0-9]+_time_avgtime (s)Average mul runtime for the current benchmark run and one array size.
STREAM_ADD_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Maximum add bandwidth for the current benchmark run and one array size.
STREAM_ADD_[float|double]_array_[0-9]+_time_mintime (s)Minimum add runtime for the current benchmark run and one array size.
STREAM_ADD_[float|double]_array_[0-9]+_time_maxtime (s)Maximum add runtime for the current benchmark run and one array size.
STREAM_ADD_[float|double]_array_[0-9]+_time_avgtime (s)Average add runtime for the current benchmark run and one array size.
STREAM_TRIAD_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Maximum triad bandwidth for the current benchmark run and one array size.
STREAM_TRIAD_[float|double]_array_[0-9]+_time_mintime (s)Minimum triad runtime for the current benchmark run and one array size.
STREAM_TRIAD_[float|double]_array_[0-9]+_time_maxtime (s)Maximum triad runtime for the current benchmark run and one array size.
STREAM_TRIAD_[float|double]_array_[0-9]+_time_avgtime (s)Average triad runtime for the current benchmark run and one array size.
STREAM_DOT_[float|double]_array_[0-9]+_bwbandwidth (GB/s)Maximum dot bandwidth for the current benchmark run and one array size.
STREAM_DOT_[float|double]_array_[0-9]+_time_mintime (s)Minimum dot runtime for the current benchmark run and one array size.
STREAM_DOT_[float|double]_array_[0-9]+_time_maxtime (s)Maximum dot runtime for the current benchmark run and one array size.
STREAM_DOT_[float|double]_array_[0-9]+_time_avgtime (s)Average dot runtime for the current benchmark run and one array size.

gpu-stream reports phase and function metrics. _ratio and block_* metrics are removed. Bandwidth metrics are converted from BabelStream max_mbytes_per_sec by using GB/s = MB/s / 1000.

ib-loopback#

Introduction#

Measure the InfiniBand loopback verbs bandwidth, performed by OFED performance tests.

Metrics#

NameUnitDescription
ib-loopback/ibwrite_bw${msg_size}bandwidth (GB/s)InfiniBand loopback write bandwidth with given message size.
ib-loopback/ibread_bw${msg_size}bandwidth (GB/s)InfiniBand loopback read bandwidth with given message size.
ib-loopback/ibsend_bw${msg_size}bandwidth (GB/s)InfiniBand loopback send bandwidth with given message size.

nccl-bw / rccl-bw#

Introduction#

Measure the performance of NCCL/RCCL operations under multi nodes' traffic pattern, performed by nccl-tests or rccl-tests. Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall. Support both in-place and out-of-place measurements.

Support the following traffic patterns:

  • all-nodes, validate the NCCL/RCCL performance across all VM nodes simultaneously.
  • pair-wise, validate the NCCL/RCCL performance across VM pairs with all possible combinations in parallel.
  • k-batch, validate the NCCL/RCCL performance across VM groups with a specified batch scale.
  • topo-aware, validate the NCCL/RCCL performance across VM pairs with different distances/hops as a quick test.

Metrics#

NameUnitDescription
nccl-bw/${operation}_${msg_size}_timetime (us)NCCL operation lantency with given message size.
nccl-bw/${operation}_${msg_size}_algbwbandwidth (GB/s)NCCL operation algorithm bandwidth with given message size.
nccl-bw/${operation}_${msg_size}_busbwbandwidth (GB/s)NCCL operation bus bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_timetime (us)RCCL operation lantency with given message size.
rccl-bw/${operation}_${msg_size}_algbwbandwidth (GB/s)RCCL operation algorithm bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_busbwbandwidth (GB/s)RCCL operation bus bandwidth with given message size.

If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time

  • serial_index represents the serial index of the host group in serial.
  • parallel_index represents the parallel index of the host list in parallel.

tcp-connectivity#

Introduction#

Test the TCP connectivity between current node and nodes in the hostfile, performed by tcping

Metrics#

MetricsUnitDescription
tcp-connectivity/${hostname/ip}_successed_countcountsuccessed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_failed_countcountfailed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_success_ratesuccess rate (successed/total) of tcp connection between current node and other nodes
tcp-connectivity/${hostname/ip}_time_mintime (ms)mininum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_maxtime (ms)maximum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_avgtime (ms)average latency of tcp connections between current node and other nodes

gpcnet-network-test / gpcnet-network-load-test#

Introduction#

Distributed test, test the global network performance and congestion, performed by GPCNET

gpcnet-network-test: Full system network tests in random and natural ring, alltoall and allreduce, at least 2 nodes

gpcnet-network-load-test: Select full system network tests run with four congestors to measure network congestion or contention, at least 10 nodes

  • supporting network tests: RR Two-sided Lat (8 B), RR Get Lat (8 B), RR Two-sided BW (131072 B), RR Put BW (131072 B), RR Two-sided BW+Sync (131072 B), Nat Two-sided BW (131072 B), Multiple Allreduce (8 B), Multiple Alltoall (4096 B)
  • supporting congestors: Alltoall (4096 B), Two-sided Incast (4096 B), Put Incast (4096 B), Get Bcast (4096 B)

Metrics#

MetricsUnitDescription
gpcnet-network-test/rrtwo-sided_lat${stat}time (us)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing
gpcnet-network-test/rrtwo-sided+sync_bw${stat}bandwidth (MiB/s/rank)fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing
gpcnet-network-test/multipleallreduce_time${stat}time (us)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing
gpcnet-network-test/rrget_lat${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing
gpcnet-network-test/rrtwo-sided_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/nattwo-sided_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/multiplealltoall_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing
gpcnet-network-load-test/rrtwo-sided_lat_x${stat}factor (x)summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/rrtwo-sided+sync_bw_x${stat}factor (x)summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/multipleallreduce_x${stat}factor (x)summary about congestion impact factor of the network test algorithm

ib-traffic#

Introduction#

Measure the InfiniBand performance under multi nodes' traffic pattern.

The direction between client and server can be 'cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu'.

The traffic pattern is defined in a config file, which is pre-defined for one-to-many, many-to-one and all-to-all patterns. Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.

Besides the above three patterns, ib-traffic also supports topology-aware traffic pattern. To run ib-traffic with topology-aware pattern, the user needs to specify 3 required (and 2 optional) parameters in YAML config file:

  • --pattern  topo-aware
  • --ibstat  path to ibstat output
  • --ibnetdiscover  path to ibnetdiscover output
  • --min_dist  minimum distance of VM pairs (optional, default 2)
  • --max_dist  maximum distance of VM pairs (optional, default 6)

Each row in the config file has all VM pairs with a fixed distance (#hops). That's by default, 1st, 2nd, 3rd row has all VM pairs with topology distance of 2, 4, 6, respectively.

Metrics#

MetricsUnitDescription
ib-traffic/ib_write_bw_${msg_size}_${direction}_${line}_${pair}:${server}_${client}bandwidth (GB/s)The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client.
ib-traffic/ib_write_lat_${msg_size}_${direction}_${line}_${pair}:${server}_${client}time (us)The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client.

nvbandwidth#

Introduction#

Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods, performed by nvbandwidth

Metrics#

MetricsUnitDescription
host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sHost to device CE memcpy using cuMemcpyAsync
host_to_device_memcpy_ce_sum_bwGB/sSum of the output matrix
device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sDevice to host CE memcpy using cuMemcpyAsync
device_to_host_memcpy_ce_sum_bwGB/sSum of the output matrix
host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sA host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported.
host_to_device_bidirectional_memcpy_ce_sum_bwGB/sSum of the output matrix
device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sA device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported.
device_to_host_bidirectional_memcpy_ce_sum_bwGB/sSum of the output matrix
device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_memcpy_read_ce_sum_bwGB/sSum of the output matrix
device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_memcpy_write_ce_sum_bwGB/sSum of the output matrix
device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_bidirectional_memcpy_read_ce_sum_bwGB/sSum of the output matrix
device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_bidirectional_memcpy_write_ce_sum_bwGB/sSum of the output matrix
all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host.
all_to_host_memcpy_ce_sum_bwGB/sSum of the output matrix
all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sA device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic.
all_to_host_bidirectional_memcpy_ce_sum_bwGB/sSum of the output matrix
host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices.
host_to_all_memcpy_ce_sum_bwGB/sSum of the output matrix
host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bwGB/sA host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic.
host_to_all_bidirectional_memcpy_ce_sum_bwGB/sSum of the output matrix
all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
all_to_one_write_ce_sum_bwGB/sSum of the output matrix
all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
all_to_one_read_ce_sum_bwGB/sSum of the output matrix
one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
one_to_all_write_ce_sum_bwGB/sSum of the output matrix
one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
one_to_all_read_ce_sum_bwGB/sSum of the output matrix
host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sHost to device SM memcpy using a copy kernel
host_to_device_memcpy_sm_sum_bwGB/sSum of the output matrix
device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sDevice to host SM memcpy using a copy kernel
device_to_host_memcpy_sm_sum_bwGB/sSum of the output matrix
device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_memcpy_read_sm_sum_bwGB/sSum of the output matrix
device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_memcpy_write_sm_sum_bwGB/sSum of the output matrix
device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_bidirectional_memcpy_read_sm_sum_bwGB/sSum of the output matrix
device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_bidirectional_memcpy_write_sm_sum_bwGB/sSum of the output matrix
all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host.
all_to_host_memcpy_sm_sum_bwGB/sSum of the output matrix
all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sA device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels.
all_to_host_bidirectional_memcpy_sm_sum_bwGB/sSum of the output matrix
host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sMeasures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices.
host_to_all_memcpy_sm_sum_bwGB/sSum of the output matrix
host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bwGB/sA host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels.
host_to_all_bidirectional_memcpy_sm_sum_bwGB/sSum of the output matrix
all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
all_to_one_write_sm_sum_bwGB/sSum of the output matrix
all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
all_to_one_read_sm_sum_bwGB/sSum of the output matrix
one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
one_to_all_write_sm_sum_bwGB/sSum of the output matrix
one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bwGB/sMeasures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
one_to_all_read_sm_sum_bwGB/sSum of the output matrix
host_device_latency_sm_cpu[0-9]_gpu[0-9]_latµsHost - device SM copy latency using a ptr chase kernel
host_device_latency_sm_sum_latµsSum of the output matrix
device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_latµsMeasures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency.
device_to_device_latency_sm_sum_latµsSum of the output matrix

Computation-communication Benchmarks#

computation-communication-overlap#

Introduction#

Test the performance of single node when communication and computation overlap.

Metrics#

NameUnitDescription
pytorch-computation-communication-overlap/mul_timetime (ms)Time of communication and mul kernel computation overlap.
pytorch-computation-communication-overlap/matmul_timetime (ms)Time of communication and matmul kernel computation overlap.

sharding-matmul#

Introduction#

Test the performance of large scale matmul operation with multiple GPUs:

  • allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
  • allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.

Metrics#

NameUnitDescription
pytorch-sharding-matmul/allreduce_timetime (ms)Time of sharding matmul using allreduce.
pytorch-sharding-matmul/allgather_timetime (ms)Time of sharding matmul using allgather.

dist-inference#

Introduction#

Test the performance of distributed model inference. Support both PyTorch implementation and cpp implementation.

Parameters#

ParameterDefaultDescription
--use_pytorchfalseUse the PyTorch implementation. If omitted, the C++ implementation is used.
--batch_size64Batch size of the generated input tensor.
--input_size1024Input dimension of the synthetic model.
--hidden_size1024Hidden dimension of the synthetic model.
--alpha1.0Alpha coefficient for D = alpha * (A * B) + beta * C.
--beta1.0Beta coefficient for D = alpha * (A * B) + beta * C.
--num_layers1Number of repeated compute-communicate-activate layers.
--computation_kernelmatmulComputation kernel: addmm, matmul, or mul.
--communication_kernelallreduceCommunication kernel: allgather, allreduce, or alltoall.
--activation_kernelreluActivation kernel: relu, sigmoid, or tanh.
--precisionfloat32Model precision, such as float32 or float16.
--num_warmup50Number of warmup steps excluded from metrics.
--num_steps10000Number of measured inference steps.
--distributed_implddpDistributed implementation for the PyTorch path.
--distributed_backendncclDistributed backend for the PyTorch path.
--use_cuda_graphfalseLaunch kernels in CUDA graph mode when supported.
--tune_gemmfalseTune GEMM performance before measurement in the C++ implementation.

Metrics#

NameUnitDescription
pytorch-dist-inference/step_timestime (ms)Average time of model inference runs.
pytorch-dist-inference/steptimes${percentile}time (ms)Tail (50,90,95,99,99.9) time of model inference runs.

Storage Benchmarks#

disk-benchmark#

Introduction#

Measure the disk performance through FIO.

Metrics#

NameUnitDescription
disk-benchmark/${disk_name}_rand_read_write_bssize (bytes)Disk random read write block size.
disk-benchmark/${disk_name}_rand_read_write_read_iopsIOPSDisk random read write read IOPS.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.0time (ns)Disk random read write read latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.0time (ns)Disk random read write read latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.9time (ns)Disk random read write read latency in 99.9 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_iopsIOPSDisk random read write write IOPS.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.0time (ns)Disk random read write write latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.0time (ns)Disk random read write write latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.9time (ns)Disk random read write write latency in 99.9 percentile.