Testing-tensor-core
Altough tensor-core was introduced in the RTX20s series, I have only tried it very breifly after the launch but never used it in practice apart from deloying tensorRT models. It only supports FP16 initially but it’s now supporting more data types. With siginificant boost of tensor cores in RTX50s series, and disappointing increase in its shader cores performance, it’s worth spending some time investigating this feature. I want to test if the performance of the tensor cores are comparable to odinary cuda/shader cores and wether I should upgrade to RTX5090.
I’m runing all tests on my RTX3080ti, which have 320 tensor cores and offers 136 dense and 273 sparse FP16 TFLOPS and 34.1/68.2 TF32 TFLOPS. For higher precision, tensor core typically use TF32 data type, it only have 10 bits for the significand (reduced from FP32’s 23 bits), this reduction in significand bits lowers precision to about 3 decimal digits, but with siginificant speedup (500 TFLOPS compares to 60 TFLOPS for TF32 with Nvidia H100). Server GPUs have a better support for higher precision tensorcore computations.
Does my RTX3080ti support TF32? It does according to this,
but is this performance a result from the regular shader cores? I need test it to find out, Monitor it’s utilization requires more effort than nvidia-smi
.
Monitoring
The easiest way is to compare the numerical result:
import torch
import torch.profiler
matrix_size = 2<<12 # Large enough to utilize Tensor Cores optimally
# Generate two large random matrices on the GPU with FP32 precision
A = torch.randn((matrix_size, matrix_size), device='cuda', dtype=torch.float32)
B = torch.randn((matrix_size, matrix_size), device='cuda', dtype=torch.float32)
def test():
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'),
record_shapes=True,
with_stack=True
) as prof:
# Run your code here
C_old = torch.zeros((matrix_size, matrix_size), device='cuda', dtype=torch.float32)
for _ in range(100):
C = torch.matmul(A, B) # Matrix multiplication using FP32/TF32 on Tensor Cores
print(prof.key_averages().table(sort_by="cuda_time_total"))
return C
torch.backends.cuda.matmul.allow_tf32 = True # Enable TF32 for matmul
torch.backends.cudnn.allow_tf32 = True # Enable TF32 for cuDNN operations
C_tf32 = test()
torch.backends.cuda.matmul.allow_tf32 = False # Enable TF32 for matmul
torch.backends.cudnn.allow_tf32 = False # Enable TF32 for cuDNN operations
C_fp32 = test()
error = (C_tf32 - C_fp32).abs().max()
print(f"Max absolute error between TF32 and FP32: {error}")
I do observe a speedup and difference in the resulting matrix:
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::zeros 0.06% 1.711ms 0.77% 23.881ms 23.881ms 1
aten::empty 0.03% 1.076ms 0.07% 2.163ms 2.163ms 1
cudaStreamIsCapturing 0.00% 5.746us 0.00% 5.746us 1.436us 4
cudaMalloc 0.14% 4.196ms 0.14% 4.196ms 599.480us 7
aten::zero_ 0.06% 1.934ms 0.65% 20.008ms 20.008ms 1
aten::fill_ 0.03% 944.903us 0.58% 18.074ms 18.074ms 1
cudaLaunchKernel 0.55% 17.129ms 0.55% 17.129ms 17.129ms 1
aten::matmul 0.02% 632.346us 2.81% 86.796ms 867.963us 100
aten::mm 1.91% 58.923ms 2.79% 86.164ms 861.639us 100
cudaFree 0.10% 3.219ms 0.10% 3.219ms 3.219ms 1
cudaDeviceGetAttribute 0.00% 25.599us 0.00% 25.599us 0.217us 118
cudaGetDriverEntryPoint 0.00% 0.879us 0.00% 0.879us 0.440us 2
cudaGetSymbolAddress 0.63% 19.461ms 0.63% 19.461ms 19.461ms 1
cuLaunchKernel 0.05% 1.419ms 0.05% 1.419ms 14.191us 100
cudaDeviceSynchronize 96.42% 2.982s 96.42% 2.982s 2.982s 1
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 3.093s
-------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
-------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::zeros 0.00% 33.837us 0.03% 1.406ms 1.406ms 1
aten::empty 0.02% 1.304ms 0.02% 1.304ms 1.304ms 1
aten::zero_ 0.00% 6.321us 0.00% 67.731us 67.731us 1
aten::fill_ 0.00% 20.899us 0.00% 61.410us 61.410us 1
cudaLaunchKernel 0.00% 40.511us 0.00% 40.511us 40.511us 1
aten::matmul 0.00% 220.170us 0.17% 9.472ms 94.718us 100
aten::mm 0.09% 4.829ms 0.17% 9.252ms 92.516us 100
cudaDeviceGetAttribute 0.00% 31.641us 0.00% 31.641us 0.316us 100
cudaMemsetAsync 0.03% 1.775ms 0.03% 1.775ms 17.752us 100
cuLaunchKernel 0.02% 1.246ms 0.02% 1.246ms 12.456us 100
cudaStreamIsCapturing 0.00% 2.410us 0.00% 2.410us 2.410us 1
cudaMalloc 0.03% 1.367ms 0.03% 1.367ms 1.367ms 1
cudaDeviceSynchronize 99.80% 5.408s 99.80% 5.408s 5.408s 1
-------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 5.419s
Max absolute error between TF32 and FP32: 0.15386199951171875
Automatic Mixed Precision package
This is the easiest way to leverage tensor core for training/inferencing, to allow some ops to automatically run in lower precision. I will use the standard tutorial for testing.
Include this to import amp
from torch.amp import autocast, GradScaler
And modify the following inference and backprop code
scaler = GradScaler('cuda')
...
with autocast('cuda'): # Enable mixed precision
pred = model(data)
loss = loss_fn(pred, y)
# Backpropagation
scaler.scale(loss).backward() # Scale the loss for stability
scaler.step(optimizer)
scaler.update()
...