9.1 MAC Fundamentals
The Multiply-Accumulate (MAC) operation is the single most important arithmetic operation in GPU computing. Every pixel rendered, every neural network inference, and every physics simulation ultimately reduces to MAC operations: D = A × B + C
MAC Operation — Hardware View
══════════════════════════════
A ──────┐
├──▶ ┌────────────┐ ┌────────────┐
B ──────┘ │ Multiplier │──────▶ │ Adder │──────▶ D (Result)
│ (A × B) │ │ (Prod + C) │
C ──────────────────────────────────▶│ │
└────────────┘ └────────────┘
Single MAC: 1 multiply + 1 add = 2 FLOPs
FMA (Fused Multiply-Add):
┌──────────────────────────────────────────────────┐
│ Multiplier and Adder share a SINGLE rounding │
│ step → more accurate + saves 1 cycle │
│ IEEE 754-2008 mandates FMA as a single op │
└──────────────────────────────────────────────────┘
MAC vs FMA — Critical Distinction:
| Operation | Formula | Rounding | Accuracy | Hardware Cost |
| MAC (unfused) | D = round(round(A×B) + C) | Two rounding steps | Lower — error accumulates | Separate multiplier + adder |
| FMA (fused) | D = round(A×B + C) | One rounding step | Higher — single rounding | Fused unit, wider internal datapath |
Why FMA Matters: In deep learning, millions of MAC operations chain together. With unfused MAC, rounding errors accumulate across layers and degrade model accuracy. FMA's single rounding preserves precision. Modern GPUs exclusively use FMA units.
9.2 Where MAC Lives in GPU Architecture
MAC/FMA units are embedded at every level of the GPU compute hierarchy:
MAC Usage Across GPU Architecture
═══════════════════════════════════
┌─────────────────────────────────────────────────────────┐
│ Streaming Multiprocessor │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Shader / CUDA Cores │ │
│ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │
│ │ │FP32 │ │FP32 │ │FP32 │ │FP32 │ │ │
│ │ │FMA │ │FMA │ │FMA │ │FMA │ │ │
│ │ │Unit │ │Unit │ │Unit │ │Unit │ │ │
│ │ └───────┘ └───────┘ └───────┘ └───────┘ │ │
│ │ × 32 per SM = 32 FMAs per cycle │ │
│ │ Each processes 1 thread of a warp │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Tensor Cores (AI/ML) │ │
│ │ ┌───────────────────────────────────┐ │ │
│ │ │ 4×4 Systolic Array of MACs │ │ │
│ │ │ │ │ │
│ │ │ MAC MAC MAC MAC │ │ │
│ │ │ MAC MAC MAC MAC │ │ │
│ │ │ MAC MAC MAC MAC │ │ │
│ │ │ MAC MAC MAC MAC │ │ │
│ │ │ │ │ │
│ │ │ = 64 MACs per cycle per Tensor │ │ │
│ │ │ Core (4×4 × 4 depth) │ │ │
│ │ └───────────────────────────────────┘ │ │
│ │ × 4 Tensor Cores per SM │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Texture Units (Graphics) │ │
│ │ Bilinear filtering = 4 MACs per texel │ │
│ │ Trilinear = 8 MACs per texel │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Ray Tracing Cores │ │
│ │ Ray-triangle intersection = 18+ MACs │ │
│ │ per ray-triangle test │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
| GPU Block | MAC Type | Precision | MACs per Cycle (per SM) | Use Case |
| Shader Cores | Scalar FMA | FP32, FP16 | 32-128 | General compute, shading |
| Tensor Cores | Matrix FMA (systolic) | FP8, BF16, FP16, TF32, INT8 | 256-1024 | AI training & inference |
| Texture Units | Fixed-point MAC | FP16/fixed | 16-32 | Texture filtering |
| RT Cores | FP MAC | FP32 | Varies | Ray intersection math |
| SFU | Iterative MAC | FP32 | 8 | sin, cos, rsqrt via polynomial |
Scale Perspective: A modern high-end GPU has ~128 SMs. Each SM has ~128 FP32 FMA units + 4 Tensor Cores. Total: ~16,384 scalar FMAs + ~512 Tensor Cores running in parallel = capable of ~80+ TFLOPS FP32 and ~2500+ TOPS INT8.
9.3 MAC/FMA RTL Design — Hardware Implementation
9.3.1 FP32 FMA Microarchitecture
FP32 Fused Multiply-Add Pipeline (5-stage)
═══════════════════════════════════════════
Stage 1: DECODE & UNPACK
┌──────────────────────────────────────────────┐
│ Extract from IEEE 754: │
│ A = {sign_a, exp_a[7:0], mant_a[22:0]} │
│ B = {sign_b, exp_b[7:0], mant_b[22:0]} │
│ C = {sign_c, exp_c[7:0], mant_c[22:0]} │
│ Add implicit leading 1: mant = {1, frac} │
│ Handle special cases: NaN, Inf, Zero, Denorm│
└──────────────────────────────────────┬───────┘
│
Stage 2: MULTIPLY ▼
┌──────────────────────────────────────────────┐
│ product_mant = mant_a × mant_b │
│ (24-bit × 24-bit = 48-bit result) │
│ product_exp = exp_a + exp_b - 127 (bias) │
│ product_sign = sign_a XOR sign_b │
│ │
│ Hardware: Booth-encoded Wallace tree │
│ multiplier for speed │
└──────────────────────────────────────┬───────┘
│
Stage 3: ALIGN ▼
┌──────────────────────────────────────────────┐
│ exp_diff = product_exp - exp_c │
│ Shift C's mantissa to align with product: │
│ aligned_c = mant_c >> exp_diff │
│ │
│ Key: Must use WIDE internal datapath │
│ (74+ bits) to avoid precision loss │
└──────────────────────────────────────┬───────┘
│
Stage 4: ADD / SUBTRACT ▼
┌──────────────────────────────────────────────┐
│ If same sign: sum = product + aligned_c │
│ If diff sign: sum = product - aligned_c │
│ │
│ Handle massive cancellation case: │
│ when A×B ≈ -C, result can lose many bits │
│ → need Leading Zero Anticipator (LZA) │
└──────────────────────────────────────┬───────┘
│
Stage 5: NORMALIZE & ROUND ▼
┌──────────────────────────────────────────────┐
│ 1. Count leading zeros (from LZA) │
│ 2. Left-shift to normalize: 1.xxxxx │
│ 3. Adjust exponent accordingly │
│ 4. Round to 23-bit mantissa (IEEE 754) │
│ → Round-to-nearest-even (banker's) │
│ 5. Pack: {sign, exp[7:0], frac[22:0]} │
│ 6. Handle overflow → Inf, underflow → Zero │
└──────────────────────────────────────────────┘
9.3.2 RTL Code — Simplified FMA Pipeline
module fma_fp32 #(
parameter MANT_W = 24,
parameter EXP_W = 8,
parameter PROD_W = 2*MANT_W
)(
input logic clk, rst_n, valid_in,
input logic [31:0] op_a, op_b, op_c,
output logic valid_out,
output logic [31:0] result
);
logic [PROD_W-1:0] product_s1;
logic [EXP_W:0] exp_prod_s1;
logic sign_prod_s1;
always_ff @(posedge clk) begin
product_s1 <= {1'b1, op_a[22:0]} * {1'b1, op_b[22:0]};
exp_prod_s1 <= op_a[30:23] + op_b[30:23] - 8'd127;
sign_prod_s1 <= op_a[31] ^ op_b[31];
end
logic [PROD_W+1:0] sum_s2;
logic [EXP_W:0] exp_s2;
always_ff @(posedge clk) begin
sum_s2 <= product_s1 + align(exp_prod_s1, op_c);
exp_s2 <= exp_prod_s1;
end
always_ff @(posedge clk) begin
result <= normalize_and_round(sum_s2, exp_s2);
valid_out <= valid_s2;
end
endmodule
9.3.3 Systolic Array — How Tensor Cores Use MACs
Systolic Array (4×4) — Data Flow for Matrix Multiply
═════════════════════════════════════════════════════
B elements flow DOWN ↓ A elements flow RIGHT →
b0 b1 b2 b3
↓ ↓ ↓ ↓
a0 → ┌─────┬─────┬─────┬─────┐
│ MAC │ MAC │ MAC │ MAC │ → partial sums
│d+= │d+= │d+= │d+= │
a1 → ├─────┼─────┼─────┼─────┤
│ MAC │ MAC │ MAC │ MAC │
│a×b+c│a×b+c│a×b+c│a×b+c│
a2 → ├─────┼─────┼─────┼─────┤
│ MAC │ MAC │ MAC │ MAC │
│ │ │ │ │
a3 → ├─────┼─────┼─────┼─────┤
│ MAC │ MAC │ MAC │ MAC │
│ │ │ │ │
└─────┴─────┴─────┴─────┘
↓ ↓ ↓ ↓
Result matrix D accumulates over K cycles
Each cell:
┌──────────────────────────┐
│ accum += a_in × b_in │ ← one MAC per cycle
│ pass a_in → right │
│ pass b_in → down │
└──────────────────────────┘
4×4 array doing K=4 accumulation:
= 4 × 4 × 4 = 64 MACs in 4 cycles
= 16 MACs per cycle throughput
RTL Design Challenge: The systolic array requires precise data staging — A elements must arrive one cycle apart horizontally, B elements one cycle apart vertically. Skew registers at the array edges handle this timing. The accumulator inside each cell must handle mixed-precision (e.g., FP16 multiply → FP32 accumulate) without overflow.
9.3.4 MAC in the Multiplier — Booth Encoding
Booth-Encoded Wallace Tree Multiplier
(Used inside every MAC unit)
══════════════════════════════════════
24-bit × 24-bit mantissa multiplication:
Radix-4 Booth Encoding:
┌─────────────────────────────────────────┐
│ Reduces 24 partial products → 12 │
│ Each 2 bits of multiplier B encodes: │
│ │
│ B[1:0] Action Partial Product │
│ ─────── ──────────── ──────────────── │
│ 00 +0 0 │
│ 01 +1×A A │
│ 10 -2×A -(A << 1) │
│ 11 -1×A -A │
└─────────────────────────────────────────┘
Wallace Tree Compression:
┌─────────────────────────────────────────┐
│ 12 partial products │
│ → Layer 1: 3:2 compressors → 8 rows │
│ → Layer 2: 3:2 compressors → 6 rows │
│ → Layer 3: 3:2 compressors → 4 rows │
│ → Layer 4: 3:2 compressors → 3 rows │
│ → Layer 5: 3:2 compressors → 2 rows │
│ → Final: CPA (carry-propagate add) │
│ │
│ Total delay: ~5 compressor levels │
│ + 1 CPA = very fast parallel multiply │
└─────────────────────────────────────────┘
9.4 Precision Formats — Which MAC for Which Task
Number Formats Used in GPU MACs
════════════════════════════════
FP32 (Single Precision):
┌──┬──────────┬───────────────────────┐
│S │ Exponent │ Mantissa │ 32 bits total
│1 │ 8 bits │ 23 bits │ Range: ±3.4×10³⁸
└──┴──────────┴───────────────────────┘ Precision: ~7 decimal digits
FP16 (Half Precision):
┌──┬───────┬──────────┐
│S │ Exp │ Mantissa │ 16 bits total
│1 │ 5 bit │ 10 bits │ Range: ±65504
└──┴───────┴──────────┘ Precision: ~3 decimal digits
BF16 (Brain Float):
┌──┬──────────┬───────┐
│S │ Exponent │ Mant │ 16 bits total
│1 │ 8 bits │ 7 bit │ Same range as FP32!
└──┴──────────┴───────┘ Precision: ~2 decimal digits
TF32 (Tensor Float):
┌──┬──────────┬──────────┐
│S │ Exponent │ Mantissa │ 19 bits (internal only)
│1 │ 8 bits │ 10 bits │ Range of FP32, precision of FP16
└──┴──────────┴──────────┘ Used only inside Tensor Cores
FP8 (E4M3 / E5M2):
┌──┬──────┬─────┐
│S │ Exp │Mant │ 8 bits total
│1 │ 4/5 │ 3/2 │ Very low precision, very fast
└──┴──────┴─────┘ Used for inference
INT8:
┌────────────┐
│ 8-bit int │ Range: -128 to +127
└────────────┘ No exponent — fixed point
| Format | Bits | MAC/cycle (per Tensor Core) | Use Case | Accuracy Trade-off |
| FP32 | 32 | 16 | Scientific computing, graphics shading | Highest — gold standard |
| TF32 | 19 | 64 | AI training (default) | Good — minimal loss vs FP32 |
| BF16 | 16 | 128 | AI training (mixed precision) | Good — same range as FP32 |
| FP16 | 16 | 128 | Inference, graphics | Moderate — limited range |
| FP8 | 8 | 256 | Inference, fast training | Low — needs careful scaling |
| INT8 | 8 | 256 | Inference deployment | Requires quantization-aware training |
| FP4 | 4 | 512 | Aggressive inference | Very low — experimental |
RTL Impact: Each precision format requires a different multiplier width, accumulator width, and rounding logic. Supporting multiple formats in one MAC unit requires multiplexed datapaths or reconfigurable multiplier arrays. The trend is toward flexible MAC units that handle FP32/TF32/BF16/FP16/FP8/INT8 in the same hardware, selectable per instruction.
Mixed-Precision Training — How Formats Combine:
Mixed-Precision Training Flow
══════════════════════════════
Forward Pass:
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Weights │ │ Tensor Core │ │ Activations│
│ FP16 │────▶│ FP16 × FP16 │────▶│ FP16 │
│ │ │ + FP32 accum │ │ │
└──────────┘ └──────────────┘ └──────────┘
Loss Calculation: FP32 (full precision)
Backward Pass:
┌──────────┐ ┌──────────────┐ ┌──────────┐
│Gradients │ │ Tensor Core │ │ Weight │
│ FP16 │────▶│ FP16 × FP16 │────▶│ Updates │
│ │ │ + FP32 accum │ │ FP32 │
└──────────┘ └──────────────┘ └──────────┘
│
▼
Master Weights
stored in FP32
(never lose precision)