Basics of GPU — From Concept to Handoff to Hardening

A Comprehensive Guide to GPU Architecture, RTL Design, Verification & Physical Implementation

1. GPU Architecture Fundamentals

1.1 The Graphics Pipeline

Unlike CPUs (optimized for single-thread latency), GPUs are optimized for throughput — processing thousands of threads simultaneously. The classic graphics pipeline:

Application (CPU) │ ▼ ┌─────────────────────────────────────────────────────┐ │ GRAPHICS PIPELINE │ │ │ │ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ │ │ Vertex │──▶│Primitive │──▶│ Rasterization │ │ │ │ Shader │ │ Assembly │ │ │ │ │ └──────────┘ └──────────┘ └───────┬───────┘ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────▼────────┐ │ │ │ Frame │◀──│ Blend / │◀──│ Fragment │ │ │ │ Buffer │ │ ROP │ │ Shader │ │ │ └──────────┘ └──────────┘ └───────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ ▼ Display

Stage	What it Does	RTL Design Relevance
Vertex Shader	Transforms 3D vertex positions, applies lighting per vertex	Programmable execution units, ALU design
Primitive Assembly	Groups vertices into triangles	Fixed-function logic, FIFO management
Rasterization	Converts triangles to pixel fragments	Edge equations, interpolation hardware
Fragment Shader	Computes per-pixel color, textures, effects	Texture units, ALU arrays, shared memory
ROP (Render Output)	Depth test, blending, anti-aliasing	Fixed-function, memory bandwidth critical

Key Insight for SoC Engineers: The vertex and fragment shaders run on the same unified shader cores in modern GPUs. The fixed-function stages (rasterizer, ROP) are separate hardware blocks. As a GPU RTL designer, you may work on either type.

1.2 Compute Architecture — SIMT Model

Modern GPUs use SIMT (Single Instruction, Multiple Threads) — a key concept to master:

SIMT Execution Model ┌─────────────────────────────────────────┐ │ Warp (32 threads) │ │ ┌────┬────┬────┬────┬─────┬────┐ │ │ │ T0 │ T1 │ T2 │ T3 │ ... │T31 │ │ │ └──┬─┴──┬─┴──┬─┴──┬─┴─────┴──┬─┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ Same Instruction (ADD R1,R2) │ │ │ │ Different Data per Thread │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────┘ Common terminology: Warp — 32 threads (most common) Wavefront — 64 threads (alternative) EU Thread — 8-16 threads (varies by vendor)

Concept	Description	Why it Matters for RTL
Thread	Single execution instance	Each thread has its own register state
Warp/Wavefront	Group of threads executing same instruction	Warp scheduler is a critical RTL block
Thread Block	Group of warps sharing resources	Shared memory, barrier synchronization HW
Grid	All thread blocks for a kernel	Dispatch unit, work distribution
Divergence	Threads in a warp take different paths (if/else)	Predication masks, re-convergence logic

Critical RTL Design Point: When threads in a warp diverge (if/else), the GPU must execute both paths and mask inactive threads. This is handled by predication hardware — a key area for GPU RTL designers.

1.3 GPU Memory Hierarchy

┌─────────────────────────────────────────────────┐ │ GPU Memory Hierarchy │ │ │ │ Fastest ┌──────────────────┐ │ │ ◄──────── │ Register File │ per thread │ │ │ (256 KB per SM) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ Shared Memory │ per thread block │ │ │ (64-228 KB/SM) │ (software managed│ │ └────────┬─────────┘ scratchpad) │ │ │ │ │ ┌────────▼─────────┐ │ │ │ L1 Cache │ per SM │ │ │ (128-256 KB) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ L2 Cache │ shared across GPU│ │ │ (4-96 MB) │ │ │ └────────┬─────────┘ │ │ │ │ │ Slowest ┌────────▼─────────┐ │ │ ◄──────── │ GDDR6X / HBM │ off-chip DRAM │ │ │ (8-80 GB) │ (high bandwidth) │ │ └──────────────────┘ │ └─────────────────────────────────────────────────┘

Memory Type	Latency (cycles)	Bandwidth	RTL Design Considerations
Register File	0-1	Highest	Multi-ported SRAM, bank conflicts, allocation logic
Shared Memory	~20-30	Very High	Bank design (32 banks), conflict resolution, barrier sync
L1 Cache	~30-50	High	Tag arrays, replacement policy, coherence with shared mem
L2 Cache	~200-400	Medium	Slice architecture, NoC interface, coherence protocol
GDDR6X/HBM	~400-800	Up to 3+ TB/s (HBM3)	Memory controller, scheduling, refresh, ECC

SoC Engineer Crossover: GPU memory hierarchy is similar to CPU caches you already know, but with key differences: (1) Register files are massively larger (to hold thousands of thread contexts), (2) Shared memory is explicitly managed by software, (3) Bandwidth is prioritized over latency.

↑ Back to Table of Contents

2. GPU Microarchitecture Deep Dive

2.1 Streaming Multiprocessor (SM) Architecture

The Streaming Multiprocessor (SM) / Compute Unit (CU) / Execution Unit (EU) is the fundamental building block:

┌─────────────────────────────────────────────────────────┐ │ Streaming Multiprocessor (SM) │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Warp Scheduler (x4) │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ Sched0 │ │ Sched1 │ │ Sched2 │ │ Sched3 │ │ │ │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ └──────┼──────────┼──────────┼──────────┼──────┘ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Execution Units │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │INT32 │ │INT32 │ │FP32 │ │FP32 │ │ │ │ │ │ALU │ │ALU │ │ALU │ │ALU │ │ │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │FP64 │ │Tensor│ │ SFU │ (Special Func) │ │ │ │ │ALU │ │Core │ │sin/cos│ │ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Register File │ │Shared Memory │ │ L1 Cache │ │ │ │ (256 KB) │ │ (100 KB) │ │ (128 KB) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Load/Store Units │ Texture Units │ │ │ │ (memory requests) │ (filtering, LOD) │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘

Sub-block	Function	RTL Complexity
Warp Scheduler	Picks ready warps, issues instructions each cycle	Very High — scoreboarding, dependency tracking
INT32/FP32 ALUs	Core arithmetic — add, mul, fma, bitwise	Medium — pipelined datapaths
Tensor Cores	Matrix multiply-accumulate (AI/ML workloads)	Very High — systolic arrays, mixed precision
SFU	Transcendental functions — sin, cos, rsqrt	Medium — iterative approximation
Register File	Thread context storage	Very High — multi-banked, conflict-free access
Load/Store Units	Memory access — coalescing, address generation	Very High — coalescing logic, TLB
Texture Units	Texture sampling, bilinear/trilinear filtering	Medium — interpolation math, LOD calc

2.2 Warp Scheduling — The Heart of GPU Performance

The warp scheduler is one of the most critical RTL blocks in a GPU. It must:

Track dependency status of all active warps (scoreboard)
Select ready warps each cycle (priority/round-robin/age-based)
Handle divergence and re-convergence
Manage stalls (memory latency, execution hazards)

Warp Scheduling — Latency Hiding ═══════════════════════════════════ Cycle: 1 2 3 4 5 6 7 8 9 10 11 12 ──────────────────────────────────────────────────────── Warp 0: EXE EXE MEM --- --- --- --- --- --- MEM_DONE Warp 1: --- EXE EXE EXE MEM --- --- --- --- Warp 2: --- --- EXE EXE EXE EXE MEM --- --- Warp 3: --- --- --- EXE EXE EXE EXE EXE MEM ──────────────────────────────────────────────────────── ▲ While Warp 0 waits for memory, the scheduler runs Warps 1, 2, 3. This is GPU LATENCY HIDING.

Key Design Decision: The scheduler's policy (Greedy-Then-Oldest, Loose Round Robin, Two-Level) directly impacts performance by 15-30%. This is a critical area where micro-architecture innovation happens.

2.3 Register File Design

GPU register files are unlike anything in CPU design:

Parameter	CPU	GPU
Size per core	~5-10 KB	~256 KB per SM
Ports	~10-20 read/write	Banked (32+ banks) to simulate many ports
Allocation	Rename/OoO	Static at kernel launch
Purpose	Speculative execution	Hold thousands of thread contexts

RTL Challenge: Designing a 256 KB register file with conflict-free access across 32 banks, serving 4 warp schedulers issuing to multiple execution units — while meeting timing at 1.5+ GHz — is one of the hardest GPU RTL design problems.

↑ Back to Table of Contents

3. RTL Design for GPU IPs

3.1 RTL Coding Style for GPU Blocks

GPU RTL coding follows strict guidelines for synthesis and timing closure at advanced nodes:

// Example: Simple pipelined FP32 FMA unit (Fused Multiply-Add)
// Key GPU operation: result = A * B + C

module fp32_fma_pipe (
    input  logic        clk,
    input  logic        rst_n,
    input  logic        valid_in,
    input  logic [31:0] operand_a,
    input  logic [31:0] operand_b,
    input  logic [31:0] operand_c,
    output logic        valid_out,
    output logic [31:0] result
);

    // Pipeline stage 1: Multiply
    logic [47:0] mul_result_s1;
    logic [31:0] operand_c_s1;
    logic        valid_s1;

    always_ff @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            valid_s1 <= 1'b0;
        end else begin
            valid_s1      <= valid_in;
            mul_result_s1 <= operand_a[22:0] * operand_b[22:0];
            operand_c_s1  <= operand_c;
        end
    end

    // Pipeline stage 2: Accumulate + Normalize
    always_ff @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            valid_out <= 1'b0;
        end else begin
            valid_out <= valid_s1;
            result    <= normalize(add(mul_result_s1, operand_c_s1));
        end
    end

endmodule

GPU RTL Coding Guidelines:

Always use always_ff for sequential logic — synthesizers optimize better
Pipeline aggressively — GPU targets high clock frequencies (1.5-2.5 GHz)
Avoid latches — use explicit flops for every storage element
Minimize combinational depth — target <15 logic levels per stage
Clock gating everywhere — GPUs burn 300-700W, power is critical
Parameterize designs — GPU blocks are instantiated many times

3.2 FSM Design for GPU Controllers

GPU control logic frequently uses FSMs. Example — a simplified warp dispatch controller:

Warp Dispatch FSM ┌───────┐ ┌──────────┐ │ │ ready warps > 0 │ │ │ IDLE ├───────────────────▶│ DECODE │ │ │ │ │ └───┬───┘ └────┬─────┘ ▲ │ │ all warps complete │ decoded │ ▼ ┌───┴───────┐ ┌──────────┐ │ │◀──────────────│ │ │ COMPLETE │ result ready │ EXECUTE │ │ │ │ │ └───────────┘ └────┬─────┘ │ │ memory request ▼ ┌──────────┐ │ MEM_WAIT │ │ │ └──────────┘

RTL Best Practice: Use one-hot encoding for GPU FSMs at advanced nodes — it's faster (single bit check) and easier for synthesis tools to optimize, even though it uses more flops.

3.3 Pipeline Design Patterns

Pattern	Use Case in GPU	Key Design Consideration
Linear Pipeline	ALU execution stages	Balanced stage delays, forwarding paths
Elastic Pipeline	Memory subsystem	Valid/ready handshake, backpressure
Skid Buffer	Interface between blocks	Absorbs 1-cycle stalls without losing throughput
Credit-based Flow	NoC, L2 cache interface	Prevents overflow, decouples producer/consumer
FIFO Queues	Instruction buffer, memory queues	Async FIFOs for clock domain crossing

↑ Back to Table of Contents

4. Power, Performance, Area (PPA) Optimization

4.1 Power Optimization Techniques

Technique	Savings	Where Used in GPU
Clock Gating	20-40%	Idle execution units, unused warps, inactive SMs
Power Gating	Up to 90% (leakage)	Entire SMs powered off when unused
DVFS	Variable	Dynamic voltage/frequency based on workload
Operand Isolation	5-10%	Gate inputs to multipliers when not valid
Memory Banking	10-20%	Only activate needed SRAM banks
Data Encoding	5-15%	Bus invert coding on wide data buses

// Clock gating example for GPU ALU
logic alu_clk_en;
logic alu_gated_clk;

assign alu_clk_en = valid_in | flush | stall_recovery;

// Use ICG cell (Integrated Clock Gate) — synthesis replaces this
clk_gate_cell u_alu_cg (
    .clk_in  (clk),
    .enable  (alu_clk_en),
    .clk_out (alu_gated_clk)
);

4.2 Performance Optimization

Pipeline balancing — ensure each stage has similar combinational delay
Reduce stall cycles — forwarding/bypassing between dependent instructions
Memory coalescing — merge multiple thread memory accesses into one transaction
Occupancy optimization — maximize active warps per SM to hide latency
Instruction-Level Parallelism — dual-issue or quad-issue capabilities

Performance Metric: GPU performance is measured in FLOPS (Floating Point Operations Per Second). A modern GPU achieves 40-80 TFLOPS FP32. RTL designers must ensure the datapath can sustain peak throughput without pipeline bubbles.

4.3 Area Optimization

Resource sharing — time-multiplex ALUs between warps
SRAM compilers — use foundry-optimized SRAM macros vs flip-flop arrays
Logic restructuring — share common sub-expressions across datapaths
Encoding efficiency — use compressed formats (FP16, BF16, INT8 for AI)

4.4 Timing Closure at Advanced Nodes

Node	Target Freq	Key Challenges
7nm	1.5-1.8 GHz	Wire delay dominates, multi-Vt optimization
5nm	1.8-2.2 GHz	FinFET variability, increased leakage
4nm/3nm	2.0-2.5 GHz	GAA transistors, extreme BEOL congestion

RTL for Timing: At 3nm, you must think about timing while writing RTL. Techniques: limit fan-out to <8, break long combinational chains, use retiming-friendly coding, insert pipeline stages at module boundaries.

↑ Back to Table of Contents

5. Verification & Debug

5.1 UVM Verification for GPU Blocks

GPU verification shares UVM methodology with SoC, but with GPU-specific challenges:

GPU Verification Hierarchy ══════════════════════════ ┌──────────────────────────────────────────┐ │ Full Chip / SoC Level │ System tests, boot, OS ├──────────────────────────────────────────┤ │ GPU Cluster Level │ Multi-SM tests, L2, NoC ├──────────────────────────────────────────┤ │ SM / Compute Unit Level │ Shader programs, scheduling ├──────────────────────────────────────────┤ │ Sub-block Level │ ALU, register file, cache ├──────────────────────────────────────────┤ │ Unit Level │ Individual pipeline stages └──────────────────────────────────────────┘

GPU Verification Challenge	Approach
Massive parallelism (thousands of threads)	Constrained random with thread-aware coverage
Warp divergence/convergence	Directed tests + coverage on divergence patterns
Floating-point precision	Reference model comparison with tolerance
Memory consistency	Litmus tests, memory ordering checkers
Performance (cycle accuracy)	Performance counters in RTL, throughput assertions

5.2 Formal Equivalence Verification (FEV)

FEV ensures RTL changes don't break functionality — critical in GPU design where frequent optimizations happen:

Tool	Vendor	Use Case
Formality	EDA Tool	RTL-to-gate equivalence after synthesis
Conformal LEC	EDA Tool	Logic equivalence checking
JasperGold	EDA Tool	Property checking, formal proofs
VC Formal	EDA Tool	Assertion-based formal verification

// SVA Assertion example for GPU pipeline
// Ensure every valid instruction produces a result within N cycles

property instr_completes;
    @(posedge clk) disable iff (!rst_n)
    (valid_in && !stall) |-> ##[1:MAX_LATENCY] valid_out;
endproperty

assert property (instr_completes)
    else $error("Instruction did not complete within %0d cycles", MAX_LATENCY);

5.3 GPU-Specific Debug Techniques

Waveform analysis — trace warp scheduler decisions, identify stall causes
Performance counters — embed HW counters for IPC, cache hit rate, occupancy
Shader ISA simulation — run actual shader programs on RTL, compare with golden model
Memory trace analysis — verify coalescing, bank conflict detection
Power estimation — use switching activity from simulation for power analysis
Protocol checkers — AXI/ACE/CHI protocol monitors on interfaces

↑ Back to Table of Contents

6. TCL Scripting for Design Automation

6.1 Synthesis Automation

# TCL script for synthesis — GPU block

# Read design files
set search_path [list ./rtl ./lib ./constraints]
set target_library {foundry_3nm_ss_0p72v_125c.db}
set link_library  {* $target_library}

# Analyze and elaborate
analyze -format sverilog [glob ./rtl/*.sv]
elaborate gpu_sm_top

# Apply constraints
source ./constraints/gpu_sm.sdc
set_clock_uncertainty 0.05 [get_clocks gpu_clk]
set_max_fanout 8 [current_design]

# Clock gating insertion
set_clock_gating_style -type integrated \
    -minimum_bitwidth 4 \
    -control_point before

# Compile with high effort
compile_ultra -gate_clock -timing_high_effort_script

# Reports
report_timing -max_paths 50 > rpt/timing.rpt
report_area -hierarchy       > rpt/area.rpt
report_power -analysis_effort high > rpt/power.rpt
report_clock_gating          > rpt/cg.rpt

# Write outputs
write -format verilog -hierarchy -output netlist/gpu_sm_top.v
write_sdc netlist/gpu_sm_top.sdc

6.2 Place & Route Automation

# TCL for Place & Route — GPU block

# Initialize design
read_verilog netlist/gpu_sm_top.v
read_sdc netlist/gpu_sm_top.sdc
read_lef {tech.lef macro.lef}

# Floorplan
floorPlan -r 0.7 0.8 5 5 5 5
# Place SRAM macros (register file, caches)
placeInstance u_regfile_sram 100 200 R0
placeInstance u_l1_cache     300 200 R0

# Power planning
addStripe -layer M8 -width 2 -spacing 2 -set_to_set_distance 40 \
    -nets {VDD VSS}

# Placement and optimization
place_opt_design
clock_opt_design
route_opt_design

# Timing signoff
report_timing -max_paths 100 -early -late

6.3 Lint, CDC, and RDC

# Spyglass lint and CDC setup for GPU block

# Lint run
set_option enableSV yes
read_file -type sourcelist rtl_files.f
current_goal lint/lint_rtl
run_goal

# CDC (Clock Domain Crossing) analysis
# Critical for GPU: core clock, memory clock, PCIe clock
current_goal cdc/cdc_verify
set_parameter crossingcheck_strictsync yes
run_goal

# Report CDC violations
report_crossings -format csv > cdc_crossings.csv

↑ Back to Table of Contents

7. SoC Integration of GPU IP

7.1 Bus Protocols for GPU-SoC Interface

GPU IP Integration in SoC ═════════════════════════ ┌────────────┐ AXI/ACE ┌──────────────┐ │ ├────────────────▶ │ │ │ CPU │ │ Interconnect │ │ Cluster │◀──────────────── │ / NoC │ │ │ Coherent │ │ └────────────┘ └──────┬───────┘ │ ┌─────────────────────┼────────────────┐ │ │ │ ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ GPU │ │ Memory │ │ Display │ │ IP │ │ Ctrl │ │ Engine │ │ │ │ (DDR/HBM)│ │ │ │ AXI-M │ │ │ │ │ │ AXI-S │ └───────────┘ └───────────┘ │ IRQ │ │ DMA │ └───────────┘

Protocol	Use in GPU Integration	Key Features
AXI4	Non-coherent memory access	Burst, outstanding transactions, ID-based ordering
ACE/ACE-Lite	Cache-coherent access with CPU	Snoop channels, shared/exclusive states
CHI	Next-gen coherent interconnect	Packet-based, scalable, request/response/data/snoop
AXI-Stream	Display pipeline, video output	Unidirectional, no address, continuous flow
APB	GPU configuration registers	Simple, low-bandwidth control interface

SoC Crossover: Understanding AXI/ACE protocols, clock domain crossings at SoC level, and integration debugging is directly transferable to GPU integration work.

7.2 Clock & Power Domains in GPU SoCs

GPU SoC Clock & Power Architecture ════════════════════════════════════ ┌─────────────────────────────────────────────┐ │ Power Domain: GPU_VDD │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ SM 0 │ │ SM 1 │ │ SM N │ │ │ │ gpu_clk │ │ gpu_clk │ │ gpu_clk │ │ │ │ (2 GHz) │ │ (2 GHz) │ │ (2 GHz) │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ (individually power-gatable) │ ├─────────────────────────────────────────────┤ │ Power Domain: MEM_VDD │ │ ┌──────────┐ ┌──────────┐ │ │ │ L2 Cache │ │ Mem Ctrl │ │ │ │ mem_clk │ │ mem_clk │ │ │ │ (1 GHz) │ │ (1 GHz) │ │ │ └──────────┘ └──────────┘ │ ├─────────────────────────────────────────────┤ │ Power Domain: IO_VDD │ │ ┌──────────┐ ┌──────────┐ │ │ │ PCIe │ │ Display │ │ │ │ pcie_clk │ │ disp_clk │ │ │ │ (250MHz) │ │ (pixel) │ │ │ └──────────┘ └──────────┘ │ └─────────────────────────────────────────────┘ CDC crossings needed at every domain boundary!

7.3 Customer Collaboration — GPU as IP

When GPU is delivered as IP to SoC customers (your outsourcing experience is valuable here):

Deliverable	Description
RTL Package	Encrypted/obfuscated RTL, integration wrapper, config parameters
Integration Guide	Clock/reset requirements, pin descriptions, connectivity rules
Verification IP	UVM agents, reference tests, coverage models for customer verification
Timing Constraints	SDC files, interface timing budgets, multicycle paths
Power Intent	UPF/CPF files, isolation/retention requirements
Programming Guide	Register map, driver interface, initialization sequence

Note: Experience packaging IP for external customers, writing SOWs, and managing outsourced deliverables is a valuable complement to GPU RTL design skills.

↑ Back to Table of Contents

8. Modern GPU Trends (2024-2026)

8.1 Chiplet Architecture & NoC

Modern GPU Chiplet Architecture ════════════════════════════════ ┌─────────────────────────────────────────────────┐ │ Package (CoWoS) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Compute │ │ Compute │ │ Compute │ │ │ │ Chiplet │ │ Chiplet │ │ Chiplet │ │ │ │ (5nm) │ │ (5nm) │ │ (5nm) │ │ │ │ │ │ │ │ │ │ │ │ 32 SMs │ │ 32 SMs │ │ 32 SMs │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ═════╧══════════════╧══════════════╧════════ │ │ Die-to-Die Interconnect │ │ ════════════════════╤══════════════════════ │ │ │ │ │ ┌─────────▼──────────┐ │ │ │ IO / Cache Die │ │ │ │ (6nm) │ │ │ │ │ │ │ │ L2 Cache + NoC │ │ │ │ HBM Controllers │ │ │ │ PCIe Gen5 │ │ │ └─────────────────────┘ │ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │HBM3 │ │HBM3 │ │HBM3 │ │HBM3 │ │HBM3 │ │ │ │Stack│ │Stack│ │Stack│ │Stack│ │Stack│ │ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │ │ └─────────────────────────────────────────────────┘

NoC Relevance: Chiplet GPUs need sophisticated NoC designs for die-to-die communication, cache coherence across chiplets, and bandwidth management — a direct application of Network-on-Chip expertise.

8.2 AI/ML Acceleration — Tensor Cores

Generation	Operations	Precision	TOPS
Tensor Core v1 (1st Gen)	4x4 matrix FMA	FP16	~125
Tensor Core v2 (2nd Gen)	Sparse + dense	TF32, BF16, INT8	~312
Tensor Core v3 (3rd Gen)	Transformer Engine	FP8, FP16, BF16	~1000
Tensor Core v4 (4th Gen)	2nd gen Transformer	FP4, FP8, FP16	~2500

Tensor Core Operation (Simplified) ════════════════════════════════════ Matrix A (4x4) × Matrix B (4x4) + Matrix C (4x4) = Matrix D (4x4) ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ │ a b c d │ × │ e f g h │ + │ . . . . │ = │ . . . . │ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ FP16/BF16 FP16/BF16 FP32 FP32 Done in ONE cycle per Tensor Core! → This is how GPUs achieve massive AI throughput

RTL Design Impact: Tensor cores are essentially systolic arrays — a grid of multiply-accumulate units with data flowing through them. Designing these requires understanding of dataflow architecture, mixed-precision arithmetic, and efficient SRAM access patterns.

8.3 Ray Tracing Hardware

Modern GPUs include dedicated RT (Ray Tracing) cores:

Component	Function	RTL Design
BVH Traversal Unit	Traverse acceleration structure (tree)	Tree walker FSM, stack management
Ray-Box Intersection	Test ray against bounding boxes	FP comparators, slab method
Ray-Triangle Intersection	Test ray against triangle geometry	Möller-Trumbore algorithm in hardware

↑ Back to Table of Contents

9. Multiply-Accumulate (MAC) — In Depth

9.1 MAC Fundamentals

The Multiply-Accumulate (MAC) operation is the single most important arithmetic operation in GPU computing. Every pixel rendered, every neural network inference, and every physics simulation ultimately reduces to MAC operations: D = A × B + C

MAC Operation — Hardware View ══════════════════════════════ A ──────┐ ├──▶ ┌────────────┐ ┌────────────┐ B ──────┘ │ Multiplier │──────▶ │ Adder │──────▶ D (Result) │ (A × B) │ │ (Prod + C) │ C ──────────────────────────────────▶│ │ └────────────┘ └────────────┘ Single MAC: 1 multiply + 1 add = 2 FLOPs FMA (Fused Multiply-Add): ┌──────────────────────────────────────────────────┐ │ Multiplier and Adder share a SINGLE rounding │ │ step → more accurate + saves 1 cycle │ │ IEEE 754-2008 mandates FMA as a single op │ └──────────────────────────────────────────────────┘

MAC vs FMA — Critical Distinction:

Operation	Formula	Rounding	Accuracy	Hardware Cost
MAC (unfused)	D = round(round(A×B) + C)	Two rounding steps	Lower — error accumulates	Separate multiplier + adder
FMA (fused)	D = round(A×B + C)	One rounding step	Higher — single rounding	Fused unit, wider internal datapath

Why FMA Matters: In deep learning, millions of MAC operations chain together. With unfused MAC, rounding errors accumulate across layers and degrade model accuracy. FMA's single rounding preserves precision. Modern GPUs exclusively use FMA units.

9.2 Where MAC Lives in GPU Architecture

MAC/FMA units are embedded at every level of the GPU compute hierarchy:

MAC Usage Across GPU Architecture ═══════════════════════════════════ ┌─────────────────────────────────────────────────────────┐ │ Streaming Multiprocessor │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Shader / CUDA Cores │ │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ │ │FP32 │ │FP32 │ │FP32 │ │FP32 │ │ │ │ │ │FMA │ │FMA │ │FMA │ │FMA │ │ │ │ │ │Unit │ │Unit │ │Unit │ │Unit │ │ │ │ │ └───────┘ └───────┘ └───────┘ └───────┘ │ │ │ │ × 32 per SM = 32 FMAs per cycle │ │ │ │ Each processes 1 thread of a warp │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Tensor Cores (AI/ML) │ │ │ │ ┌───────────────────────────────────┐ │ │ │ │ │ 4×4 Systolic Array of MACs │ │ │ │ │ │ │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ │ │ │ │ │ │ = 64 MACs per cycle per Tensor │ │ │ │ │ │ Core (4×4 × 4 depth) │ │ │ │ │ └───────────────────────────────────┘ │ │ │ │ × 4 Tensor Cores per SM │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Texture Units (Graphics) │ │ │ │ Bilinear filtering = 4 MACs per texel │ │ │ │ Trilinear = 8 MACs per texel │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Ray Tracing Cores │ │ │ │ Ray-triangle intersection = 18+ MACs │ │ │ │ per ray-triangle test │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘

GPU Block	MAC Type	Precision	MACs per Cycle (per SM)	Use Case
Shader Cores	Scalar FMA	FP32, FP16	32-128	General compute, shading
Tensor Cores	Matrix FMA (systolic)	FP8, BF16, FP16, TF32, INT8	256-1024	AI training & inference
Texture Units	Fixed-point MAC	FP16/fixed	16-32	Texture filtering
RT Cores	FP MAC	FP32	Varies	Ray intersection math
SFU	Iterative MAC	FP32	8	sin, cos, rsqrt via polynomial

Scale Perspective: A modern high-end GPU has ~128 SMs. Each SM has ~128 FP32 FMA units + 4 Tensor Cores. Total: ~16,384 scalar FMAs + ~512 Tensor Cores running in parallel = capable of ~80+ TFLOPS FP32 and ~2500+ TOPS INT8.

9.3 MAC/FMA RTL Design — Hardware Implementation

9.3.1 FP32 FMA Microarchitecture

FP32 Fused Multiply-Add Pipeline (5-stage) ═══════════════════════════════════════════ Stage 1: DECODE & UNPACK ┌──────────────────────────────────────────────┐ │ Extract from IEEE 754: │ │ A = {sign_a, exp_a[7:0], mant_a[22:0]} │ │ B = {sign_b, exp_b[7:0], mant_b[22:0]} │ │ C = {sign_c, exp_c[7:0], mant_c[22:0]} │ │ Add implicit leading 1: mant = {1, frac} │ │ Handle special cases: NaN, Inf, Zero, Denorm│ └──────────────────────────────────────┬───────┘ │ Stage 2: MULTIPLY ▼ ┌──────────────────────────────────────────────┐ │ product_mant = mant_a × mant_b │ │ (24-bit × 24-bit = 48-bit result) │ │ product_exp = exp_a + exp_b - 127 (bias) │ │ product_sign = sign_a XOR sign_b │ │ │ │ Hardware: Booth-encoded Wallace tree │ │ multiplier for speed │ └──────────────────────────────────────┬───────┘ │ Stage 3: ALIGN ▼ ┌──────────────────────────────────────────────┐ │ exp_diff = product_exp - exp_c │ │ Shift C's mantissa to align with product: │ │ aligned_c = mant_c >> exp_diff │ │ │ │ Key: Must use WIDE internal datapath │ │ (74+ bits) to avoid precision loss │ └──────────────────────────────────────┬───────┘ │ Stage 4: ADD / SUBTRACT ▼ ┌──────────────────────────────────────────────┐ │ If same sign: sum = product + aligned_c │ │ If diff sign: sum = product - aligned_c │ │ │ │ Handle massive cancellation case: │ │ when A×B ≈ -C, result can lose many bits │ │ → need Leading Zero Anticipator (LZA) │ └──────────────────────────────────────┬───────┘ │ Stage 5: NORMALIZE & ROUND ▼ ┌──────────────────────────────────────────────┐ │ 1. Count leading zeros (from LZA) │ │ 2. Left-shift to normalize: 1.xxxxx │ │ 3. Adjust exponent accordingly │ │ 4. Round to 23-bit mantissa (IEEE 754) │ │ → Round-to-nearest-even (banker's) │ │ 5. Pack: {sign, exp[7:0], frac[22:0]} │ │ 6. Handle overflow → Inf, underflow → Zero │ └──────────────────────────────────────────────┘

9.3.2 RTL Code — Simplified FMA Pipeline

// Simplified 3-stage FP32 FMA (conceptual — production uses 5+ stages)

module fma_fp32 #(
    parameter MANT_W = 24,    // 23 fraction + 1 implicit
    parameter EXP_W  = 8,
    parameter PROD_W = 2*MANT_W  // 48-bit product
)(
    input  logic        clk, rst_n, valid_in,
    input  logic [31:0] op_a, op_b, op_c,
    output logic        valid_out,
    output logic [31:0] result
);

    // Stage 1: Multiply mantissas, compute exponent
    logic [PROD_W-1:0] product_s1;
    logic [EXP_W:0]    exp_prod_s1;
    logic              sign_prod_s1;

    always_ff @(posedge clk) begin
        product_s1   <= {1'b1, op_a[22:0]} * {1'b1, op_b[22:0]};
        exp_prod_s1  <= op_a[30:23] + op_b[30:23] - 8'd127;
        sign_prod_s1 <= op_a[31] ^ op_b[31];
    end

    // Stage 2: Align addend C, perform addition
    logic [PROD_W+1:0] sum_s2;
    logic [EXP_W:0]    exp_s2;

    always_ff @(posedge clk) begin
        // Align and add (simplified — real design handles
        // sign, shift amount, wide datapath)
        sum_s2 <= product_s1 + align(exp_prod_s1, op_c);
        exp_s2 <= exp_prod_s1;
    end

    // Stage 3: Normalize and round
    always_ff @(posedge clk) begin
        result    <= normalize_and_round(sum_s2, exp_s2);
        valid_out <= valid_s2;
    end

endmodule

9.3.3 Systolic Array — How Tensor Cores Use MACs

Systolic Array (4×4) — Data Flow for Matrix Multiply ═════════════════════════════════════════════════════ B elements flow DOWN ↓ A elements flow RIGHT → b0 b1 b2 b3 ↓ ↓ ↓ ↓ a0 → ┌─────┬─────┬─────┬─────┐ │ MAC │ MAC │ MAC │ MAC │ → partial sums │d+= │d+= │d+= │d+= │ a1 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │a×b+c│a×b+c│a×b+c│a×b+c│ a2 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │ │ │ │ │ a3 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │ │ │ │ │ └─────┴─────┴─────┴─────┘ ↓ ↓ ↓ ↓ Result matrix D accumulates over K cycles Each cell: ┌──────────────────────────┐ │ accum += a_in × b_in │ ← one MAC per cycle │ pass a_in → right │ │ pass b_in → down │ └──────────────────────────┘ 4×4 array doing K=4 accumulation: = 4 × 4 × 4 = 64 MACs in 4 cycles = 16 MACs per cycle throughput

RTL Design Challenge: The systolic array requires precise data staging — A elements must arrive one cycle apart horizontally, B elements one cycle apart vertically. Skew registers at the array edges handle this timing. The accumulator inside each cell must handle mixed-precision (e.g., FP16 multiply → FP32 accumulate) without overflow.

9.3.4 MAC in the Multiplier — Booth Encoding

Booth-Encoded Wallace Tree Multiplier (Used inside every MAC unit) ══════════════════════════════════════ 24-bit × 24-bit mantissa multiplication: Radix-4 Booth Encoding: ┌─────────────────────────────────────────┐ │ Reduces 24 partial products → 12 │ │ Each 2 bits of multiplier B encodes: │ │ │ │ B[1:0] Action Partial Product │ │ ─────── ──────────── ──────────────── │ │ 00 +0 0 │ │ 01 +1×A A │ │ 10 -2×A -(A << 1) │ │ 11 -1×A -A │ └─────────────────────────────────────────┘ Wallace Tree Compression: ┌─────────────────────────────────────────┐ │ 12 partial products │ │ → Layer 1: 3:2 compressors → 8 rows │ │ → Layer 2: 3:2 compressors → 6 rows │ │ → Layer 3: 3:2 compressors → 4 rows │ │ → Layer 4: 3:2 compressors → 3 rows │ │ → Layer 5: 3:2 compressors → 2 rows │ │ → Final: CPA (carry-propagate add) │ │ │ │ Total delay: ~5 compressor levels │ │ + 1 CPA = very fast parallel multiply │ └─────────────────────────────────────────┘

9.4 Precision Formats — Which MAC for Which Task

Number Formats Used in GPU MACs ════════════════════════════════ FP32 (Single Precision): ┌──┬──────────┬───────────────────────┐ │S │ Exponent │ Mantissa │ 32 bits total │1 │ 8 bits │ 23 bits │ Range: ±3.4×10³⁸ └──┴──────────┴───────────────────────┘ Precision: ~7 decimal digits FP16 (Half Precision): ┌──┬───────┬──────────┐ │S │ Exp │ Mantissa │ 16 bits total │1 │ 5 bit │ 10 bits │ Range: ±65504 └──┴───────┴──────────┘ Precision: ~3 decimal digits BF16 (Brain Float): ┌──┬──────────┬───────┐ │S │ Exponent │ Mant │ 16 bits total │1 │ 8 bits │ 7 bit │ Same range as FP32! └──┴──────────┴───────┘ Precision: ~2 decimal digits TF32 (Tensor Float): ┌──┬──────────┬──────────┐ │S │ Exponent │ Mantissa │ 19 bits (internal only) │1 │ 8 bits │ 10 bits │ Range of FP32, precision of FP16 └──┴──────────┴──────────┘ Used only inside Tensor Cores FP8 (E4M3 / E5M2): ┌──┬──────┬─────┐ │S │ Exp │Mant │ 8 bits total │1 │ 4/5 │ 3/2 │ Very low precision, very fast └──┴──────┴─────┘ Used for inference INT8: ┌────────────┐ │ 8-bit int │ Range: -128 to +127 └────────────┘ No exponent — fixed point

Format	Bits	MAC/cycle (per Tensor Core)	Use Case	Accuracy Trade-off
FP32	32	16	Scientific computing, graphics shading	Highest — gold standard
TF32	19	64	AI training (default)	Good — minimal loss vs FP32
BF16	16	128	AI training (mixed precision)	Good — same range as FP32
FP16	16	128	Inference, graphics	Moderate — limited range
FP8	8	256	Inference, fast training	Low — needs careful scaling
INT8	8	256	Inference deployment	Requires quantization-aware training
FP4	4	512	Aggressive inference	Very low — experimental

RTL Impact: Each precision format requires a different multiplier width, accumulator width, and rounding logic. Supporting multiple formats in one MAC unit requires multiplexed datapaths or reconfigurable multiplier arrays. The trend is toward flexible MAC units that handle FP32/TF32/BF16/FP16/FP8/INT8 in the same hardware, selectable per instruction.

Mixed-Precision Training — How Formats Combine:

Mixed-Precision Training Flow ══════════════════════════════ Forward Pass: ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ Weights │ │ Tensor Core │ │ Activations│ │ FP16 │────▶│ FP16 × FP16 │────▶│ FP16 │ │ │ │ + FP32 accum │ │ │ └──────────┘ └──────────────┘ └──────────┘ Loss Calculation: FP32 (full precision) Backward Pass: ┌──────────┐ ┌──────────────┐ ┌──────────┐ │Gradients │ │ Tensor Core │ │ Weight │ │ FP16 │────▶│ FP16 × FP16 │────▶│ Updates │ │ │ │ + FP32 accum │ │ FP32 │ └──────────┘ └──────────────┘ └──────────┘ │ ▼ Master Weights stored in FP32 (never lose precision)

↑ Back to Table of Contents

10. GPU Deployment for Model Training

10.1 Hardware Stack — What's Needed to Train a Model

Complete GPU Training Infrastructure ═════════════════════════════════════ ┌─────────────────────────────────────────────────────────────┐ │ DATA CENTER RACK │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ GPU SERVER NODE (1 of many) │ │ │ │ │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ │ │ │ │ 80GB │ │ 80GB │ │ 80GB │ │ 80GB │ │ │ │ │ │ HBM3 │ │ HBM3 │ │ HBM3 │ │ HBM3 │ │ │ │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ ════╧══════════╧══════════╧══════════╧═════ │ │ │ │ NVLink / GPU-to-GPU Interconnect │ │ │ │ (900 GB/s bidirectional) │ │ │ │ ═══════════════════╤═══════════════════════ │ │ │ │ │ │ │ │ │ ┌──────────────────▼──────────────────────┐ │ │ │ │ │ CPU (Host) │ │ │ │ │ │ 2× Server CPU, 512GB-2TB DDR5 RAM │ │ │ │ │ │ Orchestrates training, data loading │ │ │ │ │ └──────────────────┬──────────────────────┘ │ │ │ │ │ PCIe Gen5 ×16 │ │ │ │ ┌──────────────────▼──────────────────────┐ │ │ │ │ │ Network Interface │ │ │ │ │ │ InfiniBand 400 Gb/s (RDMA) │ │ │ │ │ │ or RoCE (RDMA over Ethernet) │ │ │ │ │ └──────────────────┬──────────────────────┘ │ │ │ │ │ │ │ │ └─────────────────────┼──────────────────────────────────┘ │ │ │ │ │ ══════════════════════╧══════════════════════════════════ │ │ High-Speed Network Fabric (Spine-Leaf) │ │ ═══════╤══════════╤══════════╤══════════╤════════════════ │ │ │ │ │ │ │ │ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐ │ │ │Node 1 │ │Node 2 │ │Node 3 │ │Node N │ │ │ │8 GPUs │ │8 GPUs │ │8 GPUs │ │8 GPUs │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ STORAGE SUBSYSTEM │ │ │ │ High-speed NVMe SSDs (local) + Parallel filesystem │ │ │ │ (Lustre / GPFS / WekaFS) for shared training data │ │ │ │ Capacity: 100s of TB to PBs │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ COOLING & POWER │ │ │ │ Liquid cooling (direct-to-chip) for GPU nodes │ │ │ │ Power: 5-10 kW per GPU node │ │ │ │ Total cluster: 100 kW to 100+ MW │ │ │ └────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Hardware Component Breakdown:

Component	Purpose	Specification (Typical)	Why It's Needed
GPU	Compute (MAC operations)	80-192 GB HBM3, 1000+ TFLOPS	All matrix math for training happens here
HBM Memory	GPU memory	80-192 GB per GPU, 2-4 TB/s bandwidth	Holds model weights, activations, gradients
GPU Interconnect	GPU-to-GPU within node	900 GB/s bidirectional per link	Gradient synchronization, tensor parallelism
Host CPU	Data loading, orchestration	64-128 cores, 512 GB-2 TB DDR5	Preprocesses data, feeds GPUs
System RAM	CPU memory	512 GB - 2 TB DDR5	Dataset buffering, CPU-side processing
Network (InfiniBand)	Node-to-node communication	400 Gb/s, RDMA	Multi-node distributed training
NVMe SSDs	Local fast storage	4-16 TB per node, 7 GB/s	Checkpoint saving, data staging
Parallel Filesystem	Shared storage	PB-scale, 100+ GB/s aggregate	Training dataset access from all nodes
Network Switches	Spine-leaf fabric	400 Gb/s ports, low latency	Connects all nodes with full bisection bandwidth
Cooling	Thermal management	Direct liquid cooling	GPUs generate 300-700W each, air cooling insufficient at scale
Power Distribution	Electrical infrastructure	5-10 kW per node	Clean power delivery, redundancy (UPS)

10.2 Software Stack — From Hardware to Model

GPU Training Software Stack ════════════════════════════ ┌─────────────────────────────────────────────────┐ │ APPLICATION LAYER │ │ Training script (Python) │ │ model.train(), optimizer.step() │ ├─────────────────────────────────────────────────┤ │ ML FRAMEWORK │ │ PyTorch / TensorFlow / JAX │ │ Defines model, loss, optimizer, data pipeline │ ├─────────────────────────────────────────────────┤ │ DISTRIBUTED TRAINING LIBRARIES │ │ DeepSpeed / Megatron / FSDP / Horovod │ │ Model parallelism, data parallelism, ZeRO │ ├─────────────────────────────────────────────────┤ │ GPU COMPUTE LIBRARIES │ │ cuDNN (neural net primitives) │ │ cuBLAS (matrix operations — GEMM) │ │ NCCL (multi-GPU communication) │ │ cuFFT, cuSPARSE, cuRAND │ ├─────────────────────────────────────────────────┤ │ RUNTIME / COMPILER │ │ CUDA Runtime / ROCm / oneAPI │ │ Kernel compilation, memory management │ │ Stream scheduling, async execution │ ├─────────────────────────────────────────────────┤ │ GPU DRIVER │ │ Kernel-mode driver │ │ GPU context management, memory mapping │ │ PCIe / NVLink communication │ ├─────────────────────────────────────────────────┤ │ OPERATING SYSTEM │ │ Linux (Ubuntu / RHEL / Rocky) │ │ Kernel modules, IOMMU, huge pages │ ├─────────────────────────────────────────────────┤ │ FIRMWARE │ │ GPU VBIOS / firmware │ │ BMC (Baseboard Management Controller) │ │ NIC firmware (InfiniBand / RoCE) │ ├─────────────────────────────────────────────────┤ │ HARDWARE │ │ GPU silicon → HBM → PCIe/NVLink → CPU → Network │ └─────────────────────────────────────────────────┘

Software Dependencies — Detailed:

Layer	Software	Function	Depends On
Framework	PyTorch	Model definition, autograd, training loop	Python, CUDA, cuDNN, NCCL
Framework	TensorFlow	Graph-based model execution	Python, CUDA, cuDNN, XLA
Framework	JAX	Functional transformations, JIT compilation	Python, XLA, CUDA
Distributed	DeepSpeed	ZeRO optimizer, pipeline parallelism	PyTorch, NCCL, MPI
Distributed	Megatron-LM	Tensor + pipeline parallelism for LLMs	PyTorch, NCCL, CUDA
Math Library	cuBLAS	GEMM (General Matrix Multiply) — the MAC workhorse	CUDA Runtime, GPU Driver
DNN Library	cuDNN	Convolution, attention, pooling, normalization	cuBLAS, CUDA Runtime
Communication	NCCL	All-reduce, broadcast across GPUs	GPU Driver, NVLink/PCIe/InfiniBand
Runtime	CUDA Toolkit	Kernel launch, memory management, streams	GPU Driver, Linux kernel
Driver	GPU Kernel Driver	Hardware abstraction, context management	Linux kernel, GPU firmware
Networking	MPI / RDMA	Inter-node communication	InfiniBand driver, OS kernel
Container	Docker + GPU Runtime	Reproducible environment packaging	OS, GPU Driver (must match)
Orchestration	Kubernetes + GPU plugin	Cluster scheduling, multi-job management	Docker, network fabric, storage

Key Dependency Chain: Training script → PyTorch → cuDNN → cuBLAS → CUDA Runtime → GPU Driver → GPU Firmware → GPU Silicon (MAC units). Every layer must be version-compatible. A driver mismatch or cuDNN version conflict can prevent training entirely.

10.3 Multi-GPU Training Topologies

Data Parallelism vs Model Parallelism vs Pipeline Parallelism ══════════════════════════════════════════════════════════════ DATA PARALLELISM (most common): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ Full │ │ Full │ │ Full │ │ Full │ │ Model │ │ Model │ │ Model │ │ Model │ │ Copy │ │ Copy │ │ Copy │ │ Copy │ │ │ │ │ │ │ │ │ │ Batch/4 │ │ Batch/4 │ │ Batch/4 │ │ Batch/4 │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └──────────┴──────────┴──────────┘ All-Reduce gradients (via NCCL over NVLink) TENSOR PARALLELISM (for large layers): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ Layer │ │ Layer │ │ Layer │ │ Layer │ │ Cols │ │ Cols │ │ Cols │ │ Cols │ │ [0:N/4] │ │[N/4:N/2]│ │[N/2:3N/4]│ │[3N/4:N]│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └──────────┴──────────┴──────────┘ All-Reduce partial results PIPELINE PARALLELISM (for deep models): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │──▶│ GPU 1 │──▶│ GPU 2 │──▶│ GPU 3 │ │ Layers │ │ Layers │ │ Layers │ │ Layers │ │ 1-10 │ │ 11-20 │ │ 21-30 │ │ 31-40 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ Micro-batches flow through stages like an assembly line

GPU Interconnect Topologies:

Topology	Bandwidth	Scope	Use Case
NVLink (within node)	900 GB/s per GPU	8 GPUs in one server	Tensor parallelism, fast all-reduce
NVSwitch	Full bisection	All-to-all within node	Any GPU can talk to any GPU at full speed
InfiniBand (across nodes)	400 Gb/s per port	Thousands of nodes	Data parallelism gradient sync
PCIe Gen5	64 GB/s per ×16	GPU ↔ CPU	Data loading, CPU-GPU transfer
RoCE v2	100-400 Gb/s	Ethernet-based clusters	Lower cost alternative to InfiniBand

10.4 End-to-End Training Workflow

How a Model Gets Trained — Step by Step ═════════════════════════════════════════ ┌──────────────────┐ │ 1. PREPARE DATA │ │ │ │ Raw data (text, │ ┌──────────────────┐ │ images, etc.) │────▶│ Tokenize / │ │ stored on │ │ Preprocess │ │ parallel FS │ │ (CPU job) │ └──────────────────┘ └────────┬─────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 2. LOAD MODEL & DISTRIBUTE │ │ │ │ Initialize model weights (random or │ │ from checkpoint) │ │ │ │ Distribute across GPUs: │ │ - Data parallel: copy model to all GPUs │ │ - Tensor parallel: shard layers across GPUs │ │ - Pipeline parallel: assign layer groups │ └──────────────────────────┬────────────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 3. TRAINING LOOP (repeat millions of times) │ │ │ │ a) DataLoader fetches batch from storage │ │ CPU → GPU transfer via PCIe/pinned memory │ │ │ │ b) FORWARD PASS │ │ Input → Layer 1 → Layer 2 → ... → Output │ │ Each layer: GEMM (MAC!) + activation │ │ Uses: cuBLAS GEMM → Tensor Cores → MACs │ │ │ │ c) LOSS COMPUTATION │ │ Compare output to ground truth │ │ Compute scalar loss value │ │ │ │ d) BACKWARD PASS (backpropagation) │ │ Compute gradients via chain rule │ │ Each layer: another GEMM (MAC!) for grads │ │ Memory intensive — store activations │ │ │ │ e) GRADIENT SYNC (multi-GPU) │ │ All-reduce gradients via NCCL │ │ NVLink within node, InfiniBand across │ │ │ │ f) OPTIMIZER STEP │ │ Update weights: W = W - lr × gradient │ │ Adam/AdamW: also maintains momentum │ │ │ │ g) CHECKPOINT (periodically) │ │ Save model state to NVMe/parallel FS │ │ Enables restart if hardware fails │ └──────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 4. EVALUATION & EXPORT │ │ │ │ Run validation dataset through model │ │ Measure accuracy/perplexity/loss │ │ Export final weights for inference │ └──────────────────────────────────────────────┘

Compute Cost at Each Stage:

Training Step	Primary Hardware	Bottleneck	MAC Usage
Data Loading	CPU + SSD + PCIe	I/O bandwidth	None
Forward Pass	GPU Tensor Cores	Compute (MACs)	Maximum
Loss Computation	GPU Shader Cores	Minimal	Low
Backward Pass	GPU Tensor Cores	Compute + Memory	Maximum
Gradient Sync	NVLink / InfiniBand	Network bandwidth	None
Optimizer Step	GPU Shader Cores	Memory bandwidth	Low
Checkpointing	NVMe SSDs	I/O bandwidth	None

Key Insight: The forward and backward passes are where 90%+ of GPU compute happens — and it's almost entirely matrix multiply (GEMM), which means MAC operations. This is why GPU performance for AI is measured in MAC throughput (TFLOPS/TOPS), and why Tensor Cores (dense MAC arrays) exist.

Scale Examples:

Model Size	GPUs Needed	Training Time	Cost (approx)
1B parameters	8 GPUs (1 node)	~1-3 days	$5K-$15K
7B parameters	32 GPUs (4 nodes)	~1-2 weeks	$50K-$150K
70B parameters	256 GPUs (32 nodes)	~1-3 months	$1M-$5M
400B+ parameters	2000+ GPUs (250+ nodes)	~3-6 months	$10M-$100M+

↑ Back to Table of Contents

11. Key Concepts & Q&A

11.1 Fundamental Questions & Answers

Q1: How does GPU latency hiding differ from CPU out-of-order execution?

A: CPUs hide latency by reordering instructions within a single thread (OoO execution with reservation stations, ROB). GPUs hide latency by switching between thousands of threads (warps). When one warp stalls on a memory access, the scheduler immediately issues instructions from another ready warp. This requires massive register files to hold all thread contexts simultaneously, but avoids the complex OoO hardware. The trade-off: GPUs sacrifice single-thread performance for aggregate throughput.

Q2: What happens when threads in a warp diverge at a branch?

A: The GPU executes both paths sequentially, masking inactive threads using predication. For example, if 20 of 32 threads take the if-path: first execute the if-path with a 20-thread active mask, then execute the else-path with a 12-thread active mask. At the reconvergence point (post-dominator), all 32 threads resume together. This wastes SIMT lanes and reduces throughput. Modern GPUs use independent thread scheduling to allow partial reconvergence before the post-dominator.

Q3: How would you design a clock gating strategy for a GPU SM?

A: Multi-level approach: (1) Fine-grained: gate individual ALU units when no valid instruction — use ICG cells, minimum 4-bit width threshold. (2) Medium-grained: gate entire execution sub-blocks (tensor cores, SFUs) based on instruction type decode. (3) Coarse-grained: power-gate entire SMs when no workload assigned — use retention flops for fast wake-up. (4) Data-dependent: operand isolation on multiplier inputs when not in use. Track clock gating efficiency with coverage metrics targeting >70% gating in idle scenarios.

Q4: Explain memory coalescing in GPU and its RTL implications.

A: When 32 threads in a warp access consecutive memory addresses, the hardware merges them into a single wide memory transaction (e.g., 32×4B = 128B). The coalescing unit in the load/store pipeline compares thread addresses, detects stride patterns, and generates minimum transactions. RTL design requires: address comparison logic for 32 threads, transaction merging FSM, handling of partial coalescing when threads access non-consecutive addresses, and replay logic for bank conflicts in shared memory.

Q5: How do you approach timing closure on a GPU block at 3nm?

A: (1) RTL phase: limit combinational depth to <12 gates, register all module outputs, use retiming-friendly structures. (2) Synthesis: multi-Vt optimization (HVT for non-critical, LVT for critical paths), aggressive clock gating for power. (3) P&R: macro placement for SRAMs near datapaths, strategic pipeline flop placement, useful skew insertion. (4) Signoff: run timing at worst-case PVT corners (SS, 0.72V, 125°C), fix setup/hold with ECO. Key GPU-specific: the register file-to-ALU path is usually the critical path — focus there first.

Q6: How is a GPU IP block integrated into an SoC?

A: (1) Define interface protocol (AXI4 for memory, APB for config), clock/reset requirements, power domains. (2) Provide integration wrapper with configurable parameters (number of SMs, cache sizes). (3) CDC analysis at all clock boundaries (GPU core clock, memory clock, bus clock). (4) Deliver UPF for power intent, SDC for timing constraints. (5) Provide verification IP (UVM agents, protocol checkers, integration tests). (6) Provide integration guide, known issues, and debug access via JTAG/scan.

11.2 Hands-On Design Exercises

Exercise	Skills Tested	Key Concepts
Design a warp scheduler	Microarchitecture thinking, priority logic	Scoreboarding, dependency tracking
Design a banked register file	SRAM design, conflict resolution	Bank interleaving, port arbitration
Design a memory coalescing unit	Address comparison, transaction merging	CAM-like structures
Debug a timing violation	Physical design understanding	Critical path analysis, fixing techniques
Write clock gating RTL	Power awareness in coding	ICG instantiation, enable generation
Design an AXI-to-internal bridge	Protocol knowledge, FSM design	AXI channels, outstanding transactions

↑ Back to Table of Contents

12. Study Resources & Preparation Roadmap

12.1 Recommended Study Plan (4-6 Weeks)

Week	Focus Area	Activities
Week 1	GPU Architecture Basics	Study CUDA programming model, understand SIMT, read GPU architecture whitepapers
Week 2	GPU Microarchitecture	Deep dive into SM structure, warp scheduling, register file design, memory hierarchy
Week 3	RTL & PPA for GPU	Practice GPU-style RTL coding, study clock gating techniques, review timing closure at advanced nodes
Week 4	Verification & FEV	Review formal verification tools (Formality, Conformal), practice SVA assertions
Week 5	Integration & Protocols	Study AXI/ACE/CHI protocols, CDC techniques, power intent (UPF)
Week 6	Design Exercises	Whiteboard exercises, practice explaining designs, solve design problems

12.2 Essential Reading

Resource	Type	Why Read It
GPU Architecture Whitepapers (latest generation)	Whitepaper	Official architecture details with block diagrams
"Computer Architecture: A Quantitative Approach" — Hennessy & Patterson (Ch. 4: GPU)	Textbook	Academic foundation of GPU architecture
CUDA C++ Programming Guide	Documentation	Understand the software model that drives hardware decisions
"A Survey of Techniques for Architecting and Managing GPU Register File"	Research Paper	Deep dive into register file design challenges
RDNA/CDNA Architecture Whitepapers	Whitepaper	Alternative GPU architecture perspective
Xe Architecture Documentation	Documentation	Another GPU architecture approach
Hot Chips Conference Presentations	Conference Talks	Latest GPU architecture announcements and trends

12.3 SoC-to-GPU Skill Mapping

Engineers with SoC backgrounds have strong transferable foundations for GPU design:

SoC Skill	GPU Equivalent	Relevance
SoC RTL design	GPU IP RTL design	Complex pipelined blocks with similar structures
UVM verification	GPU block verification	UVM methodology scales to GPU's parallel verification challenges
NoC / interconnect	GPU chiplet interconnect	NoC fabric design directly applicable to GPU die-to-die
AXI/ACE protocols	GPU-SoC interface	Bridges GPU IP into SoC seamlessly
Security architecture	GPU secure boot, TEE	Security expertise is increasingly critical in GPU designs
Customer/outsourcing	GPU IP delivery	Managing IP deliverables and customer integration
Automotive standards	Automotive GPU compliance	Critical for automotive GPU deployments

Key Takeaway: SoC architects bring system-level thinking, security expertise, and integration skills that complement GPU-specific domain knowledge. The combination of NoC, verification, and customer-facing experience is particularly valuable in modern chiplet-based GPU architectures.

↑ Back to Table of Contents