Basics of GPU — From Concept to Handoff to Hardening

A Comprehensive Guide to GPU Architecture, RTL Design, Verification & Physical Implementation

1. GPU Architecture Fundamentals

1.1 The Graphics Pipeline

Unlike CPUs (optimized for single-thread latency), GPUs are optimized for throughput — processing thousands of threads simultaneously. The classic graphics pipeline:

Application (CPU) │ ▼ ┌─────────────────────────────────────────────────────┐ │ GRAPHICS PIPELINE │ │ │ │ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │ │ │ Vertex │──▶│Primitive │──▶│ Rasterization │ │ │ │ Shader │ │ Assembly │ │ │ │ │ └──────────┘ └──────────┘ └───────┬───────┘ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────▼────────┐ │ │ │ Frame │◀──│ Blend / │◀──│ Fragment │ │ │ │ Buffer │ │ ROP │ │ Shader │ │ │ └──────────┘ └──────────┘ └───────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ ▼ Display
StageWhat it DoesRTL Design Relevance
Vertex ShaderTransforms 3D vertex positions, applies lighting per vertexProgrammable execution units, ALU design
Primitive AssemblyGroups vertices into trianglesFixed-function logic, FIFO management
RasterizationConverts triangles to pixel fragmentsEdge equations, interpolation hardware
Fragment ShaderComputes per-pixel color, textures, effectsTexture units, ALU arrays, shared memory
ROP (Render Output)Depth test, blending, anti-aliasingFixed-function, memory bandwidth critical
Key Insight for SoC Engineers: The vertex and fragment shaders run on the same unified shader cores in modern GPUs. The fixed-function stages (rasterizer, ROP) are separate hardware blocks. As a GPU RTL designer, you may work on either type.

1.2 Compute Architecture — SIMT Model

Modern GPUs use SIMT (Single Instruction, Multiple Threads) — a key concept to master:

SIMT Execution Model ┌─────────────────────────────────────────┐ │ Warp (32 threads) │ │ ┌────┬────┬────┬────┬─────┬────┐ │ │ │ T0 │ T1 │ T2 │ T3 │ ... │T31 │ │ │ └──┬─┴──┬─┴──┬─┴──┬─┴─────┴──┬─┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ Same Instruction (ADD R1,R2) │ │ │ │ Different Data per Thread │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────┘ Common terminology: Warp — 32 threads (most common) Wavefront — 64 threads (alternative) EU Thread — 8-16 threads (varies by vendor)
ConceptDescriptionWhy it Matters for RTL
ThreadSingle execution instanceEach thread has its own register state
Warp/WavefrontGroup of threads executing same instructionWarp scheduler is a critical RTL block
Thread BlockGroup of warps sharing resourcesShared memory, barrier synchronization HW
GridAll thread blocks for a kernelDispatch unit, work distribution
DivergenceThreads in a warp take different paths (if/else)Predication masks, re-convergence logic
Critical RTL Design Point: When threads in a warp diverge (if/else), the GPU must execute both paths and mask inactive threads. This is handled by predication hardware — a key area for GPU RTL designers.

1.3 GPU Memory Hierarchy

┌─────────────────────────────────────────────────┐ │ GPU Memory Hierarchy │ │ │ │ Fastest ┌──────────────────┐ │ │ ◄──────── │ Register File │ per thread │ │ │ (256 KB per SM) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ Shared Memory │ per thread block │ │ │ (64-228 KB/SM) │ (software managed│ │ └────────┬─────────┘ scratchpad) │ │ │ │ │ ┌────────▼─────────┐ │ │ │ L1 Cache │ per SM │ │ │ (128-256 KB) │ │ │ └────────┬─────────┘ │ │ │ │ │ ┌────────▼─────────┐ │ │ │ L2 Cache │ shared across GPU│ │ │ (4-96 MB) │ │ │ └────────┬─────────┘ │ │ │ │ │ Slowest ┌────────▼─────────┐ │ │ ◄──────── │ GDDR6X / HBM │ off-chip DRAM │ │ │ (8-80 GB) │ (high bandwidth) │ │ └──────────────────┘ │ └─────────────────────────────────────────────────┘
Memory TypeLatency (cycles)BandwidthRTL Design Considerations
Register File0-1HighestMulti-ported SRAM, bank conflicts, allocation logic
Shared Memory~20-30Very HighBank design (32 banks), conflict resolution, barrier sync
L1 Cache~30-50HighTag arrays, replacement policy, coherence with shared mem
L2 Cache~200-400MediumSlice architecture, NoC interface, coherence protocol
GDDR6X/HBM~400-800Up to 3+ TB/s (HBM3)Memory controller, scheduling, refresh, ECC
SoC Engineer Crossover: GPU memory hierarchy is similar to CPU caches you already know, but with key differences: (1) Register files are massively larger (to hold thousands of thread contexts), (2) Shared memory is explicitly managed by software, (3) Bandwidth is prioritized over latency.

↑ Back to Table of Contents

2. GPU Microarchitecture Deep Dive

2.1 Streaming Multiprocessor (SM) Architecture

The Streaming Multiprocessor (SM) / Compute Unit (CU) / Execution Unit (EU) is the fundamental building block:

┌─────────────────────────────────────────────────────────┐ │ Streaming Multiprocessor (SM) │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Warp Scheduler (x4) │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ Sched0 │ │ Sched1 │ │ Sched2 │ │ Sched3 │ │ │ │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ └──────┼──────────┼──────────┼──────────┼──────┘ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Execution Units │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │INT32 │ │INT32 │ │FP32 │ │FP32 │ │ │ │ │ │ALU │ │ALU │ │ALU │ │ALU │ │ │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ │ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │FP64 │ │Tensor│ │ SFU │ (Special Func) │ │ │ │ │ALU │ │Core │ │sin/cos│ │ │ │ │ └──────┘ └──────┘ └──────┘ │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Register File │ │Shared Memory │ │ L1 Cache │ │ │ │ (256 KB) │ │ (100 KB) │ │ (128 KB) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Load/Store Units │ Texture Units │ │ │ │ (memory requests) │ (filtering, LOD) │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
Sub-blockFunctionRTL Complexity
Warp SchedulerPicks ready warps, issues instructions each cycleVery High — scoreboarding, dependency tracking
INT32/FP32 ALUsCore arithmetic — add, mul, fma, bitwiseMedium — pipelined datapaths
Tensor CoresMatrix multiply-accumulate (AI/ML workloads)Very High — systolic arrays, mixed precision
SFUTranscendental functions — sin, cos, rsqrtMedium — iterative approximation
Register FileThread context storageVery High — multi-banked, conflict-free access
Load/Store UnitsMemory access — coalescing, address generationVery High — coalescing logic, TLB
Texture UnitsTexture sampling, bilinear/trilinear filteringMedium — interpolation math, LOD calc

2.2 Warp Scheduling — The Heart of GPU Performance

The warp scheduler is one of the most critical RTL blocks in a GPU. It must:

  • Track dependency status of all active warps (scoreboard)
  • Select ready warps each cycle (priority/round-robin/age-based)
  • Handle divergence and re-convergence
  • Manage stalls (memory latency, execution hazards)
Warp Scheduling — Latency Hiding ═══════════════════════════════════ Cycle: 1 2 3 4 5 6 7 8 9 10 11 12 ──────────────────────────────────────────────────────── Warp 0: EXE EXE MEM --- --- --- --- --- --- MEM_DONE Warp 1: --- EXE EXE EXE MEM --- --- --- --- Warp 2: --- --- EXE EXE EXE EXE MEM --- --- Warp 3: --- --- --- EXE EXE EXE EXE EXE MEM ──────────────────────────────────────────────────────── ▲ While Warp 0 waits for memory, the scheduler runs Warps 1, 2, 3. This is GPU LATENCY HIDING.
Key Design Decision: The scheduler's policy (Greedy-Then-Oldest, Loose Round Robin, Two-Level) directly impacts performance by 15-30%. This is a critical area where micro-architecture innovation happens.

2.3 Register File Design

GPU register files are unlike anything in CPU design:

ParameterCPUGPU
Size per core~5-10 KB~256 KB per SM
Ports~10-20 read/writeBanked (32+ banks) to simulate many ports
AllocationRename/OoOStatic at kernel launch
PurposeSpeculative executionHold thousands of thread contexts
RTL Challenge: Designing a 256 KB register file with conflict-free access across 32 banks, serving 4 warp schedulers issuing to multiple execution units — while meeting timing at 1.5+ GHz — is one of the hardest GPU RTL design problems.

↑ Back to Table of Contents

3. RTL Design for GPU IPs

3.1 RTL Coding Style for GPU Blocks

GPU RTL coding follows strict guidelines for synthesis and timing closure at advanced nodes:

// Example: Simple pipelined FP32 FMA unit (Fused Multiply-Add) // Key GPU operation: result = A * B + C module fp32_fma_pipe ( input logic clk, input logic rst_n, input logic valid_in, input logic [31:0] operand_a, input logic [31:0] operand_b, input logic [31:0] operand_c, output logic valid_out, output logic [31:0] result ); // Pipeline stage 1: Multiply logic [47:0] mul_result_s1; logic [31:0] operand_c_s1; logic valid_s1; always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin valid_s1 <= 1'b0; end else begin valid_s1 <= valid_in; mul_result_s1 <= operand_a[22:0] * operand_b[22:0]; operand_c_s1 <= operand_c; end end // Pipeline stage 2: Accumulate + Normalize always_ff @(posedge clk or negedge rst_n) begin if (!rst_n) begin valid_out <= 1'b0; end else begin valid_out <= valid_s1; result <= normalize(add(mul_result_s1, operand_c_s1)); end end endmodule

GPU RTL Coding Guidelines:

  • Always use always_ff for sequential logic — synthesizers optimize better
  • Pipeline aggressively — GPU targets high clock frequencies (1.5-2.5 GHz)
  • Avoid latches — use explicit flops for every storage element
  • Minimize combinational depth — target <15 logic levels per stage
  • Clock gating everywhere — GPUs burn 300-700W, power is critical
  • Parameterize designs — GPU blocks are instantiated many times

3.2 FSM Design for GPU Controllers

GPU control logic frequently uses FSMs. Example — a simplified warp dispatch controller:

Warp Dispatch FSM ┌───────┐ ┌──────────┐ │ │ ready warps > 0 │ │ │ IDLE ├───────────────────▶│ DECODE │ │ │ │ │ └───┬───┘ └────┬─────┘ ▲ │ │ all warps complete │ decoded │ ▼ ┌───┴───────┐ ┌──────────┐ │ │◀──────────────│ │ │ COMPLETE │ result ready │ EXECUTE │ │ │ │ │ └───────────┘ └────┬─────┘ │ │ memory request ▼ ┌──────────┐ │ MEM_WAIT │ │ │ └──────────┘
RTL Best Practice: Use one-hot encoding for GPU FSMs at advanced nodes — it's faster (single bit check) and easier for synthesis tools to optimize, even though it uses more flops.

3.3 Pipeline Design Patterns

PatternUse Case in GPUKey Design Consideration
Linear PipelineALU execution stagesBalanced stage delays, forwarding paths
Elastic PipelineMemory subsystemValid/ready handshake, backpressure
Skid BufferInterface between blocksAbsorbs 1-cycle stalls without losing throughput
Credit-based FlowNoC, L2 cache interfacePrevents overflow, decouples producer/consumer
FIFO QueuesInstruction buffer, memory queuesAsync FIFOs for clock domain crossing

↑ Back to Table of Contents

4. Power, Performance, Area (PPA) Optimization

4.1 Power Optimization Techniques

TechniqueSavingsWhere Used in GPU
Clock Gating20-40%Idle execution units, unused warps, inactive SMs
Power GatingUp to 90% (leakage)Entire SMs powered off when unused
DVFSVariableDynamic voltage/frequency based on workload
Operand Isolation5-10%Gate inputs to multipliers when not valid
Memory Banking10-20%Only activate needed SRAM banks
Data Encoding5-15%Bus invert coding on wide data buses
// Clock gating example for GPU ALU logic alu_clk_en; logic alu_gated_clk; assign alu_clk_en = valid_in | flush | stall_recovery; // Use ICG cell (Integrated Clock Gate) — synthesis replaces this clk_gate_cell u_alu_cg ( .clk_in (clk), .enable (alu_clk_en), .clk_out (alu_gated_clk) );

4.2 Performance Optimization

  • Pipeline balancing — ensure each stage has similar combinational delay
  • Reduce stall cycles — forwarding/bypassing between dependent instructions
  • Memory coalescing — merge multiple thread memory accesses into one transaction
  • Occupancy optimization — maximize active warps per SM to hide latency
  • Instruction-Level Parallelism — dual-issue or quad-issue capabilities
Performance Metric: GPU performance is measured in FLOPS (Floating Point Operations Per Second). A modern GPU achieves 40-80 TFLOPS FP32. RTL designers must ensure the datapath can sustain peak throughput without pipeline bubbles.

4.3 Area Optimization

  • Resource sharing — time-multiplex ALUs between warps
  • SRAM compilers — use foundry-optimized SRAM macros vs flip-flop arrays
  • Logic restructuring — share common sub-expressions across datapaths
  • Encoding efficiency — use compressed formats (FP16, BF16, INT8 for AI)

4.4 Timing Closure at Advanced Nodes

NodeTarget FreqKey Challenges
7nm1.5-1.8 GHzWire delay dominates, multi-Vt optimization
5nm1.8-2.2 GHzFinFET variability, increased leakage
4nm/3nm2.0-2.5 GHzGAA transistors, extreme BEOL congestion
RTL for Timing: At 3nm, you must think about timing while writing RTL. Techniques: limit fan-out to <8, break long combinational chains, use retiming-friendly coding, insert pipeline stages at module boundaries.

↑ Back to Table of Contents

5. Verification & Debug

5.1 UVM Verification for GPU Blocks

GPU verification shares UVM methodology with SoC, but with GPU-specific challenges:

GPU Verification Hierarchy ══════════════════════════ ┌──────────────────────────────────────────┐ │ Full Chip / SoC Level │ System tests, boot, OS ├──────────────────────────────────────────┤ │ GPU Cluster Level │ Multi-SM tests, L2, NoC ├──────────────────────────────────────────┤ │ SM / Compute Unit Level │ Shader programs, scheduling ├──────────────────────────────────────────┤ │ Sub-block Level │ ALU, register file, cache ├──────────────────────────────────────────┤ │ Unit Level │ Individual pipeline stages └──────────────────────────────────────────┘
GPU Verification ChallengeApproach
Massive parallelism (thousands of threads)Constrained random with thread-aware coverage
Warp divergence/convergenceDirected tests + coverage on divergence patterns
Floating-point precisionReference model comparison with tolerance
Memory consistencyLitmus tests, memory ordering checkers
Performance (cycle accuracy)Performance counters in RTL, throughput assertions

5.2 Formal Equivalence Verification (FEV)

FEV ensures RTL changes don't break functionality — critical in GPU design where frequent optimizations happen:

ToolVendorUse Case
FormalityEDA ToolRTL-to-gate equivalence after synthesis
Conformal LECEDA ToolLogic equivalence checking
JasperGoldEDA ToolProperty checking, formal proofs
VC FormalEDA ToolAssertion-based formal verification
// SVA Assertion example for GPU pipeline // Ensure every valid instruction produces a result within N cycles property instr_completes; @(posedge clk) disable iff (!rst_n) (valid_in && !stall) |-> ##[1:MAX_LATENCY] valid_out; endproperty assert property (instr_completes) else $error("Instruction did not complete within %0d cycles", MAX_LATENCY);

5.3 GPU-Specific Debug Techniques

  • Waveform analysis — trace warp scheduler decisions, identify stall causes
  • Performance counters — embed HW counters for IPC, cache hit rate, occupancy
  • Shader ISA simulation — run actual shader programs on RTL, compare with golden model
  • Memory trace analysis — verify coalescing, bank conflict detection
  • Power estimation — use switching activity from simulation for power analysis
  • Protocol checkers — AXI/ACE/CHI protocol monitors on interfaces

↑ Back to Table of Contents

6. TCL Scripting for Design Automation

6.1 Synthesis Automation

# TCL script for synthesis — GPU block # Read design files set search_path [list ./rtl ./lib ./constraints] set target_library {foundry_3nm_ss_0p72v_125c.db} set link_library {* $target_library} # Analyze and elaborate analyze -format sverilog [glob ./rtl/*.sv] elaborate gpu_sm_top # Apply constraints source ./constraints/gpu_sm.sdc set_clock_uncertainty 0.05 [get_clocks gpu_clk] set_max_fanout 8 [current_design] # Clock gating insertion set_clock_gating_style -type integrated \ -minimum_bitwidth 4 \ -control_point before # Compile with high effort compile_ultra -gate_clock -timing_high_effort_script # Reports report_timing -max_paths 50 > rpt/timing.rpt report_area -hierarchy > rpt/area.rpt report_power -analysis_effort high > rpt/power.rpt report_clock_gating > rpt/cg.rpt # Write outputs write -format verilog -hierarchy -output netlist/gpu_sm_top.v write_sdc netlist/gpu_sm_top.sdc

6.2 Place & Route Automation

# TCL for Place & Route — GPU block # Initialize design read_verilog netlist/gpu_sm_top.v read_sdc netlist/gpu_sm_top.sdc read_lef {tech.lef macro.lef} # Floorplan floorPlan -r 0.7 0.8 5 5 5 5 # Place SRAM macros (register file, caches) placeInstance u_regfile_sram 100 200 R0 placeInstance u_l1_cache 300 200 R0 # Power planning addStripe -layer M8 -width 2 -spacing 2 -set_to_set_distance 40 \ -nets {VDD VSS} # Placement and optimization place_opt_design clock_opt_design route_opt_design # Timing signoff report_timing -max_paths 100 -early -late

6.3 Lint, CDC, and RDC

# Spyglass lint and CDC setup for GPU block # Lint run set_option enableSV yes read_file -type sourcelist rtl_files.f current_goal lint/lint_rtl run_goal # CDC (Clock Domain Crossing) analysis # Critical for GPU: core clock, memory clock, PCIe clock current_goal cdc/cdc_verify set_parameter crossingcheck_strictsync yes run_goal # Report CDC violations report_crossings -format csv > cdc_crossings.csv

↑ Back to Table of Contents

7. SoC Integration of GPU IP

7.1 Bus Protocols for GPU-SoC Interface

GPU IP Integration in SoC ═════════════════════════ ┌────────────┐ AXI/ACE ┌──────────────┐ │ ├────────────────▶ │ │ │ CPU │ │ Interconnect │ │ Cluster │◀──────────────── │ / NoC │ │ │ Coherent │ │ └────────────┘ └──────┬───────┘ │ ┌─────────────────────┼────────────────┐ │ │ │ ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ GPU │ │ Memory │ │ Display │ │ IP │ │ Ctrl │ │ Engine │ │ │ │ (DDR/HBM)│ │ │ │ AXI-M │ │ │ │ │ │ AXI-S │ └───────────┘ └───────────┘ │ IRQ │ │ DMA │ └───────────┘
ProtocolUse in GPU IntegrationKey Features
AXI4Non-coherent memory accessBurst, outstanding transactions, ID-based ordering
ACE/ACE-LiteCache-coherent access with CPUSnoop channels, shared/exclusive states
CHINext-gen coherent interconnectPacket-based, scalable, request/response/data/snoop
AXI-StreamDisplay pipeline, video outputUnidirectional, no address, continuous flow
APBGPU configuration registersSimple, low-bandwidth control interface
SoC Crossover: Understanding AXI/ACE protocols, clock domain crossings at SoC level, and integration debugging is directly transferable to GPU integration work.

7.2 Clock & Power Domains in GPU SoCs

GPU SoC Clock & Power Architecture ════════════════════════════════════ ┌─────────────────────────────────────────────┐ │ Power Domain: GPU_VDD │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ SM 0 │ │ SM 1 │ │ SM N │ │ │ │ gpu_clk │ │ gpu_clk │ │ gpu_clk │ │ │ │ (2 GHz) │ │ (2 GHz) │ │ (2 GHz) │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ (individually power-gatable) │ ├─────────────────────────────────────────────┤ │ Power Domain: MEM_VDD │ │ ┌──────────┐ ┌──────────┐ │ │ │ L2 Cache │ │ Mem Ctrl │ │ │ │ mem_clk │ │ mem_clk │ │ │ │ (1 GHz) │ │ (1 GHz) │ │ │ └──────────┘ └──────────┘ │ ├─────────────────────────────────────────────┤ │ Power Domain: IO_VDD │ │ ┌──────────┐ ┌──────────┐ │ │ │ PCIe │ │ Display │ │ │ │ pcie_clk │ │ disp_clk │ │ │ │ (250MHz) │ │ (pixel) │ │ │ └──────────┘ └──────────┘ │ └─────────────────────────────────────────────┘ CDC crossings needed at every domain boundary!

7.3 Customer Collaboration — GPU as IP

When GPU is delivered as IP to SoC customers (your outsourcing experience is valuable here):

DeliverableDescription
RTL PackageEncrypted/obfuscated RTL, integration wrapper, config parameters
Integration GuideClock/reset requirements, pin descriptions, connectivity rules
Verification IPUVM agents, reference tests, coverage models for customer verification
Timing ConstraintsSDC files, interface timing budgets, multicycle paths
Power IntentUPF/CPF files, isolation/retention requirements
Programming GuideRegister map, driver interface, initialization sequence
Note: Experience packaging IP for external customers, writing SOWs, and managing outsourced deliverables is a valuable complement to GPU RTL design skills.

↑ Back to Table of Contents

8. Modern GPU Trends (2024-2026)

8.1 Chiplet Architecture & NoC

Modern GPU Chiplet Architecture ════════════════════════════════ ┌─────────────────────────────────────────────────┐ │ Package (CoWoS) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Compute │ │ Compute │ │ Compute │ │ │ │ Chiplet │ │ Chiplet │ │ Chiplet │ │ │ │ (5nm) │ │ (5nm) │ │ (5nm) │ │ │ │ │ │ │ │ │ │ │ │ 32 SMs │ │ 32 SMs │ │ 32 SMs │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ═════╧══════════════╧══════════════╧════════ │ │ Die-to-Die Interconnect │ │ ════════════════════╤══════════════════════ │ │ │ │ │ ┌─────────▼──────────┐ │ │ │ IO / Cache Die │ │ │ │ (6nm) │ │ │ │ │ │ │ │ L2 Cache + NoC │ │ │ │ HBM Controllers │ │ │ │ PCIe Gen5 │ │ │ └─────────────────────┘ │ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │HBM3 │ │HBM3 │ │HBM3 │ │HBM3 │ │HBM3 │ │ │ │Stack│ │Stack│ │Stack│ │Stack│ │Stack│ │ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │ │ └─────────────────────────────────────────────────┘
NoC Relevance: Chiplet GPUs need sophisticated NoC designs for die-to-die communication, cache coherence across chiplets, and bandwidth management — a direct application of Network-on-Chip expertise.

8.2 AI/ML Acceleration — Tensor Cores

GenerationOperationsPrecisionTOPS
Tensor Core v1 (1st Gen)4x4 matrix FMAFP16~125
Tensor Core v2 (2nd Gen)Sparse + denseTF32, BF16, INT8~312
Tensor Core v3 (3rd Gen)Transformer EngineFP8, FP16, BF16~1000
Tensor Core v4 (4th Gen)2nd gen TransformerFP4, FP8, FP16~2500
Tensor Core Operation (Simplified) ════════════════════════════════════ Matrix A (4x4) × Matrix B (4x4) + Matrix C (4x4) = Matrix D (4x4) ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ │ a b c d │ × │ e f g h │ + │ . . . . │ = │ . . . . │ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ │ a b c d │ │ e f g h │ │ . . . . │ │ . . . . │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ FP16/BF16 FP16/BF16 FP32 FP32 Done in ONE cycle per Tensor Core! → This is how GPUs achieve massive AI throughput
RTL Design Impact: Tensor cores are essentially systolic arrays — a grid of multiply-accumulate units with data flowing through them. Designing these requires understanding of dataflow architecture, mixed-precision arithmetic, and efficient SRAM access patterns.

8.3 Ray Tracing Hardware

Modern GPUs include dedicated RT (Ray Tracing) cores:

ComponentFunctionRTL Design
BVH Traversal UnitTraverse acceleration structure (tree)Tree walker FSM, stack management
Ray-Box IntersectionTest ray against bounding boxesFP comparators, slab method
Ray-Triangle IntersectionTest ray against triangle geometryMöller-Trumbore algorithm in hardware

↑ Back to Table of Contents

9. Multiply-Accumulate (MAC) — In Depth

9.1 MAC Fundamentals

The Multiply-Accumulate (MAC) operation is the single most important arithmetic operation in GPU computing. Every pixel rendered, every neural network inference, and every physics simulation ultimately reduces to MAC operations: D = A × B + C

MAC Operation — Hardware View ══════════════════════════════ A ──────┐ ├──▶ ┌────────────┐ ┌────────────┐ B ──────┘ │ Multiplier │──────▶ │ Adder │──────▶ D (Result) │ (A × B) │ │ (Prod + C) │ C ──────────────────────────────────▶│ │ └────────────┘ └────────────┘ Single MAC: 1 multiply + 1 add = 2 FLOPs FMA (Fused Multiply-Add): ┌──────────────────────────────────────────────────┐ │ Multiplier and Adder share a SINGLE rounding │ │ step → more accurate + saves 1 cycle │ │ IEEE 754-2008 mandates FMA as a single op │ └──────────────────────────────────────────────────┘

MAC vs FMA — Critical Distinction:

OperationFormulaRoundingAccuracyHardware Cost
MAC (unfused)D = round(round(A×B) + C)Two rounding stepsLower — error accumulatesSeparate multiplier + adder
FMA (fused)D = round(A×B + C)One rounding stepHigher — single roundingFused unit, wider internal datapath
Why FMA Matters: In deep learning, millions of MAC operations chain together. With unfused MAC, rounding errors accumulate across layers and degrade model accuracy. FMA's single rounding preserves precision. Modern GPUs exclusively use FMA units.

9.2 Where MAC Lives in GPU Architecture

MAC/FMA units are embedded at every level of the GPU compute hierarchy:

MAC Usage Across GPU Architecture ═══════════════════════════════════ ┌─────────────────────────────────────────────────────────┐ │ Streaming Multiprocessor │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Shader / CUDA Cores │ │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ │ │FP32 │ │FP32 │ │FP32 │ │FP32 │ │ │ │ │ │FMA │ │FMA │ │FMA │ │FMA │ │ │ │ │ │Unit │ │Unit │ │Unit │ │Unit │ │ │ │ │ └───────┘ └───────┘ └───────┘ └───────┘ │ │ │ │ × 32 per SM = 32 FMAs per cycle │ │ │ │ Each processes 1 thread of a warp │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Tensor Cores (AI/ML) │ │ │ │ ┌───────────────────────────────────┐ │ │ │ │ │ 4×4 Systolic Array of MACs │ │ │ │ │ │ │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ MAC MAC MAC MAC │ │ │ │ │ │ │ │ │ │ │ │ = 64 MACs per cycle per Tensor │ │ │ │ │ │ Core (4×4 × 4 depth) │ │ │ │ │ └───────────────────────────────────┘ │ │ │ │ × 4 Tensor Cores per SM │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Texture Units (Graphics) │ │ │ │ Bilinear filtering = 4 MACs per texel │ │ │ │ Trilinear = 8 MACs per texel │ │ │ └─────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Ray Tracing Cores │ │ │ │ Ray-triangle intersection = 18+ MACs │ │ │ │ per ray-triangle test │ │ │ └─────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
GPU BlockMAC TypePrecisionMACs per Cycle (per SM)Use Case
Shader CoresScalar FMAFP32, FP1632-128General compute, shading
Tensor CoresMatrix FMA (systolic)FP8, BF16, FP16, TF32, INT8256-1024AI training & inference
Texture UnitsFixed-point MACFP16/fixed16-32Texture filtering
RT CoresFP MACFP32VariesRay intersection math
SFUIterative MACFP328sin, cos, rsqrt via polynomial
Scale Perspective: A modern high-end GPU has ~128 SMs. Each SM has ~128 FP32 FMA units + 4 Tensor Cores. Total: ~16,384 scalar FMAs + ~512 Tensor Cores running in parallel = capable of ~80+ TFLOPS FP32 and ~2500+ TOPS INT8.

9.3 MAC/FMA RTL Design — Hardware Implementation

9.3.1 FP32 FMA Microarchitecture

FP32 Fused Multiply-Add Pipeline (5-stage) ═══════════════════════════════════════════ Stage 1: DECODE & UNPACK ┌──────────────────────────────────────────────┐ │ Extract from IEEE 754: │ │ A = {sign_a, exp_a[7:0], mant_a[22:0]} │ │ B = {sign_b, exp_b[7:0], mant_b[22:0]} │ │ C = {sign_c, exp_c[7:0], mant_c[22:0]} │ │ Add implicit leading 1: mant = {1, frac} │ │ Handle special cases: NaN, Inf, Zero, Denorm│ └──────────────────────────────────────┬───────┘ │ Stage 2: MULTIPLY ▼ ┌──────────────────────────────────────────────┐ │ product_mant = mant_a × mant_b │ │ (24-bit × 24-bit = 48-bit result) │ │ product_exp = exp_a + exp_b - 127 (bias) │ │ product_sign = sign_a XOR sign_b │ │ │ │ Hardware: Booth-encoded Wallace tree │ │ multiplier for speed │ └──────────────────────────────────────┬───────┘ │ Stage 3: ALIGN ▼ ┌──────────────────────────────────────────────┐ │ exp_diff = product_exp - exp_c │ │ Shift C's mantissa to align with product: │ │ aligned_c = mant_c >> exp_diff │ │ │ │ Key: Must use WIDE internal datapath │ │ (74+ bits) to avoid precision loss │ └──────────────────────────────────────┬───────┘ │ Stage 4: ADD / SUBTRACT ▼ ┌──────────────────────────────────────────────┐ │ If same sign: sum = product + aligned_c │ │ If diff sign: sum = product - aligned_c │ │ │ │ Handle massive cancellation case: │ │ when A×B ≈ -C, result can lose many bits │ │ → need Leading Zero Anticipator (LZA) │ └──────────────────────────────────────┬───────┘ │ Stage 5: NORMALIZE & ROUND ▼ ┌──────────────────────────────────────────────┐ │ 1. Count leading zeros (from LZA) │ │ 2. Left-shift to normalize: 1.xxxxx │ │ 3. Adjust exponent accordingly │ │ 4. Round to 23-bit mantissa (IEEE 754) │ │ → Round-to-nearest-even (banker's) │ │ 5. Pack: {sign, exp[7:0], frac[22:0]} │ │ 6. Handle overflow → Inf, underflow → Zero │ └──────────────────────────────────────────────┘

9.3.2 RTL Code — Simplified FMA Pipeline

// Simplified 3-stage FP32 FMA (conceptual — production uses 5+ stages) module fma_fp32 #( parameter MANT_W = 24, // 23 fraction + 1 implicit parameter EXP_W = 8, parameter PROD_W = 2*MANT_W // 48-bit product )( input logic clk, rst_n, valid_in, input logic [31:0] op_a, op_b, op_c, output logic valid_out, output logic [31:0] result ); // Stage 1: Multiply mantissas, compute exponent logic [PROD_W-1:0] product_s1; logic [EXP_W:0] exp_prod_s1; logic sign_prod_s1; always_ff @(posedge clk) begin product_s1 <= {1'b1, op_a[22:0]} * {1'b1, op_b[22:0]}; exp_prod_s1 <= op_a[30:23] + op_b[30:23] - 8'd127; sign_prod_s1 <= op_a[31] ^ op_b[31]; end // Stage 2: Align addend C, perform addition logic [PROD_W+1:0] sum_s2; logic [EXP_W:0] exp_s2; always_ff @(posedge clk) begin // Align and add (simplified — real design handles // sign, shift amount, wide datapath) sum_s2 <= product_s1 + align(exp_prod_s1, op_c); exp_s2 <= exp_prod_s1; end // Stage 3: Normalize and round always_ff @(posedge clk) begin result <= normalize_and_round(sum_s2, exp_s2); valid_out <= valid_s2; end endmodule

9.3.3 Systolic Array — How Tensor Cores Use MACs

Systolic Array (4×4) — Data Flow for Matrix Multiply ═════════════════════════════════════════════════════ B elements flow DOWN ↓ A elements flow RIGHT → b0 b1 b2 b3 ↓ ↓ ↓ ↓ a0 → ┌─────┬─────┬─────┬─────┐ │ MAC │ MAC │ MAC │ MAC │ → partial sums │d+= │d+= │d+= │d+= │ a1 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │a×b+c│a×b+c│a×b+c│a×b+c│ a2 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │ │ │ │ │ a3 → ├─────┼─────┼─────┼─────┤ │ MAC │ MAC │ MAC │ MAC │ │ │ │ │ │ └─────┴─────┴─────┴─────┘ ↓ ↓ ↓ ↓ Result matrix D accumulates over K cycles Each cell: ┌──────────────────────────┐ │ accum += a_in × b_in │ ← one MAC per cycle │ pass a_in → right │ │ pass b_in → down │ └──────────────────────────┘ 4×4 array doing K=4 accumulation: = 4 × 4 × 4 = 64 MACs in 4 cycles = 16 MACs per cycle throughput
RTL Design Challenge: The systolic array requires precise data staging — A elements must arrive one cycle apart horizontally, B elements one cycle apart vertically. Skew registers at the array edges handle this timing. The accumulator inside each cell must handle mixed-precision (e.g., FP16 multiply → FP32 accumulate) without overflow.

9.3.4 MAC in the Multiplier — Booth Encoding

Booth-Encoded Wallace Tree Multiplier (Used inside every MAC unit) ══════════════════════════════════════ 24-bit × 24-bit mantissa multiplication: Radix-4 Booth Encoding: ┌─────────────────────────────────────────┐ │ Reduces 24 partial products → 12 │ │ Each 2 bits of multiplier B encodes: │ │ │ │ B[1:0] Action Partial Product │ │ ─────── ──────────── ──────────────── │ │ 00 +0 0 │ │ 01 +1×A A │ │ 10 -2×A -(A << 1) │ │ 11 -1×A -A │ └─────────────────────────────────────────┘ Wallace Tree Compression: ┌─────────────────────────────────────────┐ │ 12 partial products │ │ → Layer 1: 3:2 compressors → 8 rows │ │ → Layer 2: 3:2 compressors → 6 rows │ │ → Layer 3: 3:2 compressors → 4 rows │ │ → Layer 4: 3:2 compressors → 3 rows │ │ → Layer 5: 3:2 compressors → 2 rows │ │ → Final: CPA (carry-propagate add) │ │ │ │ Total delay: ~5 compressor levels │ │ + 1 CPA = very fast parallel multiply │ └─────────────────────────────────────────┘

9.4 Precision Formats — Which MAC for Which Task

Number Formats Used in GPU MACs ════════════════════════════════ FP32 (Single Precision): ┌──┬──────────┬───────────────────────┐ │S │ Exponent │ Mantissa │ 32 bits total │1 │ 8 bits │ 23 bits │ Range: ±3.4×10³⁸ └──┴──────────┴───────────────────────┘ Precision: ~7 decimal digits FP16 (Half Precision): ┌──┬───────┬──────────┐ │S │ Exp │ Mantissa │ 16 bits total │1 │ 5 bit │ 10 bits │ Range: ±65504 └──┴───────┴──────────┘ Precision: ~3 decimal digits BF16 (Brain Float): ┌──┬──────────┬───────┐ │S │ Exponent │ Mant │ 16 bits total │1 │ 8 bits │ 7 bit │ Same range as FP32! └──┴──────────┴───────┘ Precision: ~2 decimal digits TF32 (Tensor Float): ┌──┬──────────┬──────────┐ │S │ Exponent │ Mantissa │ 19 bits (internal only) │1 │ 8 bits │ 10 bits │ Range of FP32, precision of FP16 └──┴──────────┴──────────┘ Used only inside Tensor Cores FP8 (E4M3 / E5M2): ┌──┬──────┬─────┐ │S │ Exp │Mant │ 8 bits total │1 │ 4/5 │ 3/2 │ Very low precision, very fast └──┴──────┴─────┘ Used for inference INT8: ┌────────────┐ │ 8-bit int │ Range: -128 to +127 └────────────┘ No exponent — fixed point
FormatBitsMAC/cycle (per Tensor Core)Use CaseAccuracy Trade-off
FP323216Scientific computing, graphics shadingHighest — gold standard
TF321964AI training (default)Good — minimal loss vs FP32
BF1616128AI training (mixed precision)Good — same range as FP32
FP1616128Inference, graphicsModerate — limited range
FP88256Inference, fast trainingLow — needs careful scaling
INT88256Inference deploymentRequires quantization-aware training
FP44512Aggressive inferenceVery low — experimental
RTL Impact: Each precision format requires a different multiplier width, accumulator width, and rounding logic. Supporting multiple formats in one MAC unit requires multiplexed datapaths or reconfigurable multiplier arrays. The trend is toward flexible MAC units that handle FP32/TF32/BF16/FP16/FP8/INT8 in the same hardware, selectable per instruction.

Mixed-Precision Training — How Formats Combine:

Mixed-Precision Training Flow ══════════════════════════════ Forward Pass: ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ Weights │ │ Tensor Core │ │ Activations│ │ FP16 │────▶│ FP16 × FP16 │────▶│ FP16 │ │ │ │ + FP32 accum │ │ │ └──────────┘ └──────────────┘ └──────────┘ Loss Calculation: FP32 (full precision) Backward Pass: ┌──────────┐ ┌──────────────┐ ┌──────────┐ │Gradients │ │ Tensor Core │ │ Weight │ │ FP16 │────▶│ FP16 × FP16 │────▶│ Updates │ │ │ │ + FP32 accum │ │ FP32 │ └──────────┘ └──────────────┘ └──────────┘ │ ▼ Master Weights stored in FP32 (never lose precision)

↑ Back to Table of Contents

10. GPU Deployment for Model Training

10.1 Hardware Stack — What's Needed to Train a Model

Complete GPU Training Infrastructure ═════════════════════════════════════ ┌─────────────────────────────────────────────────────────────┐ │ DATA CENTER RACK │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ GPU SERVER NODE (1 of many) │ │ │ │ │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ │ │ │ │ 80GB │ │ 80GB │ │ 80GB │ │ 80GB │ │ │ │ │ │ HBM3 │ │ HBM3 │ │ HBM3 │ │ HBM3 │ │ │ │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ ════╧══════════╧══════════╧══════════╧═════ │ │ │ │ NVLink / GPU-to-GPU Interconnect │ │ │ │ (900 GB/s bidirectional) │ │ │ │ ═══════════════════╤═══════════════════════ │ │ │ │ │ │ │ │ │ ┌──────────────────▼──────────────────────┐ │ │ │ │ │ CPU (Host) │ │ │ │ │ │ 2× Server CPU, 512GB-2TB DDR5 RAM │ │ │ │ │ │ Orchestrates training, data loading │ │ │ │ │ └──────────────────┬──────────────────────┘ │ │ │ │ │ PCIe Gen5 ×16 │ │ │ │ ┌──────────────────▼──────────────────────┐ │ │ │ │ │ Network Interface │ │ │ │ │ │ InfiniBand 400 Gb/s (RDMA) │ │ │ │ │ │ or RoCE (RDMA over Ethernet) │ │ │ │ │ └──────────────────┬──────────────────────┘ │ │ │ │ │ │ │ │ └─────────────────────┼──────────────────────────────────┘ │ │ │ │ │ ══════════════════════╧══════════════════════════════════ │ │ High-Speed Network Fabric (Spine-Leaf) │ │ ═══════╤══════════╤══════════╤══════════╤════════════════ │ │ │ │ │ │ │ │ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐ │ │ │Node 1 │ │Node 2 │ │Node 3 │ │Node N │ │ │ │8 GPUs │ │8 GPUs │ │8 GPUs │ │8 GPUs │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ STORAGE SUBSYSTEM │ │ │ │ High-speed NVMe SSDs (local) + Parallel filesystem │ │ │ │ (Lustre / GPFS / WekaFS) for shared training data │ │ │ │ Capacity: 100s of TB to PBs │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ COOLING & POWER │ │ │ │ Liquid cooling (direct-to-chip) for GPU nodes │ │ │ │ Power: 5-10 kW per GPU node │ │ │ │ Total cluster: 100 kW to 100+ MW │ │ │ └────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Hardware Component Breakdown:

ComponentPurposeSpecification (Typical)Why It's Needed
GPUCompute (MAC operations)80-192 GB HBM3, 1000+ TFLOPSAll matrix math for training happens here
HBM MemoryGPU memory80-192 GB per GPU, 2-4 TB/s bandwidthHolds model weights, activations, gradients
GPU InterconnectGPU-to-GPU within node900 GB/s bidirectional per linkGradient synchronization, tensor parallelism
Host CPUData loading, orchestration64-128 cores, 512 GB-2 TB DDR5Preprocesses data, feeds GPUs
System RAMCPU memory512 GB - 2 TB DDR5Dataset buffering, CPU-side processing
Network (InfiniBand)Node-to-node communication400 Gb/s, RDMAMulti-node distributed training
NVMe SSDsLocal fast storage4-16 TB per node, 7 GB/sCheckpoint saving, data staging
Parallel FilesystemShared storagePB-scale, 100+ GB/s aggregateTraining dataset access from all nodes
Network SwitchesSpine-leaf fabric400 Gb/s ports, low latencyConnects all nodes with full bisection bandwidth
CoolingThermal managementDirect liquid coolingGPUs generate 300-700W each, air cooling insufficient at scale
Power DistributionElectrical infrastructure5-10 kW per nodeClean power delivery, redundancy (UPS)

10.2 Software Stack — From Hardware to Model

GPU Training Software Stack ════════════════════════════ ┌─────────────────────────────────────────────────┐ │ APPLICATION LAYER │ │ Training script (Python) │ │ model.train(), optimizer.step() │ ├─────────────────────────────────────────────────┤ │ ML FRAMEWORK │ │ PyTorch / TensorFlow / JAX │ │ Defines model, loss, optimizer, data pipeline │ ├─────────────────────────────────────────────────┤ │ DISTRIBUTED TRAINING LIBRARIES │ │ DeepSpeed / Megatron / FSDP / Horovod │ │ Model parallelism, data parallelism, ZeRO │ ├─────────────────────────────────────────────────┤ │ GPU COMPUTE LIBRARIES │ │ cuDNN (neural net primitives) │ │ cuBLAS (matrix operations — GEMM) │ │ NCCL (multi-GPU communication) │ │ cuFFT, cuSPARSE, cuRAND │ ├─────────────────────────────────────────────────┤ │ RUNTIME / COMPILER │ │ CUDA Runtime / ROCm / oneAPI │ │ Kernel compilation, memory management │ │ Stream scheduling, async execution │ ├─────────────────────────────────────────────────┤ │ GPU DRIVER │ │ Kernel-mode driver │ │ GPU context management, memory mapping │ │ PCIe / NVLink communication │ ├─────────────────────────────────────────────────┤ │ OPERATING SYSTEM │ │ Linux (Ubuntu / RHEL / Rocky) │ │ Kernel modules, IOMMU, huge pages │ ├─────────────────────────────────────────────────┤ │ FIRMWARE │ │ GPU VBIOS / firmware │ │ BMC (Baseboard Management Controller) │ │ NIC firmware (InfiniBand / RoCE) │ ├─────────────────────────────────────────────────┤ │ HARDWARE │ │ GPU silicon → HBM → PCIe/NVLink → CPU → Network │ └─────────────────────────────────────────────────┘

Software Dependencies — Detailed:

LayerSoftwareFunctionDepends On
FrameworkPyTorchModel definition, autograd, training loopPython, CUDA, cuDNN, NCCL
FrameworkTensorFlowGraph-based model executionPython, CUDA, cuDNN, XLA
FrameworkJAXFunctional transformations, JIT compilationPython, XLA, CUDA
DistributedDeepSpeedZeRO optimizer, pipeline parallelismPyTorch, NCCL, MPI
DistributedMegatron-LMTensor + pipeline parallelism for LLMsPyTorch, NCCL, CUDA
Math LibrarycuBLASGEMM (General Matrix Multiply) — the MAC workhorseCUDA Runtime, GPU Driver
DNN LibrarycuDNNConvolution, attention, pooling, normalizationcuBLAS, CUDA Runtime
CommunicationNCCLAll-reduce, broadcast across GPUsGPU Driver, NVLink/PCIe/InfiniBand
RuntimeCUDA ToolkitKernel launch, memory management, streamsGPU Driver, Linux kernel
DriverGPU Kernel DriverHardware abstraction, context managementLinux kernel, GPU firmware
NetworkingMPI / RDMAInter-node communicationInfiniBand driver, OS kernel
ContainerDocker + GPU RuntimeReproducible environment packagingOS, GPU Driver (must match)
OrchestrationKubernetes + GPU pluginCluster scheduling, multi-job managementDocker, network fabric, storage
Key Dependency Chain: Training script → PyTorch → cuDNN → cuBLAS → CUDA Runtime → GPU Driver → GPU Firmware → GPU Silicon (MAC units). Every layer must be version-compatible. A driver mismatch or cuDNN version conflict can prevent training entirely.

10.3 Multi-GPU Training Topologies

Data Parallelism vs Model Parallelism vs Pipeline Parallelism ══════════════════════════════════════════════════════════════ DATA PARALLELISM (most common): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ Full │ │ Full │ │ Full │ │ Full │ │ Model │ │ Model │ │ Model │ │ Model │ │ Copy │ │ Copy │ │ Copy │ │ Copy │ │ │ │ │ │ │ │ │ │ Batch/4 │ │ Batch/4 │ │ Batch/4 │ │ Batch/4 │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └──────────┴──────────┴──────────┘ All-Reduce gradients (via NCCL over NVLink) TENSOR PARALLELISM (for large layers): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │ Layer │ │ Layer │ │ Layer │ │ Layer │ │ Cols │ │ Cols │ │ Cols │ │ Cols │ │ [0:N/4] │ │[N/4:N/2]│ │[N/2:3N/4]│ │[3N/4:N]│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ └──────────┴──────────┴──────────┘ All-Reduce partial results PIPELINE PARALLELISM (for deep models): ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ GPU 0 │──▶│ GPU 1 │──▶│ GPU 2 │──▶│ GPU 3 │ │ Layers │ │ Layers │ │ Layers │ │ Layers │ │ 1-10 │ │ 11-20 │ │ 21-30 │ │ 31-40 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ Micro-batches flow through stages like an assembly line

GPU Interconnect Topologies:

TopologyBandwidthScopeUse Case
NVLink (within node)900 GB/s per GPU8 GPUs in one serverTensor parallelism, fast all-reduce
NVSwitchFull bisectionAll-to-all within nodeAny GPU can talk to any GPU at full speed
InfiniBand (across nodes)400 Gb/s per portThousands of nodesData parallelism gradient sync
PCIe Gen564 GB/s per ×16GPU ↔ CPUData loading, CPU-GPU transfer
RoCE v2100-400 Gb/sEthernet-based clustersLower cost alternative to InfiniBand

10.4 End-to-End Training Workflow

How a Model Gets Trained — Step by Step ═════════════════════════════════════════ ┌──────────────────┐ │ 1. PREPARE DATA │ │ │ │ Raw data (text, │ ┌──────────────────┐ │ images, etc.) │────▶│ Tokenize / │ │ stored on │ │ Preprocess │ │ parallel FS │ │ (CPU job) │ └──────────────────┘ └────────┬─────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 2. LOAD MODEL & DISTRIBUTE │ │ │ │ Initialize model weights (random or │ │ from checkpoint) │ │ │ │ Distribute across GPUs: │ │ - Data parallel: copy model to all GPUs │ │ - Tensor parallel: shard layers across GPUs │ │ - Pipeline parallel: assign layer groups │ └──────────────────────────┬────────────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 3. TRAINING LOOP (repeat millions of times) │ │ │ │ a) DataLoader fetches batch from storage │ │ CPU → GPU transfer via PCIe/pinned memory │ │ │ │ b) FORWARD PASS │ │ Input → Layer 1 → Layer 2 → ... → Output │ │ Each layer: GEMM (MAC!) + activation │ │ Uses: cuBLAS GEMM → Tensor Cores → MACs │ │ │ │ c) LOSS COMPUTATION │ │ Compare output to ground truth │ │ Compute scalar loss value │ │ │ │ d) BACKWARD PASS (backpropagation) │ │ Compute gradients via chain rule │ │ Each layer: another GEMM (MAC!) for grads │ │ Memory intensive — store activations │ │ │ │ e) GRADIENT SYNC (multi-GPU) │ │ All-reduce gradients via NCCL │ │ NVLink within node, InfiniBand across │ │ │ │ f) OPTIMIZER STEP │ │ Update weights: W = W - lr × gradient │ │ Adam/AdamW: also maintains momentum │ │ │ │ g) CHECKPOINT (periodically) │ │ Save model state to NVMe/parallel FS │ │ Enables restart if hardware fails │ └──────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ 4. EVALUATION & EXPORT │ │ │ │ Run validation dataset through model │ │ Measure accuracy/perplexity/loss │ │ Export final weights for inference │ └──────────────────────────────────────────────┘

Compute Cost at Each Stage:

Training StepPrimary HardwareBottleneckMAC Usage
Data LoadingCPU + SSD + PCIeI/O bandwidthNone
Forward PassGPU Tensor CoresCompute (MACs)Maximum
Loss ComputationGPU Shader CoresMinimalLow
Backward PassGPU Tensor CoresCompute + MemoryMaximum
Gradient SyncNVLink / InfiniBandNetwork bandwidthNone
Optimizer StepGPU Shader CoresMemory bandwidthLow
CheckpointingNVMe SSDsI/O bandwidthNone
Key Insight: The forward and backward passes are where 90%+ of GPU compute happens — and it's almost entirely matrix multiply (GEMM), which means MAC operations. This is why GPU performance for AI is measured in MAC throughput (TFLOPS/TOPS), and why Tensor Cores (dense MAC arrays) exist.

Scale Examples:

Model SizeGPUs NeededTraining TimeCost (approx)
1B parameters8 GPUs (1 node)~1-3 days$5K-$15K
7B parameters32 GPUs (4 nodes)~1-2 weeks$50K-$150K
70B parameters256 GPUs (32 nodes)~1-3 months$1M-$5M
400B+ parameters2000+ GPUs (250+ nodes)~3-6 months$10M-$100M+

↑ Back to Table of Contents

11. Key Concepts & Q&A

11.1 Fundamental Questions & Answers

Q1: How does GPU latency hiding differ from CPU out-of-order execution?
A: CPUs hide latency by reordering instructions within a single thread (OoO execution with reservation stations, ROB). GPUs hide latency by switching between thousands of threads (warps). When one warp stalls on a memory access, the scheduler immediately issues instructions from another ready warp. This requires massive register files to hold all thread contexts simultaneously, but avoids the complex OoO hardware. The trade-off: GPUs sacrifice single-thread performance for aggregate throughput.
Q2: What happens when threads in a warp diverge at a branch?
A: The GPU executes both paths sequentially, masking inactive threads using predication. For example, if 20 of 32 threads take the if-path: first execute the if-path with a 20-thread active mask, then execute the else-path with a 12-thread active mask. At the reconvergence point (post-dominator), all 32 threads resume together. This wastes SIMT lanes and reduces throughput. Modern GPUs use independent thread scheduling to allow partial reconvergence before the post-dominator.
Q3: How would you design a clock gating strategy for a GPU SM?
A: Multi-level approach: (1) Fine-grained: gate individual ALU units when no valid instruction — use ICG cells, minimum 4-bit width threshold. (2) Medium-grained: gate entire execution sub-blocks (tensor cores, SFUs) based on instruction type decode. (3) Coarse-grained: power-gate entire SMs when no workload assigned — use retention flops for fast wake-up. (4) Data-dependent: operand isolation on multiplier inputs when not in use. Track clock gating efficiency with coverage metrics targeting >70% gating in idle scenarios.
Q4: Explain memory coalescing in GPU and its RTL implications.
A: When 32 threads in a warp access consecutive memory addresses, the hardware merges them into a single wide memory transaction (e.g., 32×4B = 128B). The coalescing unit in the load/store pipeline compares thread addresses, detects stride patterns, and generates minimum transactions. RTL design requires: address comparison logic for 32 threads, transaction merging FSM, handling of partial coalescing when threads access non-consecutive addresses, and replay logic for bank conflicts in shared memory.
Q5: How do you approach timing closure on a GPU block at 3nm?
A: (1) RTL phase: limit combinational depth to <12 gates, register all module outputs, use retiming-friendly structures. (2) Synthesis: multi-Vt optimization (HVT for non-critical, LVT for critical paths), aggressive clock gating for power. (3) P&R: macro placement for SRAMs near datapaths, strategic pipeline flop placement, useful skew insertion. (4) Signoff: run timing at worst-case PVT corners (SS, 0.72V, 125°C), fix setup/hold with ECO. Key GPU-specific: the register file-to-ALU path is usually the critical path — focus there first.
Q6: How is a GPU IP block integrated into an SoC?
A: (1) Define interface protocol (AXI4 for memory, APB for config), clock/reset requirements, power domains. (2) Provide integration wrapper with configurable parameters (number of SMs, cache sizes). (3) CDC analysis at all clock boundaries (GPU core clock, memory clock, bus clock). (4) Deliver UPF for power intent, SDC for timing constraints. (5) Provide verification IP (UVM agents, protocol checkers, integration tests). (6) Provide integration guide, known issues, and debug access via JTAG/scan.

11.2 Hands-On Design Exercises

ExerciseSkills TestedKey Concepts
Design a warp schedulerMicroarchitecture thinking, priority logicScoreboarding, dependency tracking
Design a banked register fileSRAM design, conflict resolutionBank interleaving, port arbitration
Design a memory coalescing unitAddress comparison, transaction mergingCAM-like structures
Debug a timing violationPhysical design understandingCritical path analysis, fixing techniques
Write clock gating RTLPower awareness in codingICG instantiation, enable generation
Design an AXI-to-internal bridgeProtocol knowledge, FSM designAXI channels, outstanding transactions

↑ Back to Table of Contents

12. Study Resources & Preparation Roadmap

12.1 Recommended Study Plan (4-6 Weeks)

WeekFocus AreaActivities
Week 1GPU Architecture BasicsStudy CUDA programming model, understand SIMT, read GPU architecture whitepapers
Week 2GPU MicroarchitectureDeep dive into SM structure, warp scheduling, register file design, memory hierarchy
Week 3RTL & PPA for GPUPractice GPU-style RTL coding, study clock gating techniques, review timing closure at advanced nodes
Week 4Verification & FEVReview formal verification tools (Formality, Conformal), practice SVA assertions
Week 5Integration & ProtocolsStudy AXI/ACE/CHI protocols, CDC techniques, power intent (UPF)
Week 6Design ExercisesWhiteboard exercises, practice explaining designs, solve design problems

12.2 Essential Reading

ResourceTypeWhy Read It
GPU Architecture Whitepapers (latest generation)WhitepaperOfficial architecture details with block diagrams
"Computer Architecture: A Quantitative Approach" — Hennessy & Patterson (Ch. 4: GPU)TextbookAcademic foundation of GPU architecture
CUDA C++ Programming GuideDocumentationUnderstand the software model that drives hardware decisions
"A Survey of Techniques for Architecting and Managing GPU Register File"Research PaperDeep dive into register file design challenges
RDNA/CDNA Architecture WhitepapersWhitepaperAlternative GPU architecture perspective
Xe Architecture DocumentationDocumentationAnother GPU architecture approach
Hot Chips Conference PresentationsConference TalksLatest GPU architecture announcements and trends

12.3 SoC-to-GPU Skill Mapping

Engineers with SoC backgrounds have strong transferable foundations for GPU design:

SoC SkillGPU EquivalentRelevance
SoC RTL designGPU IP RTL designComplex pipelined blocks with similar structures
UVM verificationGPU block verificationUVM methodology scales to GPU's parallel verification challenges
NoC / interconnectGPU chiplet interconnectNoC fabric design directly applicable to GPU die-to-die
AXI/ACE protocolsGPU-SoC interfaceBridges GPU IP into SoC seamlessly
Security architectureGPU secure boot, TEESecurity expertise is increasingly critical in GPU designs
Customer/outsourcingGPU IP deliveryManaging IP deliverables and customer integration
Automotive standardsAutomotive GPU complianceCritical for automotive GPU deployments
Key Takeaway: SoC architects bring system-level thinking, security expertise, and integration skills that complement GPU-specific domain knowledge. The combination of NoC, verification, and customer-facing experience is particularly valuable in modern chiplet-based GPU architectures.

↑ Back to Table of Contents