Bandwidth-Efficient BVH Layout
for Incremental Hardware Traversal

Gábor Liktor Karthik Vaidyanathan
Intel Corporation

Use arrow keys or SPACE to navigate the presentation.
Press ‘p’ to see the presenter notes.

Motivation

Low-Cost BVH Traversal Methods

BVH node compression and incremental traversal methods allow efficient HW implementations.

Node Compression
Reduced memory footprint and bandwidth
  • [Mahovsky '05]:
    Bounding plane quantization
  • [Fabianowski and Dingliana '09]:
    Parent plane reuse
Incremental Traversal
Reduced-precision arithmetic
Low power and area cost
  • [Keely '14]:
    BVH quantization with incremental traversal.
  • [Vaidyanathan et al. '16]:
    Previous talk!
    Robust watertight method, enables parent plane reuse.

Fixed-function MIMD architectures

Functional parallelism instead of SIMD coherence extraction.
Latency hiding through multithreading, out-of-order exec.

Benefits include:

  • Custom built, efficient HW units
  • Good handling of divergent traversal
  • Lack of SIMD batching relaxes resource (e.g. cache bank) conflicts
  • High throughput through HW pipelining
Previous Work
TRaX [Spjut et al. '09], MIMD TM [Kopta et al. '10]: MIMD processing for incoherent rays
T&I Engine [Nah et al. '11], SGRT [Lee et al. '13]: Dedicated HW units for traversal and intersection

BVH Bandwidth Challenges

Several sources of bandwidth in a ray tracing system…

  • Ray data
    In a MIMD system ray sorting can be avoided
  • Traversal stack
    Stackless / short-stack methods exist [Laine '10]
  • Primitive data
    Indices, vertices. Outside our scope, though we show an optimization
  • BVH node data
    Frequent, potentially incoherent accesses
    Cache-efficiency is crucial

Problem

The reduction in BVH bandwidth is not proportional to the compression of the BVH node data.

New Challenges

  • Systems access nodes through cache hierarchies
  • Our quantized BVH nodes become comparably small to the cache line size
    Smallest HW transaction size
    Goal: miminize redundant data on cache lines, improve cache-locality
  • The size of the node child pointer becomes significant overhead
    Goal: compressed pointer structure

Importance of Memory Layout


Node Ordering Solutions


Contributions

Node layout and addressing scheme

  • We compress child pointers through clustering
  • We introduce a new node type to reference
    address space changes

Architecture proposal and analysis

  • Specialized for incremental traversal with plane reuse
  • Cycle-accurate cache simulation
  • Show improved L1$ and L2$ bandwidth compared to standard BVH layouts

Ideas for pointer compression


New node type: "Glue Node"

  • Stores a large pointer
  • Can be a child of a compressed internal node

Multi-Level Clustered BVH


We introduce two levels of clustering

Algorithm

Algorithm


Architecture

System Functional Overview


RT Functional Units


  • TU, PU Dedicated pipelined HW units for traversal and prim. intersection
    Similar to the T&I Engine or SGRT
  • A new unit for processing leaf nodes: LU
    • Functional difference: leaf nodes store no bounding planes
    • A leaf node can also be a glue node
    • Different frequency of encountering leaf and internal nodes

Increase throughput and utilization?

  • Multithreading
  • Out-of-order processing based on cache hits
  • Number of units
    • Follows expected ratio of node types during traversal
    • In our workloads 1 triagle tested after every 8 internal node
    • Traversed glue node similar magnitude of triangle tests
    • This brings the ratio: 8T:2L:1P

Traversal Cluster


  • The reduced bandwidth between L1$ and TU allows sharing the same L1 cache between units
  • We use heavily banked caches
    • L1$-TU: 8-byte crossbar network
    • 64-byte cache lines
  • Bank conflicts
    • Partially hidden by multithreading
    • We store the top 3 levels (7 nodes) of the BVH on a per-unit lookup buffer.
      Most conflicts occur at the top of the tree

Primitive Unit

  • Pipelined intersection arithmetic
  • 4 fetch units
    • To keep the single pipeline busy
    • 4 fetches / cycle necessary for 1 triangle isect / cycle

Sub-Clusters

  • Even with heavy banking, we noticed severe bank conflict overhead with 8 TUs per cluster
  • To make 8 units possible, we split the L1 cache to two discrete parts
    • Reduce conflicts through redundancy
  • The new blocks are called Sub-Clusters that share
    • Triangle unit
    • Ray interface
    • L2$ interface

Scalable Architecture


Results

Scenes


Workloads captured as PBRT / Embree traces, 512 x 512 pixels resolution.

Compared Methods


Node Ordering

BVH Layouts

  • Regular Index Buffers
    • Leaf nodes store triangle ids, vertices fetched from idx buffer lookup
  • Indices in Leaf Extension Nodes
    • Vertex indices are pre-fetched with leaf-nodes
    • Requires extension node (two 8-byte fetches in total)

Bandwidth Measurements


Bandwidth as a Function of Cache Size


  • Relative improvement in L1 BW constant compared to ODFL
  • When scaling L2 with fixed L1 the reduction diminishes as L2 increases
    • Coherence of outstanding misses reduces
    • Our clustering heuristic can predict these less reliably
  • Utilization increases with L2 capacity

Throughput and Scalability


Performance bottlenecks:

  • Ratio of units: where ratio of internal to leaf nodes less than 8:1
  • Bus utilization: the throughput of the L2-LLC bus can serve 64 bytes/cycle

Conclusion

BVH Compression
for Practical Cache Hierachies

  • BV quantization alone is not sufficient
    to realize bandwidth-potential
  • Our layout scheme demonstrates significant bandwidth reduction

Architectural Implications

  • Reduced node bandwidth permits shared cache among HW units
  • Global ray reordering schemes less appealing traversal, still important for shading
  • A throughput of several 100 Mrays/s possible with reasonable bandwidth

Future Work

  • We have not optimized BVH clustering for high performance…
  • We believe an efficient implementation possible due to greedy nature
  • Addressing primitive bandwidth compression important future challenge

Thank You

We are grateful for the open-source community for the scripts that made this presentation possible

  • jmpress.js: http://jmpressjs.github.io (presentation engine)
  • snap.svg: http://snapsvg.io/ (vector graphics renderer)
  • kramdown: https://kramdown.gettalong.org/ (nice markup language by Thomas Leitner)

Many thanks to Anjul Pantey for his help in publishing the slides