Bandwidth-Efficient BVH Layout
for Incremental Hardware Traversal

Gábor Liktor

Karthik Vaidyanathan

Intel Corporation

Use arrow keys or SPACE to navigate the presentation.
Press ‘p’ to see the presenter notes.

Motivation

Low-Cost BVH Traversal Methods

BVH node compression and incremental traversal methods allow efficient HW implementations.

Node Compression: Reduced memory footprint and bandwidth

[Mahovsky '05]:
Bounding plane quantization
[Fabianowski and Dingliana '09]:
Parent plane reuse

Incremental Traversal: Reduced-precision arithmetic; Low power and area cost

[Keely '14]:
BVH quantization with incremental traversal.
[Vaidyanathan et al. '16]:
Previous talk!
Robust watertight method, enables parent plane reuse.

Fixed-function MIMD architectures

Functional parallelism instead of SIMD coherence extraction.
Latency hiding through multithreading, out-of-order exec.

Benefits include:

Custom built, efficient HW units
Good handling of divergent traversal
Lack of SIMD batching relaxes resource (e.g. cache bank) conflicts
High throughput through HW pipelining

Previous Work: TRaX [Spjut et al. '09], MIMD TM [Kopta et al. '10]: MIMD processing for incoherent rays; T&I Engine [Nah et al. '11], SGRT [Lee et al. '13]: Dedicated HW units for traversal and intersection

BVH Bandwidth Challenges

Several sources of bandwidth in a ray tracing system…

Ray data

In a MIMD system ray sorting can be avoided
Traversal stack

Stackless / short-stack methods exist [Laine '10]
Primitive data

Indices, vertices. Outside our scope, though we show an optimization
BVH node data

Frequent, potentially incoherent accesses

Cache-efficiency is crucial

Problem

The reduction in BVH bandwidth is not proportional to the compression of the BVH node data.

New Challenges

Systems access nodes through cache hierarchies

Our quantized BVH nodes become comparably small to the cache line size

Smallest HW transaction size

Goal: miminize redundant data on cache lines, improve cache-locality

The size of the node child pointer becomes significant overhead

Goal: compressed pointer structure

Importance of Memory Layout

Node Ordering Solutions

Compact Node Ordering
- Ordered Depth-First Layout (ODFL) [Nah et al. '10]:
  Place child with larger surface area next to parent

Subtree Partitioning
- Decompose the BVH into clusters of nodes, better cache locality
- Requires two child pointers
- Treelet scheduling [Aila and Karras '10]:
  Clusters matching the L1 size, limit traversal working set through ray sorting
- Cache Oblivious Clustering (COLBVH) [Yoon and Manocha '06]:
  Node reordering based on surface area, parent-child and spatial locality.

Contributions

Node layout and addressing scheme

We compress child pointers through clustering
We introduce a new node type to reference
address space changes

Architecture proposal and analysis

Specialized for incremental traversal with plane reuse
Cycle-accurate cache simulation
Show improved L1$ and L2$ bandwidth compared to standard BVH layouts

Ideas for pointer compression

Observation 1: clustering creates ranges in memory addressable with low-precision

Observation 2: we store child BBoxes in parent nodes
- Accessing a child pointer is only necessary upon intersection
- By moving some pointer data outside the parent node, we can eliminate redundant bandwidth

Avoid varying node sizes
- Keeps a HW architecture simple
- Always fetch the same amount of data

New node type: "Glue Node"

Stores a large pointer
Can be a child of a compressed internal node

Multi-Level Clustered BVH

We introduce two levels of clustering

A cache cluster (CC) contains nodes that fit within a cache line

… to improve cache coherency. Requires 2 child ptrs.

An address cluster (AC) is a low-resolution address space over cache clusters

… to allow compression of internal child ptrs.

Glue nodes: external ptrs.

Connect the root of an AC to their parent AC.

Algorithm

Architecture

System Functional Overview

RT Functional Units

TU, PU Dedicated pipelined HW units for traversal and prim. intersection

Similar to the T&I Engine or SGRT

A new unit for processing leaf nodes: LU
- Functional difference: leaf nodes store no bounding planes
- A leaf node can also be a glue node
- Different frequency of encountering leaf and internal nodes

Increase throughput and utilization?

Multithreading
Out-of-order processing based on cache hits

Number of units
- Follows expected ratio of node types during traversal
- In our workloads 1 triagle tested after every 8 internal node
- Traversed glue node similar magnitude of triangle tests
- This brings the ratio: 8T:2L:1P

Traversal Cluster

The reduced bandwidth between L1$ and TU allows sharing the same L1 cache between units
We use heavily banked caches
- L1$-TU: 8-byte crossbar network
- 64-byte cache lines
Bank conflicts
- Partially hidden by multithreading
- We store the top 3 levels (7 nodes) of the BVH on a per-unit lookup buffer.
  
  Most conflicts occur at the top of the tree

Primitive Unit

Pipelined intersection arithmetic
4 fetch units
- To keep the single pipeline busy
- 4 fetches / cycle necessary for 1 triangle isect / cycle

Sub-Clusters

Even with heavy banking, we noticed severe bank conflict overhead with 8 TUs per cluster
To make 8 units possible, we split the L1 cache to two discrete parts
- Reduce conflicts through redundancy
The new blocks are called Sub-Clusters that share
- Triangle unit
- Ray interface
- L2$ interface

Scalable Architecture

Results

Scenes

Workloads captured as PBRT / Embree traces, 512 x 512 pixels resolution.

Compared Methods

Node Ordering

ODFL: Ordered Depth-First
CALBVH: Cache-aware version of [Yoon and Manocha '06], cluster size is a cache line.

BVH Layouts

Regular Index Buffers
- Leaf nodes store triangle ids, vertices fetched from idx buffer lookup
Indices in Leaf Extension Nodes
- Vertex indices are pre-fetched with leaf-nodes
- Requires extension node (two 8-byte fetches in total)

Bandwidth Measurements

Bandwidth as a Function of Cache Size

Relative improvement in L1 BW constant compared to ODFL

When scaling L2 with fixed L1 the reduction diminishes as L2 increases
- Coherence of outstanding misses reduces
- Our clustering heuristic can predict these less reliably

Utilization increases with L2 capacity

Throughput and Scalability

Performance bottlenecks:

Ratio of units: where ratio of internal to leaf nodes less than 8:1

Bus utilization: the throughput of the L2-LLC bus can serve 64 bytes/cycle

Conclusion

BVH Compression
for Practical Cache Hierachies

BV quantization alone is not sufficient
to realize bandwidth-potential
Our layout scheme demonstrates significant bandwidth reduction

Architectural Implications

Reduced node bandwidth permits shared cache among HW units
Global ray reordering schemes less appealing traversal, still important for shading
A throughput of several 100 Mrays/s possible with reasonable bandwidth

Future Work

We have not optimized BVH clustering for high performance…
We believe an efficient implementation possible due to greedy nature
Addressing primitive bandwidth compression important future challenge

Thank You

We are grateful for the open-source community for the scripts that made this presentation possible

jmpress.js: http://jmpressjs.github.io (presentation engine)
snap.svg: http://snapsvg.io/ (vector graphics renderer)
kramdown: https://kramdown.gettalong.org/ (nice markup language by Thomas Leitner)

Many thanks to Anjul Pantey for his help in publishing the slides

Bandwidth-Efficient BVH Layout for Incremental Hardware Traversal

Motivation

Low-Cost BVH Traversal Methods

Fixed-function MIMD architectures

BVH Bandwidth Challenges

Problem

Importance of Memory Layout

Node Ordering Solutions

Contributions

Node layout and addressing scheme

Architecture proposal and analysis

Ideas for pointer compression

Multi-Level Clustered BVH

Algorithm

Algorithm

Architecture

System Functional Overview

RT Functional Units

Node Structure

Increase throughput and utilization?

Traversal Cluster

Primitive Unit

Sub-Clusters

Scalable Architecture

System Configuration

Results

Scenes

Compared Methods

Bandwidth Measurements

Bandwidth as a Function of Cache Size

Throughput and Scalability

Conclusion

BVH Compression for Practical Cache Hierachies

Architectural Implications

Future Work

Thank You

Bandwidth-Efficient BVH Layout
for Incremental Hardware Traversal

BVH Compression
for Practical Cache Hierachies