View on GitHub

418

EFFICIENT BVH & RAY TRACING

Members: Judy Mai & Vivian Wang

(project handin was emailed to staff)

Summary

We implemented an efficient ray tracer using OpenMP and used pthreads to build a scene BVH in parallel.

Background

Ray tracing is used to render images of 3D objects. It involves sending out rays from the camera through each pixel on the screen and calculating the closest intercepted face, if any. If there is an intercept, more rays are sent out of the intercept point. This is repeated until either a light source is reached or a certain number of iterations have been completed. This easily gets very compute intensive because of the number of rays and collision calculations.

Collision calculations very easily get very expensive. Even a small screen has thousands of pixels, each of which sends out numerous rays, and a scene in a game easily contains thousands if not millions of faces that each ray must determine if it intercepts or not, most of which the ray doesn’t. A BVH (bounding volume hierarchy) tree is one way to decrease the number of computations needed to determine the closest intercept point. A BVH tree is a way to group the faces in a scene into clusters. This speeds up ray tracing significantly because now the ray just needs to calculate whether or not it intercepts the bounding box of these clusters to easily eliminate large portions of the faces.

Building a BVH in parallel is slightly tricky because you’re building a tree so some parts must be done sequentially. Although each call to split the children nodes on each level of the tree is independent, child nodes require the parent to split first.

On the other hand, it’s much more simple to parallelize the ray tracing because you can compute each pixel independently.

Image of BVH

Approach

We started off with some base code from Computer Graphics 15-462, implemented using OpenGL and C++. This was previously a homework assignment, but we had to finish implementing ray tracing and indirect lighting.

Strangely enough, our code wasn’t running properly on GHC machines (the code ran 3x faster with no parallelism compared to 8 threads at 4am), so we had to set up our local machines with all the required packages to test. Hence, all of our data is run on a Intel(R) Core(TM) i7-4700HQ CPU @ 2.40 GHz machine.

Raytracing

Originally we wanted to implement our ray tracer in CUDA for speedup because of the massive amount of simple arithmetic calculations (exactly what the GPU is built for), but we were a little concerned because since ray tracing with indirect lighting doesn’t terminate at the same time, and CUDA runs in blocks, we wouldn’t have been able to use dynamic scheduling.

This would essentially involve parallelizing calls to the function trace_ray. However, after looking more into it, we ran into a lot of difficulty properly linking the files in the CMake. Since you can’t call host functions from the device, we’d have to port over or make copies of a lot of the libraries and functions. We also briefly considered using ISPC before realizing that it had the same library issues as CUDA along with heavy restrictions on types.

Instead, we decided to use pthreads and OpenMP to parallelize ray tracing. Since we can calculate the color from ray intersections for each pixel independently in ray tracing, we could parallelize over pixels, but we decided to stick with parallelizing over tiles of the scene instead of pixel by pixel to prevent each thread from being too fine-grained.


BVH Tree Building

Our original goal was to parallelize BVH tree generation with ISPC. Like with CUDA, we faced some challenges getting ISPC set up with CMake. After completing setup, we ported the function split_node over to ISPC and encountered some difficulties with its function calls to other files. Fortunately, the functions were simple, so we just made a copy in the ISPC file. The next issue we found was ISPC’s restrictions on input and return types. This made passing information into the ISPC function incredibly difficult. It also made passing more than one piece of information back nearly impossible. We had to hack around it by storing two ints in a double. However, after going all this trouble, we realized that much of the split_node function wasn’t parallelizable. The for loops within the function involved calculating a minimum, which requires a certain amount of sequential code. On top of that, these for loops ran only num_buckets iterations, which is a relatively small number, not worth the overhead created in passing information to ISPC.

Instead, we decided to use pthreads and parallelize across the calls to split_node. We restructured the code a bit to convert it from recursive calls to split_node to putting work on a work queue. To avoid having to interact with locks as much as possible, each thread only added one of the child nodes onto the work queue after a successful split and continued on to working with the other child.

We considered using a different approach to building the BVH tree that might work better with ISPC, but the implementation described in the first link under resources requires combining results between nodes, which becomes difficult with the return type restrictions of ISPC.

Results

Our project was optimizing the time it takes to complete BVH and Raytracing, so we measured performance by comparing our times with a baseline implementation of no parallelization.

We tested our implementations on four different scenes, each representing a different case:

cbspheres_lambertian (14 primitives) – very few primitives

cow (5856 primitives) – moderate amount of primitives evenly distributed (balanced BVH)

cblucy (133796 primitives) – many primitives unevenly distributed (clustered in the middle)

dragon (105120 primitives) – many primitives relatively distributed in the scene (somewhat balanced BVH)

Image of Scenes


Baseline times with no parallelization (single-threaded on the CPU).
  Number of primitives BVH times (s) Raytracing (s) Raytracing with Indirect Light (s)
cbspheres 14 0.0001 82.9075 226.3883
cow 5856 0.0687 114.3519 356.4323
cblucy 133796 2.8661 281.7143 660.2168
dragon 105120 2.6075 483.296 1085.938

Note that it makes sense for dragon to take longer to raytrace compared to cblucy despite cblucy having more primitives, because cblucy’s primitives are mostly clustered at one point in the scene whereas dragon has a lot of primitives at every point in the image.


BVH Tree Building

We also experimented with different numbers of pthreads

  16 pthreads 8 pthreads 4 pthreads 2 pthreads
cbspheres -22.0x -11.2x -5.50x -4.50x
cow 2.15x 2.09x 2.44x 1.77x
cblucy -1.38x 1.92x 2.27x 1.43x
dragon -1.36x 1.76x 2.03x 1.60x

We noticed that adding pthreads to parallelize cbspheres actually slowed down due to overhead from creating unnecessary threads (not that many primitives in the scene), since we would only split_node a handful of times.

With too many pthreads (see 16 pthreads), there is too much overhead from pthread startup and in many situations, depending on how the BVH tree is shaped, not enough work for all the threads.


Ray Tracing with pthreads

With parallelization over pthreads, we experimented with various numbers of pthreads and found that we achieved the fastest raytracing speedup at 32 threads, at 4.76x speedup with indirect lighting.

with indirect lighting
  32 pthreads 16 pthreads 8 pthreads 4 pthreads 2 pthreads
cbspheres 4.70x 4.07x 3.65x 2.54x 1.61x
cow 5.71x 4.84x 4.38x 3.11x 1.88x
cblucy 3.74x 3.51x 3.27x 2.26x 1.31x
dragon 4.87x 4.40x 4.08x 2.70x 1.60x
  4.76x 4.21x 3.85x 2.65x 1.60x
without indirect lighting
  32 pthreads 16 pthreads 8 pthreads 4 pthreads 2 pthreads
cbspheres 3.62x 3.32x 3.09x 2.28x 1.42x
cow 4.28x 3.38x 3.13x 2.31x 1.42x
cblucy 4.01x 3.49x 3.08x 2.21x 1.33x
dragon 4.00x 3.67x 3.22x 1.28x 2.30x
  3.98x 3.47x 3.13x 2.02x 1.62x

Note that it slows down with less pthreads, which makes sense because raytracing takes a lot of time and, unlike with BVH, all work is produced at the beginning, so threads always have work to do.


Ray Tracing with OpenMP

Another implementation we tried was using OpenMP with dynamic scheduling to parallelize raytracing.

with indirect lighting
  Baseline time (s) OpenMP time (s) Speedup OpenMP w/ scheduling time (s) Speedup
cbspheres 226.3883 79.6586 2.84x 54.7401 4.14x
cow 356.4323 89.7448 3.97x 70.5077 5.06x
cblucy 660.2168 274.6610 2.40x 170.0832 3.88x
dragon 1085.938 428.7411 2.53x 218.5886 4.97x
      2.94x   4.51x

On raytracing with indirect lighting, we achieved an average of 2.94x speedup using OpenMP, which increased to 4.51x speedup after we included dynamic scheduling.

without indirect lighting
  Baseline time (s) OpenMP time (s) Speedup OpenMP w/ scheduling time (s) Speedup
cbspheres 82.9075 30.7268 2.70x 23.4781 3.53x
cow 114.3519 38.4358 2.98x 32.1103 3.56x
cblucy 281.7143 121.9358 2.31x 74.9961 3.76x
dragon 483.296 210.5504 2.30x 121.931 3.96x
      2.57x   3.70x

On raytracing without indirect lighting, we achieved an average of 2.57x speedup using OpenMP, which increased to 3.70x speedup after we included dynamic scheduling.


Overall Analysis

Overall, our results show that the 32 pthread implementation is the fastest for raytracing, with 4.76x speedup. OpenMP with dynamic scheduling is a close second at 4.51x speedup.

On the other hand, BVH is actually better with less threads because of startup overhead and lack of work, and is fastest with 4 pthreads.

One interesting thing we found was that raytracing with indirect lighting had more speedup than raytracing without indirect lighting with parallelization.

Another thing we realized was that our current implementation of BVH is not built for trying to parallelize through ISPC because of the restrictions that ISPC places.

Resources

Below are some graphics articles that we referenced…

List of Work

Equal work was performed by both project members. Much of the code was written in pair programming.