Kernel number optimization

CUDA kernels have a much lower overhead compared to other GPU APIs¹. Even so, if a large number of them is launched, it can add up.

BEAGLE's core execution model is a list of operations, where each operation is an update to partial likelihood of some node. The caller builds up this list and BEAGLE iterates over it with a for loop, executing each operation one after another. In the CUDA implementation, each operation launches a separate CUDA kernel.

Here's an interactive illustration. Each square is the state of the partial likelihood. Rows are nodes, columns are sites. When the square is blacked out, it means it is being updated. The gray background shows kernel execution.

Kernel execution in BEAGLE

This means that if the tree has N nodes and an update has changed all of them² This can add up, especially for large trees made up of thousands of nodes.

b3's CUDALikelihood, on the other hand, launches at most 2 kernels³. It calculates likelihood projections which are associated with edges instead of nodes. If all leaf edges have been updated², it launches update_leaves which calculates all projections in parallel across both sites and edges themselves.

For the rest of the edges, the propose kernel is called. It sequentially calculates all projections in a single kernel.

Kernel execution in b3

This design helps increase parallelism on expensive moves and minimizes kernel launch overhead.

b3 originally used Vulkan because it was cross-platform. But the overhead turned out to be too great: the Vulkan implementation was 5 times slower than the BEAGLE one. Video game frames have milliseconds to get rendered (8ms for 120fps, ~3ms for 360fps), while each MCMC step in phylogenetic simulations should ideally take at most hundreds of microseconds (600µs for 1 million steps per 10 minutes). ↩
Full updates are the most expensive kind of updates. They are caused, for example, by the tree scale operator or by changing the parameters of the substation model. ↩ ↩²
Technically, both b3 and BEAGLE have to launch one more kernel (so 3 and N + 1 in total, respectively) to get the root likelihood. But its cost is negligible for both. ↩

Kernel number optimization

Footnotes