COMP 426 Project
Sam Anthony 40271987


= Heterogeneous Multicore Implementation of 2D Bouncing Balls Simulation Using OpenCL =

The PCAM methodology was used to redesign the bouncing balls simulation
as a heterogeneous multicore program.  There are four steps in PCAM:
Partitioning, Communication, Agglomeration, and Mapping.  The program
was designed as a pipeline that runs on the CPU and GPU in parallel.


# Partitioning

The first step of PCAM is partitioning.  It involves breaking the
computation into individual pieces.

Functional decomposition was used to split the computation into tasks
and extract coarse-grained parallelism from the program.  There are four
tasks: 1) update the positions of the balls, 2) check for collisions
between balls, 3) check for collisions with the edges of the screen, and
4) generate the vertex array.  One OpenCL kernel was developed for each
of these tasks: move(), collideBalls(), collideWalls(), and genVertices().

Data decomposition was used to extract fine-grained parallelism from
the collideBalls() task. The partitioning involves building sets of
independent pairs of balls that can be tested for collision in parallel.
The set of possible collisions between balls is represented as the edges
of a completely connected graph.  The partition is built by repeatedly
finding matchings on this graph.  A matching is a set of independent
edges.  Two edges are independent if they don't share any vertices.
Therefore, all pairs of balls (edges) in a matching can be tested for
collision in parallel.  This partitioning is performed only once at
startup time and it removes the need for any synchronization at run time.


# Communication

The second step of PCAM is communication where the flow of data from one
task to the next is identified.  On a high level, the task dependency
graph is strightforward, with data flowing linearly from task 1 to 2 to
3 to 4.

[move] -> [collideBalls] -> [collideWalls] -> [genVertices]

There is a smaller dependency graph within step 2 (collideBalls()).
Each cell of the partition is computed one after the other in a straight line.
But all collisions within a cell are computed in parallel.

           Cell 1                            Cell n
  +----------------------+           +----------------------+
  |  --> [collision] --  |           |  --> [collision] --  |
  | /                  \ |           | /                  \ |
----      ...           -----> ... ----      ...           ---->
  | \                  / |           | \                  / |
  |  --> [collision] --  |           |  --> [collision] --  |
  +----------------------+           +----------------------+


# Agglomeration and Mapping

The final two steps of PCAM are agglomeration and mapping.  Agglomeration
involves grouping tasks together to minimize communication.  Finally,
groups of tasks are mapped to processors that will execute them.

The program was being developed for a heterogeneous system with a
CPU and a GPU.  The tasks were divided into two groups: those that
operate balls, and those that operate on vertices.  The first three
tasks—move, collideBalls, and collideWalls—work with the positions,
velocities, etc. of the balls, so they go into the first group.  Task
4—genVertices—generates an array of vertices from the positions of
the balls, so it goes in the second group.

The GPU typically has many more compute units than the CPU.  For example,
the development system has 8 CPU cores and 28 GPU CUs.  The number of
vertices is an order of magnitude greater than the number of balls
because each ball is drawn with many vertices (32).  Therefore, the
vertex-related task 4, which has a very high degree of data-parallelism,
was mapped to the GPU.  The other three tasks were mapped to the CPU.

A four-stage pipeline with one task per stage was considered, but in
the end, the first three tasks were agglomerated and a two-stage
pipeline was used.  The first stage—consisting of the move,
collideBalls, and collideWalls tasks—runs on the CPU.  The second
stage—genVertices—runs on the GPU.


# Implementation Notes

Little's Law says that the capacity of a pipeline must be at least
the product of the bandwidth and latency (C >= b * l).  In this case,
the bandwidth is 1 since each stage computes one frame at a time, and
the latency is 2 because there are two stages.  Therefore, the memory
capacity must be at least twice that required to compute a single frame
sequentially.  The positions of the balls are held in two buffers, one on
the CPU and one on the GPU.  The two stages communicate by copying the CPU
buffer to the GPU after the CPU finishes computing the updated positions.

Since CPU device memory is allocated from host memory anyway, the CPU-side
buffer was created with CL_MEM_USE_HOST_PTR.  This allows the CPU buffer
to be directly copied from host to GPU, without needing to copy from CPU
to host first.  Note that clEnqueueCopyBuffer() cannot be used because
the CPU and GPU may be from different platforms, and thus cannot share
a context.

The kernels for each pipeline stage are executed asynchronously from
the host, so they run in parallel on the two devices.  When both stages
have finished, the updated positions are copied forward from the CPU to
the GPU.

time -->

CPU | [move]->[collideBalls]->[collideWalls]->event|
                                                   |
GPU | [genVertices]--------->[draw screen]-------->| [copy buffer]