1 files changed, 111 insertions, 59 deletions
diff --git a/report.txt b/report.txt
index 1eab1a1..ed03942 100644
--- a/report.txt
+++ b/report.txt
@@ -1,63 +1,115 @@
-COMP 426 Assignment 3
+COMP 426 Project
 Sam Anthony 40271987
 
 
-= Manycore implementation of 2D bouncing balls simulation on the GPU =
+= Heterogeneous Multicore Implementation of 2D Bouncing Balls Simulation Using OpenCL =
+
+The PCAM methodology was used to redesign the bouncing balls simulation
+as a heterogeneous multicore program.  There are four steps in PCAM:
+Partitioning, Communication, Agglomeration, and Mapping.  The program
+was designed as a pipeline that runs on the CPU and GPU in parallel.
+
 
-
-# Simulation
-
-This version uses OpenCL to run the simulation on the GPU.  The old C
-functions were transcribed into three OpenCL kernels: move(), which
-updates the positions of the balls; collideWalls(), which reacts to
-collisions between balls and the bounds of the screen; and collideBalls(),
-which reacts to collisions between pairs of balls.
-
-The positions, velocities, and radii of the balls are stored in OpenCL
-buffers in GPU memory.  Vectors are now real float2 vectors as opposed
-to structs in prior implementations.
-
-The move() and collideWalls() kernels run with the global work size set
-to the number of balls.  Each thread works on one of the balls.
-
-The collideBalls() kernel relies on the partitioning scheme from the
-TBB implementation.  The host partitions the set of collisions between
-pairs of balls so that dependencies are separated in different cells of
-the partition.  All collisions within a cell may run in parallel without
-synchronization.  The partition is represented as a 2D array of pairs of
-indices of balls.  Each row of the array represents a cell, and each pair
-of vertices within a row represents a collision between those two balls.
-
-The host creates an OpenCL buffer for each cell of the partition and
-copies the arrays into GPU memory.  To run the collideBalls() kernel,
-the host iterates over the partition and sets the kernel argument to the
-appropriate buffer containing the cell.  The global work size is set to
-the size of the cell.  Each thread works on one collision between a pair
-of balls.
-
-This version of the program uses a different collision formula based on
-the impulse (J) between the two colliding balls.  The formula is based
-on this page:
-
-https://introcs.cs.princeton.edu/java/assignments/collisions.html
-
-
-# Graphics
-
-OpenGL is used to draw the balls on screen.  The program leverages
-interoperability between OpenCL and OpenGL to minimize data transfer
-between host and GPU.
-
-There are two OpenGL vertex buffer objects that reside on the GPU:
-one containing vertices, and one containing colors.  There is also an
-additional OpenCL buffer instantiated with clCreateFromGLBuffer() which
-points to the same data as the GL vertex VBO.
-
-The genVertices() OpenCL kernel sets the vertex buffer according to the
-positions of the balls.  The local work size is the number of vertices
-per ball (24 currently.)  Each thread sets one vertex.  There is one
-work group per ball.  The group id is used to index the position array,
-and the global id is used to index the vertex array.
-
-After the vertices are set, the program uses glDrawArrays() with
-GL_TRIANGLE_FAN to draw the balls on-screen.
+# Partitioning
+
+The first step of PCAM is partitioning.  It involves breaking the
+computation into individual pieces.
+
+Functional decomposition was used to split the computation into tasks
+and extract coarse-grained parallelism from the program.  There are four
+tasks: 1) update the positions of the balls, 2) check for collisions
+between balls, 3) check for collisions with the edges of the screen, and
+4) generate the vertex array.  One OpenCL kernel was developed for each
+of these tasks: move(), collideBalls(), collideWalls(), and genVertices().
+
+Data decomposition was used to extract fine-grained parallelism from
+the collideBalls() task. The partitioning involves building sets of
+independent pairs of balls that can be tested for collision in parallel.
+The set of possible collisions between balls is represented as the edges
+of a completely connected graph.  The partition is built by repeatedly
+finding matchings on this graph.  A matching is a set of independent
+edges.  Two edges are independent if they don't share any vertices.
+Therefore, all pairs of balls (edges) in a matching can be tested for
+collision in parallel.  This partitioning is performed only once at
+startup time and it removes the need for any synchronization at run time.
+
+
+# Communication
+
+The second step of PCAM is communication where the flow of data from one
+task to the next is identified.  On a high level, the task dependency
+graph is strightforward, with data flowing linearly from task 1 to 2 to
+3 to 4.
+
+[move] -> [collideBalls] -> [collideWalls] -> [genVertices]
+
+There is a smaller dependency graph within step 2 (collideBalls()).
+Each cell of the partition is computed one after the other in a straight line.
+But all collisions within a cell are computed in parallel.
+
+           Cell 1                            Cell n
+  +----------------------+           +----------------------+
+  |  --> [collision] --  |           |  --> [collision] --  |
+  | /                  \ |           | /                  \ |
+----      ...           -----> ... ----      ...           ---->
+  | \                  / |           | \                  / |
+  |  --> [collision] --  |           |  --> [collision] --  |
+  +----------------------+           +----------------------+
+
+
+# Agglomeration and Mapping
+
+The final two steps of PCAM are agglomeration and mapping.  Agglomeration
+involves grouping tasks together to minimize communication.  Finally,
+groups of tasks are mapped to processors that will execute them.
+
+The program was being developed for a heterogeneous system with a
+CPU and a GPU.  The tasks were divided into two groups: those that
+operate balls, and those that operate on vertices.  The first three
+tasks—move, collideBalls, and collideWalls—work with the positions,
+velocities, etc. of the balls, so they go into the first group.  Task
+4—genVertices—generates an array of vertices from the positions of
+the balls, so it goes in the second group.
+
+The GPU typically has many more compute units than the CPU.  For example,
+the development system has 8 CPU cores and 28 GPU CUs.  The number of
+vertices is an order of magnitude greater than the number of balls
+because each ball is drawn with many vertices (32).  Therefore, the
+vertex-related task 4, which has a very high degree of data-parallelism,
+was mapped to the GPU.  The other three tasks were mapped to the CPU.
+
+A four-stage pipeline with one task per stage was considered, but in
+the end, the first three tasks were agglomerated and a two-stage
+pipeline was used.  The first stage—consisting of the move,
+collideBalls, and collideWalls tasks—runs on the CPU.  The second
+stage—genVertices—runs on the GPU.
+
+
+# Implementation Notes
+
+Little's Law says that the capacity of a pipeline must be at least
+the product of the bandwidth and latency (C >= b * l).  In this case,
+the bandwidth is 1 since each stage computes one frame at a time, and
+the latency is 2 because there are two stages.  Therefore, the memory
+capacity must be at least twice that required to compute a single frame
+sequentially.  The positions of the balls are held in two buffers, one on
+the CPU and one on the GPU.  The two stages communicate by copying the CPU
+buffer to the GPU after the CPU finishes computing the updated positions.
+
+Since CPU device memory is allocated from host memory anyway, the CPU-side
+buffer was created with CL_MEM_USE_HOST_PTR.  This allows the CPU buffer
+to be directly copied from host to GPU, without needing to copy from CPU
+to host first.  Note that clEnqueueCopyBuffer() cannot be used because
+the CPU and GPU may be from different platforms, and thus cannot share
+a context.
+
+The kernels for each pipeline stage are executed asynchronously from
+the host, so they run in parallel on the two devices.  When both stages
+have finished, the updated positions are copied forward from the CPU to
+the GPU.
+
+time -->
+
+CPU | [move]->[collideBalls]->[collideWalls]->event|
+                                                   |
+GPU | [genVertices]--------->[draw screen]-------->| [copy buffer]