summaryrefslogtreecommitdiffstats
path: root/report.txt
diff options
context:
space:
mode:
authorSam Anthony <sam@samanthony.xyz>2024-11-20 12:16:59 -0500
committerSam Anthony <sam@samanthony.xyz>2024-11-20 12:16:59 -0500
commitcc219d02737e7c982eb5d4431b1a7e1d575a37be (patch)
tree9557d268ebaab60cb094af16c476efedf4d34eea /report.txt
parentb074a56aa5e4485c4d2b261bf698f5424a02aa00 (diff)
downloadballs-hetero.zip
reporthetero
Diffstat (limited to 'report.txt')
-rw-r--r--report.txt170
1 files changed, 111 insertions, 59 deletions
diff --git a/report.txt b/report.txt
index 1eab1a1..ed03942 100644
--- a/report.txt
+++ b/report.txt
@@ -1,63 +1,115 @@
-COMP 426 Assignment 3
+COMP 426 Project
Sam Anthony 40271987
-= Manycore implementation of 2D bouncing balls simulation on the GPU =
+= Heterogeneous Multicore Implementation of 2D Bouncing Balls Simulation Using OpenCL =
+
+The PCAM methodology was used to redesign the bouncing balls simulation
+as a heterogeneous multicore program. There are four steps in PCAM:
+Partitioning, Communication, Agglomeration, and Mapping. The program
+was designed as a pipeline that runs on the CPU and GPU in parallel.
+
-
-# Simulation
-
-This version uses OpenCL to run the simulation on the GPU. The old C
-functions were transcribed into three OpenCL kernels: move(), which
-updates the positions of the balls; collideWalls(), which reacts to
-collisions between balls and the bounds of the screen; and collideBalls(),
-which reacts to collisions between pairs of balls.
-
-The positions, velocities, and radii of the balls are stored in OpenCL
-buffers in GPU memory. Vectors are now real float2 vectors as opposed
-to structs in prior implementations.
-
-The move() and collideWalls() kernels run with the global work size set
-to the number of balls. Each thread works on one of the balls.
-
-The collideBalls() kernel relies on the partitioning scheme from the
-TBB implementation. The host partitions the set of collisions between
-pairs of balls so that dependencies are separated in different cells of
-the partition. All collisions within a cell may run in parallel without
-synchronization. The partition is represented as a 2D array of pairs of
-indices of balls. Each row of the array represents a cell, and each pair
-of vertices within a row represents a collision between those two balls.
-
-The host creates an OpenCL buffer for each cell of the partition and
-copies the arrays into GPU memory. To run the collideBalls() kernel,
-the host iterates over the partition and sets the kernel argument to the
-appropriate buffer containing the cell. The global work size is set to
-the size of the cell. Each thread works on one collision between a pair
-of balls.
-
-This version of the program uses a different collision formula based on
-the impulse (J) between the two colliding balls. The formula is based
-on this page:
-
-https://introcs.cs.princeton.edu/java/assignments/collisions.html
-
-
-# Graphics
-
-OpenGL is used to draw the balls on screen. The program leverages
-interoperability between OpenCL and OpenGL to minimize data transfer
-between host and GPU.
-
-There are two OpenGL vertex buffer objects that reside on the GPU:
-one containing vertices, and one containing colors. There is also an
-additional OpenCL buffer instantiated with clCreateFromGLBuffer() which
-points to the same data as the GL vertex VBO.
-
-The genVertices() OpenCL kernel sets the vertex buffer according to the
-positions of the balls. The local work size is the number of vertices
-per ball (24 currently.) Each thread sets one vertex. There is one
-work group per ball. The group id is used to index the position array,
-and the global id is used to index the vertex array.
-
-After the vertices are set, the program uses glDrawArrays() with
-GL_TRIANGLE_FAN to draw the balls on-screen.
+# Partitioning
+
+The first step of PCAM is partitioning. It involves breaking the
+computation into individual pieces.
+
+Functional decomposition was used to split the computation into tasks
+and extract coarse-grained parallelism from the program. There are four
+tasks: 1) update the positions of the balls, 2) check for collisions
+between balls, 3) check for collisions with the edges of the screen, and
+4) generate the vertex array. One OpenCL kernel was developed for each
+of these tasks: move(), collideBalls(), collideWalls(), and genVertices().
+
+Data decomposition was used to extract fine-grained parallelism from
+the collideBalls() task. The partitioning involves building sets of
+independent pairs of balls that can be tested for collision in parallel.
+The set of possible collisions between balls is represented as the edges
+of a completely connected graph. The partition is built by repeatedly
+finding matchings on this graph. A matching is a set of independent
+edges. Two edges are independent if they don't share any vertices.
+Therefore, all pairs of balls (edges) in a matching can be tested for
+collision in parallel. This partitioning is performed only once at
+startup time and it removes the need for any synchronization at run time.
+
+
+# Communication
+
+The second step of PCAM is communication where the flow of data from one
+task to the next is identified. On a high level, the task dependency
+graph is strightforward, with data flowing linearly from task 1 to 2 to
+3 to 4.
+
+[move] -> [collideBalls] -> [collideWalls] -> [genVertices]
+
+There is a smaller dependency graph within step 2 (collideBalls()).
+Each cell of the partition is computed one after the other in a straight line.
+But all collisions within a cell are computed in parallel.
+
+ Cell 1 Cell n
+ +----------------------+ +----------------------+
+ | --> [collision] -- | | --> [collision] -- |
+ | / \ | | / \ |
+---- ... -----> ... ---- ... ---->
+ | \ / | | \ / |
+ | --> [collision] -- | | --> [collision] -- |
+ +----------------------+ +----------------------+
+
+
+# Agglomeration and Mapping
+
+The final two steps of PCAM are agglomeration and mapping. Agglomeration
+involves grouping tasks together to minimize communication. Finally,
+groups of tasks are mapped to processors that will execute them.
+
+The program was being developed for a heterogeneous system with a
+CPU and a GPU. The tasks were divided into two groups: those that
+operate balls, and those that operate on vertices. The first three
+tasks—move, collideBalls, and collideWalls—work with the positions,
+velocities, etc. of the balls, so they go into the first group. Task
+4—genVertices—generates an array of vertices from the positions of
+the balls, so it goes in the second group.
+
+The GPU typically has many more compute units than the CPU. For example,
+the development system has 8 CPU cores and 28 GPU CUs. The number of
+vertices is an order of magnitude greater than the number of balls
+because each ball is drawn with many vertices (32). Therefore, the
+vertex-related task 4, which has a very high degree of data-parallelism,
+was mapped to the GPU. The other three tasks were mapped to the CPU.
+
+A four-stage pipeline with one task per stage was considered, but in
+the end, the first three tasks were agglomerated and a two-stage
+pipeline was used. The first stage—consisting of the move,
+collideBalls, and collideWalls tasks—runs on the CPU. The second
+stage—genVertices—runs on the GPU.
+
+
+# Implementation Notes
+
+Little's Law says that the capacity of a pipeline must be at least
+the product of the bandwidth and latency (C >= b * l). In this case,
+the bandwidth is 1 since each stage computes one frame at a time, and
+the latency is 2 because there are two stages. Therefore, the memory
+capacity must be at least twice that required to compute a single frame
+sequentially. The positions of the balls are held in two buffers, one on
+the CPU and one on the GPU. The two stages communicate by copying the CPU
+buffer to the GPU after the CPU finishes computing the updated positions.
+
+Since CPU device memory is allocated from host memory anyway, the CPU-side
+buffer was created with CL_MEM_USE_HOST_PTR. This allows the CPU buffer
+to be directly copied from host to GPU, without needing to copy from CPU
+to host first. Note that clEnqueueCopyBuffer() cannot be used because
+the CPU and GPU may be from different platforms, and thus cannot share
+a context.
+
+The kernels for each pipeline stage are executed asynchronously from
+the host, so they run in parallel on the two devices. When both stages
+have finished, the updated positions are copied forward from the CPU to
+the GPU.
+
+time -->
+
+CPU | [move]->[collideBalls]->[collideWalls]->event|
+ |
+GPU | [genVertices]--------->[draw screen]-------->| [copy buffer]