GPU-Based Digital Hologram Reconstruction and Particle Detection A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Daniel Benjamin Taylor IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE John Sartori December, 2014 c© Daniel Benjamin Taylor 2014 ALL RIGHTS RESERVED Acknowledgements I would like to acknowledge a number of people for their contributions to my experience and success in graduate school: Professors John Sartori (Electrical Engineering) and Jiarong Hong (Mechanical En- gineering), for both giving me the opportunity to work on this project and for the expertise required to complete it. Mostafa Toloui, for providing a sound basis for this research and for his help in developing the algorithms presented. Professor Brent Hecht (Computer Science), for pushing me to do more than I wanted to and for providing feedback and insight of great value. i Dedication I would like to dedicate this work to my family, for providing the means and motivation to do great things: My parents, for letting me drill holes in the walls, build siege engines in the basement, get aluminum chips everywhere, and generally encouraging doing things of real value. My eldest brother, for setting a standard to live up to. And my #1 fan, for keeping me from slacking off. ii Abstract Digital holograms, when combined with tracer particles, can be used for examining otherwise-invisible fluid flows. These holograms can be captured with standard digital imaging equipment, however processing them to extract tracer or particle locations is computationally expensive. Exacerbating the issue is that hundreds or thousands of holograms must be reconstructed to analyze a single flow. Presented here is a hologram reconstruction and particle extraction system exploiting the massive parallelism of graphics processing units (GPUs) to reduce the time required to locate the particles in 3D space by orders of magnitude. This system requires no expensive proprietary hardware and runs on standard computers, with the only special requirement being an off-the-shelf Nvidia GPU. iii Contents Acknowledgements i Dedication ii Abstract iii List of Tables vi List of Figures vii 1 Introduction 1 2 Background and Motivation 4 2.1 Hologram Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 CUDA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Host and device code . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Kernels, threads, warps, blocks, and grids . . . . . . . . . . . . . 7 2.2.3 CUDA memory model . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Related Work 10 4 Implementation 13 4.1 Stage 1: Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Stage 2-3: Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.1 Tenengrad map formation . . . . . . . . . . . . . . . . . . . . . . 16 4.2.2 Combined image, max tenengrad . . . . . . . . . . . . . . . . . . 16 iv 4.2.3 Threshold calculation . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Stage 4-6: Centroid Calculation . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 2D CCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.2 3D connected-component labeling (CCL) . . . . . . . . . . . . . 22 4.4 Host code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 Results and Discussion 28 5.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Figures of merit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.4.1 Detection Rate and False-Positives . . . . . . . . . . . . . . . . . 31 5.4.2 XY Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4.3 Z Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.4.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6 Conclusion 35 References 36 Appendix A. Glossary and Acronyms 38 A.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A.2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 v List of Tables 5.1 3 µm performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vi List of Figures 2.1 Example reconstructed planes . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 CUDA thread organization . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 CUDA memory organization . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1 Pseudo-code: Stage 1 zKernel . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Pseudo-code: Stage 1 memcpyKernel . . . . . . . . . . . . . . . . . . . . 14 4.3 Pseudo-code: Stage 1 operation . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Pseudo-code: Tenengrad calculation . . . . . . . . . . . . . . . . . . . . 15 4.5 Pseudo-code: Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.6 Pseudo-code: Tenengrad map calculation . . . . . . . . . . . . . . . . . 17 4.7 Pseudo-code: Combined image, max tenengrad . . . . . . . . . . . . . . 17 4.8 Pseudo-code: Threshold calculation . . . . . . . . . . . . . . . . . . . . . 19 4.9 Particle YZ cross-section . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.10 Pseudo-code: 2D CCL, host code . . . . . . . . . . . . . . . . . . . . . . 20 4.11 Pseudo-code: 2D CCL, device code . . . . . . . . . . . . . . . . . . . . . 21 4.12 Pseudo-code: Centroid x, y approximation . . . . . . . . . . . . . . . . . 22 4.13 Pseudo-code: 2D CCL centroid formation . . . . . . . . . . . . . . . . . 23 4.14 Pseudo-code: 3D CCL, device code . . . . . . . . . . . . . . . . . . . . . 24 4.15 Pseudo-code: 3D CCL centroid formation . . . . . . . . . . . . . . . . . 25 4.16 Pseudo-code: Centroid merging . . . . . . . . . . . . . . . . . . . . . . . 26 4.17 Pseudo-code: False-positive removal . . . . . . . . . . . . . . . . . . . . 26 4.18 Pseudo-code: Host code overall operation . . . . . . . . . . . . . . . . . 27 5.1 CUDA execution time by stage . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 MATLAB and CUDA Implementation Z Error . . . . . . . . . . . . . . 33 5.3 MATLAB and CUDA Implementation Execution Times . . . . . . . . . 34 vii Chapter 1 Introduction In recent years, GPUs have become an important tool for high-powered computing, especially as the performance gains of single-threaded processors slows [1]. GPUs over- come this limitation by providing immense computational capabilities through massively parallel processing of data, using the so-called SIMD (single instruction, multiple data) architecture, in which identical instructions are carried out upon differing pieces of data in parallel. This architecture, while powerful, only lends itself well to a subset of problems. These problems typically exhibit two characteristics: independence of data, and low data transfers. In order to reap a benefit from parallel processing, there must be work available for each processor. Forced serialization- where calculations on one piece of data require a previous piece of data to have been processed- can quickly negate the benefits of having multiple processors available. In the most extreme case, where no calculation can be performed before the previous one has completed, no benefit can be had from parallel processing, as only one processor could be active at a time. Thus workloads with a large number amount of data with no interdependencies will maximally benefit from parallelization. Ken Batcher is often attributed with saying ”A supercomputer is a device for turn- ing compute-bound problems into I/O-bound problems.” GPUs exhibit similar behavior. Because they can process large amounts of data quickly, getting data into and out of the GPU can quickly the performance bottleneck instead of computational capabilities. 1 2The ideal GPU-computing problem is computationally intensive, but requires relatively small datasets as input, and generates relatively little data as output. The problem considered in this paper is that of digital hologram reconstruction and particle detection, especially as applicable to tracer-particle detection for identifying fluid flows. It exhibits both of the characteristics noted above. Firstly, holograms can be reconstructed by using the Fresnel-Kirchhoff Diffraction Formula for each pixel of a given plane, independent of other pixels. Similarly, the resulting reconstructed images can be processed pixel-by-pixel to identify particles, operations which can be performed independently and summed together at the end. Additionally, hologram reconstruction and particle detection requires minimal data input and produces minimal output. The only input required is a single image that makes up the hologram, and the output produced consists of a list of particles’ x, y, z locations. Because of these characteristics, hologram reconstruction and particle extraction is theoretically an ideal problem for GPU acceleration. This paper shows this to be true in practice, with runtimes orders of magnitude lower than a traditional CPU implemen- tation. This paper makes two main contributions to the existing GPU-accelerated hologra- phy work: • An approach to hologram reconstruction and particle extraction that runs entirely on the GPU, excepting some minor CPU-based post-processing • Massive improvements in reconstruction speed with accuracy comparable to exist- ing sequential techniques, achieved through parallel adaptations of the sequential algorithms The remainder of the paper is organized as follows: • Chapter 2 gives an overview of the problem, as well as the motivation for this work • Chapter 3 covers related work in the field • Chapter 4 describes in detail how the system was implemented in a parallel manner using the CUDA GPU-computing toolkit 3• Chapter 5 presents and discusses the speed and accuracy of the implementation, both in absolute terms and as compared to a single-threaded reference implemen- tation • Chapter 6 provides a summary of the work presented and comments on future work. Chapter 2 Background and Motivation This chapter seeks to provide an overview of how hologram reconstruction works and why it’s a good match for a massively parallel SIMD architecture. It further outlines the basic workings of CUDA, required to follow the details of the implementation presented in chapter 4. 2.1 Hologram Reconstruction The problem considered in this paper is that of reconstructing digital holograms and extracting particle locations from those holograms. The general approach for doing so is as follows: 1. Obtain a hologram: This is outside the scope of this paper, but a hologram is generally generated by illuminating the area of interest with a laser. Any particles in the area will create refraction patterns, and a camera is used to record those into a hologram. Thus a hologram consists of a single monochrome image containing the sum of the refraction patterns from all of the particles in the area of interest. Simulated holograms can also be generated (including on GPUs, as in [2], [3]). 2. Reconstruct focused depth planes: Using the Fresnel-Kirchhoff Diffraction Formula, it’s possible to re-create how the diffraction patterns would appear in a plane some distance from the camera along the direction of the light’s propagation. By doing so for a multitude of distances from the camera, an approximation of 4 5the diffraction patterns in 3D space can be created. An example of the results of this process can be seen in figure 2.1. 3. Threshold particles: In order to locate particles in space, they must somehow be separated from the background. Because the pixels contributing to particles have a different intensity than those forming the background, a single value can be used to separate the two. This step is used to find the value that maximizes the particle-contributing pixels detected while still ignoring the background. 4. Calculate centroids: Once the particles have been separated from the back- ground, their location in 3D space must be calculated. This is done by scanning across the reconstructed planes and connecting pixels that are adjacent (and fall within the threshold) into individual centroids, each representing a particle. The average (x,y,z) location of all of the pixels belonging to each centroid is used to calculate the location of that particle. 5. Post-processing: Centroids may not be detected perfectly. For example, the refraction patterns may constructively interfere to create pixels within the particle threshold, leading to the detection of a ”particle” that doesn’t exist. Similarly, actual particles may register as multiple particles due to gaps in their thresholded pixels. Because of this, post-processing of the detected centroids is required to filter out these false-postives and to merge parts of the same centroid into a single particle. On a standard CPU, these steps must all be done sequentially. Ignoring the ac- quisition step, the CPU must first iteratively reconstruct each focused Z plane. From there, it must calculate the ideal threshold value, then use that threshold to identify each centroid and add up the pixels that comprise it. Finally, it must perform any post-processing on the centroids calculated. As shown in chapter 5, this is a lengthy process, even with modern high-end CPUs. Analysis of a turbulent fluid flow may require processing hundreds or thousands of holograms, greatly exacerbating the issue. All but one of these tasks has a large amount of independent work- each Z plane that must be reconstructed has no dependence on any other Z plane, and thus each could be done in parallel. Planes can similarly be thresholded at the same time. Finally, 6Figure 2.1: Example reconstructed planes from a simulated hologram (upper left). The planes are reconstructed at (from lower-left to upper-right) 3.293, 5.963, and 8.633 mm from the camera long the z axis. The black spots indicate the particle being in focus at that depth. identifying centroids is a largely independent task, as each centroid can be calculated at the same time as the others. Post-processing is the only step that’s largely sequential, due in large part to the fact that pieces of particles may be merged together during the process. This creates data dependencies that severely limit the applicability of parallelism. For the other steps, however, the parallelism isn’t limited to just the planes or the particles. Multiple threads can work together to reconstruct a single plane, as the pixels within a plane are also independent of one another. Similarly, multiple threads can count the pixels comprising a single centroid, then merge their sums at the end to get the average location. Because the number of potential parallel tasks is so large, this problem benefits enormously from the massive amount of parallelism available in modern GPUs. 2.2 CUDA architecture For the parallel implementation of hologram reconstruction, CUDA was used. CUDA is Nvidia’s GPU computing toolkit, enabling general computation to be done on their 7GPUs. Discussed in this section is an overview of CUDA’s execution and memory mod- els, intended to provide a background for understanding the implementation presented in chapter 4. 2.2.1 Host and device code Each CUDA program consists of two parts: the host code, and the device code. The host code runs on the CPU of the computer and is, for our intents and purposes, identical to any other C program. It can use all of the usual libraries, constructs, etc. that any other program has access to, and (generally) runs purely sequentially. It additionally has the job of controlling the GPU, which mostly entails moving data into or out of the GPU’s memory and commanding the GPU to perform computations. The device code executes on the GPU itself, and consists of kernels and other device helper functions. A kernel is basically the unit of execution in CUDA- a function that is executed in parallel by a number of threads. When a kernel is launched by the host code, the host specifies the block size and the grid size of the kernel launch, in addition to the parameters passed to the function itself. 2.2.2 Kernels, threads, warps, blocks, and grids A CUDA kernel is essentially a function executed on the GPU. A kernel is launched with a specific grid. This grid is is comprised of thread blocks, each of which contains a number of threads. Grids and blocks may be 1-, 2-, or 3-dimensional. Figure 2.2 illustrates this with a 2D grid containing 2D blocks. The blocks in a grid execute largely independently, while constructs exist for threads within the same block to synchronize and quickly share access to data. Threads in CUDA are further grouped into warps. A warp is a group of threads that is executed in strict lockstep- each thread in a warp will execute the same instruction at the same time. Unlike grids, blocks, and threads, warps are handled internally by the GPU, and not specified by the programmer. When a kernel is launched, it has a specified block and grid size. Each thread can identify its location within the block by the parameters threadIdx.x, threadIdx.y, and threadIdx.z. Position within in the grid can similarly be obtained with the blockIdx 8Figure 2.2: Visualization of CUDA thread organization (figure 6 of the CUDA C Programming Guide [4]) structure. Using this information, each thread can select a separate piece of data to compute on. 2.2.3 CUDA memory model In addition to the parallel computation model, CUDA has a memory model that differs slightly from standard CPU. CUDA still has registers and main memory (RAM, referred to as ”global memory” in CUDA parlance) on the GPU, as a standard CPU does, as well as cache. There are two important distinctions from a standard CPU, however. Similar to a traditional system, memory accesses in CUDA are significantly slower than the processor frequency. In CUDA, however, many threads are trying to access memory at once, as opposed to (ideally) a single thread on a CPU. Because threads execute in a SIMD fashion, all of the threads in a warp must complete their memory access before any of them can continue. Thus a single memory access can have the same effect on speed as many consecutive accesses on a CPU. 9Figure 2.3: Visualization of CUDA memory organization (figure 7 of the CUDA C Programming Guide [4]) In part to work around that, CUDA implements a split implicit/explicit cache, with the explicit cache being referred to as ”shared memory”. As its name implies, shared memory is shared amongst a block of threads, and may be used to either share data among them, or to cache data that each thread would otherwise individually fetch from main memory. Because CUDA can rapidly become IO-bound, this can be a great advantage, as an access to shared memory is significantly faster than global memory [4]. This memory configuration is shown in figure 2.3. Chapter 3 Related Work Digital holography work exists outside of the realm of GPUs; many of the ideas are re-used here. [5], [6] present techniques for reconstruction very similar to those utilized in this paper. [7] utilizes a component segmentation approach like that used in the 3D CCL ap- proach used here. [8] presents the hybrid tenegrad approach for improved depth accuracy in processing holograms, also utilized in this work. Similarly, using GPUs in digital holography is similarly not a new idea. A body of work exists to either render images from captured holograms or to generate holograms in various capacities, using GPUs for acceleration. [9] shows work to render detailed images from captured holograms. CUDA was used to greatly reduce the reconstruction time when compared to a CPU implementation, and a speedup of greater than 100 times was shown. Similar to the work presented here, cuFFT was used for FFT operations. [10] similarly examines capturing and manipulating holograms using CUDA. They further extend the work to high-resolution (4k and 8x) imagery and real-time rendering of the captured scenes in 3D. The work focuses on the detailed rendering 3D imagery, as opposed to extracting information from these images. [3] further explores reconstructing images, and also presents a technique for creating them in the first place. While GPUs are used in the work, it was based around OpenCL 10 11 and an AMD GPU, not Nvidia’s CUDA. Again, no attempt was made to detect particles. [2] similarly demonstrates generating holograms, but further shows the utilization of multiple GPUs to do so. A speedup of approximately 100 times over a CPU implemen- tation was shown using a single GPU, with an additional factor of 2.5 being achieved when multiple GPUs were used. [11] uses holograms to manipulate tiny particles using laser beams. The work pre- sented uses GPUs to solve for the phase modulation required to deliver energy from a laser array to the target focal point. Prior efforts to utilize parallel processing not only to recreate or generate holograms, but also to extract information from them also exist. [12] presents an implementation of reconstructing digital holograms in real-time (24 frames per second), using 512-pixel-square holograms. This was implemented using CUDA and the cuFFT library, and included parallel capture (with the CPU) and pro- cessing (with the GPU) of frames. No attempt was made, however, at detecting particles in the reconstructed images. Furthermore, each frame reconstructed is similar in work to a single plane reconstructed in this work presented in this paper, and thus the per- formance presented here is significantly higher. This is likely due to advances in GPU technology, as well as improvements in the cuFFT library. [13] presents a parallel particle-detection approach using computer clusters. The technique presented is shown to function over a range of particle sizes by using different algorithms based on the particle size. The algorithms generally utilize peak-detection, with the value being detected being changed based on particle size (possibilities pre- sented include pixel intensity and Sobel gradient). The implementation presented, in addition to not being on a GPU, is also done in an ”embarrassingly parallel fashion”, where a single core handles a single hologram. Thus a speedup in the processing of multiple holograms is demonstrated, but not for any single hologram. [14] also demonstrates hologram particle extraction, but uses a different technique and GPUs. Planes are reconstructed, then approximate particle locations are detected based on the focus at a given plane. The most intense particles are then segmented, with their refraction patterns removed from the planes to minimize noise for consecutive particle extractions 12 Despite the use of GPUs, only a 10x speedup over a CPU implementation was found. Additionally, some processing was found to require either double-precision calculations on the GPU or additional CPU processing. No hard performance numbers were given to enable a performance comparison. Finally, [15] presents work similar to that presented in this paper, showing particle extraction from holograms (and even particle tracking, albeit using existing libraries), implemented in the interpreted IDL programming language. A different approach is taken to find particles’ locations as well, approximating the x, y centers by taking mul- tiple gradients along the refraction patterns, then applying a best-fit approach to match the patterns to predictions made by the Lorenz-Mie theory to find z. Particle extraction was done at 8 frames per second with images 200 pixels square, using a combination of multi-threaded CPU processing and a GPU. The GPU-CPU hybrid implementation was found to be 20x faster than that of the CPU-only version. The work presented in this paper improves upon this performance by using CUDA directly from a compiled C program. Additionally, it shows an implementation almost entirely on the GPU, maximizing the benefits of parallelism and identifying particles in half the computation time, adjusted for image size. Finally, this paper presents an alternative method for locating particles in space. Chapter 4 Implementation This chapter describes the implementation of the hologram reconstruction algorithm in CUDA. The implementation is broken up into ”stages”, roughly corresponding to the steps outlined in chapter 2. One section is devoted to each stage, including an overview of what functionality the stage provides, as well as the implementation thereof and a discussion of why each stage was implemented in the manner indicated. The numbering irregularities of the stages should also be addressed. A reference implementation in MATLAB (developed by Mostafa Toloui and Jiarong Hong, cur- rently unpublished) was used as a guide for the CUDA implementation. The MATLAB implementation consisted of six discrete steps in extracting particle centroids. Due to architectural differences between CUDA and MATLAB, it was efficient to combine some stages into a single stage in CUDA. Thus the numbering of CUDA stages reflects the MATLAB stage functionality that each CUDA stage implements. 4.1 Stage 1: Reconstruction Reconstructing Z planes- recreating the refraction patterns at a distance from the camera- is performed in as straightforward a manner as possible, using previously- established techniques [5], [6]. First, the hologram is converted into the frequency domain using a FFT. CUDA provides a fast, parallel implementation of this in their cuFFT package. To avoid reinventing the wheel, and to leverage the knowledge the Nvidia developers have of 13 14 their product, this was used for all FFT operations. Once the hologram is in the frequency domain, three steps are performed for each plane to be reconstructed. First, a kernel is launched to calculate the value of each pixel. This calculation is shown in figure 4.1. Require: w, h, the width and height of the hologram Require: hz, the complex output array Require: x, y, z, the location to reconstruct Require: λ, the wavelength of illuminating light Require: reso, the image resolution in micrometers per pixel Require: fftHolo, the complex hologram array in the frequency domain fx = (x+ w 2 ) mod w + 1 2 − 1 w fy = (y + h 2 ) mod h+ 1 2 − 1 h φ = piλ reso2 θ = zφ(f2x + f 2 y ) hz(x, y) = fftHolo(x, y)∠θ Figure 4.1: Stage 1 calculation of pixel values in the complex frequency domain. Because each pixel’s value is independent of the adjacent pixels’ values, this calcu- lation can be done highly efficiently in parallel; each thread performs the calculation shown in figure 4.1 on its own x, y location in parallel. The host code iterates over z, as seen in figure 4.3. Once each pixel in the plane has been calculated, an inverse FFT is performed (again using cuFFT) to find the complex value of the refractions at that plane. Finally, a third kernel is launched to calculate the magnitude of the refraction at each pixel and copy the result into the destination array for use by later kernels. Again, each value is independent of the others, so calculations can be done efficiently. This operation is shown in figure 4.2. Require: planeIn, the 3D complex-valued inverse FFT data array Require: planeOut, the 2D real-valued output data array Require: x, y, z, the pixel location to copy planeOut(x, y) =abs(planeIn(x, y, z) Figure 4.2: Device algorithm for copying the inverse FFT results into the output array It should be noted that this approach appears to introduce an additional memory copy, which were previously noted to be expensive. Because of the way cuFFT handles 15 complex-to-real transforms, additional memory juggling would be required even if this approach was used. As such, the more human-readable approach was taken. The overall stage 1 code is shown in figure 4.3. Require: holoIn, the real-valued input hologram fftHolo =cuFFT(holoIn, forward) for all z do Launch zKernel (figure 4.1) Perform inverse FFT on the result using cuFFT Launch memcpyKernel (figure 4.2) end for Figure 4.3: Algorithm for completing the reconstruction step. 4.2 Stage 2-3: Thresholding In order to detect a threshold for separating the particles from the background, some metric for evaluating the quality of any given threshold is required. The metric chosen was the sharpness of the minimum-intensity image- that is, the image generated by condensing all of the reconstructed planes into a single image by taking the minimum pixel intensity value at each x, y location. The sharpness of a given configuration was calculated as the average tenengrad value of pixels with non-zero tenengrads, where the tenengrad of a pixel at position x, y is shown in figure 4.4. This approach was shown by [8] to be highly accurate in detecting the locations of particles along the axis of reconstruction. Require: pixel(x, y) is the magnitude of the pixel at location x, y of the input Require: pixel(x ± [0, 1], y ± [0, 1]) = pixel(x, y) when x ± [0, 1], y ± [0, 1] is out of bounds l = (pixel(x− 1, y − 1) + 2pixel(x− 1, y) + pixel(x− 1, y + 1))/8 r = −(pixel(x+ 1, y − 1) + 2pixel(x+ 1, y) + pixel(x+ 1, y + 1))/8 t = (pixel(x− 1, y − 1) + 2pixel(x, y − 1) + pixel(x+ 1, y − 1))/8 b = −(pixel(x− 1, y + 1) + 2pixel(x, y + 1) + pixel(x+ 1, y + 1))/8 return (l + r)2 + (t+ b)2 Figure 4.4: Pseudo-code for calculating the tenengrad of a pixel at location x, y. In order to accomplish this, a large number of steps were used. It seems likely that 16 they could be simplified somewhat, though that’s left as an area of future research. As an alternative, an option was provided in the code to manually enter a pixel threshold based on prior knowledge of the holograms, bypassing these lengthy calculations. 4.2.1 Tenengrad map formation First, the tenengrad of each pixel in each reconstructed plane is calculated, along with the average tenengrad of each of those planes. This is accomplished with a single CUDA kernel consisting of a single block for each reconstructed Z plane. The threads iterate over the plane, calculating tenengrads for individual pixels as they do so, also keeping track of the total calculated tenengrad. Once the tenengrad for each pixel of a plane has been calculated, the block utilizes shared memory to find the sum of all tenengrads in the plane. From this the average tenengrad is found for the plane. With the average tenengrad known, the threads re-iterate over each pixel in the plane, this time calculating the gradient (figure 4.5). For pixels with gradients greater than the average, the gradient is stored into the tenengrad map, while a zero value is stored for those under the average. The result of this kernel is the tenengrad map for the plane- an image roughly corresponding to the edges detected in the plane, weighted by the sharpness of the edge. This procedure is shown in figure 4.6 Require: x, y, the location to calculate the gradient for gradLR =abs(pix(x− 1, y)− pix(x+ 1, y)) gradUD =abs(pix(x, y − 1)− pix(x, y + 1)) return 0.5×max(gradLR, gradUD) Figure 4.5: Gradient calculation function. For this application, the magnitude of the greater of the vertical and horizontal gradients is taken. The result is divided by two, as the calculated gradient is taken across two pixels. 4.2.2 Combined image, max tenengrad Next, a second kernel calculates two additional images: the combined image, and the maximum tenengrad image. The former consists of the minimum-intensity (lowest- valued) pixel in each x, y location for any depth z, while the latter holds the highest 17 Require: w, h are the width and height of the hologram Require: tenengrad map is an output array with dimensions equal to w, h x = threadIdx.x; y = threadIdx.y; xstep = blockDim.x; ystep = blockDim.y; for all i, px = x+ i× xstep < w do for all j, py = y + j × ystep < h do tenengrad map(px, py) =tenengrad(px, py) end for end for Figure 4.6: Pseudo-code for calculating the tenengrad map of an image. This would be run by each thread in a block, with one block per reconstructed plane tenengrad value at each x, y location. This kernel is configured as a 2D grid, where each thread handles a single pixel, and iterates along the z dimension. Taken together, these are used to form a heuristic for the sharpness at a given threshold value in the next step. The calculation of the combined image and maximum tenengrad is shown in figure 4.7. Require: tenengrads[z](x, y) corresponds to the tenengrad calculated for pixel x, y on plane z Require: pixel[z](x, y) corresponds to the value of the pixel x, y of the reconstructed plane z Require: combined and max tenengrad are output arrays x = blockIdx.x× blockDim.x+ threadIdx.x; y = blockIdx.y × blockDim.y + threadIdx.y; for all z do combined(x, y) =min(pixel[z](x, y), combined(x, y)) max tenengrad(x, y) =max(tenengrads[z](x, y),max tenengrad(x, y)) end for Figure 4.7: Pseudo-code for calculating the combined image and maximum tenengrad. Each thread on the GPU runs this code, with one thread per pixel of the image. 18 4.2.3 Threshold calculation The last kernel is used to calculate the actual best threshold. First, the combined image is masked with a potential threshold value- each pixel under the threshold is left as-is, while those above the threshold are set to 0. This seemingly-backwards approach is due to pixels corresponding to particles being darker than the background, and hence having a low pixel value. The resulting values are pairwise-multiplied by the maximum tenengrad image, and added to the total sharpness of the threshold. Once the total sharpness for a threshold value has been calculated, the average is found by dividing by the number of non-zero elements that added to the total. The possible threshold values are calculated in parallel by the GPU, one threshold value per block. From these values, the host selects the to be used in later calculations. This algorithm is described in figure 4.8. 4.3 Stage 4-6: Centroid Calculation The final GPU step of extracting particles from the hologram is calculating centroids. This involves taking the reconstructed planes, examining the pixels below the calculated threshold, deciding which belong to what particle, and finally calculating the average location of each particle based on the pixels that compose it. An averaging approach is desirable because the particles become highly elongated along the axis of reconstruction (figure 4.9). Thus finding one point where the particle is most in focus is less than guaranteed to reflect the particle’s true location. Two approaches to solving this problem were tested, presented in the following sections. 4.3.1 2D CCL This approach is significantly simpler than 3D CCL (section 4.3.2), and is based on the premise that with the combined image known in advance, the approximate x, y locations of the particles in the image are also known. From there, the search space for particle centroids can be greatly reduced, resulting in a faster runtime. This makes use of a few assumptions. Firstly, it assumes that particles are well- spaced- that is, that each particle will form a distinct shape in the combined image, and 19 Require: w, h are the width and height of the hologram Require: threshold is the block’s threshold to calculate on Require: tenengrad map is the previously-calculated array of tenengrads Require: cmb masked is the previously-calculated minimum-intensity image, but with any cmb masked(x, y) < threshold being 0 x = threadIdx.x; y = threadIdx.y; xstep = blockDim.x; ystep = blockDim.y; sharpness = 0 nonzero = 0 for all i, px = x+ i× xstep < w do for all j, py = y + j × ystep < h do val = tenengrad map(px, py)×tenengrad(cmb masked(px, py)) sharpness+ = abs(val) if val 6= 0 then nonzero+ = 1 end if end for end for nonzero =sum(nonzero) sharpness =sum(sharpness) return nonzero,sharpness Figure 4.8: Threshold sharpness calculation GPU code. Each block calculates the sharpness for a different threshold; the host code than selects the threshold value with the highest sharpness of those returned. The ”sum” function simply merges the partial results from all of the threads in the block into the final value. not overlap with other nearby particles. This is a safe assumption for the use case of tracking fluid flows, as tracer particles that are that nearby can be treated as a single particle anyway. Furthermore, if they’re separated sufficiently along the Z axis, they can be differentiated anyway. Second, the particles are assumed to be of a uniform size. As explained later, the search space for each centroid is based upon the likely size of the particle, in this case calculated as the average of all particles present. While this simplifies the algorithm somewhat, it could likely be worked around in future work. For the purposes of tracking fluid flows, where tracers are of uniform size, this assumption is justified. Finally, the particles are assumed to be roughly spherical, which is again the case 20 Figure 4.9: A spherical particle’s cross-section along the axis of reconstruction. with tracers. This is important for identifying the center of the search space for the centroid of each particle- an elongated particle may present an inaccurate x, y center, causing the end result to be skewed. 2D CCL Before performing localized searches for particles, the approximate locations of them must be known. The first part of this endeavor is performing a 2-dimensional CCL to form the groups of pixels denoting the centers of the particles. Each pixel is initially labeled with either its row-major index (w × y + x) if its combined-image value is less than the threshold, or -1 (unsigned, the largest possible unsigned value), if it’s above. From there, each CUDA thread repeatedly examines the pixels surrounding the pixel it corresponds to, and assigns its pixel the lowest-valued label of itself and any of its neighbors, as seen in figure 4.11. As soon as no thread changes its pixel’s value for an iteration, the labels are globally consistent, and the components have been formed. This procedure is shown in figure 4.10. Require: labels is the array containing the component labels for each pixel changed = true while changed do changed =launch ccl kernels(labels) end while Figure 4.10: 2D CCL algorithm host code. It simply re-launches the kernels until none of them adjust the labels. 21 x = blockIdx.x× blockDim.x+ threadIdx.x; y = blockIdx.y × blockDim.y + threadIdx.y; globalChanged = false changed = true while changed do changed = false oldlabel = pix(x, y) label(x, y) = min(label(x, y), label(x−1, y), label(x+1, y), label(x, y−1), label(x, y+ 1)) if label(x, y) 6= oldlabel then globalChanged = true end if changed = syncthreads or(changed) end while return globalChanged Figure 4.11: 2D CCL algorithm executed on the GPU. The line changed = syncthreads or(changed) is used to keep individual blocks iterating until the block converges by instructing each block to re-iterate if any thread updated a pixel value. Centroid x, y approximation Once the components have been labeled, the approximate x, y positions of each particle must be found. This operation is identical to finding a centroid in 2D. A global list of centroids is created, ordered by label. A unique center label can be identified by any pixel having a label matching its index. Each thread then examines a single pixel. If that pixel has a valid label, it adds the x, y values to the global total of the center with the corresponding label, as well as incrementing the pixel count for that center. Once all threads have finished this task, the totals are divided by the pixel counts for each center to find the average x, y position for that particle. This pixel count information is further used to calculate the average size of the particles. This procedure is shown in figure 4.12. The host code handles the task of dividing the x, y sums by the number of pixels in the center to get the average x, y location. 22 if label(x, y) is valid then center = findCenterByLabel(label(x, y)) center.x+ = x center.y+ = y center.pixel count+ = 1 end if Figure 4.12: Algorithm for finding the centers of potential particles in the combined image. Centroid formation Once the approximate centers of the particles have been detected, forming the centroids is relatively straightforward. A kernel is launched with one block for each particle detected, with the block being sized proportionately to the average particle cross-section calculated previously. Each block is centered on one of the detected particles, and essentially iterates along the Z dimension, adding any pixels under the threshold to the total contributing to the location. When two consecutive planes are found that contain no pixels under the threshold, the centroid is considered finished, and no further pixels are added to it. If a plane further on contains pixels below the threshold, a new centroid is formed. This algorithm is shown in figure 4.13. 4.3.2 3D connected-component labeling (CCL) The second approach considered is full 3D CCL. This attempts to create components in 3D space, instead of attempting to find just the x, y centers. It uses much the same approach as the 2D CCL, but also propagates labels along the Z axis. This approach is largely identical to that detailed in [7]. 3D CCL The first step is the actual CCL approach. As mentioned previously, this is largely the same as the 2D version (figure 4.11), but iterates over the Z dimension as well. At each iteration, it takes the lowest label of any pixel adjacent to that x, y, z location, assuming the location has a label to begin with. The kernel is configured such that one thread handles a given x, y location for all z. This is shown in figure 4.14. The host code is 23 Require: x, y, center location of this block cx = x− blockDim.x/2 + threadIdx.x; cy = y − blockDim.y/2 + threadIdx.y; centroid = empty centroid for all z do if No pixels below the threshold then if Was forming a centroid and this is the second such plane then merge(centroid) globalSave(centroid) centroid = empty centroid end if Go to next iteration end if centroid.x+ = cx centroid.y+ = cy centroid.z+ = z centroid.pixelcount+ = 1 end for Figure 4.13: Algorithm for finding centroids using 2D CCL. The function merge(centroid) combines the partial centroids being calculated by each thread in the block; globalSave(centroid) saves the calculated centroid to global memory for the host to access. The host code later divides the centroid’s x, y, z values by its pixel count to get the actual x, y, z location of the centroid. identical to the 2D CCL case (figure 4.10), except that the labels array is 3D instead of 2D. Once this operation has completed, the label array contains elements either with value INT MAX (if they don’t contain a particle pixel), or of an integer representing which centroid they belong to. Centroid formation Forming centroids using the 3D CCL approach is done by simply examining each pixel and, if it has a label corresponding to a centroid, adding that pixel’s x, y, z location to the centroid’s total. After this has been completed, the host code divides the x, y, z totals by the number of pixels that contributed to the centroid to get the average location. The GPU code implements this in a straightforward manner. One thread is assigned to each x, y location, and the kernel then iterates over z. When a thread finds a pixel 24 x = blockIdx.x× blockDim.x+ threadIdx.x; y = blockIdx.y × blockDim.y + threadIdx.y; minlabel = INT MAX changed = true globalchanged = false while changed do changed = false for all z do if label(x, y, z) == INT MAX then minlabel = INT MAX else minlabel =min(minlabel, label(x, y, z), label(x + 1, y, z), label(x − 1, y, z), label(x, y + 1, z), label(x, y − 1, z)) if minlabel 6= label(x, y, z) then changed = true globalchanged = true end if end if label(x, y, z) = minlabel end for changed = syncthreads or(changed) end while return globalchanged Figure 4.14: 3D CCL algorithm executed on the GPU. The line changed = syncthreads or(changed) is used to keep individual blocks iterating until the block converges by instructing each block to re-iterate if any thread updated a pixel value. INT MAX is used to denote pixels over the threshold. that’s labeled with a centroid, it looks up the centroid by it’s label, then adds the x, y, z location of the pixel to it and also increments the pixel count. This is illustrated in figure 4.15. 4.4 Host code The host code in this implementation has fairly minimal functionality. It largely exists to manage passing data between the different CUDA stages, and also handles the usual configuration duties- configuring variables, reading input and writing output. The host code does, however, perform some relevant computational duties. Because 25 x = blockIdx.x× blockDim.x+ threadIdx.x; y = blockIdx.y × blockDim.y + threadIdx.y; for all z do if pix(x, y, z) 6= INT MAX then centroid =getCentroid(pix(x, y, z)) centroid.x+ = x centroid.y+ = y centroid.z+ = z centroid.pixelcount+ = 1 end if end for Figure 4.15: 3D centroid-formation algorithm. The function getCentroid(pix(x, y, z)) gets a reference to the global centroid array, looking up the correct centroid by index. the GPU memory is limited, it can’t reconstruct and process an arbitrarily large num- ber of Z planes at a time- the memory required scales with the size of the hologram and reconstruction resolution. As such, it must reconstruct and process sets of planes iteratively. The host code manages this process by iterating over the Z plane sets and launching the kernels appropriately. Additionally, each iteration creates as output a complete set of detected centroids. Because the Z plane processing commences iteratively, it’s possible that a particle is detected in two such sets. It’s also possible that a particle becomes over-segmented- that is, multiple centroids are registered for a single particle because the refraction pattern for that particle is composed of two or more disjoint pixel groups. The host code must correct for these errors. Fortunately, both can be solved in one pass: In both cases, centroids are simply merged together based on proximity- if two centroids are close enough in x, y, and z, it can reasonably be assumed that they belong to the same particle. Thus the centroids are combined, forming a new centroid with an x, y, z location given by the average of the two contributing centroids, weighted by the number of pixels comprising each. While this is a computationally-expensive process (O(n2) in the number of cen- troids), it maps poorly to the GPU due to the high number of data dependencies. The actual algorithm for merging centroids is shown below in figure 4.16. Finally, the host code must do some filtering. Particles’ refraction patterns may interfere constructively or destructively with each other to create pixels that register as 26 Require: mergeRadius, a configurable parameter for all ci ∈ centroids do for all cj ∈ centroids, cj 6= ci do if distance(ci, cj) < mergeRadius then merge(ci, cj) end if end for end for Figure 4.16: Host code algorithm for merging fragmented centroids. being part of a particle, but aren’t. The centroids corresponding to these false-positives should then ideally be detected and discarded. The refraction pattern for each particle changes with each Z plane. Because of this, any interference significant enough to trigger a false-positive reading is likely to be short- lived. As such, the centroid resulting from any such interference will have relatively few pixels contributing to it when compared to a centroid created by a real particle. Thus a simple filter to discard centroids smaller than a fixed number of pixels is effective in greatly reducing these false-positive results. This approach is shown in figure 4.17. Require: minCentroidSize, a configurable parameter for all centroidci do if sizeof(ci) < minCentroidSize then remove ci end if end for Figure 4.17: Host code algorithm for reducing false-positives reported. Finally, the high-level host code operation is shown in figure 4.18. 27 Require: hologram, an input hologram image Require: k, the number of planes that can be handled at once in GPU memory Require: nz, the number of planes to reconstruct Allocate memory on GPU for all i, k × i < nz do Launch stage 1-2 kernels Launch stage 3 kernels Launch stage 4-6 kernels Copy centroids to host from GPU end for Merge centroids Remove false-positives return centroids Figure 4.18: Host code algorithm for controlling the entire hologram reconstruction and particle extraction. Chapter 5 Results and Discussion 5.1 Test setup Due to difficulties in getting the existing MATLAB code to run in order to benchmark it, the MATLAB and CUDA experiments were run on different machines. The GPU tests were run on a Linux machine containing two Intel Xeon E5-2650 processors at 2.0 GHz, 64 GB of RAM, and three Nvidia GPUs was used (though only a single Tesla K20c GPU was used by the application). The software used was CUDA 6.5. The MATLAB tests were run on a Windows 7 machine containing an Intel Core i7-4770 CPU at 3.4 GHz with 16 GB of RAM. MATLAB 2013 was used to perform the tests. While this is less than ideal for making a fair comparison, the MATLAB performance is, if anything, overstated, as the CPU used for the MATLAB computations has almost twice the single-thread performance as that used for the GPU experiments [16]. For both the CUDA and MATLAB configurations, the inputs provided consisted of the input hologram, as well as parameters concerning the hologram, such as the wavelength of the illuminating light, the resolution of the image (pixels/m), and the reconstruction depths, which theoretically correspond to the bounding near and far planes of the particles contained in the hologram. As outputs, each algorithm provided a list x, y, z positions, one per particle detected. To allow the accuracy of each approach to be tested, simulated holograms were 28 29 generated and used as inputs, with each hologram tested by both the CUDA and MAT- LAB implementations. Because the holograms were simulated, the ”real” locations of all particle contained within them could be known. This information was compared to the particles detected by each algorithm, and used to determine the accuracy of each. 5.2 Figures of merit Before comparing the GPU and reference implementations, figures of merit must be decided upon. In this instance, there are essentially two points on which a comparison can be made. Firstly, speed. Because the driving force for this research is to greatly reduce the time overhead involved in analyzing fluid flows, the time required to reconstruct and analyze a hologram is a key point of comparison. As mentioned previously, memory operations (allocating memory and moving data) are relatively expensive in CUDA. As such, it becomes relevant to not only measure the time required to reconstruct and analyze a single hologram, but also the time required to process multiple holograms, where the memory allocation cost is amortized across multiple images. The second figure of merit for hologram reconstruction is accuracy. Being able to reconstruct holograms and detect particles arbitrarily fast is only useful if the results are useful, after all. With holograms, accuracy can be compared in a number of ways- the position of detected particles in three dimensions compared to the physical location of particles in space, and the number of particles detected. The latter point encompasses detecting all the particles that exist, as well as the existence of false-positives- reporting a particle where none exists. 5.3 Test results The performance of both the MATLAB and CUDA implementations are shown in table 5.1. Figure 5.3 graphically represents execution, and the relative time required for each CUDA stage is shown in figure 5.1. Figure 5.2 further highlights the accuracy of the CUDA and MATLAB approaches as a function of the particle field density. 30 T ec h n iq u e N u m b er of p ar ti cl es P ar ti cl es D et ec te d F al se P os it iv es A ve ra ge X Y er ro r (µ m ) A ve ra ge Z er ro r (µ m ) S td . Z er ro r (µ m ) T ot al ru n ti m e (s ) C al cu la ti on ti m e (s ) 3D CCL 100 100 0 0 0.280 0.4 5.596 1.779 2D CCL 100 0 0 0.270 0.5 5.409 1.355 MATLAB 100 0 0.039 0.984 0.3 2091 — 3D CCL 467 467 0 0.001 0.308 0.5 5.579 1.817 2D CCL 467 0 0.005 0.291 0.5 5.426 1.380 MATLAB 467 0 0.090 1.022 0.4 2127 — 3D CCL 1030 1016 4 0.047 0.764 2.5 5.743 1.932 2D CCL 1017 4 0.110 0.668 1.9 5.118 1.401 MATLAB 1021 0 0.136 1.033 0.5 2131 — 3D CCL 1417 1402 6 0.344 1.333 7.2 5.846 2.049 2D CCL 1402 6 0.385 1.255 7.1 5.273 1.445 MATLAB 1379 3 0.163 1.029 0.5 2121 — 3D CCL 1877 1829 7 0.253 1.448 5.1 6.044 2.253 2D CCL 1826 8 0.321 1.403 5.0 5.475 1.577 MATLAB 1795 6 0.187 1.075 0.6 2516 — 3D CCL 2229 2180 25 0.648 2.413 10.5 6.301 2.471 2D CCL 2186 27 0.662 2.376 10.6 5.541 1.695 MATLAB 2094 17 0.116 1.102 0.6 2298 — Table 5.1: Performance comparison between CUDA and MATLAB code. All holograms were 1024 × 1024 pixels, with particles 3 µm in diameter, and a camera resolution of 0.75 µm/pixel. 1101 planes were reconstructed in 1 µm increments (z = 0 to 1100 µm). MATLAB doesn’t have fixed memory allocation costs as CUDA does, and thus does not have a separate ’calculation’ time. 31 Figure 5.1: Relative execution time of each CUDA stage. Memory transfers for each stage are rolled into each stage’s execution time. 5.4 Discussion 5.4.1 Detection Rate and False-Positives The first metric to compare the CUDA and MATLAB implementations is how many particles were correctly detected, and how many false-positives were reported. From table 5.1, it can be seen that both implementations behave identically up to 467 particles. At 1030 particles, the MATLAB implementation performs slightly better, detecting a few more particles and reporting no false positives (as compared to 4 false positives for the CUDA code). For greater particle densities (1415-2229 particles), the CUDA version detects more of the particles, at the expense of more false-positives. Both the 2D and 3D segmentation approaches used by the CUDA algorithm perform similarly. The false-positive rate should be considered carefully, however, as the CUDA and MATLAB codes perform different post-processing steps on the detected particles. Be- cause this work focused on the parallelizable portions of the algorithm, the post-processing code has room for improvement, which could well improve the false-positive rate to a level on par with the existing MATLAB code. Despite this, both MATLAB and CUDA perform well by this metric, having around a false positive rate of under 2% and detecting well over 90% of the particles present in the hologram. 32 5.4.2 XY Error As shown in table 5.1, the accuracy of detected particle locations in the XY plane was found to be superb for all particle densities and both CUDA and MATLAB approaches. Average error in all was well under one particle diameter. While the accuracy was found to be good, it did show a decrease as the particle density increased. This is due in part to the higher numbers of false-positives, which generally appeared near, but not exactly on top of, the locations of real particles, and thus contributed to larger average error values. Because the false-positives were reported as particles by the algorithm, their error was still counted in calculating the overall accuracy of the implementation. 5.4.3 Z Error The MATLAB and CUDA implementations diverge when it comes to the accuracy of located particles along the Z axis, as shown in figure 5.2. For up to 1030 particles in the field, the different approaches behaved similarly, with the CUDA implementation demonstrating slightly more accuracy at lower densities. At higher densities (1417-2229 particles), however, the CUDA implementation proved to be less reliable. While the average detection error remained under the diameter of a particle, the standard deviation increased enormously. This is again partially at- tributable to the false-positive issue noted previously, where each false-positive con- tributed significantly more error than the real particles detected. With the values measured, the CUDA implementation likely would not be useful for real applications with high particle densities, though future work to implement more advanced post-processing steps could likely change that. This is further supported by the CUDA implementation actually performing better at lower field densities. 5.4.4 Execution Time Before comparing the execution times of the MATLAB and CUDA implementations, it should be noted that, despite running on a faster CPU, the MATLAB implementation is inherently handicapped. Because MATLAB code is interpreted, it runs slower than a 33 Figure 5.2: Plots of CUDA (2D and 3D CCL) and MATLAB average Z error (top) and standard deviation (bottom), as a function of particle density. compiled C program1 , such as the CUDA implementation. As such, the performance gains demonstrated are not entirely due to the parallelization of the algorithms, but also in part to the fact that they’re compiled into code that can run directly on the CPU and GPU without the interpreter’s overhead. Regardless of the reason, the CUDA implementation clearly demonstrates a perfor- mance advantage. As shown in figure 5.3, the MATLAB implementation takes signifi- cantly longer- by almost three orders of magnitude- to perform the same workload as the CUDA version. This difference is especially notable when only the CUDA calculation time is taken into account, as one-time memory allocations and other setup account for 1 This likely is the same reason the work presented here shows significant performance improvements over that presented in [15]. 34 Figure 5.3: Plot of MATLAB and CUDA execution times for various particle field densities. over half of the CUDA runtime (table 5.1 and figure 5.1). One other point of interest is the dependence on the particle field density. The MATLAB version seems to have negligible dependence on the field density- likely be- cause reconstructing the planes and calculating tenengrad maps is the majority of the execution time, and thus any small additional amount of work is less than obvious. The CUDA version clearly has some dependence on the number of particles. Because some post-processing is done on the CPU after the GPU has finished, and because the GPU is significantly faster than the CPU, the extra time is highly visible. Furthermore, the post-processing steps in the CUDA version are O(n2), creating a larger dependence on particle count. Figure 5.1 shows that this processing time is non-negligible for larger particle counts. Figure 5.1 contains a couple other details of interest. Firstly, it highlights the large one-time setup overhead CUDA has: over half of the execution time. Batch processing of holograms thus stands to show further performance gains over a single hologram. Furthermore, it can be seen that, especially for the 3D CCL approach, each CUDA stage contributes enough execution time to benefit from parallelization. This is likely a factor in the performance improvement seen over [15], which only parallelized part of the algorithm. The 2D CCL approach shows the near-negligible stage 4-6 computation time, and also the lower relative effect of the CPU-based post-processing. Both of these point to it being more conducive to a parallel approach than 3D CCL. Chapter 6 Conclusion GPU computing is a highly effective solution to a class of problems with low data transfer and high computation requirements. This paper presents a GPU-accelerated implemen- tation of one such problem: that of reconstructing digital holograms and extracting 3D particle locations from them. It further demonstrates that the parallel implementation of the procedure produces similar results as the sequential version (for relatively sparse particle fields), while taking almost three orders of magnitude less time to execute. Some accuracy issues remain at higher particle field densities. These could likely be resolved with more sophisticated post-processing techniques of the detected particles, such as those implemented in the MATLAB code. This is one area of future research. The implementation presented also has possibilities for further performance enhance- ments. Because groups of reconstructed planes are processed iteratively, independently of the other groups, the possibility exists to process them in parallel using multiple GPUs. A performance increase roughly linear in the number of GPUs used should be possible with this technique. Despite these opportunities for improvement, the ability to extract process sufficient holograms for flow analysis in a matter of a few hours instead of a few weeks is a huge step forward. 35 References [1] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobbs journal, 30(3):202–210, 2005. [2] Hoonjong Kang, Fahri Yaras¸, and Levent Onural. Graphics processing unit accel- erated computation of digital holograms. Applied optics, 48(34):H137–H143, 2009. [3] Tomoyoshi Shimobaba, Tomoyoshi Ito, Nobuyuki Masuda, Yasuyuki Ichihashi, and Naoki Takada. Fast calculation of computer-generated-hologram on amd hd5000 series gpu and opencl. Optics express, 18(10):9955–9960, 2010. [4] NVIDIA Corporation. CUDA C Programming Guide, July 2013. [5] Jerome H Milgram and Weichang Li. Computational reconstruction of images from holograms. Applied optics, 41(5):853–864, 2002. [6] Joseph Katz and Jian Sheng. Applications of holography in fluid mechanics and particle dynamics. Annual Review of Fluid Mechanics, 42:531–555, 2010. [7] Jian Sheng, Edwin Malkiel, and Joseph Katz. Single beam two-views holographic particle image velocimetry. Applied optics, 42(2):235–250, 2003. [8] Daniel R Guildenbecher, Jian Gao, Phillip L Reu, and Jun Chen. Digital holog- raphy reconstruction algorithms to estimate the morphology and depth of non- spherical absorbing particles. In SPIE Optical Engineering+ Applications, pages 849303–849303. International Society for Optics and Photonics, 2012. [9] Lukas Ahrenberg, Andrew J Page, Bryan M Hennelly, John B McDonald, and Thomas J Naughton. Using commodity graphics hardware for real-time digital hologram view-reconstruction. Display Technology, Journal of, 5(4):111–119, 2009. 36 37 [10] Yasuyuki Ichihashi, Ryutaro Oi, Takanori Senoh, Kenji Yamamoto, and Taiichiro Kurita. Real-time capture and reconstruction system with multiple gpus for a 3d live scene by a generation from 4k ip images to 8k holograms. Optics express, 20(19):21645–21655, 2012. [11] S Bianchi and R Di Leonardo. Real-time optical micro-manipulation using op- timized holograms generated on the gpu. Computer Physics Communications, 181(8):1444–1448, 2010. [12] Tomoyoshi Shimobaba, Yoshikuni Sato, Junya Miura, Mai Takenouchi, and To- moyoshi Ito. Real-time digital holographic microscopy using the graphic processing unit. Optics express, 16(16):11776–11781, 2008. [13] Jacob P Fugal, Timothy J Schulz, and Raymond A Shaw. Practical methods for automated reconstruction and characterization of particles in digital in-line holograms. Measurement Science and Technology, 20(7):075501, 2009. [14] La´szlo´ Orzo´, Z Gorocs, Istva´n Szatma´ri, and Szabolcs Toke´s. Gpu implementation of volume reconstruction and object detection in digital holographic microscopy. In Cellular Nanoscale Networks and Their Applications (CNNA), 2010 12th Inter- national Workshop on, pages 1–4. IEEE, 2010. [15] Fook Chiong Cheong, Bhaskar Jyoti Krishnatreya, and David G Grier. Strategies for three-dimensional particle tracking with holographic video microscopy. Optics express, 18(13):13563–13573, 2010. [16] PassMark Software, 2014. Appendix A Glossary and Acronyms A.1 Glossary • Block – A collection of CUDA threads. May be 1-, 2-, or 3-dimensional. • Compute Unified Device Architecture (CUDA) – Nvidia Corporation’s GPU computing platform, allowing their graphics processors to perform general computation. • Connected-Component Labelling (CCL) – The process of examining the pix- els in an image and assinging those belonging to a single component (eg, a particle) the same label or number. • False-positive – A particle reported by the algorithm which doesn’t correspond to any physical particle present in the hologram. • Global memory – The on-GPU RAM of a CUDA GPU. Equivalent to the ”main memory” of a CPU. • Graphics Processing Unit (GPU) – A device in modern computers that han- dles rendering graphics on the display. Can also be used for general-purpose processing, operating in a massively parallel computing paradigm. • Grid – A collection of blocks in CUDA. May be 1-, 2-, or 3-dimensional. Each grid corresponds to a single kernel launch. 38 39 • Minimum-Intensity Image – A projection of a reconstructed hologram onto a 2D plane. Each x, y location contains the lowest value of any pixel in that x, y location across all z planes. • Shared memory – Fast, explicit cache on CUDA GPUs, available to program- mers to minimize accesses to global memory and maximize performance. • Tenengrad – Essentially the gradient of a pixel, calculated using all 8 pixels surrounding the pixel of interest. • Tenengrad Map – An image with pixels containing the tenengrad of each pixel of an original image; similar to a gradient map. • Thread – A single process operating on a CPU or GPU. • Warp – The unit of execution for a CUDA GPU. The threads in a warp are executed in lockstep by the GPU. Warps are transparent to the programmer, and managed by the GPU. A.2 Acronyms Table A.1: Acronyms Acronym Meaning CCL Connected-Component labelling CUDA Compute Unified Device Architecture FFT Fast Fourier Transform GPU Graphics Processing Unit SIMD Single Instruction, Multiple Data