In order to better understand GPU computation kernels and parallel computation in general, this project explored an optimized N-body simulation algorithm, Barnes-Hut. This algorithm utilizes approximation methods and neat tricks in the kernel to make best use of memory accesses to improve performance. Our work mainly was in profiling the computation time of various sections of the kernel, as well as comparing the effectiveness of different optimizations present in the kernel. The slides below detail the importance of cache utilization and memory coalescing when implementing highly parallel kernels.