Show simple item record

dc.contributor.advisorGratz, Paul
dc.creatorRadaideh, Ahmad Mahmoud Mesleh
dc.date.accessioned2021-01-06T21:53:00Z
dc.date.available2021-01-06T21:53:00Z
dc.date.created2020-05
dc.date.issued2020-04-20
dc.date.submittedMay 2020
dc.identifier.urihttps://hdl.handle.net/1969.1/191830
dc.description.abstractThread parallel hardware, as the Graphics Processing Units (GPUs), greatly outperform CPUs in providing high compute throughput and memory bandwidth which make them ideal for accelerating various data-parallel applications. These hardware designs provide high performance computing by supporting a massive thread level parallelism (TLP) processing model. Our work focuses on making the thread parallel hardware more power and energy efficient and higher performance. It also focuses on making the simulation of this type of hardware more accurate. Our work is divided into three main parts: (1) We introduce a coalescing-aware register file organization that takes advantage of frequent narrow-width data present in general-purpose applications in order to increase performance and reduce energy consumption in GPU. We present a new design that is capable of combining read and write accesses originated from same or different warps into fewer accesses. Our design reduces the number of register file accesses by 30.5%, achieves IPC speedup of 16.5%, and reduces overall GPU energy by 32.2% on average. (2) We present a low-cost power saving scheme in GPU that dynamically exploits frequent zero data within and across registers in order to gate off register file reads and writes and execution units to reduce dynamic power without impacting performance. Our scheme reduces register file reads and writes on average by 50% and 54%, respectively. The register file and execution unit dynamic power are reduced on average by 27% and 19%, respectively. The reduction in total GPU dynamic power achieved is about 8% on average. (3) For multi-threaded applications, the results taken from full system architecture simulation can often be inconsistent, primarily because of a combination of small input sets and the behavior of the Linux thread scheduler. We propose a simple solution wherein the scheduler is modified to enforce mapping of software threads into available distinct processors that provides consistent runtimes for short-run, multi-thread benchmarks, leading to expected, consistent experimental results.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectRegister file coalescingen
dc.subjectzero gatingen
dc.subjectpower optimizationen
dc.subjectperformance optimizationen
dc.subjectGPGPUen
dc.subjectfull system simulationen
dc.subjectimpact of thread scheduleren
dc.titlePower and Performance Optimization in GPGPUen
dc.typeThesisen
thesis.degree.departmentElectrical and Computer Engineeringen
thesis.degree.disciplineComputer Engineeringen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberHu, Jiang
dc.contributor.committeeMemberBraga Neto, Ulisses
dc.contributor.committeeMemberKim, Eun
dc.type.materialtexten
dc.date.updated2021-01-06T21:53:01Z
local.etdauthor.orcid0000-0003-2943-5019


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record