CE 538: PARALLEL COMPUTER ARCHITECTURE

Term Project

Goal

The main goal of the term project is to write a comprehensive summary of a research topic of your interest, and/or do a performance/power analysis of realistic benchmarks using the GEM5 simulator.
You can do the final project alone or in a team of two. We recommend that you form a team: not only will you be able to tackle a more interesting project, but you will also more than double the quality of the project by having someone else to brainstorm with.

Topics

You will need to study and understand a number of research papers (approximately 15-20) on a specific topic, write a summary, and present your work during class. Potential research topics (not a complete list):
- Automatic thread extraction and parallelization. Starting from a traditional, sequential implementation of an application, we seek to detect thread level parallelism and automatically partition the computation into threads. This still remains an open problem, but there has been considerable research the last few years towards this direction.
- Architectural Synthesis from High Level Representations. A lot of research effort has been expended the last 10 years to automatically generate architectures starting from a program written in C or similar programming language. This research has been mainly fueled by researchers in the hardware (but not architecture) community, and is currently limited to converting small functions or loop structures. Besides summarizing the most important prior work, you should comment on the pros and cons of various programming models for architectural synthesis. What are the basic features of a succesful programming model to map complex applications automatically to hardware. Look at C/C++, streaming, Data-parallel programming models, etc.
- Stream Computing. Stream computing models an application as a set of tasks communicating using stream data. It facilitates extraction of coarse-grain parallelism, and better data communication and staging. Study the various streaming programming models, and processors based on streaming. You should also comment on which family of applications can benefit from the streaming programming model, and how streaming can be used for automatic hardware generation.
- Processor customization. Most processors designed and fabricated have a fixed architecture which is general purpose in nature. Some researchers have proposed adding customizable instructions and hardware either before or even after fabrication. There are even companies that have created base processors that can be customized at compile time (Tensilica) or at run-time (Stretch).
- Microprocessor reliability and validation
- Low power and low-energy microprocessors. Most of the results obtained in the last 40 years of computer architecture have focused on increasing performance. In the last 10-15 years we have a spate of power-oriented optimizations for processors, including voltage and frequency scaling focused mostly on media streaming. We also have some very low-level optimizations, such as shutting off subsections of the design, reducing address variations on pins, using high-threshold logic to reduce leakage, and subsetting the cache in various ways. More recently we have embedded processors providing hardwired units for high-level operations, such as TCP offload, video processing, compression, security, and packet processing. In principle, performance and energy efficiency are not necessarily opposing, in that they both involve getting the most utility out of a certain amount of hardware resources. Energy optimizations pay attention to utility of the resource-time product. Take a line of architectural research (such as cache organization, prediction, ILP, multiprocessing, thread support, IO processors, or instruction set design) and examine its impact from an energy efficiency viewpoint. Part of the challenge here is that the metrics of comparison and the appropriate benchmarks are much less straightforward than performance.
- Transactional memory methodologies. Transactional memories borrow ideas from data base transactions to eliminate fine-grain data synchronization. You should summarize and comment on the pros and cons of hardware, software, and hybrid methods to implement transactional memories in a multiprocessor.
- Comparison of GPU and FPGA. Due to their massive computational power, graphics processing units (GPUs) and Reconfigurable Logic (FPGAs) have become a popular platform for executing high-throughput parallel applications. Applications typically exhibit vastly different performance characteristics depending on the accelerator. For the best application-to-accelerator mapping, factors such as programmability, sources of parallelism, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. After your summary, you should comment on the pros and cons of each platform regarding performance, scalability, power, power density, ease of programming, etc. You are expected to comment on the application characteristics and sources of prallelism better served by each platform.
- Network On Chips (NoC). NoC is an approach to designing the communication subsystem between cores in a multiprocessor system or System-on-Chip (SoC). It applies networking theory and methods to on-chip communication. A NoC is constructed from multiple point-to-point data links interconnected by switches (a.k.a. routers), such that messages can be relayed from any source module to any destination module over several links, by making routing decisions at the switches. Summarize the latest research of NoC, and comment on the contemporary and future applicability of NoC on modern systems of various scale.
- Secure processors. Discuss hardware enhancements to general purpose processor to implement secure computing. Also, discuss special cryptoprocessor accelerators used to provide extra security in most cases.
- Cache optimizations for Chip Multiprocessors. As Chip Multiprocessors (CMPs) move to highly shared caches (e.g., Sun's Niagara has 32 threads sharing a 3MB L2 cache, the competition for these limited physical resources will become intense. One can imagine workloads where increasing the number of threads actually decreases throughput due to cache thrashing, just as we have long seen with page thrashing. What mechanisms are included in hardware caches and what policies should the OS implement to limit this problem and mazimize throughput?

Performance analysis using the GEM5 simulator. Use the GEM5 open source simulator to analyze the performance of a benchmark (e.g. benchmarks in PARSEC ) under different configuration scenaria. You should analyze execution time, throughput, memory hierarchy and thread behavior. You can also analyze power/energy consumption using McPAT simulator.