Research Statement Hung-Wei Tseng I have research experience in many areas of computer science and engineering, including computer architecture [1, 2, 3, 4], high-performance and reliable storage systems [5, 6, 3], software runtime systems [7, 8], programming languages [1, 7], compilers [9], embedded systems [10, 11, 12, 13] and computer networks [14, 15]. Much of my research springs from the observation that entrenched programming and execution models models do not take advantage of modern parallel and heterogeneous computer architectures, creating redundancies and limiting applications ability to use computing resources. The resulting software programs waste the potential of modern computer systems and undermine their effectiveness. Therefore, my research projects focus on making computation more efficient. I started pursuing this route with my PhD thesis on the data-triggered threads (DTT) model. The DTT model eliminates redundant computation and creates non-traditional parallelism for applications running on multi-core processors through microarchitecture, programming languages, software runtime systems, and compiler optimizations. Since receiving my PhD, I have also led a large-scale project that builds efficient data storage and communication mechanisms for big data applications in heterogeneous computing systems that include high-speed non-volatile memory devices, GPU accelerators, and FPGAs. These projects have laid the foundation for my future research, in which I hope to build efficient heterogeneous parallel computers and IoT (Internet of Things) for emerging applications by rethinking architectures, programming languages, systems, and compilers. Below, I describe my research projects at UCSD and sketch my future research directions. Data-triggered threads Data-triggered threads (DTT) [1, 2] is a programming and execution model that avoids redundant computation and exploits parallelism by initiating computation only when the application changes memory content. With the DTT model, code that depends on changing data can immediately execute in parallel, while code depending on data that remains the same can be skipped, avoiding redundant computation. I am the principal designer and developer of DTT. My thesis research makes the following contributions. (1) It defines a programming model, based on imperative programming languages, that allows programmers to express computation in a way that exposes redundancies and identifies new opportunities for parallel execution. (2) It proves that the DTT model requires only a small amount of microarchitectural change. (3) It provides a software-only solution for executing DTT applications in existing architectures. (4) It demonstrates how legacy programs can take advantage of the DTT model without any programmer intervention, by using a transparent compiler-only transformation. Viewed as a whole, my research projects in the DTT model exploit architectures, programming languages, runtime systems and compilers to improve applications performance and energy efficiency. DTT model and microarchitecture Most computers today use the von Neumann model, by which applications initiate parallelism based on the program counter. This conventional approach limits systems ability to make use of parallelism. In addition, this model incurs significant redundant computation. Our research shows that loading redundant values (i.e. unchanged values previously loaded from the same addresses) accounts for more than 70% of memory loads, and that more than half of the computations that follow these loads are unnecessary. In my research, I defined a set of language extensions to C/C++ that allow programmers to describe applications using the DTT model. Unlike other proposals that try to reduce redundant computation, the DTT model needs only a small amount of hardware support. The simulation results [1] show that the DTT model can improve performance by 45% on SPEC2000 applications, and a significant portion of this performance gain is from eliminating redundant computation. 1
DTT runtime system To increase the applicability of the DTT model, I designed a software-only framework. I applied several optimizations to minimize the multithreading overhead of real systems. In addition, the runtime system dynamically and transparently disables DTT when a DTT code section may potentially underperform the conventional model. The runtime system achieves a 15% performance improvement on SPEC2000 applications using an additional thread and, when DTT parallelism is added to traditional parallelism, a 19 speedup of PARSEC applications [7]. To provide better support for fine-grained massive data-level parallelism, I further extended the runtime system and the DTT model to allow out-of-order execution for multiple data-triggered threads. The extended DTT model also enables the runtime system to schedule tasks according to data locations and further improve performance. The results in [8] demonstrate that the DTT model can effectively overlap computation with I/O and achieve better scalability than traditional parallelism. CDTT compiler While the original DTT proposal relies on programmers efforts to achieve performance improvement, I have also been working on an LLVM-based compiler, CDTT, to automatically generate DTT for legacy C/C++ applications [9]. Even without profile data, the DTT compiler can identify code that contains redundant behavior, which is where the DTT model provides the largest performance advantages. In most applications, the compiled binary running on the software-only DTT runtime system can achieve nearly the same level of performance as programmers modifications, with an average of 10% performance gain for SPEC2000 [9]. Efficient Heterogeneous Systems for Data-Intensive Applications The growing size of application data, coupled with the emergence of heterogeneous computing resources and high-performance non-volatile memory technologies and network devices, is reshaping the computing landscape. However, programming and execution models on these platforms still follow the CPU-centric approach, which results in inefficiencies. I am currently leading a group of 6 PhD students and 2 undergraduate students in a project to improve the performance of data-intensive applications in these systems. Within 15 months, the project has made the following contributions: (1) We designed a system that enables peer-to-peer data transfers between SSDs and GPUs, eliminating redundant CPU and main memory operations. (2) We designed a simplified API and system software stack that improves data transfer performance and accelerates applications. (3) We demonstrated that exploiting the processing power inside storage devices to deserialize application objects improves energy efficiency without sacrificing performance. The following paragraphs describe these projects in more detail. Efficiently Supplying Data in Computers As computers become heterogeneous, the demands of exchanging data among storage and computing devices increase. One set of GPU benchmarks we studied left the GPU idle for 54% of the total execution time because of stalls due to data transfers [3]. Current programming models in heterogeneous computing systems still transfer data through the CPU and the main memory, regardless of the fact that the majority of computation may not be using the CPU. The result is redundant data copies that consume CPU time, waste memory bandwidth, and occupy memory space. In addition, current programming models require applications to set up the data transfer, preventing the applications from dynamically utilizing more efficient data transfer mechanisms. To address the above deficiencies, my team and I re-engineered the system to support peer-to-peer data transfers between an SSD and a GPU, bypassing the CPU and the main memory. We defined an application interface that frees applications from the task of setting up data routes. We designed a runtime system to dynamically choose the most efficient route to carry application data. A real-world evaluation shows that my proposed design can improve application performance by 46% and reduce energy use by 28%, without modifying the computation kernels. Such a system is even more effective in a multiprogrammed server workload as we can improve the utilization of computing resources and receive 50% performance gain [3]. The resulting system is now the backbone of many other research projects in the group. I am advising several students in my group as they work to develop database systems and high-performance MapReduce framework using this platform. 2
Efficiently Using Computing Resources The emergence of heterogeneous computing systems also encourages us to re-examine the role of the CPU and make greater use of the processing resources spread throughout the system. In my research, I observed that using the CPU to deserialize application objects from text files accounts for 64% of total execution time and prevents applications from sending data directly between storage devices and heterogeneous accelerators. At the same time, emerging NVMe SSDs contain energy-efficient embedded cores that allow us to perform object deserialization while bypassing the system overhead. I worked with two undergraduate researchers to move deserialization onto unused processing resources within the SSD. By offloading object deserialization from the CPU to the SSD, we were able to speed up applications by 1.39 and reduce energy consumption by 42%. This work has demonstrated the value of redefining the interaction between applications and storage devices. Accordingly, I am working with the group to provide innovative SSD interface and network semantics that help eliminate inefficient CPU code and improve system efficiencies. Future Directions Because of the limitations imposed by dark silicon, computer systems are shifting to rely more on parallel and heterogeneous architectures. In this environment, the overhead of moving data among devices becomes critical for performance, especially when the application needs to deal with a large amount of data. While my prior work helps address the problem of redundant computation, as well as unnecessary data movement (including cache-to-cache, memory-to-memory, and storage-to-memory), the fundamental demands of moving data from data storage to the device memory remain a challenge for high-performance computing resources. To reduce the amount of data movement and improve the scalability of computing, I would like to bring computation closer to data sources, so that the system propagates computation and data to other computing resources only when doing so is beneficial. To provide this kind of in-house computing power for data sources (including storage devices, network interface cards, and memory modules), I plan to investigate efficient processor designs on these devices. In addition, I will enhance and extend programming tools so that programmers can easily take advantage of this new architecture. The ultimate goal is to perform computation at the optimum place within every kind of computer system, including heterogeneous, parallel computers and emerging larger-scale systems like IoT. In the following, I outline several specific research topics directed toward my research goal. Embedding computing power in different layers of the system Embedding computing power in different layers of data storage is not just a matter of adding processors into peripheral devices. Each type of hardware s specific characteristics affect the design philosophy of processors that will work on that device. The target applications running in the system also drive design decisions and require different trade-offs in the architecture. This line of inquiry raises many research questions. For example, what types of processor architecture are needed to balance the various performance, power, energy, and hardware costs of different types of data storage and I/O devices? Which layers of the memory hierarchy and what kinds of I/O devices need computing power? How do these new processors in the memory/storage hierarchy affect the role and design of the CPU? How can we efficiently share data but still maintain consistency and coherence among processors on different devices in a system? What architectural supports are required to efficiently and dynamically move computation across different processing units in the new design? The computing scenarios of data storage and I/O devices are different from those of CPUs in several respects, including latencies, bandwidth, and parallelism. For example, the access time of current storageclass non-volatile memory technology is still orders of magnitude slower than on-chip caches. At the same time, these devices also offer rich internal parallelism. In addition, workloads that are suitable for processing inside storage or I/O devices can exhibit different behaviors from those that current CPUs are optimized for. As a result, we will need brand new processor architectures to efficiently process these workloads. With heterogeneous processors taking on the burden of computation in the system, we can revisit the CPU design and refocus the CPU on computing-intensive workloads. Sharing data efficiently among different processor-equipped devices requires new models of coherence and consistency. Even in the current computation model, maintaining coherence across high-speed volatile 3
memories with different latencies still incurs significant overhead. If the system is to extend memory coherence and consistency even further (e.g. to the non-volatile memory that most storage devices use) data persistence becomes even more challenging, as does the problem of latency. Easy-to-use programming model and efficient system software As computer systems offer increased computing resources, writing a program that best utilizes those resources can potentially become enormously complicated yet still provide only limited flexibility. To efficiently manage tasks and communications on different computing resources, we must rethink programming models, as well as the design of software interfaces, runtime systems and operating systems. Extending existing programming models and their runtime systems, such as Spark and MapReduce, is the first step in this direction. With these high-level programming models hiding details of hardware architecture and data storage, programmers are able to distribute computation to different heterogeneous devices by composing a single program. However, designing a lightweight and efficient runtime system that supports these programming models in a system with tens of computing devices (or more) presents a wealth of research problems. Programming models like the DTT model that trigger asynchronous, event-driven and data-aware parallelism can also be an alternative programming framework, as these models avoid the latency and resource utilization issues in Spark and MapReduce. A programming model like DTT that is compatible with existing high-level programming language presents the opportunity to explore ways of using legacy code to generate new code that can make effective use of emerging computing resources. (My work on the CDTT compiler described above is only one example.) Near-data processing for IoT Bringing computation closer to data locations is beneficial for systems where communication is the most expensive cost. In the IoT, where the latency and energy consumption required to exchange data through wireless communication generates significant overhead, applying the concept of near-data processing is especially valuable. My previous work in this area [14, 15] helps reduce the energy consumption of wireless communication, but fundamentally changing the computation model can improve the problem even further. As with supporting processing inside devices with data storage in a computer, bringing computation closer to IoT devices requires both hardware and software supports. However, designing a processor on tiny wireless devices is more challenging than doing so within a single node computer, because we must both leave most computation in-house to reduce the amount of outgoing data and stay within a limited energy budget. Processors that combine an energy-efficient, general-purpose processor with hardware accelerators would be a compelling choice. To make programming IoT systems as easy as writing programs in a single computer, I also plan to explore a programming model design that allows the programmer to easily configure the task but also requires the least possible middleware overhead from each device. To further reduce the cost of communication, I am also interested in developing a lightweight, RDMA-like protocol that bypasses part of the overhead in the Internet stack. Finally, since many architectural design decisions are determined by the needs of real applications, I also hope to conduct interdisciplinary research projects with researchers from health care, the social sciences, and the field of human-computer interactions to figure out the behavior of IoT applications. The significant changes in computer architectures and application demands are forcing us to rethink computing models, including all aspects of architectures, programming languages, compilers, and systems. In my future work as a professor, I plan to focus on bridging the gap between emerging computer architectures and applications, and to conduct interdisciplinary projects that will help shed light on these emerging issues. References [1] H.-W. Tseng and D. M. Tullsen, Data-triggered threads: Eliminating redundant computation, in 17th International Symposium on High Performance Computer Architecture, HPCA 2011, pp. 181 192, 2011. 4
[2] H.-W. Tseng and D. M. Tullsen, Eliminating redundant computation and exposing parallelism through data-triggered threads, IEEE Micro, Special Issue on the Top Picks from Computer Architecture Conferences, vol. 32, pp. 38 47, 2012. [3] H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S. Swanson, Gullfoss: Accelerating and simplifying data movement among heterogeneous computing and storage resources, Tech. Rep. CS2015-1015, Department of Computer Science and Engineering, University of California, San Diego technical report, 2015. [4] C.-L. Yang, A. R. Lebeck, H.-W. Tseng, and C.-H. Lee, Tolerating memory latency through push prefetching for pointer-intensive applications, Transactions on Architecture and Code Optimization (TACO), vol. 1, no. 4, pp. 445 475, 2004. [5] H.-W. Tseng, L. M. Grupp, and S. Swanson, Understanding the impact of power loss on flash memory, in 48th Design Automation Conference, DAC 2011, pp. 35 40, 2011. [6] H.-W. Tseng, L. M. Grupp, and S. Swanson, Underpowering NAND flash: Profits and perils, in 48th Design Automation Conference, DAC 2013, pp. 1 6, 2013. [7] H.-W. Tseng and D. M. Tullsen, Software data-triggered threads, in ACM SIGPLAN 2012 Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA 2012, pp. 703 716, 2012. [8] H.-W. Tseng and D. M. Tullsen, Data-triggered multithreading for near data processing, in 1st Workshop on Near-Data Processing, WoNDP 2013, 2013. [9] H.-W. Tseng and D. M. Tullsen, CDTT: Compiler-generated data-triggered threads, in 20th International Symposium on High Performance Computer Architecture, HPCA 2014, pp. 650 661, 2014. [10] H.-L. Li, C.-L. Yang, and H.-W. Tseng, Energy-aware flash memory management in virtual memory system, IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 16, no. 8, pp. 952 964, 2008. [11] H.-W. Tseng, H.-L. Li, and C.-L. Yang, An energy-efficient virtual memory system with flash memory as the secondary storage, in 2006 International Symposium on Low Power Electronics and Design, ISLPED 2006, pp. 418 423, 2006. [12] C.-L. Yang, H.-W. Tseng, C.-C. Ho, and J.-L. Wu, Software-controlled cache architecture for energy efficiency, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 5, pp. 634 644, 2005. [13] C.-L. Yang, H.-W. Tseng, and C.-C. Ho, Smart cache: An energy-efficient D-cache for a software MPEG-2 video decoder, in 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia, ICICS-PCM 2003, pp. 1660 1664, 2003. [14] S.-H. Yang, H.-W. Tseng, E. H.-K. Wu, and G.-H. Chen, Utilization based duty cycle tuning mac protocol for wireless sensor networks, in IEEE Global Telecommunications Conference, 2005, GLOBECOM 2005, pp. 3258 3262, 2005. [15] H.-W. Tseng, S.-H. Yang, P.-Y. Chuangi, E. H.-K. Wu, and G.-H. Chen, An energy consumption analytic model for a wireless sensor MAC protocol, in IEEE 60th Vehicular Technology Conference, VTC2004-Fall, pp. 4533 4537, 2004. 5