Research Statement Yiying Zhang

Transcription

1 Research Statement Yiying Zhang My research interests span operating systems, distributed systems, computer architecture, networking, and data analytics, with a focus on building fast, reliable, and flexible systems for emerging hardware and applications. My doctoral research centers around file and storage systems, specifically focusing on how to remove redundant levels of indirection in storage systems [1, 2, 3]. More recently, I have been exploring and building systems for next-generation fast, byte-addressable, non-volatile main memory (NVMM) [4, 5]. I have also worked in other aspects of storage systems [6, 7, 8] and I have experience in several different fields including machine learning and bioinformatics [9, 10]. My research approach has three characteristics: Reexamining established principles in light of new hardware and applications. System software, hardware, and applications have changed dramatically over the past decades and will continue to evolve. I believe that these new technologies and applications call for a reexamination of how we apply many established computer science design principles. My dissertation research reconsiders indirection and its costs for new types of storage systems. More recently, I revisited traditional data replication schemes for NVMM. Building real systems. I believe in prototyping research projects in real systems and understand the challenges and benefits of doing so. My work in hardware prototyping of a new I/O system is among the few research projects on flash-based SSDs that are built with real hardware. As a system designer, my experience with hardware has been both challenging and rewarding. I learned that implementing designs in reality can bring unexpected difficulties to light and inspire completely new designs. Interdisciplinary research. I enjoy interdisciplinary research and have worked in several quite different areas including storage systems, operating systems, distributed systems, computer architecture, networking, scheduling and optimization, machine learning, and bioinformatics. In several of my projects, I have looked at how software and hardware should interact with each other in light of technology shifts. My interdisciplinary background enables me to see the whole picture when building systems. Below, I first describe my thesis and more recent research, then briefly discuss research I have done in other areas, and finally present my future research agenda. 1 De-indirection in Storage Systems Indirection is a core technique in computer systems that offers many benefits. However, it is easy to overlook its costs. As software and hardware systems become more complex, redundant levels of indirection can exist in a single system. For example, running a file system on top of a device with indirection (such as a flash-based SSD) creates two levels of indirection: a block is first mapped from a file offset to its logical address and then from the logical address to its physical address in the device. Excess indirection can cause performance and memory overheads. For instance, flash-based SSDs maintain their level of indirection in SSD-internal DRAM, imposing performance, monetary, and energy costs. Such costs are of even greater concern when the SSD capacity grows or when SSDs are deployed in mobile devices. For my Ph.D. dissertation, I proposed removing excess indirection, or de-indirection, for flash-based SSDs. I used two methods to perform de-indirection. The first method is to avoid excess indirection in flash-based SSDs in the first place. The second method is to allow excess indirection to be created and later remove part of it dynamically. Avoiding excess indirection. Using the first method of de-indirection, I designed and implemented Nameless Writes [1], a new device interface that removes the need for indirection in flash-based SSDs. Nameless writes allow the device to choose the location of a write and inform the file system of the name (i.e., the physical address) where the block now resides. The file system then records the physical address in its metadata for future accesses. One of the major challenges that I encountered in designing nameless writes is that flash-based SSDs must move physical blocks for tasks like wear leveling, and these physical address changes need to be reflected in the file system metadata. I used a new method where the device sends callbacks to the file system to inform it about the physical address changes. I ported the Linux ext3 file system to nameless writes and developed a flash-based SSD emulator that models typical SSD configurations and firmware. Experiments show that nameless writes reduce SSD mapping table size (and thus the amount of DRAM in SSD) by 14 to 50 and improve random write performance by 20 compared to a typical traditional SSD. 1

2 I was then curious to see how nameless writes work with real hardware, since they require fundamental changes in the internal workings of the device, its interface to the host operating system, and the host OS. Using an experimental flash-based SSD hardware board, I developed a hardware prototype of the nameless write SSD [2]. During the prototyping, I discovered several challenges not foreseen in my emulation or in previous published work. Specifically, because the nameless writes interface is fundamentally different from traditional I/O interfaces, it is difficult to integrate nameless writes into the existing fixed SATA storage interface. To tackle these problems, I redesigned the SSD storage system by placing minimal functionality in the device and moving complex functionalities into the kernel. Nameless writes have attracted industry attention from several storage companies. A major mobile device manufacturer even expressed their interest in deploying nameless writes in their cell phones. Reducing existing excess indirection. Using the second method of de-indirection, I designed, implemented, and evaluated the File System De-Virtualizer (FSDV) [3], a system that dynamically removes the storage device indirection costs. FSDV is a flexible, light-weight tool that reduces excess indirection by changing file system pointers to use device physical addresses. When FSDV is not running, the file system and the device both maintain their indirection layers and perform normal I/O operations. When it runs, FSDV significantly reduces indirection mapping table space in a dynamic way while preserving the foreground I/O performance. Moreover, because most of the functionality is placed in FSDV, only small changes are required in existing storage systems. 2 Non-Volatile Main Memory Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage systems. Recently, I have been working on understanding NVMM performance and building systems designed for NVMM. Understanding and improving NVMM performance. I analyzed storage application performance with NVMM using a hardware NVMM emulator [5]. I found that although NVMM is projected to have higher latency and lower bandwidth than DRAM, these differences have only a small impact on application performance. Rather, the bottleneck of NVMM performance is the cost of ensuring that data resides safely in the NVMM (rather than the volatile caches). In response, I designed and implemented an approach to selectively flush data from CPU caches to minimize this cost. This technique significantly improves performance (up to 240 ) for applications that require strict durability and consistency guarantees over large regions of memory. Reliable and highly-available NVMM. Reliability and availability are critical to the success of NVMM as persistent storage in large-scale data center environments. Providing reliability and availability to NVMM is challenging, since the latency of data replication can squander the low latency that NVMM can provide. I reexamined traditional data replication methods and the tradeoffs among performance, reliability, availability, and consistency in light of new NVMM technologies. As a result, I designed, developed, and evaluated Mojim [4], a system that provides the reliability and availability that large-scale storage systems require, while preserving the low-latency performance of NVMM. Mojim achieves these goals by using highly-optimized RDMA-based replication protocols and a two-tier architecture in which the primary tier contains a mirrored pair of nodes and the secondary tier contains one or more secondary backup nodes with weakly consistent copies of data. I implemented Mojim as a generic layer in the Linux kernel and developed an optimized RDMA-based protocol to replicate fine-grained data in NVMM. Experiments show that surprisingly, Mojim provides replicated NVMM with only 29% to 73% the average latency of the un-replicated NVMM and 0.5 to 3.5 the bandwidth. It is also 3.4 to 4 faster than existing replication schemes. Mojim is recent work, but many companies already showed their interest in Mojim and my other NVMM-related research. Some of them are building NVMM-based systems with architectures and approaches that highly resemble Mojim. Other aspects of NVMM. I have also been working on a few other NVMM-related projects including optimizations in the virtualization environment for NVMM and data placement in a hybrid DRAM and NVMM system. In addition, I am currently mentoring six Ph.D. students and visiting scholars to tackle problems in various areas, including architectural and programming language support for NVMM, distributed NVMM systems, and OS optimizations for NVMM. 2

3 3 Other Research Storage systems. Apart from the research work described above, I have worked in several other aspects of storage systems. First, I analyzed data-center storage workloads, and designed and developed a system for accelerating storage-level cache warmup [6]. Experiments show that this system speeds up the cache warmup time by 14% to 100% and has 44% to 228% more server load reduction compared to traditional cache warmup. Second, I designed and implemented a system that prevents correlated device failure in a flash-based storage array by carefully introducing slightly heavier dummy writes on one device that cause it to fail sooner and then slowing down I/O rates on the surviving device [11]. Third, I analyzed different file server contents for duplicate-aware usage, and designed and implemented Duplicate- Aware Disk Arrays (DADA) [7], a system that keeps track of block duplication and uses it to improve the reliability and availability of storage arrays. DADA reduces disk scrubbing and recovery time by 17% to 26%. Scheduling, optimization, machine learning, and bioinformatics. Prior to my Ph.D. work in storage systems, I worked in several other fields. I analyzed energy usage logs of the Condor cluster at University of Wisconsin and developed scheduling algorithms to reduce energy consumption of server clusters. I also worked at a startup company where I was the primary developer of a new product, Locomotive Planning Optimizer, which optimizes locomotive routing and scheduling using integer programming and graph theories. When studying for my master s degree, I investigated several machine learning techniques, including decision trees, neural networks, and support vector machines, to learn the cleavage properties of HIV-1 virus protein sequences and other properties of DNA and RNA sequences [9, 10]. 4 Future Research Going forward, I hope to leverage my area of expertise and to explore a broad range of other areas. Below, I discuss three research directions that I plan to take in the future: (1) building rack-scale systems for little big data analysis; (2) automating software and hardware system implementation; and (3) interdisciplinary research. 4.1 Rack-Scale Little Big Data To analyze the enormous amount of data in today s world is challenging. Many research works have focused on supporting the analytics and processing of gigantic datasets on the scale of petabytes or exabytes. However, there are also many analytics tasks that involve only a few terabytes of data [12, 13, 14]. For example, as of October 2014, GenBank, a database that contains all publicly available DNA sequences, has only 680 GB data [15], and the Facebook friendship graph contains less than 2 TB data [16]: an amount that can fit in the memory of a common server rack. Even for some datasets in the peta- or exabyte range, analytics is performed only on a condensed form of the raw data (e.g., a few summarizing values of a raw MRI image). These little big data (LBD) present their own challenges. Many of these analytics workloads are highly diverse in their data skew, job burstiness, and I/O and compute patterns. Another trend is the increasing demand for interactive data analytics such as real-time risk management. Going forward, I plan to build Rack-Scale Analytical Systems (RacSAS) to more efficiently support LBD, the diversity among these datasets, and the need for interactive access. With the growing complexity of hardware and heterogeneity of analytics applications, a single framework and a fixed hardware configuration will not be enough. I advocate for a reconfigurable hardware layer and a flexible, light-weight software layer. In order to achieve this flexibility, I believe that we need to rethink the abstraction of different software and hardware layers and expose the right amount of information across layers. With such cross-layer information, we can build flexible system software and analytics frameworks that better utilize and share hardware resources and support application diversity. I would also like to take more radical design approaches such as disaggregated computing, storage, and networking resources. In the following, I outline several research directions towards building RacSAS. Configurable storage system. Many data analytic tasks are data-centric and exhibit heterogeneity, high skew, burstiness, and iterative patterns. To support such diversity, I plan to design a tiering architecture of DRAM, non-volatile main memory, and flash. The decreasing price of flash makes it possible to store all raw LBD data on flash in Rac- SAS. Emerging flash-based SSDs have easily programmable computation capabilities inside them [17], and can thus 3

4 perform raw data preprocessing. While flash-based SSDs offer better performance than hard disks, their performance can be irregular because of SSDs internal operations [1]. Exposing the performance cost of such internal operations enables system software to better schedule and coordinate tasks at different layers, for example by allowing SSDs to schedule their wear-leveling operations at system idle time. To take another example, next-generation non-volatile memories promise low latency and byte addressability, but they are still likely to exhibit read-write performance imbalance. Conveying such information to software layers would be helpful as well. To manage these storage devices and enable diverse applications, I believe that we should have highly flexible storage software. For example, it should be easy to configure what form the data is stored in, how much data is cached at each layer, what consistency and reliability model to use for different applications, and whether to remove data duplication. I plan to decouple traditional file and storage system functionalities to enable a configurable storage system. My previous work in cross-layer design [1, 2], file system design [1, 3], data caching [6, 8], reliability and consistency models [4], and data duplication [7] have prepared me for building this kind of configurable storage system. Rack-scale low-latency communication. As computation and storage are getting faster, data communication will become the bottleneck to deliver low-latency application performance. To enable fast and flexible data movement, I plan to reexamine networking architectures and protocols and optimize them for rack-scale data processing. My previous work in optimized RDMA protocol [4] provides me with a starting point towards a low-latency rack-scale network. I also hope to re-examine message passing with commodity networks and look into other possibilities at the rack scale such as photonic network and PCIe interlinks. Redesign OS and system software. With the increasing performance of new storage and networking technologies [5], OS and other system software overhead is becoming the bottleneck in application latency. For RacSAS, an important issue is how to reduce software overhead, especially for real-time analytics. I would like to reexamine current OS and system software functionalities and redesign them in the context of RacSAS and LBD. Reliability and security. Reliability and security will remain important issues for rack-scale systems. For example, how do we prevent a whole-rack failure? How to prevent information leak from a lost rack during transportation? From my previous work [4], I found that we should reconsider traditional data replication and the tradeoff of reliability, availability, and performance for new technologies. I believe that RacSAS and LBD also call for a reexamination of reliability, security, and other system properties: for example, relaxing the reliability requirements for data that can be recomputed. Adaptive resource sharing and scheduling. With the diversity of data analytic tasks and the flexibility in RacSAS s lower layers, scheduling and resource sharing will be a challenging problem. I plan to study typical LBD workloads in different domains and design scheduling systems that take into consideration domain-specific knowledge, perworkload information, and RacSAS s exposed resource information. New programming model. MapReduce-like frameworks were designed for petascale data processing. I believe that LBD calls for a new, more versatile programming model that is suitable for diverse LBD applications and takes advantage of the flexibility of RacSAS. Application adaptation. Finally, I would like to take a different approach from the applications point of view and help RacSAS users adapt their analytics algorithms and software. My previous experience in machine learning [9, 10] has prepared me for this effort. 4.2 Automating System Implementation One of my longer term research directions is to automate the implementation and testing of new software and hardware systems. Currently, building software and hardware systems is difficult, time-consuming, and error-prone. For example, modern file systems such as btrfs contain tens of thousands of lines of code in the kernel. Modern hardware devices such as SSDs have complex firmware that intelligently manages their resources. Implementing a new OS requires even more effort. At the same time, the increasing heterogeneity of applications and hardware technologies has increased the need for new systems that are optimized and customized for specific domains. My goal is to raise the level of abstraction of system implementation, so that system builders can use high-level programming models to specify system functionalities. One possible approach is to utilize and decouple existing system components, while allowing system builders to modify and add components through well-defined interfaces. As a first step to building this kind of automated systems, I hope to build libraries of generic Linux driver components. 4

5 4.3 Interdisciplinary Research I would like to work on interdisciplinary projects with researchers from other fields. For example, I am interested in designing programming models for future hardware technologies, such as new type systems for reliable data access in non-volatile main memory. I am also interested in the security issues raised by emerging hardware devices, such as preventing information leaks in non-volatile main memory. As another example, I hope to make the computation and storage of bioinformatics data more efficient. One possibility is to integrate sequence alignment with storage deduplication techniques. I would also like to adapt database query optimization techniques for data selection of different applications in heterogeneous storage environments. Finally, I would like to redesign networking mechanisms and protocols for new hardware and application trends, for example, moving networking closer to processors. Overall, I believe that my experience in storage systems, operating systems, distributed systems, architecture, networking, scheduling, and machine learning provides me with a solid background for my future research. As a professor, I want to be at the frontier of these research directions and make an impact in real world. I look forward to joining a stimulating environment where I can learn from others and contribute to the research community. References [1] Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. De-indirection for Flash-based SSDs with Nameless Writes. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST 12), San Jose, California, February [2] Mohit Saxena, Yiying Zhang, Michael M. Swift, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST 13), San Jose, California, February [3] Yiying Zhang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Removing the Costs and Retaining the Benefits of Flash-Based SSD Virtualization with FSDV. in preparation. [4] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A Reliable and Highly- Available Non-Volatile Memory System. To appear in the proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 15). [5] Yiying Zhang and Steven Swanson. Sync Stinks: Application Performance with Non-Volatile Main Memory. in preparation. [6] Yiying Zhang, Gokul Soundararajan, Mark W. Storer, Lakshmi N. Bairavasundaram, Sethuraman Subbiah, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Warming up Storage-Level Caches with Bonfire. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST 13), San Jose, California, February [7] Yiying Zhang and Vijayan Prabhakaran. Duplication Aware Disk Array. In Microsoft Technical Report, [8] Mohit Saxena, Michael M. Swift, and Yiying Zhang. FlashTier: a Lightweight, Consistent and Durable Storage Cache. In Proceedings of the EuroSys Conference (EuroSys 12), Bern, Switzerland, April [9] Hyeoncheol Kim, Tae-Sun Yoon, Yiying Zhang, Anupam Dikshit, and Su-Shing Chen. Predictability of Rules in HIV-1 Protease Cleavage Site Analysis. In Proceedings of the 2006 International Conference on Computational Science (ICCS 06), Reading, United Kingdom, March [10] Hyeoncheol Kim, Yiying Zhang, Yong-Seok Heo, Heung-Bum Oh, and Su-Shing Chen. Specificity Rule Discovery in HIV-1 Protease Cleavage Site Analysis. Computational Biology and Chemistry, 32:72 79, [11] Yiying Zhang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Warped Mirrors for Flash. In Proceedings of the 29th IEEE Conference on Massive Data Storage (MSST 13), Long Beach, California, May [12] Yanpei Chen, Sara Alspaugh, and Randy Katz. Interactive Analytical Processing in Big Data Systems: A Crossindustry Study of MapReduce Workloads. Proceedings of the VLDB Endowment, 5(12): , August

6 [13] Raja Appuswamy, Christos Gkantsidis, Dushyanth Narayanan, Orion Hodson, and Antony Rowstron. Scale-up vs Scale-out for Hadoop: Time to rethink? In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 13), Santa Clara, California, October [14] Yahoo Inc. Yahoo! WebScope Datasets. [15] National Institutes of Health (NIH). GenBank. [16] Facebook Inc. Large-scale Graph Partitioning with Apache Giraph. [17] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. Willow: A User-Programmable SSD. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, October