Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology Evangelism Manager Intel Labs Strategy & Planning Executive Summary Today cloud datacenters deploy hundreds or thousands of networked computers to support workloads such as search and social networking. What if we could bring all that computing power onto a chip and in the process, boost performance and power efficiency while radically reducing costs? That s the vision behind the Single-chip Cloud Computer (SCC), an experimental processor that Intel has designed to explore the future of many-core computing.
Introduction The SCC is a single piece of 45nm, high-k metal-gate silicon the size of a postage stamp. This fully functional research microprocessor contains 48 Intel Architecture (IA) cores the most ever integrated on a silicon CPU chip. The chip was designed as both a hardware prototype and a concept vehicle for parallel software research. The SCC represents the latest achievement of the Intel Tera-scale Computing Research Program. Three Intel lab sites around the world collaborated on the research and development of the SCC: the Advanced Microprocessor Research lab in Hillsboro, USA, Intel Labs Braunschweig in Germany, and Intel Labs Bangalore in India. Cloud Datacenter on a silicon chip The architecture of the SCC resembles a small cluster or cloud of computers similar to the computer clusters in cloud datacenters that deliver services such as electronic banking, online shopping, and social networking to millions of people. The chip s 24 tiles are organized in a 6x4 two-dimensional array that mimics the organization of computers in a typical datacenter. Such a highly integrated processor could one day replace an entire rack of servers. Think of the SCC as a prototype of a datacenter on a chip with the added benefits of much faster networking, lower power and reduced costs that high integration delivers. Each of the 24 tiles in the SCC contains two IA-compatible cores. Each core can boot its own operating system (currently Linux) and software stack. The cores are connected via a mesh network with low latency and high bandwidth (256 gigabytes per second far beyond what traditional clusters deliver). The chip also contains four integrated DDR3 memory controllers as well as hardware support for message passing a technique for rapid communication between cores. Overcoming barriers to scaling The SCC s novel architecture incorporates sophisticated power management and memory technology as well as a high-speed on-chip network for data sharing. These experimental technologies are designed to overcome barriers to scaling future Intel microprocessors to 100 cores and beyond. Such technologies could lead to a dramatic reduction in the number of physical computers needed to create a cloud datacenter, and a corresponding decline in the cost of operations. "With the technologies we re exploring in the Single-Chip Cloud Computer, you could imagine a datacenter of the future that will be an order of magnitude more energy efficient than what exists today, saving significant resources on space and power costs," said Justin Rattner, head of Intel Labs and Intel's Chief Technology Officer. Intel s Single-Chip Cloud Computer (SCC) is the size of a postage stamp and contains 48 Intel Architecture (IA) cores the most ever integrated on a silicon CPU chip. The chip s architecture resembles a small cluster or cloud of computers, similar to the computer clusters found in cloud datacenters. Advanced power management One of the key capabilities of the SCC is finegrained power management. In designing the chip, Intel developed innovative techniques that allow all 48 cores to operate simultaneously at low power from 25 watts to 125 watts in total when running at maximum performance (about as much as 2
today's Intel processors or two standard household light bulbs). Such low power could translate into big savings for datacenters, whose energy bills are growing faster than hardware costs, as more and more servers are added to deliver the higher performance customers are demanding. Power usage is determined largely by the cores clock speeds and operating voltages. Power management is controlled by software, which can be programmed to dynamically configure voltages and clock speeds for different cores, or even to turn off entire regions of the chip when not needed. These power management capabilities enable software developers to design applications that intelligently manage power consumption, adapting in real time to use only the energy required at the moment. In a test of the fine-grained power management capability, Intel researchers used software to adjust power levels for different sections of the chip. The researchers successfully mixed and matched voltage and clock speed to meet the needs of a series of tasks (in this case, climate modeling equations) whose power requirements varied over time. A look inside the architecture of Intel s Single-Chip Cloud Computer (SCC). Message passing and memory sharing With shared memory programming, the overhead required to synchronize different caches across the chip becomes more difficult as core counts increase. However, in the supercomputing world, the message-passing programming model has been used to develop parallel programs that scale well to hundreds or even thousands of processors. It s also the basis of the scale-out computing approach of cloud datacenters, which involves adding independent nodes as needed to tackle a large or complex application rather than adding more resources (cores, memory) to existing nodes. The SCC chip has special hardware support for message passing. Programmers working with cloud, cluster and datacenter applications can use message passing to maximize the performance of the SCC by moving data across the network of cores with extremely low latencies and high bandwidth. Sometimes developers will also want to use code designed for the traditional, cachecoherent model of today s microprocessors. Intel Labs has developed a software-based memory sharing technology to maintain coherence among caches on the chip, eliminating the need for hardware cache coherence support and potentially increasing 3
the performance and energy efficiency of future many-core chips. One advantage of our software-based approach is that memory coherency does not need to be enforced across all 48 cores. Rather, the software can define which cores receive memory updates as a group, giving developers greater flexibility in how they distribute application processing across the cores. To illustrate how the software-based shared memory technology of the chip can work without software support, researchers performed a financial analytics task, applying the widely used Black-Scholes model to execute thousands of calculations in parallel to evaluate thousands of possible market scenarios in order to rapidly arrive at the best investment decision. Putting the chip to the test Intel has ported a variety of applications to the SCC and validated the concept of bringing on chip the parallelism of datacenter programming models that scale to thousands of cores. In addition to testing the advanced power management and memory capabilities of the chip, Intel researchers demonstrated Apache Hadoop (an open source, Java-based framework that supports data-intensive, distributed applications) sorting objects using an approach similar to a web search. The application ran on the SCC with minimal changes required by the developer. In another demo, the researchers showed how parallelism could enable a scripting language, JavaScript, to execute a high-performance task in a browser. Although JavaScript is used in every browser, it s used mainly for simple tasks, such as processing web forms, and performs poorly running more complicated operations. JavaScript has been underutilized until now due to the lack of programming environment. Leveraging the capabilities of the SCC, researchers used JavaScript to perform physics-based cloth modeling a complex task that typically would be coded in a more sophisticated language such as C or C++. The chip acted as a server farm, dividing the work involved in calculating the motion of interactive cloth, with each core working independently to own and process one piece of cloth. Intel researchers continue to port applications to the SCC and to test and validate the chip s capabilities. Collaborating with the research community Intel plans to build 100 or more experimental chips for use by dozens of industrial and academic research collaborators around the world, to accelerate many-core software research and advanced development. We have begun discussions with several close collaborators in parallel computing research, including our partners in the Universal Parallel Computing Research Centers (UPCRCs): Microsoft Research, UC Berkeley, and the University of Illinois. Microsoft researchers have already modified the company s popular Visual Studio* software development environment to create a message passing application that can run on the SCC. They demonstrated how easy it is for a programmer to set up a project, edit, compile and run applications that take advantage of the chip s unique features. Other Intel partners have expressed a strong interest in the new research chip. "The singlechip cloud computer is of great interest to application developers and tools researchers, says Professor Wen-Mei Hwu, Co-director of the UPCRC at the University of Illinois. The availability of the hardware will greatly accelerate our development of applications and tools for massively parallel computing platforms." Beyond the datacenter The implications of the SCC extend far beyond datacenter applications. Intel believes the chip 4
is an ideal test bed to experiment with bringing parallel programming models, such as message passing, from the datacenter to the desktop. Common software applications can be run on the processor, and most languages designed to program clusters could be ported to the chip. (Intel researchers are currently programming the chip in C/C++ as well as JavaScript.) Research using the SCC could one day lead to processors with 100 or more cores, spurring the development of new applications that are far more sophisticated and intuitive than those we have today. Imagine communicating with a computer in the way you communicate with people, or picture intuitive visual and tactile interfaces that eliminate the need for keyboards, remote controls or joysticks. The chip s array of small IA-based tiles could one day lead to more flexible, scalable designs for datacenters and desktop PCs. With each tile connected to the mesh but operating independently, developers could design chips on demand, assembling the appropriate number and types of tiles to match the specific needs of their applications. Future multi-core processors might enable better artificial intelligence, instant HD-quality video communications, photo-realistic games, and multimedia data mining. In the future you might be able to perform a Hadoop search on your laptop to quickly retrieve images of family and friends in photos and videos stored on the hard drive. Physics-based modeling could be used to deliver more realism to virtual worlds, online multiplayer games and 3D movies. Massive parallelism could make possible immersive online shopping experiences, such as virtual dressing rooms that allow you to try out clothes on your virtual body and see how they would actually fall on you and match your skin tone. And it could give manufacturers the ability to build sophisticated virtual models, eliminating the time and expense of creating physical prototypes. These are just a few of the potential future applications that could be supported as processors scale to 100 cores and beyond. That future is approaching rapidly. The number of cores that can be integrated on a chip is constrained by current manufacturing processes and power limitations. With each new processor generation, we find ways to make cores even smaller and more powerefficient. And as we do, the software architecture that Intel researchers have established for the SCC will scale well, enabling core counts to increase. In the long term, the potential applications of massively parallel computing will not be limited by manufacturing processes or power consumption. When we reach 100 cores and beyond, the applications that are possible will be limited only by the imagination of developers. Learn more For more information about the SCC, visit www.intel.com/go/terascale Copyright Intel Corporation 2010 *Other names and brands may be claimed as the property of others. 5