ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning for Architecture Contex-based Prefetching (Yoav) - Memory Intensive Architecture (Avinoam) 1
ICRI-CI Architecture track Theme Past theme: Develop new Heterogeneous Architecture concepts and Architecture for Machine Learning and employ Machine Learning to develop architecture Next phase Capstone: Optimized IA for Big Data & Machine Learning Workloads Funnel Accelerators: Architecture for Machine Learning Machine Learning for Architecture Memory Intensive Architecture New Continuation 2
ICRI-CI Architecture track research activities Past and Future Hetero: Provide an energy tool to be used for future SoC energy partition Power management: next step in Heterogeneous computing Funnel: Proof of Concept and potential (collaboration Intel Lab - Debbie Marr s group) Machine Learning for Architecture - Context Aware prediction Accelerators: Associative processors Memory Intensive Architecture 3
The Funnel Research A Funnel is a pipe with a wide, often conical mouth and a narrow stem May 2015 4
Environment Era of Big Data Data centers Size 1 million m 2 Power 100MWatts Power cost (US 2014) $10B (1) Power Usage Efficiency PUE = 1.2 to 2.0 1 Joule saved in computing saves around 1.5 Joule of data center energy (1) http://www.computerworld.com/article/2598562/data-center/data-centers-are-the-new-polluters.html 5
Datacenter Power Hardware subsystem power Louis Borroso talk http://www.cs.berkeley.edu/~rxin/db-papers/warehousescalecomputing.pdf 6
Energy: From: Bill Dally (nvidia and Stanford), Efficiency and Parallelism, the challenges of future computing From: Mark Horowitz, Stanford Computing s Energy Problems 7
Energy: From: Bill Dally (nvidia and Stanford), Efficiency and Parallelism, the challenges of future computing From: Mark Horowitz, Stanford Computing s Energy Problems 8
Energy in mind 20,000pJ/256 bits 20pJ/op 10X 100X 200pJ/Instruction @28nm technology 26 From: Bill Dally (nvidia and Stanford), Efficiency and Parallelism, the challenges of future computing 9
10
Energy: DRAM 11
Data movements Data source (SSD/NIC) MC Front End $ Operations CPU DRAM Copy 12 12
Data movements Read Once Data source (SSD/NIC) MC Front End $ Operations CPU DRAM Copy Cache/Memory are not effective if: Cache related: Reuse distance: >1M access Memory related: Reuse distance: >1G access Read once data should NOT reside in either standard Cache nor DRAM 13 13
Data Movement reduction Read Once Reduction of data movement via computing as close as possible to the data source In Big Data Processing (especially in ETL* stage) huge amount of data is being READS ONCE Why direct data to DRAM? Use HW cyclic buffer (e.g. DDIO/DCA) Funnel idea Where should the Funnel reside? DISK/SSD/NIC Front end By pass DRAM via DDIO/DCA *ETL=Extract Transform Load 14
Dedicated ETL servers clusters Perf limited by I/O, disk, network, database, HDFS, etc. A lot of data gets moved around. For some customers more time is spent in ETL than in ML Training. Dedicated ML Training server clusters For deep learning, this is where people are using accelerators to make training faster. Not very scalable, wall-clock time (latency) oriented. Application Servers Machine Learning Inference/Classificatio n/prediction/scoring is embedded in the larger context of whole-application services. Usually no accelerators. Very scalable, throughput-oriented 15
ETL Extract, Transform, Load Lot of data moving IN Data movements Simple transformations simple computation Bottleneck Performance limited by I/O, disk, network, database, HDFS Data accessed not frequently but huge data is accessed Energy Bottleneck? - ETL more than ML Training 16
Big Data system Flow Data In ETL ML Clint 17
Big Data Flow Data In ETL ML Clint 18
The Funnel N f = Funnel ratio BW in = BW out * N f BW in BW out Move computing closer to data source Data BW out N 1 N 2 N 3 BW in BW 1 BW 2 Computing Engine Decrease in BW What should be the INPUT BW to fully utilize the computing engine? BW in =BW out *N 3 *N 2 *N 1 19
The Funnel If data is consumed in highest rate by the computing engine DISK/ SSD/ NIC 0.4GB/s SATA SATA SATA SATA SATA N 1 =1 N 1 =1 N 1 =1 15GB/s 50GB/s PCIe PCIe PCIe PCIe DDR4 DDR4 Computing Engines Balance the system!!! If you use Funnel remember system should be balanced BW in =BW out *N 3 *N 2 *N 1 20
Process data - location? Read once Predefine filter Dynamic filter Data source (SSD/NIC) F1 Front End F2 $ Operations CPU DRAM Read once data should NOT reside in either standard Cache nor DRAM Save energy? Save DRAM space Save data movements 21 21
Research Plan Identify access patterns related to buckets Write once - Read once Write many Read once Write once Read Many Write many- Read many Provide solution to the ETL stage to reduce energy and improve performance Funnel I at Disk level Funnel II at the front end Proof Of Concept via DDIO 22
Moving data in ETL wastes energy in big data infrastructure and applications Our Research Comprehend data flow and access patterns in big data applications Data Flow IO: Disk / Network Flow through: SATA / QPI / Chipset / DDR / DRAM / Caches Data Read/Write patterns Apply energy-efficient solutions for each data Funnel: Move computation to data when possible Funnel: Aggregate data early on to reduce communications Store data on optimized memory structures based on usage 23
Open issues and future research SW and OS Co-Processor or Heterogeneous system Compatibility Application awareness of the feature 24
Summary The Funnel functions execute close to the data source Reduction of Data movement Free up system s memory resources (re-spark) Simple-energy-efficient engines at the front end Issues Compatibility issue: Apps, OS, Amount of energy saving. 25
Thanks 26