Networking in the Big Data Era Nelson L. S. da Fonseca Institute of Computing, State University of Campinas, Brazil e-mail: nfonseca@ic.unicamp.br IFIP/IEEE NOMS, Krakow May 7th, 2014
Outline What is Big Data? What is the role of networking in Big Data? What are the sources of Big Data? What are the issues in networking for Big Data? How can Big Data be processed, transferred and storaged in a friendly way?
What is Big Data?
In 60 seconds.. https://plus.google.com/+avinash/posts/mgyatu6mbhd
It is not just about Volume!
Big Data Siewert, S. B. Biga data in the cloud, IBM Developerworks, Tech. Rep., http://www.ibm.com/developerworks/library/bd-bigdatacloud/#what-is-big-data, July 9, 2013.
Big Data and Enterprise http://wikibon.org/wiki/v/big_data_market_size_and_vendor_revenues Analytics: The real-world use of big data: How innovative enterprises extract value from uncertain data, Executive REport, IBM Institute for Business Value
What is the role of networking in Big Data?
https://www.usenix.org/legacy/event/usenix99/invited_talks/mashey.pdf
Infrastress Alibaba Mall processes in a single day (Nov 11th, 2013) 105.8 million online transactions from 213 million users and 4.1 billion transactions
Networking..
Networking Computing Network Bandwidth Communication delays (tolerance) Degree of interactivity Storage
Networking Computing Network Bandwidth Communication delays (tolerance) Degree of interactivity Storage
What are the sources of data?
Map-Reduce Facebook Trace analysis: 30% to 50% of running time took up by communication phase
Schedulers which are data-location aware to decrease network traffic as well as I/O operation How to schedule tasks with heterogeneous (CPU, I/O) demand to promote load balance? How to benefit from Yarn resource management?
Scientific Computation The Montage application created by NASA/IPAC stitches together multiple input images to create custom mosaics of the sky.
Current cloud tools do not provide an out-of-box solution to address application needs Interconnects is the major obstacle to cloud computing broad adoption for larger-scale, more tightly coupled HPC applications
Sensing-as-a-Service C. Perera, A. Zaslavsky, P. Christen and D. Georgakopoulos, "Sensing as a service model for smart cities supported by Internet of Things", TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2014; 25:81 93
Sensing-as-a-Service C. Perera, A. Zaslavsky, P. Christen and D. Georgakopoulos, "Sensing as a service model for smart cities supported by Internet of Things", TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2014; 25:81 93
Objects may require to be uniquely identified, or to be identified as belonging to a given class Multi-service platform Distributed processing of data traces Distributed flow control Privacy
What are the issues in networking for Big Data?
Data Centers
Cloud Data Center Traffic Cisco Global Cloud Index: Forecast and Methodology, 2012 2017
Canonical Data Center Architecture Core (L3) Aggregation (L2) Edge (L2) Top-of-Rack Application servers
Data Center Traffic Most of the flows are small in size (< 10 KB) Most of the bytes in top 10% large flows Traffic leaving edge switches ON-OFF, lognormal distributions Packet size distribution bimodal (200 to 1400 B) T. Benson, A. Akella, and D. A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proc of the 10th ACM SIGCOMM conference on Internet measurement (IMC '10). ACM, New York, NY, USA, 267-280
Data Center Traffic In cloud data center majority of flows stay in rack (80%) while in enterprise and university data center it varies from 40% to 90% Core layer most utilized, edge layer lightly utilized Core layer contain hot spot but less than 25% of links No need for more bisection bandwidth Most of losses occur in links with low utilization due to bursty traffic
VM Processes VM arrival and departure processes self similar, power law VM in the system: ARIMA model Yi Han, Jeffrey Chan and Christopher Leckie. Analysing Virtual Machine Usage in Cloud Computing. In Proceedings of the IEEE 2013 3rd International Workshop on Performance Aspects of Cloud and Service Virtualization, 2013
Need of longitudinal study on cloud (data center) traffic characterization Need of publicly available traces
Data Center Network Fat Tree Dcell B-Cube Jellyfish
Liberate upper layer switches for load balance to avoid few hot switches being overloaded Networks should guarantee well isolation, and stable service among multiple tenants.
Hybrid Data Center Networks Christoforos Kachris and Ioannis Tomkos "Optical interconnection networks for data centers", ONDM 2013
Hybrid Data Center Networks Christoforos Kachris and Ioannis Tomkos "Optical interconnection networks for data centers", ONDM 2013
Need of high radix, scalable, energy efficient Data Centers that can sustain the exponential increase of the network traffic.
VM Placing Non- trivial network topology for scalability and reliability Multi-path routing; route can change dynamically Heterogeneous services; large variety of run-time traffic pattern Unpredictable traffic variability due unpredictable request spikes and servisse-dependente operations
Need for traffic-aware VM placing that takes into consideration the correlations of VM traffic as well as traffic variability; dynamic placement decision.
VM Migration Improvement of data and network locality; Not Always possible to mantain same IP address, leading to service disruption; WAN Migration: trade off bandwidth x downtime R. Boutaba, Q. Zhang and M. F. Zhani. Virtual Machine Migration in Cloud Computing Environments: Benefits, Challenges, and Approaches. In Communication Infrastructures for Cloud Computing. H. Mouftah and B. Kantarci (Editors). IGI-Global, USA. pp. 383-408, September, 2013
Transport protocols to handle service disruption Sophisticated management strategies for large scale VM deployment Development of Inter-data centers VM migration framework VM migration to facilitate the collaboration between cloud and mobile devices
How can Big Data be processed, transferred and storaged in a friendly way?
Software Defined Data Center
Software Defined Data Center <http://youtu.be/uwb4kmghzaa>
Virtual Networks VN Scheme Description Encapsulation Scalability- # of VNs VLAN VXLAN NVGRE Bridges VMs, for dedicated management MAC-in-IP 2 12 Ammeliorates scalability for cloud environments Ammeliorates scalability for cloud environments MAC-in-UDP MAC-in-GRE 2 24 2 24 Contrail Uses Openflow All 2 12 NSX Uses Openflow All 2 12
Network Virtualization Routing Protocol Multicast Tree Encapsulation TRILL IS-IS Single MAC-in-MAC SPB IS-IS Multiple MAC-in-MAC NetLord SPAIN Single MAC-in-(IP+MAC) Openflow All Single All
OpenFlow Switching OpenFlow Switch specification OpenFlow Switch Controller PC sw hw Secure Channel Flow Table The Stanford Clean Slate Program http://cleanslate.stanford.edu
ElasticTree [Brandon Heller, NSDI 2010]
SDN functional architecture
Open Daylight Plataform http://www.opendaylight.org/project/technical-overview
SDN and Hadoop P. Qin, B. Dai, B. Huang and G. Xu, Bandwidth-Aware Scheduling with SDN in Hadoop: A New Trend for Big Data, in Proc. of INFOCOM 2014
Multipath solution for fast traffic rerouting Scalability - current support of 10+6 flows for data centers Delay in flow set up, proliferation of flow table can limit scalability How to place a given number of controller in a certain physical network such that predefined objectives are achieved?
Network Programming languages Language FML Frenectic Nettle Netcore Procera Pyretic Flog HFT FlatTire Short description high level policy description language (e.g. access control) avoid race conditions through well defined high level programming abstractions allow programmers to deal with streams instead of events means for expressing packet-forwarding policies in a high level high level abstractions to describe reactive and temporal behaviors specify network policies at a high level of abstraction, offering transparent composition and topology mapping combine ideas of FML and Frenetic, providing an event-driven and forward-chaining logic programming language enables hierarchical policies description with conflict-resolution operators, well suited forn decentralized decision makers enables hierarchical policies description with conflict-resolution operators, well suited for decentralized decision makers
Support of SDN language to the automation of Software Defined Data Center Abstraction of resources for the support of application requirements
Software Defined Storage
Final Remarks Need of characterization of traffic generated by Big Data applications Use of communication patterns of Big Data processing to define resource allocation Distributed processing of Big Data Integration of SDN functionality into automationof SDDC for the support of requirements of Big Data applications