Big Data and Future Networks: A Perspective from the United States Hisashi Kobayashi ( 小 林 久 志 ) Princeton University and National Institute for Information and Communications Technology
Acknowledgments Prof. Tadao Saito, Toyota Info Technology Center Dr. Nozumu Nishinaga, Mr. Masahiro Kiyokawa, and Mr. Hiroaki Yano, NICT Prof. Mung Chiang, Princeton University Dr. Evangelos Eleftheriou, IBM Zurich Research Lab Mr. Kaiser Fung, Author Numbers Rule Your World Dr. Kazuo Iwano, Mitsubishi Corporation Prof. Brian L. Mark, George Mason University Prof. Dipanker Raychaudhuri, Rutgers University Prof. Phuoc Tran-Gia, University of Würzburg Prof. Howard Wactlar, CMU and NSF CISE Directorate Prof. Philip Yu, University of Illinois at Chicago 2 Big Data and Future Network Design Hisashi Kobayashi
Outline How Much Information? How Big is Data? 4 President Obama s Open Government Initiative 12 President Obama s Big Data Initiative 16 Big Data in Science and Technology Research 17 - NITRD Program, NSF, DARPA, DOE Big Data in Enterprises 27 Call for Data Science and Data Scientists 36 Big Data and Networks 43 References 51 3 Big Data and Future Network Design Hisashi Kobayashi
HOW MUCH INFORMATION? HOW BIG IS DATA? 4 Big Data and Future Network Design Hisashi Kobayashi
Source: The World of Data (by IBM): http://adamov.net.ru/images/share/the-world-of-large-scale-data-processing.jpg 5 Big Data and Future Network Design Hisashi Kobayashi
How Much Data was Out There? [Kobayashi et al. 2005] Online: Disk Drives File Systems 300 Petabytes Petabyte [1,000,000,000,000,000 bytes OR 10 15 bytes] Exabyte [1,000,000,000,000,000,000 bytes OR 10 18 bytes] Offline: Magnetic Tape CDs 8 Exabytes cf. 2003 Report by a U.C. Berkeley research group. Analog Data: Paper Film Videotape 200 Exabytes Source: http://www.sims.berkeley.edu/research/projects/how-much-info-2003 / 6 Big Data and Future Network Design Hisashi Kobayashi
Some Big Numbers 0.43 x 10 18 seconds: The Age of the Universe (13.77 billion years). 5 Exabytes: All words ever spoken by human beings (in text) Roy Williams (Caltech, 1993) 21 Exabytes/month: Global Internet traffic in 2007 Padmasree Warrior (CISCO, March 2010) 160 Exabytes: Digital information created, captures, and replicated world wide in 2007 (International Data Corporation, 2007) 42 Zettabytes: All words ever spoken by human beings (if digitized in 6kHz 16 bit audio) Mark Lieberman (U. Penn, 2003) kilo 10 3 Mega 10 6 Giga 10 9 Tera 10 12 Peta 10 15 Exa 10 18 Zetta 10 21 Yotta 10 24 7 Big Data and Future Network Design Hisashi Kobayashi
Source: Asigra Info Graphic: http://thumbnails.visually.netdna-cdn.com/big-data-infographic_504f4d2f5bd2f.jpg 8 Big Data and Future Network Design Hisashi Kobayashi
Source: - The Retailer's Guide: http://venturebeat.files.wordpress.com/2012/11/retailersbigdata_final.png 9 Big Data and Future Network Design Hisashi Kobayashi
10 Big Data and Future Network Design Hisashi Kobayashi Source: http://www.weforum.org/reports/personaldata-emergence-new-asset-class January 2011, Davos Switzerland
Every day, we create 2.5 quintillion (10 18 ) bytes (i.e., 2.5 Exabytes) of data so much that 90% of the data in the world today has been created in the last two years alone. [IBM] Raw data has little value by itself. We must process data and extract information in a usable form. - Big Data tools, e.g., Apache Hadoop, MapReduce - Data Science, (data mining, machine learning) - Need for advancing statistical analysis techniques that are scalable. We then must put the information into a valuable action, e.g., Amazon.com, a better government 11 Big Data and Future Network Design Hisashi Kobayashi
Open Government Initiative My administration is committed to creating an unprecedented level of openness in Government. We will work together to ensure the public trust and establish a system of transparency, public participation, and collaboration. Openness will strengthen our democracy and promote efficiency and effectiveness in Government. ---- President BARACK OBAMA, 01/21/09 12 Big Data and Future Network Design Hisashi Kobayashi
Government should be transparent - To promote accountability and provides information to citizens Government should be participatory - Knowledge is widely dispersed in society, and public officials benefit from having access to that knowledge. Government should be collaborative - We should use innovative tools, methods and systems to cooperate with nonprofit organizations, businesses, and individuals in the private sector. 13 Big Data and Future Network Design Hisashi Kobayashi
Open Government Directive 1. Publish Government Information Online 2. Improve the Quality of Government Information 3. Create and Institutionalize a Culture of Open Government 4. Create an Enabling Policy Framework for Open Government -- Peter R. Orszag, Director, Office of Management and Budget, 12/8/09 http://www.whitehouse.gov/sites/default/files/omb/assets/mem oranda_2010/m10-06.pdf 14 Big Data and Future Network Design Hisashi Kobayashi
Big Data and Future Network Design Hisashi Kobayashi Source: Howard Wactlar, NSF CISE Directorate at NIST Big Data Meeting, June 2012 15
President Obama s Big Data Initiative To advance state-of-the-art technologies to collect, store, preserve, manage, analyze and share Big Data. To accelerate the pace of discovery in science and engineering, strengthen the national security, and transform teaching and learning. To expand the work force needed to develop and use Big Data technologies. More than $200 millions in new commitments through six Federal departments and agencies. - Office of Science and Technology Policy (OSTP) announced on March 29, 2012 16 Big Data and Future Network Design Hisashi Kobayashi
BIG DATA IN SCIENCE AND TECHNOLOGY RESEARCH 17 Big Data and Future Network Design Hisashi Kobayashi
NITRD (Networking and Information Technology Research and Development) Program Provides a framework in which many Federal agencies coordinate their R&D efforts on networking and IT. Operates under the aegis of the NITRD Subcommittee of the National Science and Technology Council (NSTC) s Committee on Technology. The National Coordination Office (NCO) supports the NITRD Program by providing technical expertise, planning and coordination and by serving as the Program s central point of contact. 18 Big Data and Future Network Design Hisashi Kobayashi
19 Big Data and Future Network Design Hisashi Kobayashi
The NITRD Program s focus: Big Data (BD) Cyber Physical Systems (CPS) Cyber Security and Information Assurance (CSIA) Health Information Technology R & D (Health IT R&D) Human Computer Interaction and Information Management (HCI&IM) Etc. 20 Big Data and Future Network Design Hisashi Kobayashi
21 Big Data and Future Network Design Hisashi Kobayashi
22 Big Data and Future Network Design Hisashi Kobayashi
Source: Howard Wactlar, NSF CISE Directorate at NIST Big Data Meeting, June 2012 23 Big Data and Future Network Design Hisashi Kobayashi
Source: Howard Wactlar, NSF CISE Directorate at NIST Big Data Meeting, June 2012 24 Big Data and Future Network Design Hisashi Kobayashi
XDATA Program Invest $25 million/year Develop computational techniques and software tools, for both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic) data. - Scalable algorithms for processing imperfect data in distributed data stores; - Effective human-computer interaction tools for rapidly customizable visual reasoning 25 Big Data and Future Network Design Hisashi Kobayashi
DOE s Scalable Data Management Analysis and Visualization (SDAV) Institute: ($25 million over 5 years) Project Leader: Dr. Arie Shoshani Lawrence Berkeley National Laboratory 26 Big Data and Future Network Design Hisashi Kobayashi
BIG DATA IN ENTERPRISES 27 Big Data and Future Network Design Hisashi Kobayashi
28 Big Data and Future Network Design Hisashi Kobayashi
The Big Data market will exceed $50B worldwide by 2017. http://sourcedigit.com/700-big-data-market-size-forecasts-2012-17/ 29 Big Data and Future Network Design Hisashi Kobayashi
The Big Data Market. IDC Japan s Forecast 2011 年 142.5 億 円 2012 年 197 億 円 2016 年 765 億 円 現 在 のBigData 市 場 はIT 市 場 全 体 の 13 兆 円 の 0.1% 強 程 度 30 Big Data and Future Network Design Hisashi Kobayashi
Another Forecast is much Bigger (by an order of magnitude) Source: http://www.microsoft.com/ja-jp/sqlserver/2012/big-data/default.aspx 31 Big Data and Future Network Design Hisashi Kobayashi
Big Data: The Management Revolution Success story of Amazon.com 30-40% annual growth in 2008-2012 [HBR] Data Analytics (DA) will replace the HiPPO. HiPPO= Highest Paid Person s Opinion [HBR] Data analysts (or data scientists) are in short supply. [HBR]: Harvard Business Review, October 2012: http://hbr.org/archive-toc/br1210 Diamond ハーバード ビジネス レビュー ビッグデータ 競 争 元 年 February 2013 32 Big Data and Future Network Design Hisashi Kobayashi
Big Data in Enterprises cont d Big Data exceeds the processing capacity of conventional relational database systems. Big Data primarily addresses the database (DB)/data warehousing (DWH) aspect of data analysis. Apache Hadoop is the first technology for Big Data. -- Distributed data storage -- Analysis algorithms for parallel data 33 Big Data and Future Network Design Hisashi Kobayashi
A distributed computational framework that can process a wide range of datasets. High-performance parallel data processing using MapReduce. Reliable data storage using the Hadoop Distributed File System (HDFS). - Query language is NoSQL ( Not only SQL ) Typical users seem obsessed with quantity, not quality, of data. More thought should be given how to collect and select data [Kaiser Fung]. 34 Big Data and Future Network Design Hisashi Kobayashi
1. Volume: How to handle 3 Vs [IBM] - Massively parallel processing (e.g., Greenplum data computing) - Distributed computing platform (e.g., Apache Hadoop). 2. Velocity: - Processing of streaming data to keep storage requirement practical. (e.g., Large Hadron Collider at CERN) - Instantaneous response in some applications (e.g., financial trading) 3. Variety: - Need to deal with diverse data types and sources (e.g., text from SNS, data from sensors, image data, GPS data from mobile phones, etc.) [IBM] http://www-01.ibm.com/software/data/bigdata 35 Big Data and Future Network Design Hisashi Kobayashi
Big Data Platform Data Warehousing (DWH): Store large volumes information from multiple sources. Hadoop-based Analytics: Reduce the cost of analyzing massive data. Unstructured Database (as well as RDB) and NoSQL Stream Computing: Continuously analyze data to take action in real-time. Text Analytics (or Text Mining): Analyze textual content of unstructured information, using information retrieval, data mining machine learning, statistics and computational linguistics. Data Visualization Tools (or Infographics): Real-time processing and dashboard presentation. e.g. Tableau [http://www.tableausoftware.com/], Spotfire [http://spotfire.tibco.jp/], etc. 36 Big Data and Future Network Design Hisashi Kobayashi
Some Vendors of Big Data Tools Greenplum: http://en.wikipedia.org/wiki/greenplum - founded in 2003 - acquired by EMC in 2010 Netezza: http://en.wikipedia.org/wiki/netezza - founded in 2000. - acquired by IBM in 2011 for $1.7B. SPSS: http://ja.wikipedia.org/wiki/spss -founded in 1988 -acquired by IBM in 2009 for $1.2 B) Vertica (acquired by HP) Oracle, SAP and Microsoft also provide Big Data Tools 日 本 に 関 しては; 日 経 コンピュータ 2013 年 1 月 10 日 号 37 Big Data and Future Network Design Hisashi Kobayashi
Call for Better DATA SCIENCE And More DATA SCIENTISTS 38 Big Data and Future Network Design Hisashi Kobayashi
Try to gain insights from data, instead of presenting all collected data. Study and extend classical statistical techniques : - Exploratory Data Analysis (EDA). - Time Series Analysis - Hidden Markov Models (HMMs) - Bayesian Statistics and MCMC - etc. Scalable Algorithms and Analytics e.g., PageRank Algorithm (an efficient algorithm to compute eigenvectors of a Markov transition matrix) 39 Big Data and Future Network Design Hisashi Kobayashi
40 Big Data and Future Network Design Hisashi Kobayashi
Important Subfields of Data Mining Data stream mining [Aggrawal] - Computer network traffic - Web searches - Sensor data Graph mining [Aggrawal] - Web data - Social network analysis - Bio-informatics C. C. Aggrawal (Ed.) Data Streams: Models and Algorithms, Kluwer Academic Publisher C. C. Aggarwal and H. Wang (Eds.), Managing and Mining Graph Data, Springer 41 Big Data and Future Network Design Hisashi Kobayashi
42 Big Data and Future Network Design Hisashi Kobayashi
深 刻 な 日 本 のデータ サイエンテイスト 不 足 データ アナリシスに 関 する 知 識 ( 統 計 機 械 学 習 など)を 持 つ 新 卒 者 の 数 (2008 年 ): 米 国 24,730, 中 国 17,410,インド 13,270, 日 本 3,400. ( 中 国 では 年 +10.4% 増 加 日 本 では -5.3%) Source: http://blogs.itmedia.co.jp/business20/2012/10/post-2438.html SAS(Statistical Analysis System) 認 定 プロフェッショナルの 数 米 国 10,544, インド 5,907, 韓 国 1,381, 英 国 1,242 日 本 800 GDP 当 りのSAS 認 定 プロフェッショナルの 数 ( 米 国 を100) 米 国 100, インド 458, 韓 国 177, 英 国 73, 日 本 20. Source: Diamond ハーバード ビジネス レビュー Feb. 2013 43 Big Data and Future Network Design Hisashi Kobayashi
[McKinsey] Big data: The next frontier for innovation, competition and productivity, McKinsey & Co., May 2011 44 Big Data and Future Network Design Hisashi Kobayashi
BIG DATA and NETWORKS 45 Big Data and Future Network Design Hisashi Kobayashi
Source: - What happens in an Internet Minute? (by Intel): http://www.intel.com/content/dam/www/public/us/en/images/illustrations/embedded-infographic-600-logo.jpg 46 Big Data and Future Network Design Hisashi Kobayashi
Big Data vs. Networks Networks to cope with Big Data. - Sufficient storage, bandwidth and processing Big Data to help design and manage Networks. - Better performance, reliability and security Big Data and Networks for a better world. - Transparent government, Law enforcement - Risk management - Innovative applications for value creation e.g., User behavior tracking and marketing (Privacy and security are critical). 47 Big Data and Future Network Design Hisashi Kobayashi
Cloud Computing & Networking : A Platform for Big Data Cloud computing offers an on-demand access to a shared pool of configurable resources. Big Data requires a novel approach to meet the storage and processing requirements. The Cloud can make big data (analytics) accessible to those who couldn t use otherwise. Disk storage performance can be a problem when it is shared by various users. 48 Big Data and Future Network Design Hisashi Kobayashi
OpenFlow and FLARE will help Data Centers handle Big Data Help control of connectivity of Data Centers for big data analytics via virtualization Especially useful to a Multi-tenant Data Center environment. Facilitate load balancing among Data Centers. FLARE: Deeply Programmable Network (DPN) Architecture by Aki Nakao 49 Big Data and Future Network Design Hisashi Kobayashi
ID/Locator Separation and Context-oriented Service for Big Data Where contexts means data attributes, e.g., identity, group association, time, location, etc. Data Centric Networking (also called Named Data Networking or NDN ) appears a proper approach to Big Data. But its performance implications are unclear. GUID (Globally Unique ID) of MobilityFirst also facilitates context-oriented service. 50 Big Data and Future Network Design Hisashi Kobayashi
Optical Technologies: Fast Transport and Processing of Big Data Integrated Optical Path and Optical Packets of the AKARI Architecture. Silicon Nanophotonics Technology - Integrates optical and electrical circuits on a single silicon chip, by using 90nm CMOS fabrication line. cf. IBM Press release, Dec 10, 2012 http://www-03.ibm.com/press/us/en/pressrelease/39641.wss 51 Big Data and Future Network Design Hisashi Kobayashi
Additional Issues that Future Network Architectures should Address: Interface to Database - Increasingly unstructured and heterogeneous - Requires fast processing and transportation The Database community and the Networking community should interact. - No FIA project addresses database issues Service Layer for Big Data applications 52 Big Data and Future Network Design Hisashi Kobayashi
References [Kobayashi et al 2005] H. Kobayashi, Francois Dolivo, E. Eleftheriou, 35 Years of Progress in Digital Magnetic Recording, 2005 Eduard Rhein Technology Award Lecture. [IBM] http://www-01.ibm.com/software/data/bigdata [UCB] http://www.sims.berkeley.edu/research/projects/howmuch-info-2003 [McKinsey] Big data: The next frontier for innovation, competition and productivity, McKinsey & Co., May 2011, http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation [IBM] IBM Lights Up Silicon Chips to Tackle Big Data, Press release Dec 12, 2012, http://www-03.ibm.com/press/us/en/pressrelease/39641.wss 53 Big Data and Future Network Design Hisashi Kobayashi
Appendix Big Data across the Federal Government (4) NITRD s Focus (2) NSF-NIH Initiative (2) MiKinsey Global Institute s Report (2) 2012 Summer Olympic Games Big Numbers Data Never Sleeps (Fortune Magazine, 7/ 2012) Twitter 2012 Big Data for Healthcare 54 Big Data and Future Network Design Hisashi Kobayashi
Big Data Across the Federal Government Department of Defense (DOD) March 29, 2012 Defense Advanced Research Projects Agency (DARPA) - Anomaly Detection at Multiple Scales (ADAMS) program - Cyber-Insider Threat (CINDER) program Department of Homeland Security (DHS) - Center of Excellence on Visualization and Data Analytics Department of Energy (DOE) - Advanced Scientific Computing Research (ASCR) - High Performance Storage System (HPSS) 55 Big Data and Future Network Design Hisashi Kobayashi
Department of Veterans Administration (VA) - Consortium for Healthcare Informatics Research (CHIR) - Corporate Data Warehouse (CDW) - Genomic Information System for Integrated Science (GenISIS) Department of Health and Human Services (HHS) Center for Disease Control & Prevention (CDC) - BioSense 2.0 program Center for Medicare & Medicaid Services (CMS) - A date warehouse based on Hadoop is being developed. - Use of XML database technologies is being evaluated. Food & Drug Administration (FDA) - Virtual Laboratory Environment (VLE) National Archives & Record Administration (NARA) - Cyberinfrastructure for a Billion Electronic Records (CI-BER) 56 Big Data and Future Network Design Hisashi Kobayashi
National Aeronautic & Space Administration (NASA) - Earth Science Data and Information System (ESDIS) - Global Earth Observation System of Systems (GEOSS) - Planetary Data System (PDS) - Multimission Archive at Space Telescope Science Institute (MAST) National Endowment for the Humanities (NEH) - Digging into Data Challenge National Institute of Health (NIH) - The Cancer Imaging Archives (TCIA) - Neuroimaging Informatics Tools and Resource Clearinghouse (NITRC) - Neuroscience Information Framework (NIF) - Structural Genomics Initiative - WorldWide Protein Data Bank (wwpdb) - Biomedical Informatics Research Network (BIRN) - Collaborative Research in Computational Neuroscience (CRCNS) 57 Big Data and Future Network Design Hisashi Kobayashi
National Science Foundation (NSF) - Core Techniques and Technologies for Advancing Big Data Science & Engineering - Cyberinfrastructure Framework for 21 st Century Science & Engineering (CIF21) - Data and Software Preservation for Open Science (DASPOS ) - Computational and Data-enabled Science and Engineering (CDS&E) in Mathematical and Statistical Science (CDS&E-MSS) - Open Science Grid (OSG) - Theoretical and Computational Astrophysics Networks (TCAN) National Security Agency (NSA) - Vigilant Net: A Competition to Foster and Test Cyber Defense Situational Awareness at Scale - NSA/CSS Commercial Solutions Center (NCSC) United States Geological Survey (USGS) - John Wesley Powell Center for Analysis and Synthesis 58 Big Data and Future Network Design Hisashi Kobayashi
The NITRD Program s focus: Big Data (BD) Cyber Physical Systems (CPS) Cyber Security and Information Assurance (CSIA) Health Information Technology R & D (Health IT R&D) Human Computer Interaction and Information Management (HCI&IM) 59 Big Data and Future Network Design Hisashi Kobayashi
The NITRD Program s focus cont d: High Confidence Software and Systems (HCSS) High End Computing (HEC) Large Scale Networking (LSN) Software Design and Productivity (SDP) Social, Economic, and Welfare Implication of IT and IT Workforce Development (SEW) Wireless Spectrum Research and Development (WSRD 60 Big Data and Future Network Design Hisashi Kobayashi
NSF-NIH Big Data Initiative Eight (8) fundamental research projects o Big Data were announced on October 3, 2012 Typically, one to three investigators per project. Total of $15 millions, so about $500k/project 1. Eliminating the Data Ingestion Bottleneck in Big-Data Application, M. Farach-Colton (Rutgers) and M. Bendor (Stony Brook) 2. DataBridge- A Sociometric System for Long-Tail Science Data Collection, A. Rajaesekar (Univ. of N.C.), G. King (Harvard) and Justin Zhan (NC Agricultura & Tech State Univ.) 3. A Formal Foundation for Big Data Management, D. Suciu (Univ. of Washington). 61 Big Data and Future Network Design Hisashi Kobayashi
4. Analytical Approaches to Massive Data Computation with Applications to Genomics, E. Upfal (Brown) 5. Distribution-based Machine Learning for High-dimensional Datasets, A. Singh (CMU) 6. GenomesGlore- Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing, S. Aluru (Iowa State) O. Olukotun (Stanford) and W. Feng (Virginia Tech.) 7. Big Tensor Mining: Theory, Scalable Algorithms and Applications, C. Faloutos (CMU) N. Sidiropoulos (U. of Minnesota) 8. Discovery and Social Analytics for Large-Scale Scientific Literature, P. Kantor, T. Joachims (Cornell) and D. Biei (Princeton) 62 Big Data and Future Network Design Hisashi Kobayashi
63 Big Data and Future Network Design Hisashi Kobayashi
64 Big Data and Future Network Design Hisashi Kobayashi
Source: - Big Data at London Summer Games 2012: http://www.cloudtweaks.com/web/content//big-data-infographic1.jpg 65 Big Data and Future Network Design Hisashi Kobayashi
Source: - How much data is generated Every Minute: http://blogs-images.forbes.com/davefeinleib/files/2012/07/big-data-infographic.jpg 66 Big Data and Future Network Design Hisashi Kobayashi
Source: Facts about Twitter: http://blog.sironaconsulting.com/.a/6a00d8341c761a53ef016767bafa2c970b-pi 67 Big Data and Future Network Design Hisashi Kobayashi
Source: Info Graphic Healthcare IT: http://www.healthcareitconnect.com/wp-content/uploads/2012/10/infographic-big-data.jpg 68 Big Data and Future Network Design Hisashi Kobayashi