Big Data and Industrial Internet

Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015 1/13

Big Data EMC sponsored IDC study estimates the Digital Universe to be 4.4 Zettabytes (4 400 000 000 TB) in 2013 The amount of data is estimated to grow to 44 Zettabytes by 2020 Data comes from: Video, digital images, sensor data, biological data, Internet sites, social media, Internet of Things,... Some examples: Netflix is collecting 1 PB of data per month from its video service user behaviour Rovio is collecting in the order of 1 TB of data per day of games logs The problem of such large data massed, termed Big Data calls for new approaches to analyzing these data masses 2/13

No Single Threaded Performance Increases Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb s Journal, 30(3), March 2005 (updated graph in August 2009). 3/13

Implications of the End of Free Lunch The clock speeds of microprocessors are not going to improve much in the foreseeable future The efficiency gains in single threaded performance are going to be only moderate The number of transistors in a microprocessor is still growing at a high rate One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus 4/13

Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System Distribution for Big Data Based on cloning the Google architecture design Fault tolerant distributed filesystem: HDFS Batch processing for unstructured data: Hadoop MapReduce and Apache Pig (HDD), Apache Spark (RAM) Distributed SQL database queries for analytics: Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto Fault tolerant real-time distributed database: HBase Distributed machine learning libraries, text indexing & search, etc. Data import and export with traditional databases: Sqoop 5/13

Commercial Hadoop Support Cloudera: Probably the largest Hadoop distributor, partially owned by Intel (740 million USD investment for 18% share). Available from: http://www.cloudera.com/ Hortonworks: Yahoo! spin-off from their large Hadoop development team: http://www.hortonworks.com/ MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop. http://www.mapr.com/ 6/13

Example: Cloudera s Hadoop Stack 7/13

Example: Hortonworks Hadoop Stack 8/13

Lambda Architecture Batch computing systems like Hadoop have a large latency in order of minutes to hours but have extremely good fault tolerance Stream processing systems such as Apache Storm combined with Apache Kafka allow for large scale real time processing with reduce fault tolerance These stream processing frameworks can provide latencies in the order of tens of milliseconds Lambda architecture is the combination of consistent / exact but high latency batch processing and available / approximate low latency stream processing 9/13

Lambda Architecture (cnt.) For more details, see e.g.,: http://lambda-architecture.net/ 10/13

Lambda Architecture for IoT Use 3 11/13

Big Data at Aalto We have been working with Hadoop since 2010 Teaching Hadoop and Big data technologies since 2011 - Course CSE-E5430 Scalable Cloud Computing, materials freely available on the Web: https://noppa.aalto.fi/noppa/kurssi/cse-e5430/luennot Analytics and Data Science Minor for Aalto students since 2014 Collaborating in Data to Intelligence project on Hadoop and Apache Spark related topics with CSC, Tieto, and PacketVideo Collaborating with QPR Software on business process mining 12/13

Conclusions Hadoop is becoming the Linux distribution for Big Data It consists of a number of interoperable open source components Commercial support is available from commercial vendors such as: Cloudera, HortonWorks, MapR, IBM, and Intel; as well as their partners Amazon is providing Elastic MapReduce (EMR), which is basically Hadoop as a service offering Microsoft HDInsight is a similar service for Hadoop on Microsoft Azure We are happy to discuss your Big Data issues, email: keijo.heljanko@aalto.fi 13/13