Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim [based on original slides by Lucas Rego Drumond, ISMLL 2014] 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
1. What is Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
1. What is Big Data? What is Big Data? 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25
1. What is Big Data? What is Big Data? Some definitions: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. http://en.wikipedia.org/wiki/big data Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. www.gartner.com/it-glossary/big-data/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 2 / 25
1. What is Big Data? Big Data Dimensions (the 4 Vs ) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 3 / 25
1. What is Big Data? What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured/complex) data querying Processing high volume data streams Decision support based on large data visualization, reporting navigation / query interfaces contextualization / sense making Building predictive models trained on large data 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 4 / 25
2. Where to Find Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25
2. Where to Find Big Data? Big Data in Physics (CERN) Large Hadron Collider has collected data from over 300 trillion proton-proton collisions Approx. 25 Petabytes per year 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25
2. Where to Find Big Data? Big Data in Genetics (Ensembl) Ensembl database contains the genome of humans and 50 other species only 250 GB source: http://www.ensembl.org/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 6 / 25
2. Where to Find Big Data? Big Data in the Web (Google) 3.3 billion searches per day (on average) 30 trillion unique URLs identified on the Web 20 billion sites crawled a day In 2008 Google processed more than 20 Petabytes of data per day Source: http://searchengineland.com/google-search-press-129925 Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 7 / 25
2. Where to Find Big Data? Big Data in Social Media (Facebook) 1.28 billion users (1.23 billion monthly active in January 2014) Size of user data stored by Facebook: 300 Petabytes Average amount of data that Facebook takes in daily: 600 Terabytes Size of Facebook s Graph Search database: 700 Terabytes Source: http://allfacebook.com/orcfile b130817 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 8 / 25
2. Where to Find Big Data? Big Data in Social Media (Twitter) Average number of tweets per day: 58 million Number of Twitter search engine queries every day: 2.1 billion Total number of active registered Twitter users: 645,750,000 Source: http://www.statisticbrain.com/twitter-statistics/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 9 / 25
2. Where to Find Big Data? Big Data Public Datasets 1000 Genomes Project DNA of 1700 humans 200 TB Common Crawl Corpus 5G web pages 81 TB Wikipedia / Freebase 1.9G subject/predicate/object triples 250 GB Million Song Dataset audio features of 1M songs 280 GB OpenStreetMap a map of earth 90 GB 2000 US Census US census data 200 GB PubChem library biological activities of small molecules 230 GB NCDC weather data daily measurements from 9000 stations 20 GB Open Library metadata of 20M books 7 GB Twitter 1.6G tweets 0.6 GB CD 700 MB, DVD 4,7 17 GB, Blu-ray 25 100 GB, hard disc: 3 4 TB. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 10 / 25
2. Where to Find Big Data? How Large is 1 Petabyte (PB)? 35.7M years counted one byte per second, 254 years listening to music stored in CD quality (500MB/h) 25 years watching DVDs 223.000 DVDs (a 4.7 GB) but there are only 74.000 TV movies on IMDB! (1.8M including TV episodes) can be stored on 341 harddisks à 3 TB/90 e (30,000 e) 96 days to read from standard harddisks sequentially (1030 MBits/s) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 11 / 25
3. Big Data Applications Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25
3. Big Data Applications What to do with Big Data? Application examples: Test complex scientific hypotheses (physics, genetics) Index the web, including relevance feedback by users (web) Online personalized advertising (social media, esp. Google) Recommender systems (e-commerce, esp. Amazon) Media analysis, sentiment analysis, market research (social media) e.g., Obama campaign 2012 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25
3. Big Data Applications More Applications & Case Studies T-Mobile USA: integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections By leveraging social media data along with transaction data from CRM and billing systems, customer defections is said to have been cut in half in a single quarter. US Xpress: collects data elements ranging from fuel usage to tire condition to truck engine operations to GPS information Optimal fleet management McLaren s Formula One racing team: real-time car sensor data during car races Real time identification of issues with its racing cars 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 13 / 25
4. How to analyze Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25
4. How to analyze Big Data? How to handle Big Data? The BI Approach Data Warehouse Static databases (snapshots) Structured data Centralized approaches 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25
4. How to analyze Big Data? How to handle Big Data? The Distributed Approach Massive Parallelism Heterogeneous data sources Unstructured data Data streams 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 15 / 25
4. How to analyze Big Data? Challenges Challenges to deal with large volumes of data: Store and query large amounts of data in a distributed environment efficiently distributed file systems nosql databases, distributed databases Process distributed large data efficiently execution environments Scale / distribute machine learning techniques distributed learning algorithms (message passing) ML execution environments 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 16 / 25
4. How to analyze Big Data? Execution Environments: MapReduce 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 17 / 25
4. How to analyze Big Data? Execution Environments: GraphLab 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 18 / 25
5. Big Data at ISMLL, University of Hildesheim Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25
5. Big Data at ISMLL, University of Hildesheim Usage of Big Data in Cooperative Research Projects transportation REDUCTION: Danish Taxi fleet (14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements) e-commerce / recommender systems NetFlix dataset: 100 million transactions Rossmann Online / Compra technology enhanced learning Whizz Education (1200 exercises, 250,000 students, 30 million interactions) engineering data mining Rolls Royce: jet engine vibration Detectino: ground penetrating radar 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25
Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim ISMLL Compute Cluster I 61 compute nodes I 840 cores, 1288 threads I 2.2 TB RAM I 183 TB hard disk capacity I 10 TFlops special nodes: I I I I database server coprocessor compute node (240 cores, 960 threads, 4 TFlops) software: I I simple scheduler (sun grid engine) Map Reduce (hadoop) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 20 / 25
5. Big Data at ISMLL, University of Hildesheim Data Analytics A New Study Programme in 2015 term module type ECTS 1 Advanced Machine Learning lecture 6 Modern Optimization Techniques lecture 6 Programming Machine Learning lab 6 Seminar Data Analytics I seminar 4 application module I misc. 6 2 Advanced Database Technologies lecture 6 Data and Privacy Protection lecture 3 methodological specialization I lecture 6 Distributed Machine Learning lab 6 Seminar Data Analytics II seminar 4 Project (part I) project 6 3 Planning and Optimal Control lecture 6 methodological specialization II lecture 6 Project (part II) project 9 Seminar Data Analytics III seminar 4 application module II misc. 6 4 Master thesis and colloquium thesis 30 Total 120 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 21 / 25
5. Big Data at ISMLL, University of Hildesheim Data Analytics A New Study Programme in 2015 solid education on all fundamental aspects of data analytics at the state-of-the-art machine learning analytical database technology & execution models planning and control several methodological specializations: Bayesian Networks, Computer Vision, etc. integrated application area media systems, software engineering, environmental sciences, computer linguistics, information sciences, psychology (several still requested) hands-on lab courses a deep and fun, two term integrated group project fully internationally targeted (completely in English) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 22 / 25
6. Conclusions Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25
6. Conclusions Big Data Chances... 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25
6. Conclusions Big Data... and Risks 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 24 / 25
6. Conclusions Conclusions Big Data Analytics addresses the analysis of large and complex data (volume, variety, velocity, veracity) at Petabyte scale (1024 Terabytes) Big Data emerges naturally in many different domains (science, web, e-commerce, social media, robotics) Big Data requires massively parallel infrastructure to be analyzed timely data centers with hundreds and thousands of compute nodes distributed databases, nosql databases execution environments (MapReduce) To exploit big data in a principled and optimal way, machine learning methods have to be scaled and distributed, requiring innovations at the level of the learning algorithms, also requiring special ML execution environments. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 25 / 25