Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme



Similar documents
Big Data Analytics. Lucas Rego Drumond

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Are You Ready for Big Data?

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data a threat or a chance?

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Are You Ready for Big Data?

Data Centric Computing Revisited

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics. Lucas Rego Drumond

Application Development. A Paradigm Shift

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Chapter 7. Using Hadoop Cluster and MapReduce

Data Refinery with Big Data Aspects

Better Decision Making

Large-Scale Data Processing

THE AGE OF BIG DATA. Chula DataScience

COMP9321 Web Application Engineering

SEAIP 2009 Presentation

How To Use Big Data Effectively

BIG DATA CHALLENGES AND PERSPECTIVES

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data: Study in Structured and Unstructured Data

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة

Hadoop. Sunday, November 25, 12

What happens when Big Data and Master Data come together?

Hadoop and Map-reduce computing

The 4 Pillars of Technosoft s Big Data Practice

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

A New Era Of Analytic

Big Data Executive Survey

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Doing Multidisciplinary Research in Data Science

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

A Professional Big Data Master s Program to train Computational Specialists

Big Data and Analytics: Challenges and Opportunities

Outline. What is Big data and where they come from? How we deal with Big data?

Big Data and Healthcare Payers WHITE PAPER

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Oracle Big Data for Dummies

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

HPC technology and future architecture

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIG DATA What it is and how to use?

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

How To Use Hadoop For Gis

An Oracle White Paper June Oracle: Big Data for the Enterprise

Big Data and Open Data

Big Data With Hadoop

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

The Next Wave of Data Management. Is Big Data The New Normal?

Big Data Explained. An introduction to Big Data Science.

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

Reducing Environmental Footprint based on Multi-Modal Fleet Management Systems for Eco-Routing and Driver Behavior Adaptation

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Transforming the Telecoms Business using Big Data and Analytics

The Big Deal about Big Data. Mike Skinner, CPA CISA CITP HORNE LLP

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Big Data for Processing Data and Performing Workload

BIG DATA & SOCIAL INNOVATION KENNETH THOMAS, CLIENT MANAGER

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

How To Handle Big Data With A Data Scientist

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Delivering new insights and value to consumer products companies through big data

Getting to Know Big Data

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Bringing Big Data Modelling into the Hands of Domain Experts

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

A Study of Data Management Technology for Handling Big Data

Big Data Systems CS 5965/6965 FALL 2014

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

CSE-E5430 Scalable Cloud Computing Lecture 2

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

A Survey on Big Data Concepts and Tools

Here comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data in Telco & Banking Analytics. Benjamin Sznajder IBM Research Haifa

Real Time Big Data Processing

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Big Data Challenges in Bioinformatics

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Transcription:

Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim [based on original slides by Lucas Rego Drumond, ISMLL 2014] 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25

Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25

1. What is Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25

1. What is Big Data? What is Big Data? 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25

1. What is Big Data? What is Big Data? Some definitions: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. http://en.wikipedia.org/wiki/big data Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. www.gartner.com/it-glossary/big-data/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 2 / 25

1. What is Big Data? Big Data Dimensions (the 4 Vs ) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 3 / 25

1. What is Big Data? What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured/complex) data querying Processing high volume data streams Decision support based on large data visualization, reporting navigation / query interfaces contextualization / sense making Building predictive models trained on large data 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 4 / 25

2. Where to Find Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25

2. Where to Find Big Data? Big Data in Physics (CERN) Large Hadron Collider has collected data from over 300 trillion proton-proton collisions Approx. 25 Petabytes per year 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25

2. Where to Find Big Data? Big Data in Genetics (Ensembl) Ensembl database contains the genome of humans and 50 other species only 250 GB source: http://www.ensembl.org/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 6 / 25

2. Where to Find Big Data? Big Data in the Web (Google) 3.3 billion searches per day (on average) 30 trillion unique URLs identified on the Web 20 billion sites crawled a day In 2008 Google processed more than 20 Petabytes of data per day Source: http://searchengineland.com/google-search-press-129925 Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 7 / 25

2. Where to Find Big Data? Big Data in Social Media (Facebook) 1.28 billion users (1.23 billion monthly active in January 2014) Size of user data stored by Facebook: 300 Petabytes Average amount of data that Facebook takes in daily: 600 Terabytes Size of Facebook s Graph Search database: 700 Terabytes Source: http://allfacebook.com/orcfile b130817 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 8 / 25

2. Where to Find Big Data? Big Data in Social Media (Twitter) Average number of tweets per day: 58 million Number of Twitter search engine queries every day: 2.1 billion Total number of active registered Twitter users: 645,750,000 Source: http://www.statisticbrain.com/twitter-statistics/ 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 9 / 25

2. Where to Find Big Data? Big Data Public Datasets 1000 Genomes Project DNA of 1700 humans 200 TB Common Crawl Corpus 5G web pages 81 TB Wikipedia / Freebase 1.9G subject/predicate/object triples 250 GB Million Song Dataset audio features of 1M songs 280 GB OpenStreetMap a map of earth 90 GB 2000 US Census US census data 200 GB PubChem library biological activities of small molecules 230 GB NCDC weather data daily measurements from 9000 stations 20 GB Open Library metadata of 20M books 7 GB Twitter 1.6G tweets 0.6 GB CD 700 MB, DVD 4,7 17 GB, Blu-ray 25 100 GB, hard disc: 3 4 TB. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 10 / 25

2. Where to Find Big Data? How Large is 1 Petabyte (PB)? 35.7M years counted one byte per second, 254 years listening to music stored in CD quality (500MB/h) 25 years watching DVDs 223.000 DVDs (a 4.7 GB) but there are only 74.000 TV movies on IMDB! (1.8M including TV episodes) can be stored on 341 harddisks à 3 TB/90 e (30,000 e) 96 days to read from standard harddisks sequentially (1030 MBits/s) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 11 / 25

3. Big Data Applications Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25

3. Big Data Applications What to do with Big Data? Application examples: Test complex scientific hypotheses (physics, genetics) Index the web, including relevance feedback by users (web) Online personalized advertising (social media, esp. Google) Recommender systems (e-commerce, esp. Amazon) Media analysis, sentiment analysis, market research (social media) e.g., Obama campaign 2012 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25

3. Big Data Applications More Applications & Case Studies T-Mobile USA: integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections By leveraging social media data along with transaction data from CRM and billing systems, customer defections is said to have been cut in half in a single quarter. US Xpress: collects data elements ranging from fuel usage to tire condition to truck engine operations to GPS information Optimal fleet management McLaren s Formula One racing team: real-time car sensor data during car races Real time identification of issues with its racing cars 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 13 / 25

4. How to analyze Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25

4. How to analyze Big Data? How to handle Big Data? The BI Approach Data Warehouse Static databases (snapshots) Structured data Centralized approaches 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25

4. How to analyze Big Data? How to handle Big Data? The Distributed Approach Massive Parallelism Heterogeneous data sources Unstructured data Data streams 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 15 / 25

4. How to analyze Big Data? Challenges Challenges to deal with large volumes of data: Store and query large amounts of data in a distributed environment efficiently distributed file systems nosql databases, distributed databases Process distributed large data efficiently execution environments Scale / distribute machine learning techniques distributed learning algorithms (message passing) ML execution environments 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 16 / 25

4. How to analyze Big Data? Execution Environments: MapReduce 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 17 / 25

4. How to analyze Big Data? Execution Environments: GraphLab 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 18 / 25

5. Big Data at ISMLL, University of Hildesheim Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25

5. Big Data at ISMLL, University of Hildesheim Usage of Big Data in Cooperative Research Projects transportation REDUCTION: Danish Taxi fleet (14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements) e-commerce / recommender systems NetFlix dataset: 100 million transactions Rossmann Online / Compra technology enhanced learning Whizz Education (1200 exercises, 250,000 students, 30 million interactions) engineering data mining Rolls Royce: jet engine vibration Detectino: ground penetrating radar 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25

Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim ISMLL Compute Cluster I 61 compute nodes I 840 cores, 1288 threads I 2.2 TB RAM I 183 TB hard disk capacity I 10 TFlops special nodes: I I I I database server coprocessor compute node (240 cores, 960 threads, 4 TFlops) software: I I simple scheduler (sun grid engine) Map Reduce (hadoop) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 20 / 25

5. Big Data at ISMLL, University of Hildesheim Data Analytics A New Study Programme in 2015 term module type ECTS 1 Advanced Machine Learning lecture 6 Modern Optimization Techniques lecture 6 Programming Machine Learning lab 6 Seminar Data Analytics I seminar 4 application module I misc. 6 2 Advanced Database Technologies lecture 6 Data and Privacy Protection lecture 3 methodological specialization I lecture 6 Distributed Machine Learning lab 6 Seminar Data Analytics II seminar 4 Project (part I) project 6 3 Planning and Optimal Control lecture 6 methodological specialization II lecture 6 Project (part II) project 9 Seminar Data Analytics III seminar 4 application module II misc. 6 4 Master thesis and colloquium thesis 30 Total 120 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 21 / 25

5. Big Data at ISMLL, University of Hildesheim Data Analytics A New Study Programme in 2015 solid education on all fundamental aspects of data analytics at the state-of-the-art machine learning analytical database technology & execution models planning and control several methodological specializations: Bayesian Networks, Computer Vision, etc. integrated application area media systems, software engineering, environmental sciences, computer linguistics, information sciences, psychology (several still requested) hands-on lab courses a deep and fun, two term integrated group project fully internationally targeted (completely in English) 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 22 / 25

6. Conclusions Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25

6. Conclusions Big Data Chances... 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25

6. Conclusions Big Data... and Risks 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 24 / 25

6. Conclusions Conclusions Big Data Analytics addresses the analysis of large and complex data (volume, variety, velocity, veracity) at Petabyte scale (1024 Terabytes) Big Data emerges naturally in many different domains (science, web, e-commerce, social media, robotics) Big Data requires massively parallel infrastructure to be analyzed timely data centers with hundreds and thousands of compute nodes distributed databases, nosql databases execution environments (MapReduce) To exploit big data in a principled and optimal way, machine learning methods have to be scaled and distributed, requiring innovations at the level of the learning algorithms, also requiring special ML execution environments. 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 25 / 25