Big Data Analytics. Lucas Rego Drumond

Similar documents

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Analytics (Fall 2015)

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Challenges for Data Driven Systems

The 4 Pillars of Technosoft s Big Data Practice

BIG DATA TRENDS AND TECHNOLOGIES

Are You Ready for Big Data?

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Are You Ready for Big Data?

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

An Oracle White Paper June Oracle: Big Data for the Enterprise

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Analytics. Lucas Rego Drumond

Analyzing Big Data with AWS

An Oracle White Paper September Oracle: Big Data for the Enterprise

Big Data Systems CS 5965/6965 FALL 2015

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

The Next Wave of Data Management. Is Big Data The New Normal?

BIG DATA What it is and how to use?

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data With Hadoop

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Modern Data Warehouse

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Analytics. Lucas Rego Drumond

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Transforming the Telecoms Business using Big Data and Analytics

Big Data Management and Analytics

Cleveland State University

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Big Data Analytics. Lucas Rego Drumond

Application Development. A Paradigm Shift

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

How To Handle Big Data With A Data Scientist

Big Data Analytics. Lucas Rego Drumond

An Oracle White Paper October Oracle: Big Data for the Enterprise

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

NoSQL for SQL Professionals William McKnight

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Are You Ready? Thomas Kyte

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Big Systems, Big Data

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

BIG DATA CHALLENGES AND PERSPECTIVES

Hadoop. Sunday, November 25, 12

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

INTRODUCTION TO CASSANDRA

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data on Microsoft Platform

How Companies are! Using Spark

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Interactive data analytics drive insights

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Delivering new insights and value to consumer products companies through big data

Problems to store, transfer and process the Big Data 6/2/2016 GIANG TRAN - TTTGIANG2510@GMAIL.COM 1

Real Time Big Data Processing

This Symposium brought to you by

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

How To Scale Out Of A Nosql Database

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Frameworks Course. Prof. Sasu Tarkoma

CitusDB Architecture for Real-Time Big Data

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Large-Scale Data Processing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

SEIZE THE DATA SEIZE THE DATA. 2015

Native Connectivity to Big Data Sources in MSTR 10

Machine Learning over Big Data

Scaling Out With Apache Spark. DTL Meeting Slides based on

The Internet of Things and Big Data: Intro

Big Data. Fast Forward. Putting data to productive use

Introduction to Analytics and Big Data - Hadoop. Rob Peglar EMC Isilon

How To Use Big Data For Telco (For A Telco)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Getting to Know Big Data

Big Data a threat or a chance?

Transcription:

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36

Outline 1. What is Big Data? 3. Organizational Stuff Big Data Analytics 1 / 36

1. What is Big Data? Outline 1. What is Big Data? 3. Organizational Stuff Big Data Analytics 1 / 36

1. What is Big Data? What is Big Data? Big Data Analytics 1 / 36

1. What is Big Data? What is Big Data? Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. - Dan Ariely Big Data Analytics 2 / 36

1. What is Big Data? What is Big Data? Some definitions: A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. http://en.wikipedia.org/wiki/big data Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. www.gartner.com/it-glossary/big-data/ Big Data Analytics 3 / 36

1. What is Big Data? What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Making sense of the data Predictive technologies Big Data Analytics 4 / 36

1. What is Big Data? Where to find Big Data? 1.28 billion users (1.23 billion monthly active in January 2014) Size of user data sored by Facebook: 300 Petabytes Average amount of data that Facebook takes in daily: 600 terabytes Size of Facebook s Graph Search database: 700 Terabytes Source: http://allfacebook.com/orcfile b130817 Big Data Analytics 5 / 36

1. What is Big Data? Where to find Big Data? 3.3 billion searches per day (on average) 1 30 trillion unique URLs identified on the Web 1 20 billion sites crawled a day 1 In 2008 Google processed more than 20 Petabytes of data per day 2 1 http://searchengineland.com/google-search-press-129925 2 Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113. Big Data Analytics 6 / 36

1. What is Big Data? Where to find Big Data? Average number of tweets per day: 58 million 1 Number of Twitter search engine queries every day: 2.1 billion 1 Total number of active registered Twitter users: 645,750,000 1 1 http://www.statisticbrain.com/twitter-statistics/ Big Data Analytics 7 / 36

1. What is Big Data? Where to find Big Data? Ensembl database contains the genome of humans and 50 other species only 250 GB 1 1 http://www.ensembl.org/ Big Data Analytics 8 / 36

1. What is Big Data? Where to find Big Data? Large Hadron Collider has collected data from over 300 trillion proton-proton collisions Approx. 25 Petabytes per year Big Data Analytics 9 / 36

1. What is Big Data? What to do with Big Data? We don t want to know things but to understand them! Big Data Analytics 10 / 36

1. What is Big Data? What to do with Big Data? - Case Studies T-Mobile USA: integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections By leveraging social media data along with transaction data from CRM and Billing systems, customer defections has been cut in half in a single quarter. US Xpress: collects data elements ranging from fuel usage to tire condition to truck engine operations to GPS information Optimal fleet management McLaren s Formula One racing team: real-time car sensor data during car races Real time identification of issues with its racing cars Big Data Analytics 11 / 36

1. What is Big Data? What to do with Big Data? - The BI Approach Data Warehouse Static databases Structured data Centralized approaches Big Data Analytics 12 / 36

1. What is Big Data? What to do with Big Data? Massive Parallelism Heterogeneous data sources Unstructured data Data streams Big Data Analytics 13 / 36

1. What is Big Data? What to do with Big Data? Application examples: Online personalized advertising Sentiment analysis and behavior prediction Detecting adverse events and predicting their impact Automatic Translation Image Classification and object recognition Intelligent public services Big Data Analytics 14 / 36

1. What is Big Data? How? In order to deal with large volumes of data we need to address the following challenges: Effectively store and large amounts of data in a distributed environment Query distributed databases Parallel and distributed programing models Data Mining and machine learning techniques to make sense of the data Effective data visualisation techniques Big Data Analytics 15 / 36

Outline 1. What is Big Data? 3. Organizational Stuff Big Data Analytics 16 / 36

Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 16 / 36

Overview Part I Distributed Database Distributed File System Big Data Analytics 17 / 36

Storing In a distributed environment the data storing mechanisms should address the following issues Parallel Reading and Writing Data node Failures High Availability Big Data Analytics 18 / 36

Distributed File Systems The Google File System Architecture Big Data Analytics 19 / 36

Databases Databases are needed for Querying and indexing transaction procesing State-of-the-art: Relational Databases For processing big data one needs a database which: Supports high level of parallelism Supports analytical processing Has a flexible data model to deal with unstructured data sources Big Data Analytics 20 / 36

Databases for Big Data - NoSQL NoSQL - Not only SQL Wide variety of database technologies Dynamic Schema sharded indexing horizontal scaling support columnar storage Big Data Analytics 21 / 36

NoSQL Databases Big Data Analytics 22 / 36

Overview Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 23 / 36

Accessing A computational model is needed to: Provide a set of useful computational primitives Hide the complexity of distributed and parallel programming Ensure Fault Tolerance Examples: MapReduce GraphLab Pregel Apache Spark Big Data Analytics 24 / 36

MapReduce Big Data Analytics 25 / 36

GraphLab Big Data Analytics 26 / 36

Apache Spark Big Data Analytics 27 / 36

Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 28 / 36

Making sense of the data Linear and Non Linear Models for classification and regression Scalable learning algorithms (e.g. Stochastic Gradient Descent) Distributed Learning Algorithms (e.g. ADMM) Models for Link Prediction and link analysis Factorization models Distributed Learning Schemes (e.g. NOMAD, FPSGD) Big Data Analytics 29 / 36

Classification x 2 w w x + b = 1 w x + b = 0 2 w w x + b = 1 x 1 b w Big Data Analytics 30 / 36

Recommender Systems Big Data Analytics 31 / 36

Graph Analysis y F S (n, o) =? Obama y F S (o, l) =? Lucas o l y C (o, c) = 1 c Annexation of Crimea y F (l, o) = 1y C (l, c) =? y C (o, p) =? y C (l, p) = 1 p Paco s Death y S (l, n) = 1 ys (n, l) = 1 Nico n y F S (n, l) =? y C (n, o) =? Big Data Analytics 32 / 36

Course Overview Main goal: predictive analytics from large scale data! Introduction (1 Lecture) Machine Learning problems afflicted by Big Data (3 Lectures) Distributed Learning algorithms (3 Lectures) Parallel and distributed programing models (4 Lectures) Large scale storage and retrieval mechanisms (1 Lecture) Big Data Analytics 33 / 36

3. Organizational Stuff Outline 1. What is Big Data? 3. Organizational Stuff Big Data Analytics 34 / 36

3. Organizational Stuff Exercises and tutorials There will be a weekly sheet with two exercises handed out each Friday in the tutorial. 1st sheet will be handed out Fri. 24.04 Solutions to the exercises can be submitted until next Thursday before the lecture. 1st sheet is due Thu. 30.04. Exercises will be corrected Tutorials each Friday 10-12, 1st tutorial at Friday 24.04 Successful participation in the tutorial gives up to 10% bonus points for the exam. Big Data Analytics 34 / 36

3. Organizational Stuff Exams and credit points There will be a written exam at the end of the term (2h, 4 problems). The course gives 6 ECTS The course can be used in IMIT MSc. / Informatik / Gebiet KI & ML Wirtschaftsinformatik MSc / Informatik / Gebiet KI & ML Big Data Analytics 35 / 36

3. Organizational Stuff Some books Anand Rajaraman, Jure Leskovec, and Jeffrey Ullman: Mining of massive datasets Available online: http://infolab.stanford.edu/ ullman/mmds.html Gautam Shroff: The Intelligent Web: Search, smart algorithms, and big data Big Data Analytics 36 / 36