How To Get The Most Out Of Big Data



Similar documents
Big Data Analytics. Chances and Challenges. Volker Markl

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Big Data looks Tiny from the Stratosphere

Big Data Research in Berlin BBDC and Apache Flink

Big Data. Lyle Ungar, University of Pennsylvania

Architectures for Big Data Analytics A database perspective

Big Data and Data Science: Behind the Buzz Words

Apache Flink Next-gen data analysis. Kostas

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Big Data and Industrial Internet

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Customized Report- Big Data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Challenges for Data Driven Systems

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Large-Scale Data Processing

Big Data Explained. An introduction to Big Data Science.

SAP SE - Legal Requirements and Requirements

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Peninsula Strategy. Creating Strategy and Implementing Change

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Big Data on Microsoft Platform

ANALYTICS CENTER LEARNING PROGRAM

Manifest for Big Data Pig, Hive & Jaql

The 4 Pillars of Technosoft s Big Data Practice

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data and Analytics: Challenges and Opportunities

BIG DATA-AS-A-SERVICE

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Teradata s Big Data Technology Strategy & Roadmap

COMP9321 Web Application Engineering

Roadmap Talend : découvrez les futures fonctionnalités de Talend

6.0, 6.5 and Beyond. The Future of Spotfire. Tobias Lehtipalo Sr. Director of Product Management

Big Data Technologies Compared June 2014

The 3 questions to ask yourself about BIG DATA

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Brave New World: Hadoop vs. Spark

WHAT S NEW IN SAS 9.4

Assignment # 1 (Cloud Computing Security)

Cloud Courses Description

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Harnessing the power of advanced analytics with IBM Netezza

Big Data Database Revenue and Market Forecast,

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Luncheon Webinar Series May 13, 2013

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Sunnie Chung. Cleveland State University

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

Analytics framework: creating the data-centric organisation to optimise business performance

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Enterprise Operational SQL on Hadoop Trafodion Overview

Real Time Big Data Processing

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data solutions to support Intelligent Systems and Applications

Why Big Data in the Cloud?

In-Memory Analytics for Big Data

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIG DATA TRENDS AND TECHNOLOGIES

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Integrating a Big Data Platform into Government:

Enterprise Application Enablement for the Internet of Things

Stream Processing on Demand for Lambda Architectures

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Big Data Course Highlights

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Extend your analytic capabilities with SAP Predictive Analysis

How To Learn To Use Big Data

Big Data must become a first class citizen in the enterprise

Next-Gen Big Data Analytics using the Spark stack

W o r l d w i d e B u s i n e s s A n a l y t i c s S o f t w a r e F o r e c a s t a n d V e n d o r S h a r e s

Native Connectivity to Big Data Sources in MSTR 10

Data Analytics Infrastructure

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Oracle Big Data Building A Big Data Management System

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scalable Architecture on Amazon AWS Cloud

Cloud Courses Description

Big Data Are You Ready? Thomas Kyte

Transcription:

Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD Big Data Chances and Challenges

More and more data is available to science and business! sensor data Drivers: Cloud Computing Internet of Services Internet of Things Cyberphysical Systems web archives video streams audio streams simulation data Underlying Trends: Connectivity Collaboration Computer generated data RFID data Slide 2

Data are becoming increasingly complex! Size Freshness Format/Media Type Uncertainty/Quality etc. (volume) (velocity) (variability) (veracity) Data Slide 3

Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type Uncertainty/Quality etc. Data (volume) (velocity) (variability) (veracity) Selection/Grouping Relational Operators Extraction & Integration, Data Mining Predictive Models etc. Analysis (map/reduce) (Join/Correlation) (map/reduce or dataflow systems) (R, S+, Matlab) (R, S+, Matlab) Slide 4

Data-driven applications home automation health lifecycle management water management market research information marketplaces traffic management will revolutionize decision making in business and the sciences! have great economic potential! Slide 5 energy management

General consensus seems to be that people (would like to) have lots of data, which they would like to analyze. How do we go about that? Slide 6

Running in Circles? SQL NoSQL SQL-- NoMapReduce Slide 7

What is Wrong with this Picture? scripting SQL-- XQuery? wrong platform? column store++ a query plan scalable parallel sort Slide 8

Big Data Technologies Worldwide MapReduce und Tools: Hadoop, Cloudera Impala (USA), InfoSphere BigInsights (IBM) Data Mining und Informationsextraktion: SystemT (IBM), Statistical Packages: R (Universität Auckland), Matlab (Mathworks;USA), SAS, SPSS DBMS: Aster (Teradata), Greenplum (EMC), DB2 (IBM), SQLServer (Microsoft) Oracle Exadata (Oracle) Business Analytics/Intelligence: SAS Analytics (SAS), Vertica (HP), SPSS IBM) Script Sprachen: Pig, Hive, JAQL Graph Databases: Neo4j (USA), AllegroGraph (Franz Inc.; Giraph (Apache) Visualization: Tableau (Tableau Software; USA), Suche / Indexing: Lucene (Apache), Solr (Apache) Other: UIMA (IBM) Slide 9

Big Data Technologies System Type Stakeholder ParStream DBMS (in-memory) ParStream SAP Hana DBMS(in-memory) SAP Hyper DBMS(in-memory) TU München BigMemory DBMS(in-memory) Terracotta Inc./Software AG Scalaris Storage (parallel) Konrad-Zuse-Zentrum für Informationstechnik RapidMiner Data mining toolkit Rapid-I ImageMaster Enterprise Content Management T-Systems SalesIntelligence Market research Implisense Predictive-Analytics-Suite Business Analytics Blue Yonder Stratosphere Scalable Data Analytics System TU Berlin Slide 10

3 Big Trends in Big Data Solving complex data analysis problems Predictive analytics Infinite data streams Distributed data Dealing with uncertainty Bringing the human in the loop: Visual analytics Producing results in near-real time First results fast Low latency Exploiting new Hardware (manycore, OpenGL/CUDA, FPGA, heterogeneous processors, NUMA, remote memory) Hiding complexity Declarative data anaylsis programs aggregation relational algebra iterative algorithms distributed state Fault tolerance (pessimistic/optimistic) Slide 11

Another V - Value: Big Data in thecloud -theinformation Economy A major new trend in information processing will be the trading of original and enriched data, effectively creating an information economy. When hardware became commoditized, software was valuable. Now that software is being commoditized, data is valuable. (TIM O REILLY) The important question isn t who owns the data. Ultimately, we all do. A better question is, who owns the means of analysis? (A. CROLL, MASHABLE, 2011) Information as a Service End users MIA, Datamarket, Value increase Data asa Service SaaS PaaS Application Developers Corporations Analysts MIA, AzureIMR, etc Salesforce.com, Office 2010 WebApps Microsoft Azure, Google App Engine IaaS System administrators Amazon Elastic Compute Cloud Cloud-Computing Stack The Market situation Slide 12

Information Marketplaces: Enabling SMEs to capitalize on Big Data Dataproviders Algorithms Users Data & Aggregation Revenue Sharing Technology Licensing Queries Analytical results Marketplace e.g., Social Media Monitoring e.g., Media Publisher Services e.g., SEO Index Massively Parallel Infrastructure Trust Distributed Data Storage German Web Infrastructure as a Service Slide 13 http://www.mia-marktplatz.de http://www.dopa-project.eu

Big Data looks Tiny from the Stratosphere Stratosphere is a European declarative, massively-parallel Big Data Analytics System, funded by DFG and EIT, available open-source with Apache License Data analysis program Dataflow Program compiler Stratosphere optimizer Execution plan Automatic selection of parallelization, shipping and local strategies, operator order and placement Job graph Runtime operators Hash- and sort-based out-of-core operator implementations, memory management Slide 14 Execution graph Parallel execution engine Task scheduling, network data transfers, resource allocation, checkpointing http://www. stratosphere.eu

Germany is on the Map! Slide 15 Scalable Machine Learning for Big Data Tutorial by Microsoft at the IEEE International Conference on Data Engeneering2013

5 Mistakes Data Scientists Make When Putting Big Data To Work Selecting a platform that scales Performance (runtime, first-results fast) People Selecting the right data Information Quality Timeliness Choosing the right model Assumptions about the data Reproducability Numerical stability Trusting your results Confidence and statistical significance Lineage Slide 16

Big Data is Interdisciplinary Digital Preservation Decision Making Verticals Application Dimension Legal Dimension Copyright Privacy Liability Social Dimension User Behaviour Societal Impact Collaboration Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization Technology Dimension Economic Dimension Business Models Benchmarking Impact of Open Source Deployment Pricing Slide 17

Call to Action Educate Data Scientists to create the required talent T-shaped students Information literacy Data Analytics Curriculum Research Big Data Analytics Technologies Data management (uncertainty, query processing under near real-time constraints, information extraction) Programming models Machine learning and statistical methods Systems architectures Information Visualization Innovate to maintain competetiveness Demonstrate flagship use-cases to raise awareness Promote startups in the area of data analytics Transfer technologies to enterprises, in particular SMEs Determine legal frameworks and business models We need to ensure a European technological leadership role in Big Data Seite 18

commandments for Big Data Analytics Seite 19

1: Thou shall use declarative languages All-pairs shortest paths using recursive doubling in Stratosphere s Scala front-end Seite 20

2: Thou shall accept external (dynamic) sources In situ data - no load 3: Thou shall use rich primitives Beyond MapReduce Map Match Reduce Cross CoGroup Seite 21

4: Thou shall deeply embed UDFs 5: Thou shall optimize Concise and flexible Seite 22

6: Thou shall iterate Needed for most interesting analysis cases Pregel as a Stratosphere query plan with comparable performance. Seite 23

7: Thou shall use a scalable and efficient execution engine Pipeline and data parallelism, flexible checkpointing, optimized network data transfers Seite 24

An Overview of Stratosphere Seite 25

Meteor program Pact program Meteor compiler Execution plan Stratosphere optimizer Picks data shipping and local strategies, operator order Job graph Execution graph Runtime Hash- and sort-based out-of-core operator implementations, memory management Nephele execution engine Task scheduling, network data transfers, resource allocation, checkpointing Seite 26

Stratosphere bulk iterations Currently making the optimizer iterationaware and exposing PACT-level abstraction commandments Seite 27

PACT4S: Embedding of PACTs in Scala Master thesis developed language and Scala compiler plug-in, ongoing Master thesis will integrate in query optimizer, still in (very) prototypical mode Seite 28

Ingredients of a Big Data Project technology (data preparation, scalable processing) mathematical analysis methods application database systems, scalable platforms machine learning/statistics/ optimization real-world analysis problem visualization NLP signal processing Seite 29