Big Data Analytics. Chances and Challenges. Volker Markl

Similar documents

How To Get The Most Out Of Big Data

Big Data looks Tiny from the Stratosphere

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Big Data Research in Berlin BBDC and Apache Flink

Apache Flink Next-gen data analysis. Kostas

Architectures for Big Data Analytics A database perspective

Harnessing the power of advanced analytics with IBM Netezza

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Large-Scale Data Processing

Big Data on Microsoft Platform

MATLAB in Business Critical Applications Arvind Hosagrahara Principal Technical Consultant

Cloud Courses Description

Assignment # 1 (Cloud Computing Security)

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

The 4 Pillars of Technosoft s Big Data Practice

CLOUD COMPUTING. A Primer

Hadoop in the Hybrid Cloud

Introduction to Cloud Computing

Challenges for Data Driven Systems

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

The 3 questions to ask yourself about BIG DATA

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

BIG DATA TRENDS AND TECHNOLOGIES

Cloud Courses Description

BIG DATA-AS-A-SERVICE

Big Data. Lyle Ungar, University of Pennsylvania

Big Data and Industrial Internet

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

Manifest for Big Data Pig, Hive & Jaql

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Information Processing, Big Data, and the Cloud

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data and Analytics: Challenges and Opportunities

Customized Report- Big Data

Bringing Big Data Modelling into the Hands of Domain Experts

Microsoft Research Windows Azure for Research Training

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

IaaS Federation. Contrail project. IaaS Federation! Objectives and Challenges! & SLA management in Federations 5/23/11

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Microsoft Research Microsoft Azure for Research Training

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data With Hadoop

Customer Case Study. Sharethrough

Real Time Big Data Processing

COMP9321 Web Application Engineering

Machine Learning for Data-Driven Discovery

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Why Big Data in the Cloud?

Enterprise Application Enablement for the Internet of Things

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Apache Hama Design Document v0.6

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

GigaSpaces Real-Time Analytics for Big Data

Cloud Computing Training

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Hybrid Cloud Architectures for Operational Performance Management

How To Understand Cloud Computing

Infrastructures for big data

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Best Practices for Hadoop Data Analysis with Tableau

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Big Data and Data Science: Behind the Buzz Words

Cloud Computing: What IT Professionals Need to Know

Internet of Things. Opportunity Challenges Solutions

ICT Perspectives on Big Data: Well Sorted Materials

Reference Architecture, Requirements, Gaps, Roles

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Exploiting the power of Big Data

Cloud Computing Trends

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Integrating a Big Data Platform into Government:

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Azure Data Lake Analytics

TRAINING PROGRAM ON BIGDATA/HADOOP

SQL Server 2012 Business Intelligence Boot Camp

Roadmap Talend : découvrez les futures fonctionnalités de Talend

The Cloud at Crawford. Evaluating the pros and cons of cloud computing and its use in claims management

Cloud Computing Technology

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Transcription:

Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD Big Data Chances and Challenges

More and more data is available to science and business! sensor data Drivers: Cloud Computing Internet of Services Internet of Things Cyberphysical Systems web archives video streams audio streams simulation data Underlying Trends: Connectivity Collaboration Computer generated data RFID data Slide 2

Data are becoming increasingly complex! Size Freshness Format/Media Type Uncertainty/Quality etc. (volume) (velocity) (variability) (veracity) Data Slide 3

Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type Uncertainty/Quality etc. Data (volume) (velocity) (variability) (veracity) Selection/Grouping Relational Operators Iextraction & Integration, Data Mining Predictive Models etc. Analysis (map/reduce) (Join/Correlation) (map/reduce or dataflow systems) (R, S+, Matlab) (R, S+, Matlab) Slide 4

Data-driven applications home automation health lifecycle management water management market research information marketplaces traffic management will revolutionize decision making in business and the sciences! have great economic potential! Slide 5 energy management

General consensus seems to be that people (would like to) have lots of data, which they would like to analyze. How do we go about that? Slide 6

Running in Circles? SQL NoSQL SQL-- NoMapReduce Slide 7

What is Wrong with this Picture? scripting SQL-- XQuery? wrong platform? column store++ a query plan scalable parallel sort Slide 8

3 Big Trends in Big Data Solving complex data analysis problems Predictive analytics Infinite data streams Distributed data Dealing with uncertainty Bringing the human in the loop: Visual analytics Producing results in near-real time First results fast Low latency Exploiting new Hardware (manycore, OpenGL/CUDA, FPGA, heterogeneous processors, NUMA, remote memory) Hiding complexity Declarative data anaylsis programs aggregation relational algebra iterative algorithms distributed state Fault tolerance (pessimistic/optimistic) Slide 11

Value increase Another V - Value: Big Data in the Cloud - the Information Economy A major new trend in information processing will be the trading of original and enriched data, effectively creating an information economy. When hardware became commoditized, software was valuable. Now that software is being commoditized, data is valuable. (TIM O REILLY) The important question isn t who owns the data. Ultimately, we all do. A better question is, who owns the means of analysis? (A. CROLL, MASHABLE, 2011) Information as a Service End users MIA, Datamarket, Data as a Service Corporations MIA, Azure IMR, etc SaaS Analysts Salesforce.com, Office 2010 WebApps PaaS Application Developers Microsoft Azure, Google App Engine IaaS System administrators Amazon Elastic Compute Cloud Cloud-Computing Stack The Market situation Slide 12

Trust Information Marketplaces: Enabling SMEs to capitalize on Big Data Dataproviders Algorithms Users Data & Aggregation Revenue Sharing Technology Licensing Queries Analytical results Marketplace e.g., Social Media Monitoring e.g., Media Publisher Services e.g., SEO Index Massively Parallel Infrastructure Distributed Data Storage German Web Infrastructure as a Service Slide 13 http://www.mia-marktplatz.de http://www.dopa-project.eu

Big Data looks Tiny from the Stratosphere Stratosphere is a European declarative, massively-parallel Big Data Analytics System, funded by DFG and EIT, available open-source with Apache License Data analysis program Dataflow Program compiler Execution plan Stratosphere optimizer Automatic selection of parallelization, shipping and local strategies, operator order and placement Job graph Execution graph Runtime operators Hash- and sort-based out-of-core operator implementations, memory management Slide 14 Parallel execution engine Task scheduling, network data transfers, resource allocation, checkpointing http://www. stratosphere.eu

5 Mistakes Data Scientists Make When Putting Big Data To Work Selecting a platform that scales Performance (runtime, first-results fast) People Selecting the right data Information Quality Timeliness Choosing the right model Assumptions about the data Reproducability Numerical stability Trusting your results Confidence and statistical significance Lineage Slide 16

Big Data is Interdisciplinary Digital Preservation Decision Making Verticals Application Dimension Legal Dimension Copyright Privacy Social Dimension User Behaviour Societal Impact Collaboration Scalable Data Processing Signal Processing Statistics/ML Linguistics HCI/Visualization Technology Dimension Economic Dimension Business Models Benchmarking Impact of Open Source Deployment Pricing Slide 17

Call to Action Educate Data Scientists to create the required talent T-shaped students Information literacy Data Analytics Curriculum Research Big Data Analytics Technologies Data management (uncertainty, query processing under near real-time constraints, information extraction) Programming models Machine learning and statistical methods Systems architectures Information Visualization Innovate to maintain competetiveness Demonstrate flagship use-cases to raise awareness Promote startups in the area of data analytics Transfer technologies to European enterprises, in particular SMEs Determine legal frameworks and business models We need to ensure a European technological leadership role in Big Data Seite 18

7 commandments Big for Data Analytics Seite 19

1: Thou shall use declarative languages All-pairs shortest paths using recursive doubling in Stratosphere s Scala front-end Seite 20

2: Thou shall accept external (dynamic) sources In situ data - no load 3: Thou shall use rich primitives Beyond MapReduce Map Match Reduce Cross CoGroup Seite 21

4: Thou shall deeply 5: Thou shall optimize embed UDFs Concise and flexible Seite 22

6: Thou shall iterate Needed for most interesting analysis cases Pregel as a Stratosphere query plan with comparable performance. Seite 23

7: Thou shall use a scalable and efficient execution engine Pipeline and data parallelism, flexible checkpointing, optimized network data transfers Seite 24

An Overview of Stratosphere Seite 25

Meteor program Pact program Meteor compiler Execution plan Stratosphere optimizer Picks data shipping and local strategies, operator order Job graph Execution graph Runtime Hash- and sort-based out-of-core operator implementations, memory management Nephele execution engine Task scheduling, network data transfers, resource allocation, checkpointing Seite 26

Stratosphere bulk iterations Currently making the optimizer iterationaware and exposing PACT-level abstraction commandments Seite 27

PACT4S: Embedding of PACTs in Scala Master thesis developed language and Scala compiler plug-in, ongoing Master thesis will integrate in query optimizer, still in (very) prototypical mode Seite 28