A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

Similar documents
Big Data Processing: Past, Present and Future

The Inside Scoop on Hadoop

Bringing Big Data to People

Modernizing Your Data Warehouse for Hadoop

Please give me your feedback

Advanced Analytics & IoT Architectures

Big Data Big Data/Data Analytics & Software Development

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

Microsoft Analytics Platform System. Solution Brief

Ali Ghodsi Head of PM and Engineering Databricks

Modern Data Warehousing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

How Companies are! Using Spark

Hadoop Ecosystem B Y R A H I M A.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

HDP Hadoop From concept to deployment.

Unified Big Data Processing with Apache Spark. Matei

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Moving From Hadoop to Spark

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Large scale processing using Hadoop. Ján Vaňo

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Azure Data Lake Analytics

Cloud Big Data Architectures

HDP Enabling the Modern Data Architecture

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data Technologies Compared June 2014

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data and Industrial Internet

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Big Data on Microsoft Platform

The Enterprise Data Hub and The Modern Information Architecture

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Upcoming Announcements

SAP and Hortonworks Reference Architecture

How To Handle Big Data With A Data Scientist

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

BIG DATA What it is and how to use?

The Future of Data Management

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

BIG DATA TRENDS AND TECHNOLOGIES

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

BIG DATA & DATA SCIENCE

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

The Role Polybase in the MDW. Brian Mitchell Microsoft Big Data Center of Expertise

Big Data Introduction

Making big data simple with Databricks

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Why Spark on Hadoop Matters

Big Analytics in the Cloud. Matt Winkler PM, Big

Microsoft SQL Server 2012 with Hadoop

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

A Brief Introduction to Apache Tez

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Unified Big Data Analytics Pipeline. 连 城

Microsoft Big Data. Solution Brief

Hadoop implementation of MapReduce computational model. Ján Vaňo

Conquering Big Data with BDAS (Berkeley Data Analytics)

Dell In-Memory Appliance for Cloudera Enterprise

Tap into Hadoop and Other No SQL Sources

Big Data. Lyle Ungar, University of Pennsylvania

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Trends and HDFS Evolution

Massive Scale Analytics for a Smarter Planet

Databricks. A Primer

Architectures for massive data management

Unlocking the True Value of Hadoop with Open Data Science

From Spark to Ignition:

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

Hadoop-BAM and SeqPig

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Workshop on Hadoop with Big Data

Cost-Effective Business Intelligence with Red Hat and Open Source

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Application Development. A Paradigm Shift

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Native Connectivity to Big Data Sources in MSTR 10

Databricks. A Primer

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics OverOnline Transactional Data Set

Transcription:

Presented By: Orion Gebremedhin Director of Technology, Data & Analytics, Neudesic LLC. Data Platform VTSP, Microsoft Corp. @OrionGM BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK & AZURE SQL DW

TOPICS COVERED 1 2 Fundamentals of Big Data Platforms Major Big Data Tools

Scaling Up vs. Out SCALE UP (SMP) SCALE OUT (MPP) + (n) Upgrade components or buy bigger server each time Multiprocessor system where processors share resources : Operating System (OS) Memory I/O devices connected using a common bus Add nodes to the cluster Multiple processing nodes OS RAM Network

Innovation Timeline Doug Cutting & Mike Cafarella started working on Nutch Doug Cutting adds DFS & MapReduce support to Nutch NY Times converts 4TB of image archives over 100 EC2s Fastest sort of a TB, 3.5 mins over 910 nodes Fastest sort of a TB, 62 secs over 1,460 nodes Sorted a PB in 16.25 hours over 3,658 nodes 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & MapReduce papers Yahoo! Hires Cutting, Hadoop spins out of Nutch Founded Doug Cutting joins Cloudera Facebooks launches Hive: SQL Support for Hadoop Hadoop Summit 2009, 750 attendees

THE FUNDAMENTALS OF HADOOP Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s Hadoop consists of: MapReduce Hadoop Distributed File System (HDFS)

WHAT S NEW

BASICS OF MPP 400 bills 1 bill/ sec = 400 Seconds

BASICS OF MPP 200 bills 1 bill/ sec = 200 Seconds 200 bills 1 bill/ sec = 200 Seconds Total = 200 Seconds

BASICS OF MPP 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds 100 Bills 1 bill/ sec = 100 Seconds Total = 100 Seconds

HDFS & MAPREDUCE The Main Node: runs the Job tracker and the name node controls the files. Each node runs two processes: Task Tracker and Data Node Map Reduce Job Tracker Task Tracker Task Tracker HDFS Cluster Name Node Data Node Data Node 1 N

BASICS OF MAPREDUCE The Main Node: runs the Job tracker and the name node controls the files. Each node runs two processes: Task Tracker and Data Node Query Result Data Nodes/Task Trackers Query Name Node/ Job Tracker

EXECUTION UNITS MAPREDUCE The overall MapReduce word count process Input Splitting Mapping Shuffling Reducing Final Result

SOME DISTRIBUTIONS OF APACHE HADOOP Apache Foundation

Sandbox Hortonworks

MAPREDUCE PIG & HIVE MAPREDUCE PIG HIVE Java Write many lines of code Mostly used by Yahoo Most used for data processing Shares some constructs w/ SQL Is more Verbose Needs a lot of training for users with limited procedural programming background Offers control over the flow of data Mostly used by Facebook for analytic purposes Used for analytics Relatively easier for developers w/ SQL experience Less control over optimization of data flows compared to Pig Not as efficient as MapReduce Higher productivity for data scientists and developers

THE EXPLOSION OF HADOOP

THE HISTORY OF SPARK MapReduce Top Level 2006 2010 2013 2004 2009 2011 2014 Spark Paper BSD Open Source Apache 17

SPARK SHARED LIBRARIES

SPARK THE UNIFIED PLATFORM FOR BIG DATA APIs for : Scala Java Python R Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) Spark Core

SPARK BENEFITS Performance Developer Productivity Unified Engine Ecosystem Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for batch and real-time data processing. Easy-to-use APIs for processing large datasets. Includes 100+ operators for transforming. Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing. A single application can combine all types of processing. Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB. Runs on top of the Apache YARN resource manager.

ANALYTICS CORTANA

SQL Server BIG DATA OPTIMIZATIONS

SQL Server APS High-Speed Interconnect Node 1 Node 2 Node 3 A F G R S Z

SQL Server APS GROWTH TOPOLOGY Base Unit Scale Extension Unit Base Unit

SQL Server Azure SQL DW Application or User Connection Data Loading (Poly Base, ADF, SSIS, REST, OLE, ODBC, ADF, AZCopy, PS) DMS SQL DB Compute Node Massively Parallel Processing (MPP) Engine SQL DB DMS SQL DB DMS SQL DB DMS SQL DB DMS Compute Node Compute Node Compute Node Compute Node Blob Storage [WASB(S)] Azure Infrastructure and Storage

SQL Server DEPLOY OPTIONS & HYBRID SOLUTIONS

SQL Server CONNECTING ISLANDS OF DATA WITH POLYBASE Select Result set Microsoft Azure HDInsight Hortonworks for Windows and Linux Cloudera SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-microsoft Hadoop distributions, such as Hortonworks and Cloudera

USE CASE: SUPPLY CHAIN MANAGEMENT Use Case 3: Supply Chain Management

USE CASE: SMART GRID MANAGEMENT

USE CASES SMART GRID PREDICTIVE MAINTENANCE DEMAND FORECASTING GRID OPTIMIZATION THEFT PERVENTION DEMAND RESPONSE CUSTOMER PROFILING

Neudesic partnered with one of the nation s largest utility companies that recently deployed Smar Utility Meters for power customers, nearly a million meters sending usage data every 15 minutes. The result: an Azure hybrid big data processing solution that enabled the customer to perform gap analytics: a process for identifying gaps that exist in the power usage readings, over 7x faster than their previous solution! Billions of Smart Meter reads get processed to identify the nature and duration of the gaps to mitigate revenue losses.

USE CASES SMART GRID

USE CASE: REAL TIME TRAFFIC ANALYSIS

REAL TIME TRAFFIC ANALYSIS

USE CASES STREAM ANALYTICS Real-time fraud detection Connected cars Click-stream analysis Real-time financial portfolio alerts Smart grid, energy management CRM alerting sales to customer case Data and identity protection services Real-time sales tracking

ML PROBLEMS SOLVED BY AZURE ML Classification Regression Recommenders Anomaly Detection Clustering

INDUSTRY USE CASES MACHINE LEARNING

THE MACHINE LEARNING WORKFLOW Input data Data transformation Define model Split data Train model Score (prediction) Evaluate Model

AZURE DATA FACTORY

HD INSIGHT

BIG DATA & Advanced Analytics Roadshow Questions? Orion Gebremedhin Orion.Gebremedhin@Neudesic.com Twitter: @oriongm Marc Lobree Marc.Lobree@Neudesic.com