The Inside Scoop on Hadoop

Similar documents
Big Data Processing: Past, Present and Future

Bringing Big Data to People

Modernizing Your Data Warehouse for Hadoop

Tap into Hadoop and Other No SQL Sources

Big Data on Microsoft Platform

Modern Data Warehousing

Microsoft Analytics Platform System. Solution Brief

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Native Connectivity to Big Data Sources in MSTR 10

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Please give me your feedback

Big Data and Industrial Internet

Big Data Technologies Compared June 2014

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop and Map-Reduce. Swati Gore

BIG DATA USING HADOOP

Building a BI Solution in the Cloud

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Polybase for SQL Server 2016

Hadoop in the Hybrid Cloud

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

LEARNING SOLUTIONS website milner.com/learning phone

Apache Hadoop: Past, Present, and Future

Big Data Analytics OverOnline Transactional Data Set

Constructing a Data Lake: Hadoop and Oracle Database United!

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Microsoft technológie pre BigData. Ľubomír Goryl Solution Professional

Business Intelligence for Big Data

Building a data analytics platform with Hadoop, Python and R

Microsoft Big Data. Solution Brief

Market Overview: Big Data Integration

BIG DATA SOLUTION DATA SHEET

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big data blue print for cloud architecture

Virtualizing Apache Hadoop. June, 2012

Hadoop & Spark Using Amazon EMR

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Protecting Big Data Data Protection Solutions for the Business Data Lake

CREATING PACKAGED IP FOR BUSINESS ANALYTICS PROJECTS

Big Data and Apache Hadoop Adoption:

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

Course 10977A: Updating Your SQL Server Skills to Microsoft SQL Server 2014

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Real Time Big Data Processing

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

Map Reduce & Hadoop Recommended Text:

Testing Big data is one of the biggest

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

SQL Server 2012 Business Intelligence Boot Camp

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Architecting Open source solutions on Azure. Nicholas Dritsas Senior Director, Microsoft Singapore

HDP Hadoop From concept to deployment.

Big Data and Hadoop for the Executive A Reference Guide

Using Tableau Software with Hortonworks Data Platform

Azure Data Lake Analytics

Data processing goes big

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

File S1: Supplementary Information of CloudDOE

Microsoft Research Windows Azure for Research Training

Big Business, Big Data, Industrialized Workload

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Supported Platforms. HP Vertica Analytic Database. Software Version: 7.0.x

Microsoft Research Microsoft Azure for Research Training

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

The Role Polybase in the MDW. Brian Mitchell Microsoft Big Data Center of Expertise

INTRODUCTION TO CASSANDRA

Introducing the Reimagined Power BI Platform. Jen Underwood, Microsoft

#TalendSandbox for Big Data

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Too Big To Ignore

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How To Extend An Enterprise Bio Solution

Next-Generation Cloud Analytics with Amazon Redshift

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

The little elephant driving Big Data

Ubuntu and Hadoop: the perfect match

Oracle Big Data Discovery (BDD) Hadoop Visualization

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Big Data and Data Science: Behind the Buzz Words

Microsoft SQL Server 2012 with Hadoop

Proact whitepaper on Big Data

Transcription:

The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM

The Inside Scoop on Hadoop

Topics Covered Understanding Hadoop Big Data Solution Deployment Models Architecting the Modern Data Warehouse Summary

Understanding Hadoop

Big Data = Hadoop? * A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

The Fundamentals of Hadoop Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s. Hadoop consists of a parallel execution framework called Map/Reduce and Hadoop Distributed File System (HDFS).

Latest Developments

HDFS Very high fault tolerance Can not be updated but corrections can be appended File blocks are replicated multiple types Three types nodes: Name Node (Directory) Backup Node ( checkpoint) Data Node-actual data

MapReduce A programing framework for library and runtime. just like.net Map Function - Take a task and break it down into small tasks Reduce Function - Combine the partial answers and find the combined list Master (Job Tracker) Is where you submit a query. Manages the Task Trackers which do the actual Map or Reduce task. Workers (Task Trackers) Do the work, just as each nodes in the cluster have a data node, they also have a task tracker

Basics of MapReduce = 400 Seconds 400 bills 1 bill/ sec

Basics of MapReduce = 200 Seconds 200 Bills 1 bill/ sec = 200 Seconds 200 Bills 1 bill/ sec Total =200 seconds

Basics if MapReduce 100 bills 100 bills 100 bills 100 bills = 100 Seconds = 100 Seconds = 100 Seconds = 100 Seconds Total =100 seconds

Basics of MapReduce Query Query Result Name Node/Job Tracker Data Nodes/Task Trackers

HDFS and MapReduce The Main Node: runs the Job tracker and The name node controls the files. Each node runs two processes: Task Tracker and Data Node

Hive and Pig MapReduce Java write many lines of code Pig Mostly used by yahoo highly used for data processing Shares some constructs with SQL e.g. filtering, selecting, grouping, and ordering. But syntax is very different from sql. Is more Verbose Needs a lot of training for users with limited procedural programming background. Gives you more control over the flow of data. Hive Mostly used by Facebook for analytic purposes Used for analytics Relatively easier for developers with SQL experience. Less control over optimization of data flows compared to Pig Not as efficient as MapReduce Higher productivity for data scientists and developers

HDFS * A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

Big Data Solution Deployment Models

Major Players in Big Data Apache Foundation Hortonworks Cloudera MapR Pentaho Amazon (AWS)

Hortonworks June 2011 funded by $23 million from Yahoo! and Benchmark Capital as an independent company Horton the Elephant - Horton Hears a Who! Employs contributors to project Apache Hadoop October 2011 partnered with Microsoft : Azure and Windows Server. Cloudera founded in October 2008 started the effort to be Microsoft Azure Certified in October 2014.

HDP User Interface

Deployment Models On Premise Deployment Microsoft Analytics Platform System (APS) Oracle Big Data Appliance Hortonworks Data Platform (HDP) Cloudera's CDH Pivotal Data Computing Appliance (DCA) Big Data as a service HDInsight Cloudera on AWS Amazon RedShift Amazon Elastic MapReduce

HDInsight: Hadoop As A Cloud Service

HDFS * A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

HDInsight Versions

Architecting the Modern Data Warehouse

The ETL Automation Model * A Modern Data Architecture with Apache Hadoop, Hortonworks Inc. 2014

The ETL Automation Model

BI-Integration Model

Hybrid BI-Integration Model Reporting & Analytics Self- Service BI (Excel Pivot Reports, etc) Reports Charts and Dashboards EDW Enterprise Data Models Enterprise Data Warehouse Staging Hadoop Enterprise Staging Area HDFS Data Integration tool Data Sources SQL Server Oracle, DB2, etc Flat Files sources Mainframes and proprietary data sources

Hybrid BI-Integration Model Compute: HDInsight Cluster:ssughadoop Container:ssughadoop Nodes: 1 Extract to BLOB Container: ssugstaging Container: ssugedw Power BI Data Source: Azure Blob Store Upload Integration (SSIS, Data Model) Power BI Data Source: EDW www.neubikes.com Source: weblogs EDW_neuBikes On- Prem Data Warehouse (SQL, SSAS- Tabular, SSAS OLAP) Databases: SQL Server, Oracle, etc. Flat Files Third Party applications

Hybrid BI-Integration Model

Summary

Summary Understand your data growth to determine when to Scale-Out. Determine the right tool for the workload you have. Choose the right deployment of Big Data Solutions Hybridize, do not start from scratch!

Questions? Questions and Discussion