Big Data on Microsoft Platform



Similar documents
BIG DATA TRENDS AND TECHNOLOGIES

Microsoft Big Data. Solution Brief

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

The Inside Scoop on Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Microsoft SQL Server 2012 with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

A Brief Outline on Bigdata Hadoop

Modernizing Your Data Warehouse for Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Please give me your feedback

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Bringing Big Data to People

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Workshop on Hadoop with Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

So What s the Big Deal?

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Whitepaper: Solution Overview - Breakthrough Insight. Published: March 7, Applies to: Microsoft SQL Server Summary:

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Testing Big data is one of the biggest

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Information Architecture

Navigating the Big Data infrastructure layer Helena Schwenk

Using Tableau Software with Hortonworks Data Platform

Hadoop IST 734 SS CHUNG

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

Big Data and Data Science: Behind the Buzz Words

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Bringing Big Data into the Enterprise

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Tap into Hadoop and Other No SQL Sources

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Course Highlights

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Native Connectivity to Big Data Sources in MSTR 10

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop. Sunday, November 25, 12

Modern Data Warehousing

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data: Tools and Technologies in Big Data

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

Understanding Microsoft s BI Tools

Big Data for the Rest of Us Technical White Paper

BIG DATA What it is and how to use?

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

How Companies are! Using Spark

Virtualizing Apache Hadoop. June, 2012

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data and Industrial Internet

Are You Ready for Big Data?

NoSQL for SQL Professionals William McKnight

The Next Wave of Data Management. Is Big Data The New Normal?

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

How To Scale Out Of A Nosql Database

Implement Hadoop jobs to extract business value from large and varied data sets

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Hadoop and Map-Reduce. Swati Gore

Big Data Analytics OverOnline Transactional Data Set

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Course 20467: Designing Self-Service Business Intelligence and Big Data Solutions

How To Handle Big Data With A Data Scientist

Big Data for Investment Research Management

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Advanced Big Data Analytics with R and Hadoop

Sunnie Chung. Cleveland State University

BIG DATA CHALLENGES AND PERSPECTIVES

Big Data Explained. An introduction to Big Data Science.

Business Intelligence for Big Data

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Big Data Analytics with PowerPivot and Power View

NZ BI User Group Auckland 18 September, Big Data Analytics with PowerPivot and Power View

Large scale processing using Hadoop. Ján Vaňo

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Transcription:

Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1

Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4 5. Microsoft Big Data Solutions for Hadoop...4 6. Hadoop Data ImportExport on Microsoft Platform...5 7. Conclusion...6 Page 2

What is Big Data? With the time Data is just growing and the growth is unprecedented. There were times when enterprise level organizations used to deal with few GB s of data which itself was huge and challenging to manage. Now the time has changed where needs of managing data has moved from Terabytes to Petabytes. Big data is data which is Big in size to manage. It s a word applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage and process the data within certain elapsed time. Big data is in some cases new and in some cases it is been there already. It is all about and around data explosion, large volumes but it also has other facets. When we are talking about Big data, it is about structured and unstructured data. In many cases it is unstructured or semi structured like for example weblogs, data coming from machines, plants or energy censers RFID, social data, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, medical records, photography or video archives..etc., There are those types of data on which people want to analyze and make decisions. Gartner says on the world of data "By 2012 organization that build a modern information management system will outperform their peers financially by 20%" Characteristics of Big Data There are multiple facets of Big Data which are typically called as 4V s of Big data. These typically characterize Big Data. Volume: Exceeds Physical limits of vertical scalability Velocity: Decision window small compared to data change rate Variety: Many different formats make integration expensive Variability: It s not same as Variety. Many Options or variable interpretations confound analysis. A typical Big Data patterns are where it starts with holding data in a Digital box like storage where all potentially valuable data retained then analyzed later. Once data stored then mine the data for insight and feed to the downstream systems. Use this Information production as a factory for consumable feeds and discover, enrich and publish the data. Finally optimize the loop by analyzing ambient data, monitor the system and optimize the behavior. Enter Hadoop Big data is based on Hadoop. This is the most visible face of Big data. Hadoop is a technology developed in open source Apache community which enables analysis on semi structured and structured data where data is distributed across commodity servers with in a cluster. Page 3

Two major observations of Hadoop are Storage and processing Storage (Hadoop Distributed File System layer) HDFS is all about storage. It is basically Infrastructure which hadoop implemented to store the data across multiple nodes. HDFS also takes care of application scenarios to make it resilience to server or node failures or job failures. Processing (Map reduce layer) Every query in the Hadoop system is a job and this job creates a bunch of tasks. These tasks will process the data, shuffle the results, reduce them, aggregate them and then bring the results back to the client applications. Map Reduce layer consists of Job tracker and task tracker which works in tandem with HDFS. Beauty of hadoop is it distributes the data and calculates it locally then brings the results back. Hadoop = HDFS + MapReduce Microsoft Big Data Solutions Microsoft provided a range of solutions to address big data challenges. Data Warehouse solutions from SQL Server 2008 R2/2012, Fast Track Data Warehouse, Business Data Warehouse, Parallal Data Warehouse offers a robust and scalable platform for storing and analyzing data in a traditional data warehouse. Among these all PDW offers customers: Enterprise-class performance that handles massive volumes to over 600 TB. Microsoft provide LINQ to HPC which enables the development and deployment of data intensive applications using technologies such as.net and Language Integrated Query (LINQ), to express their data analytics algorithms. LINQ to HPC can integrate with SQL Server 2008 R2 and the portfolio of BI offerings from Microsoft such as SSRS, SSAS, Power Pivot, and Excel. Microsoft Big Data Solutions for Hadoop Hadoop is not a Microsoft solution. In order to work with Apache Hadoop, Microsoft has come up with three Hadoop Deployment models on Windows Platform. 1. On premise 2. On Cloud 3. Elastic MapReduce Page 4

Figure 1: Microsoft Big Data Solution On premise distribution: This is a dedicated on premise cluster. If you want to run Hadoop on a standalone or distributed windows servers located in your own infrastructure, then Hadoop services for on premise distribution provides such option. This distribution has great things like integration with your active directory, System center that makes it enterprise manageable setup. On cloud distribution: This is a Dedicated Cloud Cluster. Hadoop services would be hosted on Cloud and will not go away. This distribution provides some elasticity based on how you setup storage and partitioning. Basically this is all about running Hadoop on azure. Hadoop based services for windows azure are available through invitation. More on this can be found at www.hadooponazzure.com. Elastic MapReduce (EMR): Setting up Apache Hadoop in your infrastructure is not simple. EMR makes this easy and also Elastic MapReduce model is ultimate elastic. If you want to get the hadoop cluster up and running very fast, run the calculations, pull back the results then do not care anymore. It s a simple management deployment process. It is all self-managed on the azure but you will be managing actual processing. Hadoop Data ImportExport on Microsoft Platform Data can be imported or exported from Hadoop storage system in various ways from Microsoft Platform. For proceeding with Data transfer, FTPS and ODBC server ports must be opened where FTPS Page 5

port 2226 is required if want to remote desktop and work on the Hadoop system. ODBC Server Port is for connecting from end user tools like Power view, Power Pivot and Excel. Once the ports are open, through various interfaces data can be brought in to the cluster. For example from Manage cluster interface, data can be bought in from windows azure blog storage and Data markets. Couples of other ways are JavaScript console and SQOOP connectors. Interacting with Hadoop Traditional way to interact with Hadoop was to write your own jobs written in JAVA. If you are not interested in going to the level of understanding how the MapReduce job been executed then you can use javascript. JavaScript is an interactive development environment along with charting and graphing. HIVE and PIG interfaces would help in interacting with Hadoop. SQOOP connectors would help in transferring data between relational and Hadoop storage systems. HIVE interface would interest to the Data professionals because it uses a language called HIVQL. We can create structures over the data where will create Meta data layer. For example Weblogs would structure in the table format with which you can interact with common tools like Excel, Power View or Power Pivot. HIVE in other words is a Data warehouse system for hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. More details can be found on Hive project, examples, documentation and use cases at http://hive.apache.org PIG is another data flow language but more like a scripting language to transform and analyze large data sets on Hadoop. It is an abstraction layer to interact with Hadoop. Here you write scripts that get converted to MapReduce Jobs. Those who are familiar with SQL love and embrace HIVE first. Those who are familiar with scripting language would love and embrace PIG first. PIG is a data-flow platform to transform and analyze large data sets on Hadoop. It can be extensible through user defined functions and methods. More can be learned from http://pig.apache.org/ You can use other languages like Perl to get the data and work on. SQOOP helps in pulling data from SQL into Hadoop or pull data from Hadoop into SQL. Every SQOOP connector is bi-directional. It allows executing MapReduce jobs to transfer data in parallel with fault tolerance in the HDFS. These connectors can be downloaded from http://bit/ly/rulsix. Conclusion Three flavored Microsoft Hadoop distributions full fill the various analyses and decision support needs on Unstructured and semi structured data. Integration of Semi and unstructured data with the RDBMS through SQOOP connecters extend the analysis scope. With the various connectors for Microsoft SQL server, Parallel Data warehousing along with end user ODBC connectors, Microsoft made Hadoop accessible to a broader class of end users, developers and IT professionals. Page 6

References Figure1 source: http://download.microsoft.com/download/1/8/b/18be3550-d04c-4b3f-9310- F8BC1B62D397/MicrosoftBigDataSolutionSheet.pdf Page 7