How to Leverage Data Lakes (Hive, Hortonworks, Cloudera, Impala, Spark SQL)

Similar documents
Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Tap into Hadoop and Other No SQL Sources

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop & Spark Using Amazon EMR

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Actian SQL in Hadoop Buyer s Guide

The Future of Data Management

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

SAP and Hortonworks Reference Architecture

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Real Time Big Data Processing

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Luncheon Webinar Series May 13, 2013

Using Tableau Software with Hortonworks Data Platform

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Hadoop Ecosystem B Y R A H I M A.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Information Builders Mission & Value Proposition

Introducing Oracle Exalytics In-Memory Machine

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

HDP Hadoop From concept to deployment.

Big Data With Hadoop

Big Data Analytics Nokia

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

So What s the Big Deal?

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Comprehensive Analytics on the Hortonworks Data Platform

Ganzheitliches Datenmanagement

HDP Enabling the Modern Data Architecture

Moving From Hadoop to Spark

Are You Big Data Ready?

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Please give me your feedback

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

Data Integration Checklist

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Next-Generation Cloud Analytics with Amazon Redshift

The Inside Scoop on Hadoop

How To Handle Big Data With A Data Scientist

MapR: Best Solution for Customer Success

Big Data at Cloud Scale

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

The Future of Data Management with Hadoop and the Enterprise Data Hub

The Internet of Things and Big Data: Intro

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Cloud Big Data Architectures

Upcoming Announcements

Sisense. Product Highlights.

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON BIG DATA MULTI-PLATFORM JUNE 25-27, 2014 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Bringing the Power of SAS to Hadoop. White Paper

Three Reasons Why Visual Data Discovery Falls Short

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Constructing a Data Lake: Hadoop and Oracle Database United!

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Implement Hadoop jobs to extract business value from large and varied data sets

Bringing Big Data to People

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data and Hadoop for the Executive A Reference Guide

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Interactive data analytics drive insights

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Workshop on Hadoop with Big Data

Cloudera Enterprise Data Hub in Telecom:

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

This Symposium brought to you by

How To Scale Out Of A Nosql Database

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Reference Architecture, Requirements, Gaps, Roles

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Safe Harbor Statement

Dell In-Memory Appliance for Cloudera Enterprise

Navigating the Big Data infrastructure layer Helena Schwenk

Qsoft Inc

BIG DATA TRENDS AND TECHNOLOGIES

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

A Survey on Big Data Analytical Tools

WELCOME TO The Future of Analytics in Action: The Art of the Possible

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Oracle Big Data Fundamentals Ed 1 NEW

Big Data on Microsoft Platform

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Transcription:

How to Leverage Data Lakes (Hive, Hortonworks, Cloudera, Impala, Spark SQL) Presented by: Trishla Maru #mstrworld

Agenda Data Lakes and Hadoop Ecosystem Various Ways to Access Big Data Demo Top 5 Reasons to Why Use MicroStrategy Customer Success Stories Summary and Q&A 2 #mstrworld

Typical Big Data Lake MicroStrategy Analytics Platform Hive Pig Impala Spark Drill DATA LAKE Hadoop HDFS OLTP,ERP, CRM systems Document s Emails Web logs, Click Streams Social Networks Machine Generated Sensor Data Geolocation Data SOURCES OF DATA

Data Warehouse vs Data Lake A Data Lake is NOT a Data Warehouse Data Warehouse Data Lake Structured and Processed Data Structured, Semi Structured/Unstructured/Raw Data Schema on write processing Schema on read processing Expensive for large data storage volumes Designed for low cost storage Less agile, fixed configuration Highly agile, configure and reconfigure as needed More for business professionals Data scientists et. al.

MicroStrategy/Hadoop Ecosystem MicroStrategy Analytics Platform Cloudera, MapR, Hortonworks, Amazon EMR MAP REDUCE Hive, Pig Spark SQL MSTR HADOOP IMPALA DRILL SPARK GATE- WAY Phoenix/JDBC HBase, Cassandra YARN Hadoop Distributed File System (HDFS)

Bring All Relevant Data to Decision Makers Support for Big Data Sources in the Data Lake Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single Database MapReduce & NOSQL Databases Elastic Map Reduce BigInsights Distribution HDFS Columnar Databases Redshift Data Warehouse Appliances HANA Parallel Data Warehouse Relational Databases Multidimensional Databases Analysis Services SaaS-Based App Data Google Analytics Zendesk Generic Web Services SOAP REST Generic Web Services with OAuth..many more.. User / Departmental Data Clipboard MicroStrategy Dataset

Various Ways to Connect to Hadoop

Generalized Big Data Analytics Workflow Big Data Analytics Workflow Sources: Unstructured, Semi-structured, Structured Outputs: HDFS, RDBMS, Hive, HBase, Cassandra, MongoDB, SOLR, Elastic, Other NoSQL Models Models In general, the big data environments are distributed transformation and calculation frameworks that take in sources and generate outputs. Sources tend to be unstructured or semi-structured whereas outputs tend to be semistructured and structured. Hadoop-based workflows may incorporate several components tied together with a dataflow or workflow manager such as Pig, Oozie, others or custom programs. Spark is able to use Spark APIs and components in the workflow for a more uniform experience

Ways to Connect and Query Hadoop #1 Access and Query Hadoop via Hive Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Any Hadoop Distribution Hive HDFS This is the most popular way of querying Hadoop, via Hive. Hive allows users who aren t familiar with programming to access and analyze big data in a less technical way, using a SQL-like syntax called Hive Query Language (HiveQL). Hive is used for complex, long-running tasks and analyses on large sets of data. Hive uses the map reduce framework to query data from HDFS. It is fault tolerant, but has high latency.

Ways to Connect and Query Hadoop #2 Access and Query Hadoop via Impala (only with Cloudera distribution) Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Cloudera Distribution Impala HDFS Impala also uses SQL syntax to query Hadoop. But does NOT use the map reduce framework. It has its own processing engine to query HDFS. Impala is used for analysis when the user wants to run and return quickly on a small subset of data, e.g. analyzing company finances for a daily or weekly report. Not ideal for complex data manipulation, data preparation etc. Hive is a screwdriver and Impala is a drill bit.

Hive and Impala Hive Impala Hive is SQL on Hadoop Impala is SQL on HDFS Hive uses MapReduce to process its queries Hive cannot perform in-memory query operations Hive is fault tolerant. Recommended for ETL type of jobs Impala uses its own processing engine Impala can perform in-memory query operations Impala is not fault tolerant Hive has high latency Impala has faster response times, great for small ad-hoc queries Use Case: If you have batch processing kind of needs, use Hive Use Case: If you need real time, adhoc queries over a subset of data, use Impala

Ways to Connect and Query Hadoop #3 Access and Query Hadoop via Apache Spark Hadoop MicroStrategy Analytics Platform Spark ODBC Connector Apache Spark HDFS Apache Spark is one of the fastest evolving open source communities, popular for it s MPP capabilities on top of Hadoop and providing advanced analytical workflows. MicroStrategy connects to Spark via Spark SQL/ODBC connector. Hive on Spark Replaces Hive SQL Engine with Spark SQL Engine starting Hive 1.2 Set hive.execution.engine=spark

Key Points for Spark Spark doesn t store information it uses common Hadoop and NoSQL storage for input and output Internally, Spark uses in-memory stores to pass data from one processing step to another Spark SQL in Spark applications works on the in-memory stores to simplify Spark programming against the Resilient Distribute Datasets (RDD) structures. This is not accessible via ODBC In-memory RDD s are not available after a Spark application completes. Spark SQL ODBC can only access RDD s that have persisted in Hive

Ways to Connect and Query Hadoop #4 Tap into Hadoop Natively with MicroStrategy s Hadoop Gateway MicroStrategy Analytics Platform Big Data Engine/Hadoop Gateway NEW Hadoop HDFS We launched this connectivity with v10. Big Data Engine (BDE), is a native YARN based application that enables direct access to HDFS. None of the core BI competitors have this capability. YARN (Yet Another Resource Negotiator) is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. Use Case: Fulfills faster data loading of data from HDFS and leverage our in-memory layer for analytics.

Demo

Top Five Reasons to Why use MicroStrategy for your next Big Data Project?

#1 Leverage varied flavors to access Hadoop that fulfills the business needs Hadoop Gateway SQL on Hadoop/Big Data Apache Hive Apache Spark Apache Pig MicroStrategy is the only core BI tool to provide a native connectivity to HDFS that enables fast data loads into its in-memory layer. MicroStrategy certifies Cloudera Impala, Presto, Google Big Query and Pivotal HAWQ, Apache Drill as a data source. MicroStrategy optimizes and certifies Hadoop/Hive as a data source. MicroStrategy certifies Spark via Spark SQL/ODBC. MicroStrategy also provides a connector to execute freeform pig-latin reports

#2 Analyze large volumes of data with a platform that is scalable and performant Sample Applications Access the Database With High Throughout CRM analysis across a large customer base Interactive analysis on high-volume clickstream data Merchant analytics for a credit card issuer Create and Publish the Cube With High Data Scalability Analyze Data With Fast Response Times Store manager application for a large national chain

#3: Cleanse your Hadoop data on the fly with MicroStrategy s Data Wrangler Cleanse and refine your data with Data Wrangler Cleanse your data Extract, Split Patterns Generate Facets Get Suggestions from our Data Visualization Wrangler, as Rename, Delete, and many more.. you prepare your data! & Analysis Designed for analysts and business users without spending any extra $$ Predictive Interactions Work on a sample and Apply to the entire dataset Undo/Redo without Data penalty Preparation

#4: Perform on the fly dynamic searches on semi-structured data Business Case: Analyzing Product Catalog Perform faceted search by dynamically clustering the items. Users can then drill down by applying specific constraints to the search results Manufacturers can get superior feedback Analyzing Diagnostic Notes Needs search engine s capabilities to facilitate semi-structured data analysis in health care domain. Helping health care professionals make informed decisions in situations that are time sensitive. Configure the file in Apache Solr Import the file via data import Dynamically Search for keywords via Visual Insight

#5: Get federated data on dashboards leveraging MicroStrategy s integrated architecture Oracle Hadoop Excel Personal Data Sources Relational Databases Multi-Dimensional Sources Analysis Services Map Reduce Databases

Customer Success Stories

Key BI Characteristics: INDUSTRY: Wireless Telecom BI COMPONENTS: Multiple Applications DATABASE: Hive and Red Shift HADOOP DISTRIBUTION: Impala VOLUME OF DATA 1 TB Daily APPLICATIONS: 30 Applications Self Service Big Data Analytics with over 30 Applications Business Use and Benefits The largest telecom company in Korea, SK Planet provides mobile services such as navigation, mobile payment, online shopping, memberships, etc. They have 30+ services, and they have all of their data in Hadoop (basically no relational databases for analysis). MicroStrategy provides with an centralized environment for analyzing their Hadoop data in a managed self-service manner. Self service data discovery while fully leveraging MicroStrategy s inmemory technology.

Key BI Characteristics: INDUSTRY: Entertainment BI COMPONENTS: 1 Application; Traditional Reports USERS ~200 DATABASE: Hadoop, Teradata HADOOP DISTRIBUTION: Amazon EMR VOLUME OF DATA Petabytes TYPE OF DATA Log and Events data APPLICATIONS: Sales Analysis Business Use and Benefits Making Better Sales Decisions by Getting Insights from Web Logs in Hadoop Sales Analysis generally with a new launch in new region, quick report analysis to understand the new accounts, number of hours of viewing etc. Directly querying and reporting from MicroStrategy on logs via Hive Able to make better Sales decisions

Key BI Characteristics: INDUSTRY: E-commerce BI COMPONENTS: 1 Application; Reports, Dashboards, VI USERS ~200 DATABASE: Hadoop, Oracle HADOOP DISTRIBUTION: Apache VOLUME OF DATA Petabytes TYPE OF DATA Web Logs, Online behavior APPLICATIONS: Sales Analysis Business Use and Benefits Analyzing web logs/online behavior stored in Hadoop. Dashboards and VI analysis run against our in-memory cubes. And ad-hoc reports run live against Hive. Analyzing Online Behavior with Live Queries to Hadoop End users do not need to code with MapReduce Developers are more productive delivering self service BI through a tool instead of coding custom user interface.

Leading Electronics Manufacturing Key BI Characteristics: INDUSTRY: Electronics Manufacturing BI COMPONENTS: Multiple Applications DATABASE: Hive and Red Shift HADOOP DISTRIBUTION: S3 VOLUME OF DATA 20-30 TB Daily APPLICATIONS: Smart TV Analytics, Mobile App Analytics Business Use and Benefits This Electronics company is collecting data from all of their smart TVs around the world to analyze the user behavior and app usage. It includes remote controller click stream and log analysis for more than 30 smart TV services. Smart TV Analytics on User Behavior and App Usage The data is actually not stored in HDFS but in S3. Fully leveraging the elastic computing capability of AWS for Map Reduce, with the CPU time based costing of ec2, this saves a big cut from the infrastructure cost. Using MicroStrategy to build applications to analyze click streams and logs.

Multi-Channel Digital Distribution Provider Key BI Characteristics: INDUSTRY: Electronics and Media BI COMPONENTS: 1 Application; Reports, VI, Dashboards DATABASE: Hadoop, Hive HADOOP DISTRIBUTION: Cloudera Impala VOLUME OF DATA Over 1 Billion traffic attribute combinations APPLICATIONS: Traffic Attribute Multiplier Business Use and Benefits The Traffic Attribute Multiplier application is helping Adconion to get precisely target their digital ads, shorten the time to prepare and tune models and better ad delivery ROI for their customers Precise Targeting of Digital Ads Leveraging MicroStrategy s integration to Impala and the rich visualizations library, making it easy to be consumed by business users. Achieved 2.4% improvement in ad budgets spending efficiency

Summary Data Lakes are Powerful MicroStrategy is ready to support your hybrid architecture Various Ways to Access Big Data MicroStrategy offers it all and is committed to be at par A Robust Platform to Analyze Big Data Highly Scalable In Memory engine Get Federated Data Effectively Navigate Data Across Multiple Data Sources with MicroStrategy multi-source and data blending End User Data Preparation Perform on the fly cleansing on Hadoop data

Questions? Q&A