How to Leverage Data Lakes (Hive, Hortonworks, Cloudera, Impala, Spark SQL)

How to Leverage Data Lakes (Hive, Hortonworks, Cloudera, Impala, Spark SQL) Presented by: Trishla Maru #mstrworld

Agenda Data Lakes and Hadoop Ecosystem Various Ways to Access Big Data Demo Top 5 Reasons to Why Use MicroStrategy Customer Success Stories Summary and Q&A 2 #mstrworld

Typical Big Data Lake MicroStrategy Analytics Platform Hive Pig Impala Spark Drill DATA LAKE Hadoop HDFS OLTP,ERP, CRM systems Document s Emails Web logs, Click Streams Social Networks Machine Generated Sensor Data Geolocation Data SOURCES OF DATA

Data Warehouse vs Data Lake A Data Lake is NOT a Data Warehouse Data Warehouse Data Lake Structured and Processed Data Structured, Semi Structured/Unstructured/Raw Data Schema on write processing Schema on read processing Expensive for large data storage volumes Designed for low cost storage Less agile, fixed configuration Highly agile, configure and reconfigure as needed More for business professionals Data scientists et. al.

MicroStrategy/Hadoop Ecosystem MicroStrategy Analytics Platform Cloudera, MapR, Hortonworks, Amazon EMR MAP REDUCE Hive, Pig Spark SQL MSTR HADOOP IMPALA DRILL SPARK GATE- WAY Phoenix/JDBC HBase, Cassandra YARN Hadoop Distributed File System (HDFS)

Bring All Relevant Data to Decision Makers Support for Big Data Sources in the Data Lake Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single Database MapReduce & NOSQL Databases Elastic Map Reduce BigInsights Distribution HDFS Columnar Databases Redshift Data Warehouse Appliances HANA Parallel Data Warehouse Relational Databases Multidimensional Databases Analysis Services SaaS-Based App Data Google Analytics Zendesk Generic Web Services SOAP REST Generic Web Services with OAuth..many more.. User / Departmental Data Clipboard MicroStrategy Dataset

Various Ways to Connect to Hadoop

Generalized Big Data Analytics Workflow Big Data Analytics Workflow Sources: Unstructured, Semi-structured, Structured Outputs: HDFS, RDBMS, Hive, HBase, Cassandra, MongoDB, SOLR, Elastic, Other NoSQL Models Models In general, the big data environments are distributed transformation and calculation frameworks that take in sources and generate outputs. Sources tend to be unstructured or semi-structured whereas outputs tend to be semistructured and structured. Hadoop-based workflows may incorporate several components tied together with a dataflow or workflow manager such as Pig, Oozie, others or custom programs. Spark is able to use Spark APIs and components in the workflow for a more uniform experience

Ways to Connect and Query Hadoop #1 Access and Query Hadoop via Hive Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Any Hadoop Distribution Hive HDFS This is the most popular way of querying Hadoop, via Hive. Hive allows users who aren t familiar with programming to access and analyze big data in a less technical way, using a SQL-like syntax called Hive Query Language (HiveQL). Hive is used for complex, long-running tasks and analyses on large sets of data. Hive uses the map reduce framework to query data from HDFS. It is fault tolerant, but has high latency.

Ways to Connect and Query Hadoop #2 Access and Query Hadoop via Impala (only with Cloudera distribution) Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Cloudera Distribution Impala HDFS Impala also uses SQL syntax to query Hadoop. But does NOT use the map reduce framework. It has its own processing engine to query HDFS. Impala is used for analysis when the user wants to run and return quickly on a small subset of data, e.g. analyzing company finances for a daily or weekly report. Not ideal for complex data manipulation, data preparation etc. Hive is a screwdriver and Impala is a drill bit.

Hive and Impala Hive Impala Hive is SQL on Hadoop Impala is SQL on HDFS Hive uses MapReduce to process its queries Hive cannot perform in-memory query operations Hive is fault tolerant. Recommended for ETL type of jobs Impala uses its own processing engine Impala can perform in-memory query operations Impala is not fault tolerant Hive has high latency Impala has faster response times, great for small ad-hoc queries Use Case: If you have batch processing kind of needs, use Hive Use Case: If you need real time, adhoc queries over a subset of data, use Impala

Ways to Connect and Query Hadoop #3 Access and Query Hadoop via Apache Spark Hadoop MicroStrategy Analytics Platform Spark ODBC Connector Apache Spark HDFS Apache Spark is one of the fastest evolving open source communities, popular for it s MPP capabilities on top of Hadoop and providing advanced analytical workflows. MicroStrategy connects to Spark via Spark SQL/ODBC connector. Hive on Spark Replaces Hive SQL Engine with Spark SQL Engine starting Hive 1.2 Set hive.execution.engine=spark

Key Points for Spark Spark doesn t store information it uses common Hadoop and NoSQL storage for input and output Internally, Spark uses in-memory stores to pass data from one processing step to another Spark SQL in Spark applications works on the in-memory stores to simplify Spark programming against the Resilient Distribute Datasets (RDD) structures. This is not accessible via ODBC In-memory RDD s are not available after a Spark application completes. Spark SQL ODBC can only access RDD s that have persisted in Hive

Ways to Connect and Query Hadoop #4 Tap into Hadoop Natively with MicroStrategy s Hadoop Gateway MicroStrategy Analytics Platform Big Data Engine/Hadoop Gateway NEW Hadoop HDFS We launched this connectivity with v10. Big Data Engine (BDE), is a native YARN based application that enables direct access to HDFS. None of the core BI competitors have this capability. YARN (Yet Another Resource Negotiator) is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. Use Case: Fulfills faster data loading of data from HDFS and leverage our in-memory layer for analytics.

Top Five Reasons to Why use MicroStrategy for your next Big Data Project?

#1 Leverage varied flavors to access Hadoop that fulfills the business needs Hadoop Gateway SQL on Hadoop/Big Data Apache Hive Apache Spark Apache Pig MicroStrategy is the only core BI tool to provide a native connectivity to HDFS that enables fast data loads into its in-memory layer. MicroStrategy certifies Cloudera Impala, Presto, Google Big Query and Pivotal HAWQ, Apache Drill as a data source. MicroStrategy optimizes and certifies Hadoop/Hive as a data source. MicroStrategy certifies Spark via Spark SQL/ODBC. MicroStrategy also provides a connector to execute freeform pig-latin reports

#2 Analyze large volumes of data with a platform that is scalable and performant Sample Applications Access the Database With High Throughout CRM analysis across a large customer base Interactive analysis on high-volume clickstream data Merchant analytics for a credit card issuer Create and Publish the Cube With High Data Scalability Analyze Data With Fast Response Times Store manager application for a large national chain

#3: Cleanse your Hadoop data on the fly with MicroStrategy s Data Wrangler Cleanse and refine your data with Data Wrangler Cleanse your data Extract, Split Patterns Generate Facets Get Suggestions from our Data Visualization Wrangler, as Rename, Delete, and many more.. you prepare your data! & Analysis Designed for analysts and business users without spending any extra $$ Predictive Interactions Work on a sample and Apply to the entire dataset Undo/Redo without Data penalty Preparation

#4: Perform on the fly dynamic searches on semi-structured data Business Case: Analyzing Product Catalog Perform faceted search by dynamically clustering the items. Users can then drill down by applying specific constraints to the search results Manufacturers can get superior feedback Analyzing Diagnostic Notes Needs search engine s capabilities to facilitate semi-structured data analysis in health care domain. Helping health care professionals make informed decisions in situations that are time sensitive. Configure the file in Apache Solr Import the file via data import Dynamically Search for keywords via Visual Insight

#5: Get federated data on dashboards leveraging MicroStrategy s integrated architecture Oracle Hadoop Excel Personal Data Sources Relational Databases Multi-Dimensional Sources Analysis Services Map Reduce Databases

Customer Success Stories

Key BI Characteristics: INDUSTRY: Wireless Telecom BI COMPONENTS: Multiple Applications DATABASE: Hive and Red Shift HADOOP DISTRIBUTION: Impala VOLUME OF DATA 1 TB Daily APPLICATIONS: 30 Applications Self Service Big Data Analytics with over 30 Applications Business Use and Benefits The largest telecom company in Korea, SK Planet provides mobile services such as navigation, mobile payment, online shopping, memberships, etc. They have 30+ services, and they have all of their data in Hadoop (basically no relational databases for analysis). MicroStrategy provides with an centralized environment for analyzing their Hadoop data in a managed self-service manner. Self service data discovery while fully leveraging MicroStrategy s inmemory technology.

Key BI Characteristics: INDUSTRY: Entertainment BI COMPONENTS: 1 Application; Traditional Reports USERS ~200 DATABASE: Hadoop, Teradata HADOOP DISTRIBUTION: Amazon EMR VOLUME OF DATA Petabytes TYPE OF DATA Log and Events data APPLICATIONS: Sales Analysis Business Use and Benefits Making Better Sales Decisions by Getting Insights from Web Logs in Hadoop Sales Analysis generally with a new launch in new region, quick report analysis to understand the new accounts, number of hours of viewing etc. Directly querying and reporting from MicroStrategy on logs via Hive Able to make better Sales decisions

Key BI Characteristics: INDUSTRY: E-commerce BI COMPONENTS: 1 Application; Reports, Dashboards, VI USERS ~200 DATABASE: Hadoop, Oracle HADOOP DISTRIBUTION: Apache VOLUME OF DATA Petabytes TYPE OF DATA Web Logs, Online behavior APPLICATIONS: Sales Analysis Business Use and Benefits Analyzing web logs/online behavior stored in Hadoop. Dashboards and VI analysis run against our in-memory cubes. And ad-hoc reports run live against Hive. Analyzing Online Behavior with Live Queries to Hadoop End users do not need to code with MapReduce Developers are more productive delivering self service BI through a tool instead of coding custom user interface.

Leading Electronics Manufacturing Key BI Characteristics: INDUSTRY: Electronics Manufacturing BI COMPONENTS: Multiple Applications DATABASE: Hive and Red Shift HADOOP DISTRIBUTION: S3 VOLUME OF DATA 20-30 TB Daily APPLICATIONS: Smart TV Analytics, Mobile App Analytics Business Use and Benefits This Electronics company is collecting data from all of their smart TVs around the world to analyze the user behavior and app usage. It includes remote controller click stream and log analysis for more than 30 smart TV services. Smart TV Analytics on User Behavior and App Usage The data is actually not stored in HDFS but in S3. Fully leveraging the elastic computing capability of AWS for Map Reduce, with the CPU time based costing of ec2, this saves a big cut from the infrastructure cost. Using MicroStrategy to build applications to analyze click streams and logs.

Multi-Channel Digital Distribution Provider Key BI Characteristics: INDUSTRY: Electronics and Media BI COMPONENTS: 1 Application; Reports, VI, Dashboards DATABASE: Hadoop, Hive HADOOP DISTRIBUTION: Cloudera Impala VOLUME OF DATA Over 1 Billion traffic attribute combinations APPLICATIONS: Traffic Attribute Multiplier Business Use and Benefits The Traffic Attribute Multiplier application is helping Adconion to get precisely target their digital ads, shorten the time to prepare and tune models and better ad delivery ROI for their customers Precise Targeting of Digital Ads Leveraging MicroStrategy s integration to Impala and the rich visualizations library, making it easy to be consumed by business users. Achieved 2.4% improvement in ad budgets spending efficiency

Summary Data Lakes are Powerful MicroStrategy is ready to support your hybrid architecture Various Ways to Access Big Data MicroStrategy offers it all and is committed to be at par A Robust Platform to Analyze Big Data Highly Scalable In Memory engine Get Federated Data Effectively Navigate Data Across Multiple Data Sources with MicroStrategy multi-source and data blending End User Data Preparation Perform on the fly cleansing on Hadoop data

Questions? Q&A