Integrating MicroStrategy 9.3.1 with Hadoop/Hive



Similar documents
Native Connectivity to Big Data Sources in MSTR 10

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Tap into Hadoop and Other No SQL Sources

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Using distributed technologies to analyze Big Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Implement Hadoop jobs to extract business value from large and varied data sets

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Cloudera Certified Developer for Apache Hadoop

Data processing goes big

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Best Practices for Hadoop Data Analysis with Tableau

CitusDB Architecture for Real-Time Big Data

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

Constructing a Data Lake: Hadoop and Oracle Database United!

How To Handle Big Data With A Data Scientist

COURSE CONTENT Big Data and Hadoop Training

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Oracle Big Data SQL Technical Update

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data With Hadoop

MicroStrategy Course Catalog

Creating a universe on Hive with Hortonworks HDP 2.0

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Integrating VoltDB with Hadoop

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Moving From Hadoop to Spark

Workshop on Hadoop with Big Data

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Sisense. Product Highlights.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Performance and Scalability Overview

Big Data Too Big To Ignore

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Complete Java Classes Hadoop Syllabus Contact No:

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop: The Definitive Guide

QlikView 11.2 SR5 DIRECT DISCOVERY

Integrating MicroStrategy Analytics Platform with Oracle

American International Journal of Research in Science, Technology, Engineering & Mathematics

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Hadoop and Map-Reduce. Swati Gore

Performance and Scalability Overview

Integrating MicroStrategy Analytics Platform with Microsoft SQL Server

Integrating MicroStrategy With Netezza

Comparing SQL and NOSQL databases

Internals of Hadoop Application Framework and Distributed File System

Big Fast Data Hadoop acceleration with Flash. June 2013

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Move Data from Oracle to Hadoop and Gain New Business Insights

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Actian SQL in Hadoop Buyer s Guide

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

Apache Kylin Introduction Dec 8,

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Data Course Highlights

and Hadoop Technology

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM


Luncheon Webinar Series May 13, 2013

ITG Software Engineering

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

What's New in SAS Data Management

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Hadoop & Spark Using Amazon EMR

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A Scalable Data Transformation Framework using the Hadoop Ecosystem

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Qsoft Inc

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Architectures for Big Data Analytics A database perspective

The Hadoop Eco System Shanghai Data Science Meetup

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

How to Choose Between Hadoop, NoSQL and RDBMS

Big Data and Market Surveillance. April 28, 2014

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Traditional BI vs. Business Data Lake A comparison

Transcription:

Integrating MicroStrategy 9.3.1 with Hadoop/Hive This document provides an overview of Hadoop/Hive and how MicroStrategy integrates with the latest version of Hive. It provides best practices and usage examples for business users to perform sophisticated analysis on data stored in Hadoop directly from MicroStrategy Applications. MicroStrategy Incorporated Page 1 of 46

Table of Contents INTRODUCTION... 4 ARCHITECTURAL CONSIDERATIONS... 5 HOW MICROSTRATEGY INTEGRATES WITH HADOOP... 5 APACHE HIVE... 6 What Hive Does... 6 How Hive Works... 6 How MicroStrategy uses Hive... 7 APACHE HBASE... 7 What HBase Does... 7 How HBase Works... 7 How MicroStrategy uses HBase... 7 ALTERNATIVES TO HIVE... 8 Apache Pig... 8 USING HADOOP FOR ANALYTICS... 9 INITIAL DATA EXPLORATION AND DISCOVERY... 10 Example of the above Use-case:... 10 RDBMS REPLACEMENT FOR LARGE DATA SCALE BI IMPLEMENTATIONS... 11 USAGE PATTERNS FOR MICROSTRATEGY WITH HADOOP AS A DATA SOURCE... 11 PHYSICAL OPTIMIZATIONS... 12 STORAGE FILE FORMAT... 12 QUERY ENGINE OPTIMIZATIONS FOR HIVE... 13 OVERVIEW... 13 DEFAULT VLDB SETTINGS... 14 Selected VLDB Settings for Hive 0.11.x... 14 Intermediate Table Type... 15 Sub Query Type... 16 PARTITIONING... 17 Bucketing... 18 QUERY OPTIMIZATIONS... 19 JOINS... 19 Map Join... 19 Bucket Map Join... 20 Sort-Merge Join... 21 Key takeaways from the tests... 22 EXTENSIBILITY USING UDFS, UDAFS AND CUSTOM MAP/REDUCE SCRIPTS... 23 HOW MICROSTRATEGY 9.3.1 KEEPS PACE WITH HIVE S EVOLUTION... 24 KNOWN ISSUES WITH HIVEQL AS OF HIVE 0.11... 24 WORKLOAD MANAGEMENT... 25 MicroStrategy Incorporated Page 2 of 46

WORKLOAD MANAGEMENT IN MICROSTRATEGY... 25 Monitoring of MicroStrategy Workload... 26 WORKLOAD MANAGEMENT IN HADOOP... 28 Hive Thread Safety and Concurrent Queries... 28 Authentication and Security... 28 MicroStrategy s Connection Management... 29 Canceling a Hive query from MicroStrategy... 29 HIGH AVAILABILITY FOR THE HADOOP DISTRIBUTED FILE SYSTEM (HDFS)... 29 HIVE METASTORE USECASES AND FAILOVER SCENARIO... 30 Use Cases... 30 Deployment Scenarios... 30 APPENDIX A... 31 OTHER VLDB SETTINGS... 31 VLDB String Insertion Settings when using Permanent Tables and Explicit Table Creation... 31 SUPPORTED DATA TYPES... 31 How to Deal with DATE Data Type in older Hive versions... 32 TRANSFORMATIONS IN HIVE... 33 MICROSTRATEGY FUNCTIONS IMPLEMENTED BY USING HIVEQL EXPRESSIONS... 33 CONFIGURING THE MICROSTRATEGY-HIVE ENVIRONMENT... 37 Hive Metastore... 37 HIVE AND QUERY LATENCY... 37 Using MicroStrategy Multi-source option to support element list prompts... 37 APPENDIX B... 40 SAMPLE REPORTS... 40 REFERENCES... 46 MicroStrategy Incorporated Page 3 of 46

Introduction Hadoop is a framework for storing and processing extremely large volumes of data on clusters built on commodity hardware MicroStrategy provides a high performance, scalable enterprise Business Intelligence platform capable of delivering deep insight with interactive dashboards and superior analytics to large user populations through Web browsers, mobile devices and Office applications. MicroStrategy enables unified access and analysis of data stored in multiple systems in the enterprise with its dynamic query generation and drill anywhere capabilities. This paper provides a brief overview of the MicroStrategy architecture and explains how this architecture takes advantage of the technology advances and big data analytics capabilities of Hadoop. The intended audience for this discussion includes application developers who want to understand how to configure a system using MicroStrategy and Hadoop. MicroStrategy Incorporated Page 4 of 46

Architectural tural considerations Hadoop is made up of a complex ecosystem of components each adding specific functional aspects to a Hadoop implementation. In this context let s review the most important components to consider when integrating MicroStrategy and Hadoop. The most important component of Hadoop is storage. Storage is accomplished with the Hadoop Distributed File System (HDFS) a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers. We will discuss the various pieces of this Hadoop stack in detail in the following sections. How MicroStrategy Integrates with Hadoop MicroStrategy primarily connects to Apache Hive when interacting with Hadoop. From the perspective of the MicroStrategy platform, Hive makes a Hadoop cluster look like a relational database. MicroStrategy optimizes for and certifies Hadoop/Hive as a data source to provide a user experience that requires no programming. MicroStrategy also supports integration to Hadoop through Pig. Pig is a scripting platform for processing and analyzing large data sets. More details on connecting to Pig follows in the later sections of this paper. MicroStrategy Incorporated Page 5 of 46

Apache Hive Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive eases integration between Hadoop and tools for business intelligence and visualization. What Hive Does Hadoop was built to organize and store massive amounts of data. A Hadoop cluster is a reservoir of heterogeneous data, from multiple sources and in different formats. Hive is designed to provide a means to access the data in Hadoop, to the commercial/legacy applications using a list of standardized protocols through a query language. By doing so it allows users to explore and structure that data, analyze it, and then turn it into business insight. How Hive Works The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Hive supports overwriting or appending data, but not updates and deletes. Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets. Within Hive, there is a Metastore layer that stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API. Thru this Metastore service, hive allows a table like structure for the files in HDFS. This metastore can be stored in any other relational Database, the supported ones for which scripts are provided being Derby, MySQL, Oracle, and PostgreSQL. How MicroStrategy leverages this Metastore to access the hive tables is through the Open Database Connectivity protocol (ODBC) which takes place seamlessly, as though the tables are directly being accessed from HDFS layer. MicroStrategy Incorporated Page 6 of 46

Hive supports primitive data formats such as TIMESTAMP, STRING, FLOAT, BOOLEAN, DECIMAL, BINARY, DOUBLE, INT, TINYINT, SMALLINT and BIGINT. In addition, primitive data types can be combined to form complex data types, such as structs, maps and arrays. How MicroStrategy uses Hive MicroStrategy connects to Hive using the Open Database Connectivity protocol (ODBC). MicroStrategy tests and certifies both the MicroStrategy branded ODBC driver as well as ODBC drivers provided from any specific Hadoop distribution. MicroStrategy leverages the ODBC standard to query the Hive metastore in order to obtain metadata about the objects (e.g. tables and columns) stored in Hive. The ODBC driver that MicroStrategy supports and certifies primarily supports the 3.x API calls and hence integrates well with MicroStrategy, leveraging all of its functionalities like fetching of tables, columns, datatypes, etc. The SQL catalog in MicroStrategy currently works by using the ODBC driver s API calls instead of custom queries, and the query generation support, function library, etc is provided thru a combination of both Hive Query language support plus ODBC capabilities. More details on the function library support are provided below in the Table on MicroStrategy/Hive 0.11 Functions MicroStrategy not only allows submitting hand-written HiveQL queries and retrieving results, but is even capable of generating HiveQL queries based on a logical data model (schema) created in the MicroStrategy metadata. Apache HBase As part of of the Data Services layer, Apache HBase is a non-relational (NoSQL) database that runs on top of HDFS(Hadoop Distributed File System). HBase is a columnar data store that provides fault-tolerant storage and quick access to large quantities of sparse data. It also adds additional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. What HBase Does HBase provides random, real time access to your Big Data. HBase was created for hosting very large tables with billions of rows and millions of columns. How HBase Works Apache HBase uses Log Structured Merge trees (LSM trees) to store and query data. It features compression, in-memory caching, bloom filters, and very fast scans. HBase tables can serve as both the input and output for MapReduce jobs. How MicroStrategy uses HBase The Hive/HBase integration support allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions. Assuming the Hbase tables are already pre-created on the Hadoop node via Hive, users can use the Hbase ODBC driver that delivers and provides broad compatibility and ensures full functionality for users analyzing and reporting on Big Data thru MicroStrategy Incorporated Page 7 of 46

MicroStrategy. It accepts an application s SQL queries, generates execution plans, and transforms them into calls to HBase s REST API. MicroStrategy is currently researching more into the Hbase capabilities, and for any questions regarding this functionality users should contact MicroStrategy technical support. This paper won t discuss the details on creating HBase tables to be managed by Hive, but users can access: https://cwiki.apache.org/confluence/display/hive/hbaseintegration to get more information. Alternatives to Hive Apache Pig Apache Pig allows you to write complex MapReduce transformations using a simple scripting language called Pig Latin. Pig Latin allows users to define a set of transformations on a data set such as aggregate, join and sort. Pig translates the Pig Latin script into MapReduce so that it can be executed within Hadoop. Pig Latin is sometimes extended using UDFs (User Defined Functions), which the user can write in Java or a scripting language and then call directly from Pig Latin scripts. What Pig Does Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data operations: standard extract-transform-load (ETL) data pipelines, research on raw data, and iterative processing of data. Whatever the use case, Pig is: Extensible. Pig users can create custom functions to meet their particular processing requirements. Easy to program. Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences. Pig programs accomplish huge tasks, but they are easy to write and maintain. Self-optimizing. The system automatically optimizes execution of Pig jobs, so the user can focus on semantics. How Pig Works Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin is a flow language whereas SQL is a declarative language. SQL is great for asking a question of your data, while Pig Latin allows you to write a data flow that describes how your data will be transformed. Since Pig Latin scripts can be graphs (instead of requiring a single output) it is possible to build complex data flows involving multiple inputs, transforms, and outputs. Users can extend Pig Latin by writing their own functions, using Java, Python, Ruby, or other scripting languages. MicroStrategy s Pig connector is a specific configuration of the XQuery framework to connect to a webservices API- allowing Pig Latin scripts to be submitted to the Hadoop cluster using MicroStrategy s freeform reporting infrastructure. For more information on how to connect MicroStrategy Incorporated Page 8 of 46

using this connector, please refer to TN: 36186 - How to configure MicroStrategy to connect to Hadoop Pig Proprietary Products Several companies have created products that provide similar functionality to Apache Hive. Because MicroStrategy strives to support the best of breed technology, many of these technologies are supported. At the point of writing this Integration paper, MicroStrategy works transparently with the following technologies that provide pass-thru functionalities to Apache Hadoop: Hortonworks Distribution of Apache Hadoop Hive Cloudera s Distribution of Apache Hadoop Hive Cloudera Impala Greenplum HD Pivotal HAWQ Teradata Enterprise Access For Hadoop SAP HANA Integration with Hadoop Hadoop on Demand Integration with ParAccel SQL-H Aster-Data The BigSQL component of IBM InfoSphere BigInsights Intel Distribution for Apache Hadoop Note that the above list tends to evolve constantly as the Hadoop framework matures further. So for more information on any of the above or other additional Hadoop-related data sources, please contact MicroStrategy Technical Support. Using Hadoop for Analytics Hadoop is extremely scalable (can run on thousands of nodes), reliable (built-in support for handling node failures) and inexpensive since it is open source software that runs on commodity hardware. Another subtle strength of Hadoop is that it is flexible in that you do not need to determine the structure of the data at the time you load it into Hadoop; HDFS is a file system, so you can just dump files and figure out how to query it later. Although Hive makes Hadoop look like a relational database, Hadoop is often seen as a complementary technology to an RDBMS or enterprise data warehouse. Enterprise customers sometimes choose to work with Hadoop as a complementary technology to build an end-to-end stack that covers all their data management and analytics needs. Hadoop by itself is not a data integration platform. However, the industry vendors are filling the functionality gap by porting existing profiling, parsing, ETL, cleansing etc capabilities to run on a Hadoop framework directly. The key to the growth and success of Hadoop is the expansion of the ecosystem with solutions that complement its native features. The MapReduce code generation allows Hadoop to be used for data integration capabilities, and one of the ways to perform MapReduce programming without actual coding on the user end can be thru Hive thus enabling the platform for Data Integration. MicroStrategy Incorporated Page 9 of 46

MicroStrategy customers are looking at MicroStrategy/Hadoop integration mainly for: Initial data exploration and discovery AND/OR Cost effective RDBMS replacement for deployments with large data volumes. Initial data exploration and discovery Current Situation Most current BI projects using traditional EDW technology often limit data volumes because of technical or cost limitations. Any data that might be available outside these limits is not available for analysis. Hadoop offers a cost effective way to make additional data for analysis. Solution The solution here is to use Hadoop as a data sink where no data is thrown away and is leveraged as database technology for low volume initial data exploration and discovery. MicroStrategy offers data scientists and analysts agile access to the raw data in Hadoop. The goal is to collect sufficient insight to make a business case to invest in the data transfer into the traditional EDW, which in turn makes the new data available for high volume analysis by broader audiences. Example of the above Use-case: MicroStrategy has Query Builder reporting capability that allows users to graphically construct HiveQL to query data in Hadoop through Hive. This QB report can include all the columns that are part of the query. Essentially we can use Hive s built-in functions like YEAR () and MONTH () in the column expression editor to dynamically construct the time hierarchy using the data information that comes from the original set. Another advanced use-case can be seen by how we build certain attributes by using regular expressions to parse the description field to get specific data on the fly. This is called Schema on Demand, which is one of the most popular use-cases for Hadoop. In addition to that, Query Builder interface also provides a Join editor by which you can have custom statements for Join clause, by specifying full outer join, left outer join or right outer join, and not just the basic inner join. The underlying query behind the report could be tweaked even more than just the basic SQL generated via QB. We can add python script, along with TRANSFORM statement to scrub the data to create our data in the format we would like. That is done through Report-Pre-SQL statements in MicroStrategy s VLDB properties. This is commonly used to add any scripts or run any transforms before the data is actually used for any further processing. The end result of such a query is that we can look at this data set using MSTR s Visual Insight, which allows users to get insights into the data in a visual format by performing MAP visualization and drill down into it. More discussion on this follows later in the document. HiveQL or PigLatin against Hadoop does have its benefits as explained above, but it also comes with its own set of limitations, e.g. relatively slow execution times. Potential way of getting around this is using In-Memory Cubes for sampling. More on this follows later. MicroStrategy Incorporated Page 10 of 46

RDBMS replacement for large data scale BI implementations Current Situation Big data on current RDBMS platforms is prohibitively expensive for large data volumes (Big Data). Cost is driven by software license fees, proprietary hardware, and installation/maintenance cost. Hadoop offers (relatively) high performance at big data scale (>>10TBs) on commodity hardware. Recent advances in Hadoop technology seem to make it feasible for interactive analysis Solution The solution in this case is to optimize Hadoop/MicroStrategy setup for query performance allowing interactive use and user scalability MicroStrategy comes with Dynamic report personalization, that includes object prompts allowing users to choose from all reporting, analysis, and business logic objects to author their own reports at run time. Also, its automatic Multi-source drill anywhere into Hive is a huge benefit when compared to other BI sources. User can store the detailed level data in Hadoop and intermediate rollups(hourly/daily) can be performed in Hadoop, and then moved to the MicroStrategy server for Interactive analysis. Usage patterns for MicroStrategy with Hadoop as a data source There are primarily two ways one can query data in Hadoop using MicroStrategy. Freeform Reports You can design freeform HiveQL queries/pig Latin scripts that can be executed using the MicroStrategy Freeform Reporting infrastructure. o The results of these freeform queries can be extracted into a MicroStrategy in-memory cube for further slicing and dicing with sub-second response times. o The freeform queries can also be parameterized and executed in a scheduled or interactive fashion against the Hadoop cluster Modeled Reports You can also design a semantic layer in MicroStrategy and enable users to design reports using this semantic layer and a drag/drop MicroStrategy Web interface. The HiveQL queries for these reports are dynamically generated by MicroStrategy s Query Engine. o These reports can be designed to purely run against data in Hive o Or against data in both an RDBMS system and Hive using MicroStrategy s Multisource capability. Users can also seamlessly drill down from data in an in-memory cube, to detail data in RDBMS and further detail in Hadoop/Hive. MicroStrategy Incorporated Page 11 of 46

Figure 1 Usage patterns for MicroStrategy with Hadoop as a data source Some of the optimizations discussed in this document are only relevant for dynamic query generation while others are relevant for all use cases. Physical Optimizations Storage File Format Data in files that make up a table in Hive can be stored in different formats. You can indicate the file format and its SerDe (short for seralizer/deserializer) properties as part of the table definition in Hive. By default the file format is textfile which is a text representation of the data. Then there are binary storage formats such as sequencefile and rcfile that provide for a compressed storage of data. Sequence files are row-oriented while rcfile files are a hybrid rowcolumnar storage format. Rcfile format can provide better performance for queries that access only a small number of columns in the table, which is typically the case for queries executed from MicroStrategy. Storing the intermediate tables as (compressed) RCFile reduces the IO and storage requirements significantly over text, sequence file, and row formats in general. Querying tables stored in RCFile format allows Hive to skip large parts of the data and get the results faster and cheaper. Different file formats and compression codecs work better for different data sets. You may still see performance gains regardless of file format, choosing the proper format for your data can yield further performance improvements. Use the following considerations to decide which combination of file format and compression to use for a particular table: MicroStrategy Incorporated Page 12 of 46

If you are working with existing files that are already in a supported file format, use the same format for your table where practical. If the original format does not yield acceptable query performance or resource usage, consider creating a new table with different file format or compression characteristics, and doing a one-time conversion by copying the data to the new table using the INSERT statement. Text files are convenient to produce through many different tools, and are human-readable for ease of verification and debugging. Those characteristics are why text is the default format generally for a CREATE TABLE statement in Hive. When performance and resource usage are the primary considerations, use one of the other file formats and consider using compression. A typical workflow might involve bringing data into a table by copying CSV or TSV files into the appropriate data directory, and then using the INSERT... SELECT syntax to copy the data into a table using a different, more compact file format. Different forms of compression are actually more performant as they can reduce the amount of work you do (e.g. dictionary encoding) and also using light weight compression increases the amount of data you can keep in memory. It is worth noting that if you care about performance you should use Snappy compression which was designed for best latency rather than GZIP or BZIP. GZIP and BZIP are designed for storage density but are extremely CPU costly and are not recommended fits for any data sets where performance is a consideration. If the user intends to connect to Hbase, then the Storage file format does not really matter as such, since the user will need to create the hbase tables thru Hive using the Storage handler and the right stored by property. Query Engine Optimizations for Hive Overview MicroStrategy identifies tables as base tables, lookup tables and Intermediate tables. Base tables support the calculation and aggregation of metrics, and the lookup tables are for resolving the attribute forms. And an intermediate table is a table created on the database to store temporary data that are used to calculate the final result set. Base tables in MicroStrategy are the logical representations of the physical tables that are existing in the database. Once a table has been created in database, we can pull that into warehouse catalog in Microstrategy to be used. So the end user need not create anything else on top of that. But if they want to, they can still write logical views over the existing tables in MicroStrategy. An intermediate table is a table created on the database to store temporary data that is used to calculate the final result set. These tables can either be 'permanent' or 'true temporary'. Usually, the creation of these tables is not logged on the database logs and is not indexed, with some exceptions. These tables are created with the intent that they exist only temporarily on the relational database and are dropped when the connection is closed. On any database that does not support 'true temporary tables', any table created by MicroStrategy or any other process, will be 'real' or 'permanent', since they are considered as permanent by the relational database, even if they are used to store temporary results. So, it is considered an intermediate permanent table from the perspective of MicroStrategy. On databases where 'true temporary tables' are MicroStrategy Incorporated Page 13 of 46

supported, the MicroStrategy SQL Generation Engine can be configured to create the table type of choice using the 'Intermediate Table Type' VLDB property for version 9.x(more details on that in the following section). One more option that exists for Intermediate table type is Derived tables. A derived table is nothing but a nested 'select' statement. This type of query does not create any table on the database side and all the processing is performed on the database server's memory. For Hive, as a default option for Intermediate tables, Derived tables are chosen while formulating the Query. The maintenance of the physical data organization is generally transparent to MicroStrategy, and it regards the base tables and intermediate tables very similarly. The MicroStrategy platform provides VLDB drivers for all supported RDBMS platforms to generate optimized queries that take advantage of database specific functionality. The full set of VLDB properties is documented in the MicroStrategy System Administration Guide. Settings that are most relevant to Hive are discussed below. Default VLDB Settings MicroStrategy VLDB properties are settings that customize the queries generated by MicroStrategy Intelligence Server. The settings provide a way to manipulate HiveQL join patterns, HiveQL insert statements, and table creation properties without manually altering HiveQL Scripts as well as how the Analytical Engine manages certain results. By adjusting the HiveQL statements using VLDB properties, specific optimizations can further enhance the performance of queries, and also provide the capability to easily incorporate and take advantage of features introduced in new versions of Hive. This is a key strength of MicroStrategy. MicroStrategy s VLDB driver for Hive is designed to use Hive-specific features when they lead to improved performance or analytical functionality. When a Database Instance is configured to use the Hive 0.11x database connection type, the recommended values for all VLDB properties will automatically be used for every report executed against that Database Instance. The recommended VLDB optimizations for Hive are listed below. Administrators may add or modify VLDB optimizations to MicroStrategy projects, reports, or metrics at any time so that their queries are specifically optimized for their data warehouse environment. Selected VLDB Settings for Hive 0.11.x VLDB Category VLDB Property Setting Value Tables Intermediate Table Type Derived Table Tables Fallback Table Type Permanent Table Tables Maximum SQL Passes 0 (no threshold) Before FallBack Tables Maximum Tables in 0 (no threshold) FROM Clause Before FallBack Tables Drop Temp Table Drop after final pass Method Tables Table Creation Type Implicit Table Query Optimizations Sub Query Type Use temporary table, falling back to MicroStrategy Incorporated Page 14 of 46

IN(SELECT COL) for correlated sub-query Joins Full Outer Join Support Supported Select/Insert Distinct/Group By option Use Group By (This setting controls the (when no aggregation use of Distinct/Group selecting elements and not table key) of non-key column from a table. The default is to use group by, but users may experiment with the use of distinct or group-by to pick the best value for their environment) Select/Insert UNION Multiple Insert DO NOT Use UNION. Joins SQL Join Type Join 92 Query Optimizations SQL Global Optimization Level 4: Level 2 + Merge All Passes with Query Optimizations Set Operator Optimization Intermediate Table Type Different Where Disabled The ability to generate multi-pass HiveQL is a key feature of the MicroStrategy Query Generation Engine. For Hive, the best way to go about is through Derived tables Derived Tables Rather than implement each pass in a separate table, Derived Table syntax allows the query generation engine to issue additional passes as query blocks in the FROM clause. Instead of issuing multiple HiveQL statements that create intermediate tables, the query generation engine generates a single large HiveQL statement. select pa13.region_nbr REGION_NBR, a14.region_desc REGION_DESC, pa11.store_nbr STORE_NBR, a12.store_desc STORE_DESC, pa11.wjxbfs1 WJXBFS1, pa13.wjxbfs1 WJXBFS2 from (select a11.store_nbr STORE_NBR, sum(a11.cle_sls_dlr) WJXBFS1 from mstrlabs_store_division a11 group by a11.store_nbr ) pa11 join mstrlabs_lookup_store a12 on (pa11.store_nbr = a12.store_nbr) join (select a11.region_nbr REGION_NBR, sum(a11.cle_sls_dlr) WJXBFS1 from mstrlabs_region_division a11 group by a11.region_nbr ) pa13 on (a12.region_nbr = pa13.region_nbr) join mstrlabs_lookup_region a14 on (pa13.region_nbr = a14.region_nbr) Note that not all reports are able to use derived tables. There are two primary scenarios in which permanent tables must be used instead of derived tables: When a report uses a function supported in the MicroStrategy analytical engine that is not supported in Hive (e.g. many of the functions in the financial and statistical function packages). If these functions are used in intermediate calculations, the MicroStrategy analytical engine will perform calculations and then attempt to insert records back into for further processing. When a report uses the MicroStrategy partitioning feature. When using partitioning, the query generation engine executes a portion of the query in order to determine which partitions to MicroStrategy Incorporated Page 15 of 46

use. The results are then used to construct the rest of the query. Because the full structure of the query is not known prior to execution, the query generation engine must use intermediate tables to execute the query in multiple steps. These situations do not cover 100% of the cases in which temporary tables must be used. The rest of the cases are relatively obscure combinations of VLDB settings, such as certain combinations of Sub Query Type plus outer join settings on metrics plus non-aggregatable metrics. Hive, however, does not yet support INSERT TABLE INTO VALUES type syntax which would be required to support the scenarios mentioned above. It is recommended that you do not use the partitioning feature. Also note that if you use a function that is not supported by Hive and MicroStrategy s Analytical Engine need to insert records back into Hive after some intermediate processing, the report will fail. Permanent Tables If Intermediate Table Type is set to Permanent Table, MicroStrategy will generate a plain vanilla CREATE TABLE statement for intermediate tables. This can be useful because other VLDB settings such as Table Qualifier, Table Space, etc. can be used to customize this standard CREATE TABLE syntax. See Additional VLDB Settings below for more detail on the settings that can be customized when using Permanent Tables. CREATE TABLE ZZMD00 AS select a12.state_nbr state_nbr, sum(a11.end_cle_stk_dlr) WJXBFS1 from eatwh1_store_division a11 join eatwh1_lookup_store a12 on (a11.store_nbr = a12.store_nbr) group by a12.state_nbr select pa15.state_nbr state_nbr, a16.state_desc state_desc0, pa15.wjxbfs1 WJXBFS1 from ZZMD00 pa15 join eatwh1_lookup_state a16 on (pa15.state_nbr = a16.state_nbr) Temporary Tables Many RDBMS s support the concept of a temporary table, a special kind of table that has less overhead than a regular table. When the VLDB property Intermediate Table Type is set to True Temporary Table, MicroStrategy generates syntax to create these special tables for multi-pass queries. Hive 0.11 doesn t support the concept of temporary tables, so this option is not relevant for Hive. Sub Query Type The following are cases in which the query generation engine will generate subqueries (i.e. query blocks in the WHERE clause): Reports that use Relationship Filters Reports that use NOT IN set qualification, e.g. AND NOT <metric_qualification> or AND NOT <relationship_filter> MicroStrategy Incorporated Page 16 of 46

Reports that use Attribute qualification with M-M relationships, e.g. show Revenue by Category, filter on Catalog Reports that raise the level of a filter, e.g. dimensional metric at Region level, but qualify on Store Reports that use non-aggregatable metrics, e.g. inventory metrics Reports that use Dimensional extensions Reports that use Attribute to attribute comparison in the filter Thru MicroStrategy, we can generate the following forms of subqueries: WHERE EXISTS (SELECT * ) WHERE EXISTS (SELECT col1, col2,.. ) WHERE COL1 IN (SELECT s1.col1 ) falling back to EXISTS (SELECT * ) for multiple columns IN WHERE (COL1, COL2, ) IN (SELECT s1.col1, s1.col2 ) Use Temporary Table, falling back to EXISTS (SELECT * ) for correlated subquery WHERE COL1 IN (SELECT s1.col1 ) falling back to EXISTS (SELECT col1, col2, ) for multiple columns IN Use Temporary Table, falling back to IN (SELECT COL) for correlated subquery However, not all of the above are supported in Hive. Hence, MicroStrategy can rewrite most of the subqueries to use intermediate tables instead. The default setting for Sub Query Type for Hive0.11.x is Option 7 th from above Use temporary table, falling back to IN for correlated subquery. This setting instructs the query generation engine to handle non-correlated sub queries using temporary tables while correlated subqueries use the IN subquery syntax. If the query truly requires a correlated subquery, there is no way to rewrite using temp tables. These queries will fail since Hive does not support correlated queries at the time of writing this paper. select distinct pa11.year_id year_id, a13.year_desc year_desc, pa11.wjxbfs1 WJXBFS1 from (select a12.year_id year_id, a11.cur_trn_dt cur_trn_dt, sum(a11.bgn_cle_stk_dlr) WJXBFS1 from mstrlabs_region_division a11 join mstrlabs_lookup_day a12 on (a11.cur_trn_dt = a12.cur_trn_dt) where a12.month_id = 199312 group by a12.year_id, a11.cur_trn_dt ) pa11 join (select pc11.year_id year_id, min(pc11.cur_trn_dt) WJXBFS1 from (select a12.year_id year_id, a11.cur_trn_dt cur_trn_dt, sum(a11.bgn_cle_stk_dlr) WJXBFS1 from mstrlabs_region_division a11 join mstrlabs_lookup_day a12 on (a11.cur_trn_dt = a12.cur_trn_dt) where a12.month_id = 199312 group by a12.year_id, a11.cur_trn_dt) pc11 group by pc11.year_id ) pa12 on (pa11.cur_trn_dt = pa12.wjxbfs1 and pa11.year_id = pa12.year_id) join mstrlabs_lookup_year a13 on (pa11.year_id = a13.year_id) Partitioning Hive organizes data in tables and partitions. As an example, order details can be stored in an order_detail table which is partitioned by date, e.g. 2009-05-01 partition for May 1, 2009 data MicroStrategy Incorporated Page 17 of 46

and 2009-04-31 for April 31, 2009 data. The data for a particular date goes into a partition for that date. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced, e.g. queries on the order details for a single day do not have to process data for other days. Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS). In the previous example the table order_detail could be mapped to /warehouse/order_detail while each of the partitions can be mapped to /warehouse/ order_detail /ds=2009-05-01 and /warehouse/ order_detail/ds=2009-04-31 where ds (date stamp) is a partitioning column. The partitioning scheme can have multiple columns as well in which case each partitioning column maps to a level within the directory name space. Note that Partitioning on base tables is transparent to MicroStrategy. At this time, MicroStrategy does not create partitioned intermediate tables. Bucketing Within each partition the data can further be bucketed into individual files. The bucketing can be random or hashed on a particular column. Organizing tables (or partitions) into buckets has a couple of advantages. The first is to enable more efficient queries. Bucketing imposes extra structure on the table which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns which include the join columns can be efficiently implemented as a map-side join. o In addition, the data within a bucket can also be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort. The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them. Populating Partitioned and Bucketed Tables Hive has built-in capabilities that make it easy to build partitioned and/or bucketed versions of a table from a non-partitioned/non-bucketed table. A typical workflow is as follows: o Create a non-partitioned, non-bucketed version of the table. CREATE TABLE OrderDetail( order_id INT, item_id INT, emp_id INT, promotion_id INT, customer_id INT, qty_sold FLOAT, unit_price FLOAT, unit_cost FLOAT, discount FLOAT, order_date STRING ) row format delimited fields terminated by '\t' escaped by '\\' stored as textfile; 2. Load data into the table. Note this is merely a file move operation unlike the load of a table in a traditional RDBMS MicroStrategy Incorporated Page 18 of 46

LOAD DATA LOCAL INPATH '/hadoophome/data/order_detail.tsv' OVERWRITE INTO TABLE OrderDetail; 3. Create a partitioned and bucketed version of the table. In the sample below the table is partitioned by order_date and bucketed by order_id CREATE TABLE OrderDetailPartClust( order_id INT, item_id INT, emp_id INT, promotion_id INT, customer_id INT, qty_sold FLOAT, unit_price FLOAT, unit_cost FLOAT, discount FLOAT ) PARTITIONED BY (order_date STRING) CLUSTERED BY (order_id) INTO 16 BUCKETS row format delimited fields terminated by '\t' escaped by '\\' stored as TEXTFILE; 4. Use Hive s Dynamic partition insert and bucket enforcement properties to dynamically build the partitioned and bucketed version of the table by inserting data from the non-partitioned, non-bucketed version SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000; SET hive.enforce.bucketing = true; INSERT OVERWRITE TABLE OrderDetailPartClust PARTITION(order_date) SELECT order_id, item_id, emp_id, promotion_id, customer_id, qty_sold, unit_price, unit_cost, discount, order_date FROM OrderDetail See ( https://cwiki.apache.org/hive/tutorial.html#tutorial-dynamicpartitioninsert) for more details on Hive s dynamic partition insert capabilities and ( https://cwiki.apache.org/hive/languagemanual-ddl-bucketedtables.html) for loading data in bucketed tables Query Optimizations Joins Map Join By default, joins in Hive involve a map stage and a reduce stage. A mapper reads from join tables and emits the join key and join value pair into an intermediate file. Hadoop sorts and merges these pairs in what's called the shuffle stage. The reducer takes the sorted results as input and does the actual join work. The shuffle stage is very expensive since it needs to sort and merge. Saving the shuffle and reduce stages improves the task performance. The motivation of map join is to save the shuffle and reduce stages and do the join work only in MicroStrategy Incorporated Page 19 of 46

the map stage. By doing so, when one of the join tables is small enough to fit into the memory, all the mappers can hold the data in memory and do the join work there. So all the join operations can be finished in the map stage. How can MicroStrategy take advantage of this You can enable map-side joins by setting the hive configuration variable hive.auto.convert.join to true (i.e, set hive.auto.convert.join=true). There are a couple of different ways to set this variable: 1. This statement could be added to Report Pre SQL statements at the Database Instance level such that each report request carries with it the logic to auto-enable map side joins. 2. This variable can also be set by modifying the hive-site.xml file <property> <name> hive.auto.convert.join </name> <value>true</value> <description>variable to auto enable map join</description> </property> In addition to auto enabling map-side joins, there are also a couple of advanced parameters around the small table that you can use as memory and size governors. Both these parameters can either be set via the Report Pre SQL statement or by modifying the hive-site.xml file hive.hashtable.max.memory.usage This parameter controls the memory usage of the local task relative to the heap size. For example: Set hive.hashtable.max.memory.usage=0.9 This means if the memory usage of local task exceeds 90% of its heap size, then the local task will abort and fall back to common-join i.e, the traditional way of joining tables in Hive. hive.smalltable.filesize = 25000000L This parameter controls how big the small table can be. For example: Set hive.smalltable.filesize = 25000000L This means if the small table file size is less than 25MB, run the map join task, else fall back to common-join. Bucket Map Join As discussed above, in the map side join case, if the tables being joined are bucketed using the same columns, and the buckets are a multiple of each other, the buckets can be joined with each other. If table T1 has 8 buckets and table T2 has 4 buckets (and table T2 is the smaller table), the following join can be done on the map side. Instead of fetching T2 completely for each mapper of T1, only the required buckets are fetched. For the query below, the mapper processing bucket 1 for T1 will only fetch bucket 1 of T2 SELECT /*+ MAPJOIN(T2) */ T1.key, T2.value FROM T1 join T2 on T1.key = T2.key MicroStrategy Incorporated Page 20 of 46

This behavior is governed by the following parameters set hive.auto.convert.join=true; set hive.optimize.bucketmapjoin = true; Sort-Merge Join If the tables being joined are sorted and bucketed on the join column, and the number of buckets are the same, a sort-merge join can be performed. The corresponding buckets are joined with each other within the mapper. If both T1 and T2 are sorted and bucketed on the key column and have the same number of buckets the join below is executed as a sort-merge join on the map side. The mapper for a bucket for T1 will traverse the corresponding bucket for T2 and perform a sort-merge join SELECT /*+ MAPJOIN(T2) */ T1.key, T2.value FROM T1 join T2 on T1.key = T2.key This behavior is governed by the following parameters set hive.auto.convert.join=true; set hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat ; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; Below are the results of sample tests ran in-house to show the impact partitioning, clustering and map-side join techniques have on hive query performance: Orderdetailnopartnoclust has about 1 years worth of data (for 2008) and is about 2B rows in size. Orderdetailpartclust also has about 1 years worth of data (for 2008) and is about 2B rows in size. It s partitioned by order_date, clustered by item_id into 16 buckets and each bucket has a sort on customer_id Lookup day has 1 years worth of Time dimension data (for 2008) and is about 365 rows in size. Table Query Execution Category ID Query Category Query String Time select a11.emp_id, count(1), count (distinct Non-partitioned, nonclustered a11.customer_id), sum(a11.qty_sold) from single table with where clause on orderdetailnopartnoclust a11 where order_date ='01-01- S1 partition column (date) 2008' group by a11.emp_id; 400 secs Single Table Partitioned, clustered single table with where clause on partition select a11.emp_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from orderdetailpartclust a11 where a11.order_date ='01-01- S2 column (date) 2008' group by a11.emp_id; 44 secs select a11.emp_id, count(1), count (distinct Partitioned, clustered single a11.customer_id), sum(a11.qty_sold) from MicroStrategy Incorporated Page 21 of 46

Multiple Tables table with where clause on partition column (date) and orderdetailpartclust TABLESAMPLE(BUCKET 2 OUT OF 128 on item_id) a11 where a11.order_date ='01-01- 2008' TableSample S3 clause group by a11.emp_id; 20 secs Baseline query - Partitioned, select a11.emp_id, count(1), count (distinct clustered fact table, join to 1 a11.customer_id), sum(a11.qty_sold) from table, no map-side join and orderdetailpartclust a11 join Lookup_Day a12 on where clause on higher level non- (a11.order_date=a12.day_date) where a12.week_id = M1 partitioned column (week) '20081' group by a11.emp_id; 1509 secs Partitioned, clustered fact table, join to 1 table, normal join and where clause on higher level non- select a11.emp_id, count(1), count (distinct partitioned column (week)- With a11.customer_id), sum(a11.qty_sold) from Lookup_Day the larger table streamed by a12 join Orderdetailpartclust a11 on moving it outer-most in the (a11.order_date=a12.day_date) where join a12.week_id = M2 clause '20081' group by a11.emp_id; 1365 secs Partitioned, clustered fact table, select a11.emp_id, count(1), count (distinct join to 1 table, where clause on a11.customer_id), sum(a11.qty_sold) from higher level non-partitioned orderdetailpartclust a11 join Lookup_Day a12 on column (week), with mapside (a11.order_date=a12.day_date) where a12.week_id = M3 join '20081' group by a11.emp_id; 700 secs select a11.emp_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from Partitioned, clustered fact table, orderdetailpartclust a11 join Lookup_Day a12 on join to 1 table, where clause (a11.order_date = a12.day_date) where on (a11.order_date on the partitioned column - >= '01-01-2008' and a11.order_date <= '01-06- map- 2008') M4 side join group by a11.emp_id; 143 secs select a11.emp_id, count(1), count (distinct a11.customer_id), sum(a11.qty_sold) from Equivalent query that qualifies orderdetailpartclust a11 where (a11.order_date >= '01-01- directly on the partitioned 2008' and a11.order_date <= '01-06-2008') group column by M5 and no joins a11.emp_id; 96 secs Key takeaways from the tests It s good to have the data in Hive tables partitioned and even more important to qualify on the partitioned column(s) to let Hive prune the data. Queries S2/S3 vs. S1 and M4/M5 vs. M1/M2/M3 show the impact of qualifying on partition column(s) Auto enabling Map-side joins (set hive.auto.convert.join=true) is recommended Clustering/Bucketing helps in quick sampling of data through the use of the TABLE SAMPLE clause. Query S3 shows the impact of bucketing. Bucketing is basically an enabler for map side merge joins. Also note that performance is contingent on environment and actual results will vary based on cluster size & specification. Apache community is working on a multi-facet initiative to improve Hive performance. Please refer to additional information on this project Stinger MicroStrategy Incorporated Page 22 of 46

here. Extensibility using UDFs, UDAFs and custom map/reduce scripts In addition to the built-in functions that Hive provides, MicroStrategy also allows users to take advantage of custom UDFs and UDAFs built in-house. These custom functions are typically built as JARs files that can be added to a Hive session through ADD commands supplied using MicroStrategy s Pre-SQL functionality. Hive uses Hadoop's Distributed Cache to distribute the added files to all the machines in the cluster at query execution time. More information about adding resources to a Hive session can be found here ( https://cwiki.apache.org/hive/languagemanual-cli.html#languagemanualcli-hiveresources) You can use a combination of CREATE Temporary function command along with the ADD JAR commands to make a function available for the duration of a Hive session. For example you can add a user defined function that gives the first x characters of a string, LString. Let s say the jar is located under the /user/mydata folder on the local file system. You can use the following set of commands to add the jar and create a function LString that can be used in HiveQL statements in that Hive session add jar /user/mydata/lstring.jar; create temporary function LString as 'com.example.hive.udf.lstring'; select mydate, LString(mystring, 5) from test limit 10; MicroStrategy enables you to perform sophisticated analysis without having to write code or learn HiveQL. However there might be scenarios where you would want to use custom map and/or reduce scripts for you analytics. You can use Hive s TRANSFORM capability to do this. The TRANSFORM function can be invoked using MicroStrategy Pre-SQL functionality. In the example below a python script is used to geo-code IP address collected from an Apache Web log. ADD FILE /hadoophome/data/apachelog/ipgeocode.py; INSERT OVERWRITE TABLE GeocodedIPs SELECT TRANSFORM (IPAddress) USING 'python IpGeocodeStdInOut.py' AS (IPAddress, Country, State, City, Zipcode, Latitude, Longitude) FROM ApacheRawIps; More information about Hive s Transform capability can be found here: https://cwiki.apache.org/confluence/display/hive/languagemanual+transform MicroStrategy Incorporated Page 23 of 46

How MicroStrategy 9.3.1 keeps pace with Hive s Evolution The following figure depicts how progressively with every release of Hive, new features were released and how MicroStrategy acknowledged them in the product: Note that MicroStrategy 9.3.1 certifies Hive 0.10 (and older versions) and also Hive 0.11. Postcertification of Hive 0.11 with MicroStrategy is tech-noted in the KBase as: TN 43815 Known Issues with HiveQL as of Hive 0.11 1. No support for subqueries (SELECT statement in the WHERE clause) a. Some Relationship Filters in MicroStrategy will result in subqueries 2. No support for INSERT INTO TABLE VALUES a. MicroStrategy Multi-Source Option will not be able to move data from one RDBMS into Hive b. Functions supported by MicroStrategy but not by Hive (e.g. MovingAvg) will not be able to insert rows back into Hive 3. Limited support for UNION ALL a. MicroStrategy partitioning feature will result in UNION ALL statements 4. No support for Temporary Table syntax. 5. No support for SQLCancel Support, SQLExtendedFetch, SQLDescribeParam, SQLTransact, Query Timeout (SQLSetConnectAttr and/or SQLSetStmtAttr) through the ODBC layer. MicroStrategy Incorporated Page 24 of 46

Workload Management In general, a typical MicroStrategy workload consists of a mix of queries with significantly varying complexity. These queries originate from different MicroStrategy jobs, e.g. element requests, simple grids (select * from), complex analytical multi-pass reports, Dashboards that rely on multiple (simple or complex) queries, database write-back from Transaction Services, and Cube reports designed to fill MicroStrategy In-Memory Cubes. Additionally, it needs to be assumed that other workloads are being simultaneously submitted to Hadoop from other sources. Workload Management (WLM) is necessary to optimize access to database resources for concurrently executing queries. The goals of a functional workload management are to - Optimally leverage available (hardware) resources for performance and throughput - Prioritize access for high priority jobs - Assure resource availability by avoiding system lock-up by any small set of jobs Effective workload management starts with comprehensive monitoring, allowing identifying bottleneck conditions, and then leveraging the available platform tools to implement a workload management strategy that meets all of the above goals. Workload Management in MicroStrategy MicroStrategy internally breaks down every user request into one or more jobs that are processed independently. Each job advances through a series of processing steps, one of which might be the submission of multi-pass queries to a database. MicroStrategy is capable of processing and submitting multiple jobs in parallel to a database. By default MicroStrategy opens multiple connections called Database Threads to any data source. This is illustrated in the diagram below. MicroStrategy Incorporated Page 25 of 46

A typical BI environment encompasses queries ranging from very quick and urgent to longrunning and low priority. To avoid the scenario where a small number of expensive queries can block all access to database resources all database threads are assigned priority classes (High, Medium and Low). When report jobs are submitted, jobs are assigned a priority based on a list of different application parameters User groups, Application type, Project, Request type and Cost. The MicroStrategy work load management routes each job according to their priority to their corresponding database threads. When no database threads are available jobs will be queued until a database thread of the appropriate class becomes available. For more information on how to set these priorities refer to technical note TN5401 (TN5300-071- 0124): How to set Group Prioritization in MicroStrategy Intelligence Server 8.x - 9.0.x? Administrators can, for each priority class and depending on the available database resources, specify the number of warehouse connections that are required for efficient job processing. They can also monitor in Enterprise manager for queue time. This indicates insufficient resources for a given priority class. While the intuitive instinct at this point is to increase the number of database threads care must be taken not to overload the database as this will at least lead to suboptimal performance, and at worst to unstable BI environments. The optimal number of connections is dependent on several factors, however the main criterion to consider when setting the number of connections is the number of concurrent queries the warehouse can support. For more information on Job prioritization and connection mapping please refer to technical note TN8486 (TN5300-72X-0405): What is Job Prioritization and Connection Management in MicroStrategy Intelligence Server 8.x and 9.x? Monitoring of MicroStrategy Workload The primary goals when monitoring work load is to ensure that the work load is utilizing the available hardware resources efficiently, that sufficient hardware resources are available to handle the given workload, and, in case the first conditions are not met, that it will provide the insights to identify issues and areas for improvement. MicroStrategy Incorporated Page 26 of 46

A good starting point for monitoring the work load of a MicroStrategy implementation is MicroStrategy Enterprise Manager. Enterprise Manager easily provides an overview of the BI workload, the consumed resources, such as time spent and CPU cycles, and allows identifying the time spent querying the database both from an aggregated view down to individual user requests. Enterprise Manager can help achieve detailed analysis in several categories: In operational analysis one can monitor concurrency, queue and response time trends by hour/minute and insights on scheduling, caching, prioritization, clustering. This can help summarize the usage, breakup of the load and the performance of the system. Below is an example of an out of the box Operational Analysis Report (Weekly Summary of Project Activity) MicroStrategy Incorporated Page 27 of 46

In some cases the time spent on the database will make up for a significant portion of the overall processing time for a user request. At this point any further analysis needs to take place on the database layer. Workload management in Hadoop Every vendor or distribution of Hadoop may have their own workload management techniques and interfaces to help the customers manage their own clusters, enable logging and monitoring and further diagnose their environments. For customers that are on Cloudera s distribution, they can set up and monitor their cluster using Cloudera Manager and similarly Hortonworks customers can monitor their resources using Apache Ambari. For details on each of these technologies or any other, please refer to the distribution s links to setup and use their systems. Hive Thread Safety and Concurrent Queries Hive server was not considered to be thread-safe till a new improved version of HiveServer came out in late 2012 which is known as HiveServer2. HiveServer2 is an improved version of HiveServer that supports Kerberos authentication and multi-client concurrency. Note that, based on our research and technology partnerships with various Hadoop vendors, we recommend that customer just run HiveServer 2 because of the new functionalities and authentication features it offers. Authentication and Security To create secure communication among its various components and to MicroStrategy, Hadoop system uses Kerberos or LDAP. As a short summary - Hadoop security uses Kerberos principals and keytabs to perform user authentication on all remote procedure calls. A Kerberos principal is used in a Kerberos-secured system to represent a unique identity. Kerberos assigns tickets to Kerberos principals to enable them to access Kerberos-secured Hadoop services. More details on Hadoop security are out of scope for this paper. MicroStrategy Incorporated Page 28 of 46

HiveServer2 can be configured to authenticate all connections; by default, it allows any client to connect. You can configure this in the hive.server2.authentication property in the hive-site.xml file. You can also configure pluggable authentication, which allows you to use a custom authentication provider for HiveServer2; and impersonation, which allows users to execute queries and access HDFS files as the connected user rather than the super user who started the HiveServer2 daemon. For more information, please refer to your Hadoop distributions documentation. Here is a quick link to see guidelines on Cloudera s Hive Security Configuration. And here is the link for Hortonworks documentation to Configure Secure Hadoop. For further information on how to connect from MicroStrategy to a secured Cloudera Hadoop cluster, please refer to the TN in MicroStrategy s KBase: 43150 - How to configure a connectivity to a secured CDH cluster with MicroStrategy 9.x with the Cloudera ODBC Driver for the Intelligence Server on a Linux operating system Also for information on how to connect from MicroStrategy to a secured Hortonworks system, refer to Appendix A: Configuring Kerberos Authentication for Windows and Appendix B: Driver Authentication Configuration for Windows in Hortonworks-Hive-ODBC-Driver-User-Guide.pdf available at http://hortonworks.com/wp-content/uploads/2013/08/hortonworks-hive-odbc- Driver-User-Guide.pdf MicroStrategy s Connection Management In addition to HiveServer2 s ability to handle concurrent connections, MicroStrategy also provides built-in connection management logic that can be used to control the number of concurrent requests that get submitted to the Hive Server. This MicroStrategy Technical Note has details on how to set-up connection management in TN8486 - What is Job Prioritization and Connection Management in MicroStrategy Intelligence Server 8.x and 9.x https://resource.microstrategy.com/support/mainsearch.aspx?tnkey=8486 Canceling a Hive query from MicroStrategy When a Hive query request is canceled from within MicroStrategy it stops the job s execution from the MicroStrategy side. However it does not currently kill the underlying Hadoop Map- Reduce job triggered by the Hive query. This is because the ODBC driver does not yet support the SQLCancel ODBC API call. High Availability for the Hadoop Distributed File System (HDFS) For the scope of this paper, it should be just noted that High Availability is an option when the availability of the cluster is critical. For detailed implementation techniques, please refer to the specific Hadoop distribution s HA documentation. For Hortonworks, please refer to the documentation at http://docs.hortonworks.com/hdpdocuments/hdp1/hdp-1.3.2/bk_hdp1- system-admin-guide/content/ch_hdp1-high-availability-for-hadoop.html and for Cloudera distribution, please refer to this documentation: http://www.cloudera.com/content/cloudera- content/cloudera-docs/cdh4/4.2.0/cdh4-high-availability-guide/cdh4-high-availability- Guide.html MicroStrategy Incorporated Page 29 of 46

Hive metastore Usecases and Failover Scenario This section provides information on the use cases and fail over scenarios for high availability (HA) in the Hive metastore. Use Cases The metastore HA solution is designed to handle metastore service failures. Whenever a deployed metastore service goes down, metastore service can remain unavailable for a considerable time until service is brought back up. To avoid such outages, deploy the metastore service in HA mode. Deployment Scenarios We recommend deploying the metastore service on multiple boxes concurrently. Each Hive metastore client will read the configuration property hive.metastore.uris to get a list of metastore servers with which it can try to communicate. <property> <name> hive.metastore.uris </name> <value> thrift://$hive_metastore_server_host_machine_fqdn </value> <description> A comma separated list of metastore uris on which metastore service is running </description> </property> These metastore servers store their state in a MySQL HA cluster, which has altogether different FailOver Protection methodologies, which is out of scope for this paper. In the case of a secure cluster, each of the metastore servers will additionally need to have the following configuration property in its hive-site.xml file. <property> <name> hive.cluster.delegation.token.store.class </name> <value> org.apache.hadoop.hive.thrift.dbtokenstore </value> </property> Fail Over Scenario A Hive metastore client always uses the first URI to connect with the metastore server. In case the metastore server becomes unreachable, the client will randomly pick up a URI from the list and try connecting with that. MicroStrategy Incorporated Page 30 of 46

Appendix A Other VLDB Settings VLDB String Insertion Settings when using Permanent Tables and Explicit Table Creation When Intermediate Table Type is set to Permanent Tables, some additional string-valued settings are enabled so that the user can customize the syntax of CREATE TABLE statement. Also, when Table Creation Type is set to Explicit, temporary tables are created with separate CREATE TABLE and INSERT statements; Implicit table creation does not generate INSERT statements. When using Explicit table creation, some additional string-valued settings are enabled so that the user can customize the syntax of the INSERT statements. The SQL below shows the position of these VLDB settings. [Report Pre Statement] [Table Pre Statement] create [Table Qualifier] table [Table Descriptor][Table Prefix]<table_name> [Table Option] ( <column_specifications>) [Table Space] [Create Post String] [Insert Pre Statement] [Bulk Insert String]insert into [Table Prefix]<table_name> [Insert Table Option] select [SQL Hint] <column_expressions> from <tables_and_joins> where <filter_expressions> group by <column_expressions>[insert Post String] [Insert Post Statement] create [Index Qualifier] index [Index Prefix]<index_name> on <table_name>(<column_expressions>) [Index Post String] [Table Post Statement] select [SQL Hint] <column_expressions> from <table_list> where <joins_and_filter_expressions> group by <column_expressions>[select Post String][Select Statement Post String] [Report Post Statement] [Cleanup Post Statement] Supported Data types MicroStrategy Incorporated Page 31 of 46

MicroStrategy supports the following Hive 0.11 data types: o Primitive:- TINYINT, SMALLINT, INT, BIGINT, DOUBLE, FLOAT, STRING (same as VARCHAR), BOOLEAN, TIMESTAMP, DECIMAL o Complex: ARRAYS, MAPS, STRUCTS How to Deal with DATE Data Type in older Hive versions Hive 0.7 or earlier did not have a built-in DATE or TIMESTAMP data type. Starting from Hive 0.8, there is a native datatype of TIMESTAMP that can be used in MicroStrategy to depict and use all the DATE datatype columns. Native DATE datatype is not yet available as of Hive 0.11. The following section describes the work-around to be used if the end user is on Hive 0.7 or earlier version, where TIMESTAMP was not supported. Dates are typically stored as strings in Hive. Although Hive does not have native support for date as a data type, it does have very rich function support for operating on date data stored within strings. Please refer to: https://cwiki.apache.org/confluence/display/hive/languagemanual+udf#languagemanualudf -DateFunctions for more details This section describes how you can configure certain properties in MicroStrategy to handle dates represented by strings in Hive. RMC on the Hive database instance and select VLDB Properties. o Go to the DATE FORMAT option under Select/insert section. Set its value to how the date string is formatted in your Hive set-up. For instance if dates are formatted as 2010-11-24, set the date format to yyyy-mm-dd o Go to the DATE PATTERN option under Select/insert section. Set it to #0. This allows us to use the string literal notation for date values You can also define a look_up table for attributes in the time hierarchy to support additional display forms. For ex:- You can define a LU_DAY table that has these two columns in addition to others o DAY_ID (string):- ID column that stores the string representation of dates. Ex:- 2010-11- 26 o DAY_DESC (string):- Description column that stores the display form for the date. Ex:- 11/26/2010 o LY_DAY_DATE(string):- Column that stores the last year s date s in ID form. Ex:-2009-11- 26 Other table samples o DAY SLS fact table: DAY_DATE(string), CALLCTRID(int),.., DOLLAR_SALES(float) o YTD_DAY table: DAY_DATE(string), YTD_DAY_DATE(string). For a sample schema containing tables LU_DAY, DAY_SLS the time hierarchy attributes can be defined in the following manner o Day@ID = LU_DAY.Day_id; DAY_SLS.Day_id o Day@DESC=LU_DAY.Day_Desc o *Month@ID = Concat(Year(DAY_SLS.Day_id), Month(DAY_SLS.Day_id)) o *Year@ID = Year(DAY_SLS.Day_id) o *MonthOfYear@ID = Month(DAY_SLS.Day_id) Modify the FORM FORMAT or the ID column of the DAY/DATE attribute from STRING MicroStrategy Incorporated Page 32 of 46

to DATE Hive s date manipulation functions (DateDiff, Date_Add, Date_Sub, Year, Month, Day, etc) expect the dates to be formatted in the yyyy-mm-dd format. If your date data is not stored in the yyyy-mm-dd format you can use the unix_timestamp and from_unixtime functions to convert the dates to the yyyy-mm-dd format. For example:- if the date is of the format 31-01-2012 (dd-mm-yyyy) format, you can convert it to the yyyy-mm-dd format using the following expression from_unixtime(unix_timestamp(date_id, dd-mm-yyyy ), yyyy-mm-dd ) You can either define the ID form of your DAY attribute form with the expression above or alternatively populate a new table with this newly formatted DAY column Note: Appendix B has some examples of reports involving dates in them, where they are treated as strings. These include reports with basic date manipulations, Last-year date calculations, YTD calculations, etc. *By defining these forms directly on the fact table we avoid having to join in the day look up table for any queries that need to roll-up by Month, Year or Month of Year. Transformations in Hive In MicroStrategy Transformations are schema objects used to compare like values at different times- for example, this year versus last year or date versus month-to-date. Transformations are useful for discovering and analyzing time-based trends in the data. There are two types of transformations- Expression-based and table-based transformations. Expression-based ones use a mathematical formula in their definition whereas the table-based ones base their definitions on a relation or lookup table. In this case where we are using strings as date data-type, we can employ both Table-based and expression-based transformations. Expression based transformations are useful in doing 1:1 transformations such as Last Year. Table-based transformations might be necessary if you have an irregular calendar and/or want to do 1-M transformations such as YTD, MTD, etc. MicroStrategy Functions Implemented by using HiveQL Expressions MicroStrategy 9 ships with over 250 functions available in the analytical engine. Although some of these functions do not have a direct analog in Hive, they can still be calculated in Hive using HiveQL expressions based on functions that are available. MicroStrategy provides out the box HiveQL expression support for a lot of these functions that either have a corresponding function in Hive or that can be created using combination of Hive functions and expressions. The support for some OLAP functions like RANK became available since 0.11, and in the versions that it s not supported, RANK is calculated in MicroStrategy s Analytical Engine layer. In addition to MicroStrategy built-in functions users can also leverage additional UDFs and UDAFs offered by Hive that are not yet supported as functions in MicroStrategy. MicroStrategy s Apply or Pass-Through function capabilities can be used to invoke Hive s UDFs, UDAFs or arbitrary SQL expressions in the SELECT, WHERE, GROUP BY and HAVING clauses of HiveQL. For example you can get the population for different states from the state XML string (with structure - <state><population>xxxx</population><housingunits>yyyy</housingunits>) using this expression with xpath_string function MicroStrategy Incorporated Page 33 of 46

ApplySimple( xpath_string(#0, state/population ), state) The complete list of MicroStrategy functions that are implemented with SQL patterns in Hive 0.11 are shown below. Table on MicroStrategy/Hive 0.11 Functions MSTR Function Recommended Syntax CoalesceFunction Not supported ExceptFunction Not supported FirstFunction MIN(#0) GeoMeanFunction EXP(AVG(LN(#0))) GreatestFunction MAX(#0) IFOperator (Case when #0 then #1 else #2 end) IsNotNullFunction Not supported IsNullFunction Not supported LastFunction MAX(#0) LeastFunction MIN(#0) MedianFunction Not supported MovingDifferenceFunction Not supported NotInOperator Not supported NotLikeOperator Not supported ProductFunction Not supported StdevFunction STDDEV(#0#< #*#>) StdevPFunction stddev_pop(#0#< #*#>) UnionFunction Not supported VarFunction VARIANCE(#0#< #*#>) VarPFunction VAR_POP(#0#< #*#>) AddDaysFunction DATE_ADD(#0,#1) AddMonthsFunction Not supported CurrentDateFunction TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())) CurrentDateTimeFunction FROM_UNIXTIME(UNIX_TIMESTAMP()) CurrentTimeFunction FROM_UNIXTIME(UNIX_TIMESTAMP()) DateFunction TO_DATE(#0) DayOfMonthFunction DAYOFMONTH(#0) DayOfWeekFunction Not supported DayOfYearFunction Not supported DaysBetweenFunction DATEDIFF(#1,#0) HourFunction HOUR(#0) MilliSecondFunction Not supported MinuteFunction MINUTE(#0) MonthEndDateFunction Not supported MonthFunction MONTH(#0) MonthsBetweenFunction Not supported MicroStrategy Incorporated Page 34 of 46

MonthStartDateFunction Not supported QuarterFunction Not supported SecondFunction SECOND(#0) WeekFunction WEEKOFYEAR(#0) YearEndDateFunction Not supported YearFunction YEAR(#0) YearStartDateFunction Not supported BandingCFunction Not supported BandingFunction Not supported AbsFunction ABS(#0) AcosFunction ACOS(#0) AcoshFunction LN(#0+SQRT(#0-1)*SQRT(#0+1)) AsinFunction ASIN(#0) AsinhFunction LN(#0+SQRT(POW(#0, 2)+1)) Atan2Function Not supported AtanFunction ATAN(#0) AtanhFunction ((LN(1+#0)-LN(1- #0))/2) CeilingFunction CEIL(#0) CosFunction COS(#0) CoshFunction (EXP(#0)+EXP(-#0))/2 DegreesFunction DEGREES(#0) ExpFunction EXP(#0) FloorFunction FLOOR(#0) Int2Function Not supported IntersectFunction Not supported IntFunction Not supported LnFunction LN(#0) Log10Function LOG10(#0) LogFunction LOG(#1, #0) ModFunction SIGN(#0)*PMOD(ABS(#0), #1) PowerFunction POWER(#0, #1) QuotientFunction RadiansFunction RandbetweenFunction CAST((#0)/(CASE WHEN (#1)=0 THEN NULL ELSE (#1) END) AS INT) RADIANS(#0) Not supported #0#,#<partition by#>#<#, #*#># rank () RankFunction over(#1#2#<,#*#> order by #0) Round2Function ROUND(#0, #1) RoundFunction ROUND(#0) SinFunction SIN(#0) SinhFunction ((EXP(#0)+EXP(#0*(-1)))/2) SqrtFunction SQRT(#0) TanFunction TAN(#0) MicroStrategy Incorporated Page 35 of 46

TanhFunction TruncFunction NullToZeroFunction ZeroToNullFunction FirstInRangeFunction LagFunction LastInRangeFunction (-1+(2/(1+EXP(-2*#0)))) CAST(#0 AS INT) CASE WHEN #0 IS NULL Then CAST (0 as DOUBLE) ELSE CAST (#0 as DOUBLE) END CASE when #0 = 0 then NULL else #0 end first_value(#0) over(#1) CASE WHEN count(*) OVER ([#P][#O] rows between unbounded preceding and current row) <=#1 THEN #2 ELSE sum(#0#< #*#>) OVER ([#P][#O] rows between #1 preceding and #1 preceding) END last_value(#0) over(#1) LeadFunction MovingAvgFunction MovingCountFunction MovingMaxFunction MovingMinFunction MovingStdevFunction MovingStdevPFunction MovingSumFunction OLAPAvgFunction OLAPCountFunction OLAPMaxFunction OLAPMinFunction OLAPRankFunction OLAPSumFunction RunningAvgFunction RunningCountFunction RunningMaxFunction RunningMinFunction RunningStdevFunction RunningStdevPFunction RunningSumFunction CorrelationFunction CASE WHEN count(*) OVER ([#P][#O] rows between current row and unbounded following) <=#1 THEN #2 ELSE sum(#0#< #*#>) OVER ([#P][#O] rows between #1 following and #1 following) END Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported rank() over ([#P][#O]) Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported Not supported CovarianceFunction covar_pop(#0, #1) FisherFunction Not supported InterceptFunction Not supported PearsonFunction Not supported RSquareFunction Not supported SlopeFunction Not supported StandardizeFunction case when (#2> 0) then (#0-#1)/(#2) else NULL end MicroStrategy Incorporated Page 36 of 46

SteYXFunction Not supported ConcatBlankFunction CONCAT(#0#<,' ',#*#>) ConcatFunction CONCAT(#0#<, #*#>) InitCapFunction CONCAT(UPPER(SUBSTR(#0, 1, 1)),LOWER(SUBSTR(#0, 2))) LeftStrFunction SUBSTR(#0, 1, #1) LengthFunction LENGTH(#0) LowerFunction LOWER(#0) LTrimFunction LTRIM(#0) PositionFunction INSTR(#0, #1) RightStrFunction SUBSTR(#0,(LENGTH(#0) -#1 + 1)) RTrimFunction RTRIM(#0) SubStrFunction SUBSTRING(#0, #1,#2) TrimFunction TRIM(#0) UpperFunction UPPER(#0) Configuring the MicroStrategy-Hive Environment Hive Metastore All the metadata for Hive tables and partitions are stored in Hive Metastore. Most of the commercial relational databases and many open source datastores are supported. Any datastore that has JDBC driver can probably be used. In case Hive's Metastore is configured to store metadata locally in an embedded Apache Derby database, then please note that this configuration only allows a single user to access the Metastore at a time. It is recommended that users instead use a MySQL database as the metastore either in the local or remote modes. See the Hive Metastore documentation (http://wiki.apache.org/hadoop/hive/adminmanual/metastoreadmin) for additional information. Hive and Query Latency Hadoop 1.0 is not an interactive query environment. Hadoop 2.0 will be supporting varied workloads, including interactive query. But from Hadoop 1.0 perspective, even the simplest queries can take on the order of tens of seconds to return. This is important when you consider that MicroStrategy submits HiveQL queries to populate pick lists for prompts and element browsing when defining filters, etc. Furthermore, Hive does not process joins as well as mature MPP databases. Joins can take a significant amount of time to process with Hive. Below are some tips on improving performance of Hive queries. Using MicroStrategy Multi-source option to support element list prompts A separate RDBMS system can be used to hold dimensional data (look up tables). This can help with improving the performance of element browsing or element list prompts defined on the dimensional data. The steps below can be used to configure this:- o Include dimension tables for any attribute used in prompts into both an RDBMS system and Hive. MicroStrategy Incorporated Page 37 of 46

o In the screenshot below, dimension tables for Time (LU_MONTH, LU_QUARTER and LU_YEAR) and Product (LU_CATEGORY, LU_SUBCATEG, LU_ITEM) dimensions have been modeled both from a MySQL database and Hive Figure:- Architect Screenshot of dimensional tables modeled in MyHIVEQL and Hive o Set the RDBMS database instance as the project primary database. o In the screenshot below the MySQL database instance has been set as the project primary database instance Figure: - Screenshot of the project primary database instance MicroStrategy Incorporated Page 38 of 46

o This set-up will route any element browsing and prompt requests to the RDBMS system and can provide improved performance for such requests. o If you plan on using FreeForm SQL reports to query Hive, you can define prompts using attributes modeled in MySQL. o In case of using the MicroStrategy Engine to dynamically generate the HiveQL for the reports, you also need to define set Hive as the secondary datasource (database instance) for the lookup tables This type of a conformed dimension set-up eliminates the need to move the dimensional data from one system to another at run-time, while directing the element look-up requests to the lower latency RDBMS system. It is also to be noted however that Joins are improved starting from Hive 0.11, as part of Stinger initiative. For more information on this, refer to: http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/ MicroStrategy Incorporated Page 39 of 46

Appendix B Sample Reports Sample reports to deal with lack of DATE data type in Hive: Example 1: Basic Report with Call Center, Region and Sales on the template, where we select the day to be a particular day, in this case 10/5/2007. We can see the calendar picker that regards the string data types correctly as dates. HiveQL: <property> select a11.call_ctr_id call_ctr_id, max(a12.center_name) center_name, a12.region_id region_id, max(a13.region_name) region_name, sum(a11.tot_unit_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_lu_call_ctr a12 on (a11.call_ctr_id = a12.call_ctr_id) join tuto901_lu_region a13 on (a12.region_id = a13.region_id) where a11.day_date = '2007-10-05' group by a11.call_ctr_id, a12.region_id MicroStrategy Incorporated Page 40 of 46

Example 2: Filter: Dates are from a list of 01/01/2007, 01/02/2007, 01/03/2007, etc. HiveQL: select SUBSTRING(a11.DAY_DATE,1,4) CustCol_2, a11.day_date DAY_DATE, max(a12.day_date_desc) DAY_DATE_DESC, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_lu_day a12 on (a11.day_date = a12.day_date) where a11.day_date in ('2007-01-01', '2007-01-02', '2007-01-03', '2007-01-04', '2007-01- 05', '2007-01-06', '2007-01-07', '2007-01-08', '2007-01-09', '2008-01-30') group by SUBSTRING(a11.DAY_DATE, 1, 4), a11.day_date; MicroStrategy Incorporated Page 41 of 46

Example 3: HiveQL: select a12.country_id COUNTRY_ID, max(a13.country_name) COUNTRY_NAME, a11.call_ctr_id CALL_CTR_ID, max(a12.center_name) CENTER_NAME, SUBSTRING(a11.DAY_DATE, 6, 2) CustCol_3, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_lu_call_ctr a12 on (a11.call_ctr_id = a12.call_ctr_id) join tuto901_lu_country a13 on (a12.country_id = a13.country_id) where SUBSTRING(a11.DAY_DATE, 6, 2) in ('10', '11', '12') group by a12.country_id, a11.call_ctr_id, SUBSTRING(a11.DAY_DATE, 6, 2); MicroStrategy Incorporated Page 42 of 46

Example 4: Last Year s Transformation Report. We are having a basic report, which makes use of last year transformation from the LU_DAY table. HiveQL select distinct a13.country_id country_id, a15.country_name country_name, pa11.call_ctr_id call_ctr_id, a13.center_name center_name, a13.region_id region_id, a14.region_name region_name, pa11.wjxbfs1 WJXBFS1, pa12.wjxbfs1 WJXBFS2 from (select a11.call_ctr_id call_ctr_id, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 where a11.day_date in ('2008-12-28', '2008-12-29', '2008-12-30', '2008-12-23', '2008-12-24', '2008-12-25', '2008-12-26', '2008-12-27') group by a11.call_ctr_id ) pa11 join (select a11.call_ctr_id call_ctr_id, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_lu_day a12 on (a11.day_date = a12.ly_day_date) where a12.day_date in ('2008-12-28', '2008-12-29', '2008-12-30', '2008-12-23', '2008-12-24', '2008-12-25', '2008-12-26', '2008-12-27') group by a11.call_ctr_id ) pa12 on (pa11.call_ctr_id = pa12.call_ctr_id) join tuto901_lu_call_ctr a13 MicroStrategy Incorporated Page 43 of 46

on (pa11.call_ctr_id = a13.call_ctr_id) join tuto901_lu_region a14 on (a13.region_id = a14.region_id) join tuto901_lu_country a15 on (a13.country_id = a15.country_id) Example 5: Using function GREATER THAN a date, for example 11/08/2008: HiveQL: select a12.region_id region_id, max(a14.region_name) region_name, a11.day_date day_date, max(a13.day_date_desc) day_date_desc, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_lu_call_ctr a12 on (a11.call_ctr_id = a12.call_ctr_id) join tuto901_lu_day a13 on (a11.day_date = a13.day_date) join tuto901_lu_region a14 on (a12.region_id = a14.region_id) where a11.day_date > '2008-11-08' group by a12.region_id, a11.day_date MicroStrategy Incorporated Page 44 of 46

Example 6: Year to DATE report Filter: Day In list: 01/04/2007, 01/05/2008 HiveQL: select a12.day_date day_date, max(a14.day_date_desc) day_date_desc, a13.country_id country_id, max(a16.country_name) country_name, a13.region_id region_id, max(a15.region_name) region_name, sum(a11.tot_dollar_sales) WJXBFS1 from tuto901_day_ctr_sls a11 join tuto901_ytd_day a12 on (a11.day_date = a12.ytd_day_date) join tuto901_lu_call_ctr a13 on (a11.call_ctr_id = a13.call_ctr_id) join tuto901_lu_day a14 on (a12.day_date = a14.day_date) join tuto901_lu_region a15 on (a13.region_id = a15.region_id) join tuto901_lu_country a16 on (a13.country_id = a16.country_id) where a12.day_date in ('2007-01-04', '2008-01-05') group by a12.day_date, a13.country_id, a13.region_id group by a11.call_ctr_id MicroStrategy Incorporated Page 45 of 46

References Apache Hive Wiki - https://cwiki.apache.org/hive/ Hadoop The Definitive Guide, 2nd Editon, O Reilly Yahoo! Press Cloudera Resources - http://www.cloudera.com/content/support/en/documentation.html Hortonworks Resources - http://docs.hortonworks.com/ MicroStrategy Incorporated Page 46 of 46