Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1
Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA SEMI-STRUCTURED DATA Total Data Stored IT Resources 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 2015 MapR Technologies 2
Data Increasingly Stored in Non-Relational Datastores Volume GBs-TBs TBs-PBs Structure Development Structured Planned (release cycle = months-years) Structured, semi-structured and unstructured Iterative (release cycle = days-weeks) Database RELATIONAL DATABASES Fixed schema DBA controls structure NON-RELATIONAL DATASTORES Dynamic / Flexible schema Application controls structure 1980 1990 2000 2010 2020 2015 MapR Technologies 3
How To Bring SQL Into An Unstructured Future? Familiarity of SQL Agility & Flexibility of NoSQL SQL BI (Tableau, MicroStrategy, etc.) Low latency Scalability No schema management HDFS (Parquet, JSON, etc.) HBase No transform or silos of data 2015 MapR Technologies 4
Industry's First Schema-free SQL engine for Big Data 2015 MapR Technologies 5
Apache Drill Brings Flexibility & Performance Access to any data type, any data source Relational Nested data Schema-less Rapid time to insights Query data in-situ No Schemas required Easy to get started Integration with existing tools ANSI SQL BI tool integration Scale in all dimensions TB-PB of scale 1000 s of users 1000 s of nodes Granular Security Authentication Row/column level controls De-centralized 2015 MapR Technologies 6
Agility & Business Value Extending Self Service to Schema-free data Schema-Free Data Exploration Analyst-driven with no IT dependency Self-Service BI Self-Service BI Analyst-driven with IT support for ETL IT-Driven BI IT-Driven BI IT-Driven BI IT-created reports, spreadsheets 1980s -1990s 2000s Now Use cases for BI 2015 MapR Technologies 7
Enabling As-It-Happens Business with Instant Analytics Total time to insight: weeks to months Governed approach Hadoop data Data modeling Transformation Data movement (optional) Users Source data evolution New Business questions Total time to insight: minutes Exploratory approach Hadoop data Users 2015 MapR Technologies 8
Drill s Role in the Enterprise Data Architecture Raw data Optimized data Centrally-structured data Relational data JSON, CSV,... Parquet, Schemas in Hive Metastore Highly-structured data Exploration (known and unknown questions) Oracle, Teradata Hive, Impala, Spark SQL 2015 MapR Technologies 9
Business Benefits Rapid time-to-value for business analysts: SQL specialists and BI analysts can query any dataset including complex nested data instantly, versus waiting several weeks for data preparation by IT. Efficiency with easy governance for IT: IT can avoid unnecessary ETL cycles and schema maintenance activities, but still ensure governance through easy-to-deploy granular access controls. Accelerated big data adoption for businesses: Organizations can use the existing and large SQL talent base and tools to rapidly discover new business insights from big data. 2015 MapR Technologies 10
Quick Tour Self-Service Data Exploration with Apache Drill 2015 2015 MapR MapR Technologies Technologies 11
Data is growing fast and scattered in various silo s: Customers CSV files Website click logs JSON files Product database MapR-DB NoSQL 2015 MapR Technologies 12
Apache Drill: SQL in a Non-Relational World 2 DON T WANT WANT Create and maintain schemas in advance: HDFS (Parquet, JSON, etc.) HBase Transform, copy, or move data ANSI SQL BI (Tableau, MicroStrategy, etc.) Low latency Scalability Agility 2015 MapR Technologies 13
Closing The Gap Between Different Datasources using Drill Customers Website click logs Product database Cust_id Customername State Gender Agg_rev Age Membership Trans_id Sess_date Cust_id Device Prod_id Purch_flag Prod_id Productname Category Price CSV JSON NoSQL Hbase / MapR-DB 2015 MapR Technologies 14
Demo 2015 2015 MapR MapR Technologies Technologies 15
In lieu of the live demonstration please find links below: Apache Drill with Tableau (4:28): https://www.youtube.com/watch?v=eh0_vrtakyk Twitter analytics with Apache Drill and Microstrategy (5:02): https://www.youtube.com/watch?v=-gqwgahtc2y Analyzing JSON and Packet Data with SAP Lumira and Apache Drill: https://www.youtube.com/watch?v=s-featdi2wa 2015 MapR Technologies 16
Access control that scales User PAM Authentication + User Impersonation User Drill View 1 Drill View 2 U Files HBase Hive U U Fine-grained row and column level access control with Drill Views no centralized security repository required 2015 MapR Technologies 17
Granular security permissions through Drill views Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Owner Admins Permission Admins Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists Business Analyst View Name City State Dave San Jose CA John Boulder CO Data Scientist View (/views/maskedcards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Business Analyst Not a physical data copy Data Scientist 2015 MapR Technologies 18
Case Studies 2015 2015 MapR MapR Technologies Technologies 19
Self-Service Data Exploration Direct access to any data store from familiar tools- ANSI SQL compatible Raw Data Exploration JSON Analytics DWH Offload {JSON}, Parquet Text Files Files Directories Hive HBase 2015 MapR Technologies 20
Data Warehouse Offload with Drill & MapR Ultimately replace existing expensive SQL analytics platform with Hadoop OBJECTIVES Mine credit card data and compares consumer shopping habits Require internal SQL specialists to gain instant access to data at all times CHALLENGES Want to preserve instant access to data but a lower price point Need a system that is reliable, does not lose data and is fast Must be able to leverage the SQL skill sets in the company SOLUTION Apache Drill allows interactive analysis on large datasets with MapR as the underlying platform that meets scale, reliability and data protection needs SQL users did not have to learn Pig, HiveQL or any other language and continue to use Tableau and Squirrel on top of Drill Business Impact Potential Hadoop and Drill dramatically reduce the price point to less than $1,000 / TB MapR platform with Drill delivers reliability and performance for the end users Leverage existing BI and SQL skill-sets on Hadoop without retraining 2015 MapR Technologies 21
Telecom OEM application with Drill & MapR Leverage Drill s JSON capabilities to create revenue-generating IOT services OBJECTIVES Offer service to mobile operators to proactively monitor and improve their subscriber experience Instant availability of data from diverse and disparate sources CHALLENGES Data is very diverse and dynamic using JSON as the key format Require interactive, ad-hoc analysis capabilities via standard BI tools such as Tableau and Spotfire SOLUTION Apache Drill is being used to build the engine for the interactive experience Drill allows SQL queries on incoming JSON structures natively without requiring any centralized schema definitions Drill connects to all BI tools using standard ODBC connectors Business Impact Potential Provide new revenue-generating services to mobile operators Enable deeper, instant intelligence about the networks and users Reduce maintenance costs - no IT intervention required for schema changes 2015 MapR Technologies 22
Recap: Apache Drill enables Self Service SQL for Big data AGILITY INSTANT INSIGHTS TO BIG DATA Direct queries on self describing data No schemas or ETL required FLEXIBILITY ONE INTERFACE FOR HADOOP & NOSQL Query HBase and other NoSQL stores Use SQL to natively operate on complex data types (such as JSON) FAMILIARITY EXISTING SKILLS & TECHNOLOGIES Leverage ANSI SQL skills and BI tools Plug-n-play with Hive schema, file formats, UDF s 2015 MapR Technologies 23
Learn more and get started with Apache Drill New to MapR and/or Drill? Get started with Free MapR On Demand training Test Drive Drill on cloud with Amazon EMR Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? Try out Apache Drill in 10 mins guide on your desktop Download Drill for your MapR cluster and start exploration Use both with relational and JSON datasets Comprehensive tutorials and documentation available Ask questions user@drill.apache.org 2014 MapR Technologies 24
Thank You @mapr maprtech mapr-technologies muddenfeldt@mapr.com mkieboom@mapr.com MapRTechnologies maprtech 2014 MapR Technologies 25
Backup Slides 2014 2014 MapR MapR Technologies Technologies 26
MapR with Drill is Top-Ranked SQL-on-Hadoop Key: Number indicates companies relative strength across all vectors Size of ball indicates company s relative strength along individual vector Source: Gigaom Research, 2015 Like other vendors offerings, Drill handles BI and interactive queries with great aplomb, but it is designed to serve these workloads with data complexity that goes well beyond the flat structured data that other SQL-on- Hadoop systems deal with. 2014 MapR Technologies 27
SQL technologies available on MapR Key Use Cases Data Sources Data Types Metadata SQL / BI tools Files support Drill Hive Impala Spark SQL Self-service Data Exploration Interactive BI / Ad-hoc queries Parquet, JSON, Text, all Hive file formats Batch/ ETL/ Long-running jobs Interactive BI / Ad-hoc queries SQL as part of Spark pipelines / Advanced analytic workflows Yes (all Hive file formats) Yes (Parquet, Sequence, RC, Text, AVRO ) Parquet, JSON, Text, all Hive file formats HBase/MapR-DB Yes Yes, performance issues Yes, performance issues Same as Hive Beyond Hadoop Yes No No Yes Relational Yes Yes Yes Yes Complex/Nested Yes Limited No Limited Schema-less /Dynamic schema Yes No No Limited Hive Meta store Yes Yes Yes Yes SQL support ANSI SQL HiveQL HiveQL ANSI SQL (limited) & HiveQL Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC Beyond Memory Yes Yes Yes Yes Optimizer Limited Limited Limited Limited Platform Latency Low Medium Low Low (in-memory) / Medium Concurrency High Medium High Medium 2014 MapR Technologies 28
Key Reasons for Selecting MapR Respondents who had prior experience with another Hadoop distribution* * Apache Hadoop, Cloudera or Hortonworks 2014 MapR Technologies 29
MapR: The Only Platform Architected For Big, Fast, Reliable Your choice of SQL Batch Tez ML, Graph SQL APACHE HADOOP AND OSS ECOSYSTEM NoSQL & Search Streaming Data Integration & Access Security Workflow & Data Governance Provisioning & coordination Open source Projects inherit MapR s platform attributes Spark Drill Cascading GraphX Spark SQL Hue Pig MLLib Impala Solr Storm HttpFS Savannah Trillion files 2-11x faster MapReduc e v1 & v2 Mahout Hive YARN EXECUTION ENGINES MapR-FS (HDFS and NFS APIs) HBase Spark Streaming More efficient use of infrastructure (30-50% lower TCO) MapR Data Platform (Random Read/Write) Flume Sqoop DATA GOVERNANCE AND OPERATIONS Juju Sentry Oozie ZooKeeper MapR-DB (High-Performance NoSQL) Industry s only mirroring, point-in-time consistent snapshots First new database designed for operational real-time 2014 MapR Technologies 30
MapR: Best Solution for Customer Success Best Product High Growth 700+ Customers Premier Investors Apache Open Source 2X 2X Growth In Direct Customers Growth In Annual Subscriptions ( ACV) 140% Dollar-based Net Expansion 90% Subscription Licenses Software Margins 2014 MapR Technologies 31