Matt McDevitt Sr. Project Manager Pavan Challa Sr. Data Engineer June 2015 Dashboard Engine for Hadoop Think Big Start Smart Scale Fast
Agenda Think Big Overview Engagement Model Solution Offerings Dashboard Engine Demo Q&A 2015 Think Big, a CONFIDENTIAL Teradata Company 2
2015 Think Big, a CONFIDENTIAL Teradata Company 3 3
Think Big Overview Founded in 2010, acquired in 2014, International in 2015 First and leading professional services firm exclusively focused on big data End to End Services: Strategy, Design, Implementation, IP/Software, Support and Managed Services Academy to scale delivery capability Extend and integrate open source with UDA Team-based delivery with Solution Center Growing quickly: we re hiring! Think Big Founded 2010 PRESTO 2015 Think Big, a CONFIDENTIAL Teradata Company 4
Think Big Engagement Model 2015 Think Big, a CONFIDENTIAL Teradata Company 5
Think Big Analytics VELOCITY Methodology New Data Big Data Approach Use Cases Roadmap Big Data Lab Business Analytics New Models New Analytics New Insights New Data Requirements Big Data Program Mgt Solutions Planning and Design Prioritization Capability Backlog Grooming for engineering Data Science Discovery R&D Managed Services Quality Assurance & Test Managed Support Break Fix Sustaining Engineering Data Engineering Engineering Sprint(s) Releases 2015 Think Big, a CONFIDENTIAL Teradata Company 6
Think Big Solution Offerings 1. Big Data Strategy Roadmap 2. Data Lake Starter Program 3. Data Lake Optimization 4. Data Lake Managed Services 5. Presto for the Enterprise new as of June 10, 2015 6. Big Data Managed Services 7. Think Big Academy Device Data Manufacturing Operations Omni-Channel Marketing Analytics Financial Services Fraud/Risk Analytics Healthcare personalization Custom Analytics Solution Services Device Data Behavior Analytics IT Threat Detection Public Sector Risk Analysis Gaming Analytics 2015 Think Big, a CONFIDENTIAL Teradata Company 7
Data Lake Implementation MAKING BIG DATA COME ALIVE
Data Lake Program Offers Data Lake: Starter Program Stand up a Data Lake and build 3 governed batch data ingest streams Includes Services and Subscription Software Frameworks Data Lake: Optimization Add governance to your Data Lake For Data Lakes not originally built by Think Big Data Lake: Dashboard Engine Reporting Install and configure engine with Data Lake to build dashboard analytics for deep dimensional rollup reporting capabilities with Tableau on Hadoop Data Lake: Security Data Security & InfoSec, Cluster Hardening, Perimeter, Connectivity Data Lake: Managed Services Only for Data Lakes that Think Big Designs and Builds On Premise, Public Cloud (AWS) and Private Cloud (Teradata and Altiscale) 2015 Think Big, a CONFIDENTIAL Teradata Company 9
Think Big Data Lake Starter Program (8 Week Engagement) Objective: Design, Develop and Deploy Data Lake Ingestion with Governance 2 weeks 2 week 2 week 2 weeks Design Build & Test Integrate & Tune Assess, Mentor & Plan Collaborative workshops with business groups Identification and prioritization of high-value data streams Gap analysis Data Stream Prioritization Develop Ingest workflows Install Metadata and Info Security Services Prepare Cluster for Integration test Develop & Unit Testing Install Ingest & System Test Begin Profiling Data System Integration Testing Learn about Information Security and data wrangling Begin Building DL Reporting Final tuning, assessment and next steps Organization & Training Data Sources Cluster configuration & Integration Info Security Objectives Software Component Installation Data Profiling and Capability Follow-up Roadmap Executive Presentation 2015 Think Big, a CONFIDENTIAL Teradata Company 10
Think Big Enterprise Data Lake Perimeter-Authentication-Authorization Sequence Automate Prepare Source Metadata Collect & Manage Apply Structure Evaluate Source Data Ingest Metadata Prepare Data for Ingest Information Sources InfoSec Compress Protect Dashboard Engine Downstream Applications Enterprise Data Lake 2015 Think Big, a CONFIDENTIAL Teradata Company 11
API API Statistics Graph Analytics Dashboard Engine Realtime Processing Machine Learning Discovery Zone Kafka Spark Experimental Data Data Lab Msg Queue CDC Raw Data Processing Derived Views Buffer Server Governed Ingestion Data Repository Metadata Repository Security, Archival Loom integrated Metadata, lineage, Wrangling RainStor System of Record, Archive 12 2015 Think Big, a CONFIDENTIAL Teradata Company 12
13 2015 Think Big, a CONFIDENTIAL Teradata Company 13
Why a Dashboard Engine? Events Hadoop 2015 Think Big, a CONFIDENTIAL Teradata Company 14
ThinkBig Dashboard Engine Strengths Near real-time analytics Easily scales to 100s of simulaneous users Query latency typically under 100 ms Deep dimensional drill-down Works with popular BI tools javascript, jquery Tableau others announced soon 2015 Think Big, a CONFIDENTIAL Teradata Company 15
Using Tableau without Dashboard Engine Queryable data limited by size of Server. Doesn t scale as users grow. Middle Tier Server Hadoop Extract 2015 Think Big, a CONFIDENTIAL Teradata Company 16
Using Impala without Think Big Dashboard Engine For the time the query is running, most or all of the cluster is dedicated to that one query. Has limitations if the cluster has other loads Has limitations for simultaneous dashboard users Low latencies possible only if all the event data is in RAM at query time. 2015 Think Big, a CONFIDENTIAL Teradata Company 17
18 Dash Board Engine Architecture
Think Big s Dashboard Engine for Hadoop Uses the power of Apache Spark to pre-aggregate data Scales as event volume grows. Scales as number of users grows. API 2015 Think Big, a CONFIDENTIAL Teradata Company 19
433 479 429 1911 2053 1965 14158 14269 14147 Arrivals-a:SFO-s:CA-2014-01-02 Arrivals-a:SFO-s:CA-2014-01-03 Arrivals-a:SFO-s:CA-2014-01-04 Arrivals-s:CA-2014-01-02 Arrivals-s:CA-2014-01-03 Arrivals-s:CA-2014-01-04 Arrivals-2014-01-02 Arrivals-2014-01-03 Arrivals-2014-01-04 Store cube data 2015 Think Big, a CONFIDENTIAL Teradata Company 20
API - Connecting to the Dashboard Engine Aggregate API that understands metrics, dimensions, time ranges. Relational API that understands (some) SQL. Aggregate API SQL API 2015 Think Big, a CONFIDENTIAL Teradata Company 21
22 Demo
Flight Data Statistics for Demo Running on a 16-node cluster (TD Appliance for Hadoop) Process and store all data in ~ 2 hours Rows Storage space Flight records 160 million 30 GB MOLAP cube 35 billion 2.1 TB 2015 Think Big, a CONFIDENTIAL Teradata Company 23
SQL Query to REST API Example Sends SQL queries to the API SELECT FlightData.Date AS "none_date_ok", FlightData.State AS "none_state_nk, SUM(FlightData.Arrivals) AS "sum_arrivals_nk FROM GROUP BY "default"."flightdata" "FlightData" "none_date_ok, "none_state_nk Translated to Aggregate API queries http://10.25.12.241:52080/clickstream/aggregate/v1/? period=day&start=1970-01-01&dimension=state:&metric=arrivals 2015 Think Big, a CONFIDENTIAL Teradata Company 24
Example index: List all Airports for a specific State <index name="airportsbystate"> <periods> <period>day</period> </periods> <indexdimensions> <dimension name="state" /> </indexdimensions> <listdimensions> <dimension name="airport" /> </listdimensions> </index> 2015 Think Big, a CONFIDENTIAL Teradata Company 25
Aggregate use: Show arrivals for all airports for NY http://10.25.12.241:52080/clickstream/aggregate/v1/?period=da y&start=2014-01-04&end=2014-01- 05&dimension=Airport:&dimension=State:NY&metric=Arrivals&head ers=on Day Start Airport State Arrivals 2014-01-04 ALB NY 20 2014-01-04 ART NY 1 2014-01-04 BUF NY 40... 2014-01-04 JFK NY 167 2014-01-04 LGA NY 206 2014-01-04 ROC NY 17 2014-01-04 SWF NY 2 2014-01-04 SYR NY 14 2015 Think Big, a CONFIDENTIAL Teradata Company 26
Index: List Flight No / Carrier / City / State combinations <index name="listflightnocarriercitystate"> <periods> <period>day</period> </periods> <indexdimensions> </indexdimensions> <listdimensions> <dimension name="state" /> <dimension name="city" /> <dimension name="carrier" /> <dimension name="flightno" /> </listdimensions> </index> 2015 Think Big, a CONFIDENTIAL Teradata Company 27
Dimensions use: Show all Flight/Carrier/City/State http://10.25.12.241:52080/clickstream/dimensions/v1/?period =day&start=2014-01-04&end=2014-01- 05&dimension=State:&dimension=City:&dimension=Carrier:&dime nsion=flightno: "results":[ ["AK","Anchorage, AK","AS","101"], ["AK","Anchorage, AK","AS","102"], ["AK","Anchorage, AK","AS","103"], ["AK","Anchorage, AK","AS","106"], ["AK","Anchorage, AK","AS","108"],... ["AL","Huntsville, AL","DL","1782"], ["AL","Huntsville, AL","DL","2077"],... ["WY","Rock Springs, WY","OO","7413"]] 2015 Think Big, a CONFIDENTIAL Teradata Company 28
Index Question Q: Drill down to a list of flights that had caused delay in Colorado done by Delta? A: Create the index below, rerun index creation step, query delay metrics for given state and carrier, while listing flight numbers dimension=flightno: <index name="listflightnobycarrierstate"> </index> <periods> <period>day</period> </periods> <indexdimensions> <dimension name="state" /> <dimension name="carrier" /> </indexdimensions> <listdimensions> <dimension name="flightno" /> </listdimensions> 2015 Think Big, a CONFIDENTIAL Teradata Company 29
30 Questions?
We are hiring!!! http://thinkbigcareers.teradata.com/ DATA ANALYTICS DATA ENGINEERS DATA SOLUTIONS Think Big International