Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016



Similar documents
Unified Big Data Analytics Pipeline. 连 城

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Unified Big Data Processing with Apache Spark. Matei

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

How Companies are! Using Spark

Self-service BI for big data applications using Apache Drill

Moving From Hadoop to Spark

SQL on NoSQL (and all of the data) With Apache Drill

Self-service BI for big data applications using Apache Drill

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop & Spark Using Amazon EMR

Scaling Out With Apache Spark. DTL Meeting Slides based on

Native Connectivity to Big Data Sources in MSTR 10

The Internet of Things and Big Data: Intro

Oracle Big Data Fundamentals Ed 1 NEW

Hadoop Job Oriented Training Agenda

Oracle Big Data SQL Technical Update

MapR: Best Solution for Customer Success

Hadoop Ecosystem B Y R A H I M A.

A Brief Introduction to Apache Tez

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Beyond Hadoop with Apache Spark and BDAS

How To Create A Data Visualization With Apache Spark And Zeppelin

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data Course Highlights

Upcoming Announcements

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

This is a brief tutorial that explains the basics of Spark SQL programming.

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Turn Big Data to Small Data

Tap into Hadoop and Other No SQL Sources

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Hadoop in the Enterprise

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Analytics on Spark &

Why Spark on Hadoop Matters

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Integrating Apache Spark with an Enterprise Data Warehouse

Workshop on Hadoop with Big Data

Open Source Technologies on Microsoft Azure

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Data Services Advisory

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Best Practices for Hadoop Data Analysis with Tableau

Impala: A Modern, Open-Source SQL

07/11/2014 Julien! Poorna! Andreas

Introduction to Big Data Training

Complete Java Classes Hadoop Syllabus Contact No:

Next-Gen Big Data Analytics using the Spark stack

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Ali Ghodsi Head of PM and Engineering Databricks

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Apache Kylin Introduction Dec 8,

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

HDP Hadoop From concept to deployment.

Information Builders Mission & Value Proposition

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

Sisense. Product Highlights.

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

ANALYTICS CENTER LEARNING PROGRAM

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Chase Wu New Jersey Ins0tute of Technology

Big Data SQL and Query Franchising

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Unified Batch & Stream Processing Platform

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Play with Big Data on the Shoulders of Open Source

AtScale Intelligence Platform

Ganzheitliches Datenmanagement

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Communicating with the Elephant in the Data Center

Apache Flink Next-gen data analysis. Kostas

Conquering Big Data with BDAS (Berkeley Data Analytics)

Architectures for Big Data Analytics A database perspective

Spark: Making Big Data Interactive & Real-Time

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Using RDBMS, NoSQL or Hadoop?

How To Scale Out Of A Nosql Database

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

#TalendSandbox for Big Data

Qlik Sense Enabling the New Enterprise

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Case Study : 3 different hadoop cluster deployments

Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

How Intel IT Successfully Migrated to Cloudera Apache Hadoop*

Transcription:

Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016

Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible Deliver Big Data In Context Keep Big Data Relevant Qlik s platform drives higher ROI by delivering big data in context with other data to ensure that Big Data stays relevant.

Partner for Success A strong culture of Partnership Broad range of technology partnerships Dedicated staff ensures continued focus Continuous evaluation of new market entrants HADOOP DATABASES ACCELERATORS DOMAIN And more

Hadoop

Apache Hadoop v2 Hive (SQL) Pig (ETL) Mahout (ML) Giraph (Graph) Hive 1.2.0, Tez, Pig, Cascading, etc. HBase MapReduce Other Compute Engines (Tez, Spark, etc.) YARN (Cluster Resource Manager) HDFS2

Stinger Initiative Hive on Tez YARN integration Distributed execution framework Eliminate extra map reads Dataflow model on DAG of nodes Query Optimisations Vectorised query execution Filter at storage layer vs SQL engine SQL cost based optimiser ORCFile format Higher compression Columnar Ideal for frequent fact filters Source: http://hortonworks.com/labs/stinger/ 145 developers 44 companies Connect via ODBC

Impala Parquet file format Driven from Twitter use cases Columnar data storage Limits IO to data needed Space saving Impala SQL Cost based optimisations Authentication, AD/Kerberos YARN integration In memory caching Metastore Can be same DB as Hive metastore e.g. MySQL Query optimiser can use table/column stats Can use Hbase/HDFS with several file formats e.g. RCfile/Parquet Impala Roadmap Additional SQL support S3 integration Nested data Connect via ODBC Source: http://blog.cloudera.com/blog/2014/08/whats-next-for-impala-focus-on-advanced-sql-functionality/ Other Sources: http://blog.cloudera.com/blog/2015/02/how-to-do-real-time-big-data-discovery-using-cloudera-enterpriseand-qlik-sense/

Apache Drill Dynamic Schema Discovery Does not require schema/type spec to start query execution Leverage self describing data formats, e.g. Parquet,Avro,JSON and NoSQL DB Flexible data model built for complex/semi-structured data Connect via ODBC ODBC/JDBC Drill SQL Query Layer & Execution Engine Files HBase Hive Hive UDF s Metastore SerDes Source: https://www.mapr.com/products/apache-drill Performance Distributed execution engine for query processing Columnar execution, avoids disk access for columns not in query Vectorisation allows CPU to operate on record batches Optimistic and pipelined query execution Data sources Drill creates a virtual view in JSON Nested data support MAPR-FS,HDFS,H-Base Can use Hive Metastore JSON, Mongo DB/NoSQL Reuse Hive UDF s

Resilient Distributed Datasets In memory MR does not lend itself to interactive /ad-hoc queries Logical collection of data partitioned across machines Can reference external datasets Market Spark distributed with ALL hadoop distro s Not just Big Data use cases Connect via ODBC Spark SQL Spark Spark Streamin g SparkSQL DataFrame distributed collection of data in named columns Supported on RDD s,parquet,json,hive,jdbc sources YARN integration Mlib Machine Learning GraphX Graph Comp. Spark R R on Spark Spark Core Engine Source: https://databricks.com/spark/about Alpha/Pre-alpha

Deeper Dive

Qlik Big Data Methodologies Different data volumes and complexities are best met using different methods Different methods ensure an optimized experience for the user for every situation On Demand App Generation Methods can be combined to meet different use cases Methods vary in deployment complexity Direct Discovery Chaining Data Volume Size (rows) Dimensions (columns) Cardinality (uniqueness) App Complexity Computational complexity such as set analysis Object density In-Memory Segmentation

On Demand App Generation A shopping cart approach to analytics Dimension selection to generate filtered analytics On demand data slices Converting Big Data to small data analytics Driven by business users governed by IT

Sample QlikView Process Selection App Dimensions in list boxes Conditional show applied to a button to limit amount of data Action to invoke analysis document with parameters Index Technique ASPX page invoked with selection criteria/user name EDX parameters passed separated in a list and pipe terminated Analysis App Analysis app produced by EDX task Data slice limited by populating a WHERE clause with the parameters

Sample QlikView Server Process 1. Selection App with dimensional data is deployed to access point 2. Business User makes multiple data selections 3. Governed selections will drive the indexing of analysis app 4. EDX process indexes analysis app with latest data from source(s) 5. Associative analysis app available for business user Aggregated data and dimensions Selection app 1 3 User selection criteria drives creation of analysis app 4 Data from multiple data sources is indexed into a template app based on the user selections Analysis app 2 5 User makes personalized selections from across many data sources Highly interactive user experience within a purpose built analysis app!

Case Study - Telco 1. Selection App is populated with dimensional data on schedule 2 2. User selects dimensional criteria. After the governed limit is reached an ASPX page is invoked 3. QMS API and EDX indexes analysis app with the most recent data from the Teradata database with only the data slice relevant to the user 4. Analysis app deployed to access point with user security 4 Analysis App Access Point 1 Selection App QlikView Server IIS QlikView Management Service ASPX QMS API 3 EDX Publisher

Qlik Sense and Elastic Tweets Example 2 1. Tweets are populated into a Elastic DB via Logstash 2. User searches for Tweets stored in the Elastic DB from a custom web page Web Page Index.html API s Proxy, Engine and Repository 3. NodeJS container with Sense Proxy, Engine and Repository API s and indexes the analysis app with the Tweets from the search stored in the Elastic DB. The data slice contains relevant for the user 4. Analysis app updated and published to a shared stream Qlik Sense Server 4 Analysis App 3 REST connector 1

Architecture Authenticate Generate and Publish App External Authentication 4243 4242 Qlik Sense Proxy API Qlik Sense Engine API Qlik Sense Repository API Gets client certificate Generates XrfKey Send request for ticket Authenticates to Sense with ticket Access Index template app Copies template and renames with filter added to app name Replaces the $search_terms$ in the script with filter Generates app with data and publishes to everyone stream

Thank you