Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap
|
|
- Bartholomew Edwards
- 8 years ago
- Views:
Transcription
1 Session# - AaS 2.1 Title SQL On Big Data - Technology, Architecture and Roadmap Sumit Pal Independent Big Data and Data Science Consultant, Boston 1 Data Center World Certified Vendor Neutral Each presenter is required to certify that their presentation will be vendor-neutral. As an attendee you have a right to enforce this policy of having no sales pitch within a session by alerting the speaker if you feel the session is not being presented in a vendor neutral fashion. If the issue continues to be a problem, please alert Data Center World staff after the session is complete. 2 1
2 SQL On Big Data - Technology, Architecture and Roadmap What is Big Data What is SQL SQL on Big Data Why SQL on Hadoop Limitations of SQL on Hadoop, Challenges & Solution Types of SQL (Volume, Velocity & Variety) Architectures Batch, Interactive, Streaming Innovations happening in this space 3 What is Big Data 4 2
3 What is SQL (Structured Query Language) - ANSI Standard - Manipulate and work with Relational Databases - Create Database - Insert / Update / Delete data into Databases - Retrieve data from Databases with filtering criteria 5 What is OLAP (On Line Analytical Processing) - Used in BI Tools - Calculating Trends from Historical Data - Building Business Scenarios - Building Aggregates and View Data Dimensionally ( Time, Geography etc.) - Navigate and Explore Adhoc Analysis - Drill Down 6 3
4 What is Hadoop New Approach to Distributed Data Processing Data is Local, NOT move across Network Shared Nothing Architecture 1. No synchronization requirement among the nodes 2. Designed for failure Multiple copies of data 3. Consistent Architecture individual failures does not fail the job 7 What is Map Reduce 8 4
5 What is HDFS 9 Why SQL on Hadoop More and more data is getting available on Hadoop SQL is an incredibly popular data querying language Is a bridge between business analysts and organizations big data Analysts would like to query data on Hadoop without using MapReduce Seamless Integration with BI Tools Scale that traditional DBMS cannot offer 10 5
6 SQL on Hadoop Goals Distributed, Scale Out Architecture Avoid Expensive Analytic DBs and Appliances Avoid Data Movement from HDFS to Analytic DBs High Concurrency of end users 11 Why SQL on Hadoop Appliances 12 6
7 SQL in Hadoop Landscape 13 Challenges of SQL on Hadoop Map Reduce and HDFS traditionally meant to solve Batch Oriented Data Map Reduce is high latency Map Reduce Not designed for long Data Pipelines (Complex SQL is inefficiently expressed as many MR stages) Disk IO between Map & Reduce do lot of data Shuffling & Sort 14 7
8 Approaches to solve the challenges 15 Approaches to solve the challenges Storage layer optimizations Data retrieval data locality, storage layout / formats & indexing Indexing (JethroData later slide) File Formats Avro, ORC, Parquet, Sequence Files Choosing the optimal file format in Hadoop is one of the most essential drivers of functionality and performance for big data processing and query Data Compression ( Reduce IO) GZIP, BZIP2, LZO, Snappy, LZ4 Workloads are IO Bound Reduce IO Compression Algorithms Compression has a gotcha Must be Splittable for Hadoop Tradeoff Storage / Network Bandwidth / CPU 16 8
9 Analytic Types 17 Batch SQL on Hadoop Hive is designed for batch queries Uses Map Reduce in the background Primarily for queries on large data sets and large ETL jobs for batch workloads Not designed for OLTP OR real time What Hive values most are scalability (scale out add machines to cluster) Extensibility (with Map Reduce framework and UDF/UDAF/UDTF) Fault tolerance and loose coupling with its input formats and Ser/De 18 9
10 Batch SQL on Hadoop Slow multiple stages of Map Reduce Complex SQL Queries need MR job Chaining, incurs Multiple Shuffles and Disk Writes 19 Optimizations Hive ~ Interactive SQL Hints ( /*MAP JOIN */) ( /*STREAMABLE */) Partitioning & Bucketing Done in relational DBs, Bucketing takes care of Skewness Vectorization Reduces the CPU usage, for query scans, filters, aggregates, and joins. Processing a block of 1024 rows at a time Within block, each column is stored as a vector Uses few instructions and finishes each instruction in fewer clock cycles, by using processor pipeline and cache memory
11 Interactive SQL 21 MPP / Full Scan Architecture 22 11
12 MPP / Full Scan Architecture Client: SELECT day, sum(sales) FROM t1 WHERE prod= abc GROUP BY day Query Planner/ Mgr Query Planner /Mgr Query Planner/ Mgr Query Planner/ Mgr Query Planner/ Mgr Query Executor Query Executor Query Executor Query Executor Query Executor Data Node Data Node Data Node Data Node Data Node Performance and resources based on the size of the dataset 23 Impala Architecture Biggest Advantage Dis-Advantage No Data Movement out of the clusters, No SPOF Processes/Deamon on the each Data Node Reasons for High Performance C++ Instead of Java Runtime Code Generation New Execution Engine ( Not Map Reduce ) Work on Optimized Data/File Format Parquet (Columnar Compressed) 24 12
13 Impala Architecture Fast and Efficient IO manager handle large data spread across array of hard drives (rotational, or SSD) Designed to run on modern architecture, recommended chipsets (i.e. Sandy Bridge, Bulldozer), as the LLVM IR compiler will use newer hardware instructions to help maximize IO throughput Impala s execution engine is decoupled from the storage engine, allowing it to plug other storage engines underneath 25 Impala Architecture 26 13
14 Impala Architecture 27 Impala Architecture 28 14
15 Impala Architecture 29 Impala Architecture LLVM & Others Runtime code generation to improve execution times Perform just in time (JIT) compilation to generate machine code Produce query specific versions of functions critical to performance Virtual function calls incur a large performance penalty. If object type is known, use code generation to replace the virtual function call with inline HDFS feature short circuit local reads to bypass the DataNode protocol when reading from local disk 30 15
16 ImpalaToGo New Kid on the Block Thin Layer on top of Impala ( C++ ) Moves out Impala to its own Cluster (DeCouples Storage From Compute ) ImpalaToGo can now reside in its own Cluster ( Separate from Hadoop Cluster ) Challenge deploying the MPP/full scan compute architecture against data that is now separated by network Deals with it by copying relevant data to local cluster Uses Tachyon to optimize it. This is great when all the data used by all current users can fit fully in cache. Once it doesn t the network cost kicks back in 31 Vertica MPP Based Analytic DB Engine MPP Columnar RDBMS 10x 100x compared to RDBMS on Analytic Queries Scales Linearly Commodity Hardware Built in Fault Tolerance Highly Compressed data for low IO 32 16
17 Vertica with Hadoop (2 Use Cases) 33 Vertica with Hadoop (2 Use Cases) 34 17
18 Spark SQL 35 Spark SQL 36 18
19 JethroData Indexes in Hadoop 37 Indexed Architecture Client: SELECT day, sum(sales) FROM t1 WHERE prod= abc GROUP BY day Jethro Query Node Query Node 1. Index Access 2. Read data only for require rows Data Node Data Node Data Node Data Node Data Node Performance and resources based on the size of the result set 38 19
20 Streaming SQL - Architecture 39 Streaming SQL - Architecture Store first, process second are unable to scale for real time Big Data Hadoop unable to offer latency and throughput for real time applications Telecoms, IOT and Cybersecurity Streams are infinite tables Standing query that executes over data Operating continuously on data as they arrive and by updating results incrementally in real time 40 20
21 Streaming SQL - Architecture Streaming Aggregation AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP Sliding Windowed analytics Streaming analytics and sliding windows Output record (or row) is produced for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row based. 41 Streaming SQL - Architecture In Memory Processing Lock free Data Structures Stateless implementation enables SQL queries to be distributed over Nodes 42 21
22 Innovations SQL Improvements Big Data Analytics (Hive Specific) SQL on Un Structured Data OLAP on Hadoop with SQL Probabilistic Query Engines BlinkDB 43 ACID transactions (HIVE-5317) INSERT INTO tbl SELECT INSERT INTO tbl VALUES UPDATE tbl SET WHERE DELETE FROM tbl WHERE MERGE INTO tbl USING src ON WHEN MATCHED THEN... WHEN NOT MATCHED THEN... SET TRANSACTION LEVEL BEGIN/END TRANSACTION How is it being done (look at the paper in reference) The heart of the approach is the client side merge of the HDFS files and directories 44 22
23 Window/Analytic Query (HIVE-4197) Windowing functions LEAD, LAG, FIRST_VALUE, LAST_VALUE, OVER with standard aggregates: COUNT, SUM, MIN, MAX, AVG OVER with a PARTITION BY OVER with PARTITION BY and ORDER BY OVER with a window specification Window specifications support these standard options: ROWS ((CURRENT ROW) (UNBOUNDED [num]) PRECEDING) AND (UNBOUNDED [num]) FOLLOWING Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank, NTILE 45 Innovations SQL on UnStructured Data 46 23
24 OLAP on Hadoop Apache Kylin Extremely Fast OLAP Engine at Scale ANSI SQL Interface on Hadoop Interactive Query Capability MOLAP Cube Seamless Integration with BI Tools 47 OLAP on Hadoop Apache Kylin 48 24
25 Choose Two Fast Response Low Latency SQL Engines Accurate Big Volume 49 Probabilistic SQL Query Engine Blink DB 50 25
26 Probabilistic SQL Query Engine Blink DB 51 Hybrid Transactional Analytical Processing Coined in early 2014 by Gartner New generation of in-memory data platforms OLTP & OLAP No data duplication/movement Hybrid Transaction/Analytical Processing In Memory technology Enabler Analytics over changing data ETL is not used any more SAP with the SAPHANA platform VoltDB, NuoDB, Clustrix, and MemSQLdatabase 52 26
27 Hybrid Transactional Analytical Processing
28 Why SQL on Hadoop is a Bad Idea Stefan Groschupf CEO and Chairman of Datameer When you put data into a structure, like SQL, you limit what you can do with the data in future. A question of Waterfall versus Agile in Data Analytics. SQL requires a waterfall design approach and is limited. Hadoop truly enables an agile analytics approach. You first collect all the data, and then you pull out whatever you want. stefan groschupf sqlhadoop.html#.vcitfxctr9c.linkedin 55 References JethroData Hive Paper icde2010.pdf Hive Transactions SQL to MR Translator state.edu/hpcs/www/html/publications/papers/tr 11 7.pdf Impala in Action Manning I was one of the reviewers Hive Cost Based Optimization 2.pdf Hive on Spark Hive Speed on spark is blazing fast or is itfinal Apache Kylin BlinkDB BlinkDB p212agarwalv2 Presto hadoop conference japan 2014?related=
29 References Hive Vectorization Hive LLAP longlived execution in hive Impala Paper Requiremtns of Stream Processing 57 ( Q& A ) Select Questions from Audience I will try to optimize the Queries 58 29
30 3 Key Things You Have Learned During this Session 1. What is Big Data 2. What is SQL 3. SQL on Big Data 59 Thank you 60 30
SQL on Hadoop Technology, Architecture & Innovations
SQL on Hadoop Technology, Architecture & Innovations 1 Introduction Sumit Pal Independent Consultant Big Data Architecture & Solutions (Spark, Pig, Hive, Impala, Tableau, Scala, Java, Python, R) palsumitpal@gmail.com
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationApache Kylin Introduction Dec 8, 2014 @ApacheKylin
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationBigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com. 2012/12/13 Beijing Apache Asia Road Show
BigData in Real-time Impala Introduction TCloud Computing 天 云 趋 势 孙 振 南 zhennan_sun@tcloudcomputing.com 2012/12/13 Beijing Apache Asia Road Show Background (Disclaimer) Impala is NOT an Apache Software
More informationThe Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
More informationTHE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS
THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationLuncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationHIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY
HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY Jaipaul Agonus FINRA Strata Hadoop World New York, Sep 2015 FINRA - WHAT DO WE DO? Collect and Create
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationCloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda
More informationTrafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationReference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
More informationImpala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.
Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationBig Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop
Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae Park IT Manager of SK Telecom, South Korea s largest wireless communications provider Work on commercial products (~ 12) T-FS: Distributed
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationInge Os Sales Consulting Manager Oracle Norway
Inge Os Sales Consulting Manager Oracle Norway Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database Machine Oracle & Sun Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database
More informationCitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
More informationActian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationIBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop
IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop Frank C. Fillmore, Jr. The Fillmore Group, Inc. Session Code: E13 Wed, May 06, 2015 (02:15 PM - 03:15 PM) Platform: Cross-platform Objectives
More informationSAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here
PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationImpala: A Modern, Open-Source SQL
Impala: A Modern, Open-Source SQL Engine Headline for Goes Hadoop Here Marcel Speaker Kornacker Name Subhead marcel@cloudera.com Goes Here CIDR 2015 Cloudera Impala Agenda Overview Architecture and Implementation
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationBig Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Buzzwords Berlin - 2015 Big data analytics / machine
More informationActian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
More informationPetabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More information2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationApache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
More informationHadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard
Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationData Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationConjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
More informationFederated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,
More informationSaskatoon Business College Corporate Training Centre 244-6340 corporate@sbccollege.ca www.sbccollege.ca/corporate
Microsoft Certified Instructor led: Querying Microsoft SQL Server (Course 20461C) Date: October 19 23, 2015 Course Length: 5 day (8:30am 4:30pm) Course Cost: $2400 + GST (Books included) About this Course
More informationA very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect
A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationJun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationBig Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
More informationTurn Big Data to Small Data
Turn Big Data to Small Data Use Qlik to Utilize Distributed Systems and Document Databases October, 2014 Stig Magne Henriksen Image: kdnuggets.com From Big Data to Small Data Agenda When do we have a Big
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationInnovative technology for big data analytics
Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationUsing RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
More informationIn-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki 30.11.2012
In-Memory Columnar Databases HyPer Arto Kärki University of Helsinki 30.11.2012 1 Introduction Columnar Databases Design Choices Data Clustering and Compression Conclusion 2 Introduction The relational
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHow To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
More informationSAP HANA From Relational OLAP Database to Big Data Infrastructure
SAP HANA From Relational OLAP Database to Big Data Infrastructure Anil K Goel VP & Chief Architect, SAP HANA Data Platform WBDB 2015, June 16, 2015 Toronto SAP Big Data Story Data Lifecycle Management
More informationNavigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
More informationIN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe 2015 1
IN-MEMORY DATABASE SYSTEMS Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe 2015 1 Analytical Processing Today Separation of OLTP and OLAP Motivation Online Transaction Processing (OLTP)
More informationIntegrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
More informationHow to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
More informationUnderstanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
More informationMicrosoft Analytics Platform System. Solution Brief
Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal
More informationPreview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.
Preview of Oracle Database 12c In-Memory Option 1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any
More informationBIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
More informationParquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various
More informationlow-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
More informationPulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
More informationBig Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
More informationHPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData
HPE Vertica & Hadoop Tapping Innovation to Turbocharge Your Big Data #SeizeTheData The HPE Vertica portfolio One Vertica Engine running on Cloud, Bare Metal, or Hadoop Data Nodes HPE Vertica OnDemand &
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationGridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs
GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs DMITRIY SETRAKYAN Founder & EVP Engineering @dsetrakyan www.gridgain.com #gridgain Agenda EvoluCon of In- Memory CompuCng
More informationBig Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
More informationBusiness Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
More informationBig Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com
Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
More informationAlternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHow Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns
How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationReal-Time Data Analytics and Visualization
Real-Time Data Analytics and Visualization Making the leap to BI on Hadoop Predictive Analytics & Business Insights 2015 February 9, 2015 David P. Mariani CEO, AtScale, Inc. THE TRUTH ABOUT DATA We think
More informationTE's Analytics on Hadoop and SAP HANA Using SAP Vora
TE's Analytics on Hadoop and SAP HANA Using SAP Vora Naveen Narra Senior Manager TE Connectivity Santha Kumar Rajendran Enterprise Data Architect TE Balaji Krishna - Director, SAP HANA Product Mgmt. -
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationPostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor
PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor The research leading to these results has received funding from the European Union's Seventh Framework
More informationBig Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
More informationAligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
More informationOracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.
Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More information