Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.
|
|
|
- Brian White
- 10 years ago
- Views:
Transcription
1 Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc.
2 Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems
3 Impala Overview: Goals General-purpose SQL query engine: works both for analytical and transactional/singlerow workloads supports queries that take from milliseconds to hours Runs directly within Hadoop: reads widely used Hadoop file formats talks to widely used Hadoop storage managers runs on same nodes that run Hadoop processes High performance: C++ instead of Java runtime code generation completely new execution engine that doesn't build on MapReduce
4 User View of Impala: Overview Runs as a distributed service in cluster: one Impala daemon on each node with data User submits query via ODBC/JDBC, Impala CLI or Hue to any of the daemons Query is distributed to all nodes with relevant data If any node fails, the query fails Impala uses Hive's metadata interface, connects to Hive's metastore Supported file formats: Parquet columnar format (more on that later) sequence files and RCfile with snappy/gzip compression Avro data files uncompressed/lzo-compressed text files
5 User View of Impala: SQL SQL support: patterned after Hive's version of SQL essentially SQL-92, minus correlated subqueries INSERT INTO... SELECT... only equi-joins; no non-equi joins, no cross products ORDER BY requires LIMIT Limited DDL support Functional limitations: no custom UDFs, file formats, SerDes no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) only hash joins; joined table has to fit in aggregate memory of all executing nodes
6 User View of Impala: HBase Functionality highlights: support for SELECT, INSERT INTO... SELECT..., and INSERT INTO... VALUES(...) predicates on rowkey columns are mapped into start/stop row predicates on other columns are mapped into SingleColumnValueFilters But: mapping of HBase table into metastore table patterned after Hive all data stored as scalars and in ascii the rowkey needs to be mapped to a single string column
7 User View of Impala: HBase Roadmap: full support for UPDATE and DELETE storage of structured data to minimize storage and access overhead composite row key encoding, mapped into an arbitrary number of table columns
8 Impala Single-User Performance Benchmark: 20 queries from TPC-DS, in 3 categories: interactive: 1 month Reports: several months deep analytics: all data Main fact table: 5 years of data, 1TB, stored as snappycompressed sequence files Cluster: 20 machines, 24 cores each
9 Impala Single-User Performance Speed-up over Hive: interactive: 25x - 68x Reports: 6x - 56x deep analytics: 6x - 55x
10 Impala Multi-User Performance Benchmark for query latency in multi-user env: same dataset and workload as single-user benchm. same hardware config multiple clients issue queries in parallel
11 Impala Multi-User Performance Query throughput (in queries per second) in multi-user environment: scaling up workload (not # of machines) qps increases until cluster is saturated qps stable at that point, system doesn't waste work
12 Impala Architecture Two binaries: impalad and statestored Impala daemon (impalad) - N instances handles client requests and all internal requests related to query execution State store daemon (statestored) - 1 instance provides name service and metadata distribution
13 Impala Architecture Query execution phases request arrives via odbc/jdbc planner turns request into collections of plan fragments coordinator initiates execution on remote impalad nodes During execution intermediate results are streamed between executors query results are streamed back to client subject to limitations imposed to blocking operators (top-n, aggregation)
14 Impala Architecture: Query Execution Request arrives via odbc/jdbc SQL App ODBC Hive Metastore HDFS NN Statestore SQL request Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
15 Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad nodes SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
16 Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App ODBC Hive Metastore HDFS NN Statestore query results Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
17 Query Planning: Overview 2-phase planning process: single-node plan: left-deep tree of plan operators partitioning of operator tree into plan fragments for parallel execution Parallelization of operators: all query operators are fully distributed Join order = FROM clause order Post-GA: cost-based optimizer
18 Query Planning: Single-Node Plan Plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange Example: SELECT t1.custid, SUM(t2.revenue) AS revenue FROM LargeHdfsTable t1 JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10
19 Query Planning: Single-Node Plan Single-node plan for example: TopN Agg HashJoin Scan: t3 HashJoin Scan: t2 Scan: t1
20 Query Planning: Distributed Plans Goals: maximize scan locality, minimize data movement full distribution of all query operators (where semantically correct) Parallel joins: broadcast join: join is collocated with left input; righthand side table is broadcast to each node executing join -> preferred for small right-hand side input partitioned join: both tables are hash-partitioned on join columns -> preferred for large joins cost-based decision based on column stats/estimated cost of data transfers
21 Query Planning: Distributed Plans Parallel aggregation: pre-aggregation where data is first materialized merge aggregation partitioned by grouping columns Parallel top-n: initial top-n operation where data is first materialized final top-n in single-node plan fragment
22 Query Planning: Distributed Plans In the example: scans are local: each scan receives its own fragment 1st join: large x large -> partitioned join 2nd scan: large x small -> broadcast join pre-aggregation in fragment that materializes join result merge aggregation after repartitioning on grouping column initial top-n in fragment that does merge aggregation final top-n in coordinator fragment
23 Query Planning: Distributed Plans at HDFS DN TopN Broadcast TopN at HBase RS MergeAgg at coordinator hash t1. custid Pre-Agg HashJoin Broadcast Scan: t3 Scan: t1 hash t1.id1 HashJoin hash t2.id Scan: t2
24 Metadata Handling Impala metadata: Hive's metastore: logical metadata (table definitions, columns, CREATE TABLE parameters) HDFS NameNode: directory contents and block replica locations HDFS DataNode: block replicas' volume ids Caches metadata: no synchronous metastore API calls during query execution impalad instances read metadata from metastore at startup REFRESH [<tbl>]: selectively reloads metadata at single impalad instance Post-GA: metadata distribution through statestore Post-GA: HCatalog and AccessServer
25 Impala Execution Engine Written in C++ for minimal execution overhead Internal in-memory tuple format puts fixed-width data at fixed offsets Uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc. Runtime code generation for "big loops"
26 Impala Execution Engine More on runtime code generation example of "big loop": insert batch of rows into hash table known at query compile time: # of tuples in batch, tuple layout, column types, etc. generate at compile time: loop that inlines all function calls, contains no dead code, minimizes branches code generated using llvm
27 Impala's Statestore Central system state repository name service (membership) Post-GA: metadata Post-GA: other scheduling-relevant or diagnostic state Soft-state all data can be reconstructed from the rest of the system cluster continues to function when statestore fails, but per-node state becomes increasingly stale Sends periodic heartbeats pushes new data checks for liveness
28 Statestore: Why not ZooKeeper ZK is not a good pub-sub system Watch API is awkward and requires a lot of client logic multiple round-trips required to get data for changes to node's children push model is more natural for our use case Don't need all the guarantees ZK provides: serializability persistence prefer to avoid complexity where possible ZK is bad at the things we care about and good at the things we don't
29 Comparing Impala to Dremel What is Dremel: columnar storage for data with nested structures distributed scalable aggregation on top of that Columnar storage in Hadoop: Parquet Distributed aggregation: Impala Impala plus Parquet: a superset of the published version of Dremel (which didn't support joins)
30 More about Parquet What it is: columnar container format for all popular serialization formats: Avro, Thrift, Protocol Buffers successor to Doug Cutting's Trevni co-designed by Cloudera and Twitter open source; hosted on github Features: fully shredded nested data; repetition and definition levels similar to Dremel's ColumnIO column values stored in native types (bool, int<x>, float, double, byte array) support for index pages for fast lookup extensible value encodings (run-length encoding, dictionary,...)
31 Comparing Impala to Hive Hive: MapReduce as an execution engine High latency, low throughput queries Fault-tolerance model based on MapReduce's ondisk checkpointing; materializes all intermediate results Java runtime allows for easy late-binding of functionality: file formats and UDFs. Extensive layering imposes high runtime overhead Impala: direct, process-to-process data exchange no fault tolerance an execution engine designed for low runtime overhead
32 Impala Roadmap: 2013 Additional SQL: UDFs SQL authorization and DDL ORDER BY without LIMIT analytic/window functions support for structured data types Improved HBase support: composite keys, complex types in columns, index nested-loop joins, INSERT/UPDATE/DELETE Runtime optimizations: straggler handling join order optimization improved cache management data collocation for improved join performance
33 Impala Roadmap: 2013 Better metadata handling: automatic metadata distribution through statestore Resource management: goal: run exploratory and production workloads in same cluster, against same data, without impacting production jobs
34 Try it out! Impala 1.0 was released on 04/30 We have packages for: RHEL6.2/5.7 Ubuntu and SLES11 Debian6 Questions/comments? My address:
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here
Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011
Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
Where is Hadoop Going Next?
Where is Hadoop Going Next? Owen O Malley [email protected] @owen_omalley November 2014 Page 1 Who am I? Worked at Yahoo Seach Webmap in a Week Dreadnaught to Juggernaut to Hadoop MapReduce Security
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009 Hadoop @ Facebook Who generates this data? Lots of data
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 2: Sqoop, Hive and Oozie Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Previously we 2 Loaded a file to HDFS Ran few MapReduce jobs Played around with Hue Now its time
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC
Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
Using RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
Integration of Apache Hive and HBase
Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
HAWQ Architecture. Alexey Grishchenko
HAWQ Architecture Alexey Grishchenko Who I am Enterprise Architect @ Pivotal 7 years in data processing 5 years of experience with MPP 4 years with Hadoop Using HAWQ since the first internal Beta Responsible
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin
Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager [email protected] @lukehq Yang Li Architect & Tech Leader [email protected] Agenda What s Apache Kylin? Tech Highlights Performance
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
Data Warehouse Overview. Namit Jain
Data Warehouse Overview Namit Jain Agenda Why data? Life of a tag for data infrastructure Warehouse architecture Challenges Summarizing Data Science peace.facebook.com Friendships on Facebook Data Science
HareDB HBase Client Web Version USER MANUAL HAREDB TEAM
2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected] Radu Chilom [email protected] Buzzwords Berlin - 2015 Big data analytics / machine
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com
REPORT Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com The content of this evaluation guide, including the ideas and concepts contained within, are the property of Splice Machine,
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
Big Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard
Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP)
Integrate Master Data with Big Data using Oracle Table Access for Hadoop
Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional
Pivotal HAWQ 1.2.1 Release Notes
Pivotal HAWQ 1.2.1 Release Notes Rev: A03 Published: September 15, 2014 Updated: November 12, 2014 Contents About the Pivotal HAWQ Components What's New in the Release Supported Platforms Installation
