HAWQ Architecture. Alexey Grishchenko

Size: px
Start display at page:

Download "HAWQ Architecture. Alexey Grishchenko"

Transcription

1 HAWQ Architecture Alexey Grishchenko

2 Who I am Enterprise Pivotal 7 years in data processing 5 years of experience with MPP 4 years with Hadoop Using HAWQ since the first internal Beta Responsible for designing most of the EMEA HAWQ and Greenplum implementations Spark contributor

3 Agenda What is HAWQ

4 Agenda What is HAWQ Why you need it

5 Agenda What is HAWQ Why you need it HAWQ Components

6 Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design

7 Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design Query execution example

8 Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design Query execution example Competitive solutions

9 What is Analytical SQL-on-Hadoop engine

10 What is Analytical SQL-on-Hadoop engine HAdoop With Queries

11 What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 Fork Postgres 8.0.2

12 What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ Fork Postgres Rebase Postgres

13 What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ Fork Postgres Rebase Postgres Fork GPDB

14 What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ Fork Postgres Rebase Postgres Fork GPDB HAWQ

15 What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ Fork Postgres Rebase Postgres Fork GPDB HAWQ HAWQ Open Source

16 HAWQ is C and C++ lines of code

17 HAWQ is C and C++ lines of code of them in headers only

18 HAWQ is C and C++ lines of code of them in headers only Python LOC

19 HAWQ is C and C++ lines of code of them in headers only Python LOC Java LOC

20 HAWQ is C and C++ lines of code of them in headers only Python LOC Java LOC Makefile LOC

21 HAWQ is C and C++ lines of code of them in headers only Python LOC Java LOC Makefile LOC Shell scripts LOC

22 HAWQ is C and C++ lines of code of them in headers only Python LOC Java LOC Makefile LOC Shell scripts LOC More than 50 enterprise customers

23 HAWQ is C and C++ lines of code of them in headers only Python LOC Java LOC Makefile LOC Shell scripts LOC More than 50 enterprise customers More than 10 of them in EMEA

24 Apache HAWQ Apache HAWQ (incubating) from What s in Open Source Sources of HAWQ 2.0 alpha HAWQ 2.0 beta is planned for 2015 Q4 HAWQ 2.0 GA is planned for 2016 Q1 Community is yet young come and join!

25 Why do we need it?

26 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003

27 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, Example line query with a number of window function generated by Cognos

28 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, Example line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data

29 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, Example line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters

30 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, Example line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters Good performance

31 Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, Example line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters Good performance How many times the data would hit the HDD during a single Hive query?

32 HAWQ Cluster Server 1 Server 2 Server 3 Server 4 ZK JM ZK JM ZK JM NameNode SNameNode interconnect Server 5 Server 6 Server N Datanode Datanode Datanode

33 HAWQ Cluster Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

34 HAWQ Cluster Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

35 Master Servers Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

36 Master Servers HAWQ Master HAWQ Standby Master Query Parser Distributed TransacJons Manager Query Parser Distributed Transac,ons Manager Query OpJmizer Metadata Catalog WAL repl. Query Op,mizer Metadata Catalog Global Resource Manager Query Dispatch Global Resource Manager Query Dispatch

37 Segments Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

38 Segments Query Executor libhdfs3 PXF YARN Node Manager Local Filesystem Temporary Data Directory Logs

39 Metadata HAWQ metadata structure is similar to Postgres catalog structure

40 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table

41 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field

42 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field

43 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field

44 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field Number of null values in the field

45 Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field Number of null values in the field Average width of the field in bytes

46 Statistics No Statistics How many rows would produce the join of two tables?

47 Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity

48 Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two row tables?

49 Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two row tables? ü From 0 to

50 Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two row tables? ü From 0 to Histograms and MCV How many rows would produce the join of two row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?

51 Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two row tables? ü From 0 to Histograms and MCV How many rows would produce the join of two row tables, with known field cardinality, values distribution diagram, number of nulls, most common values? ü ~ From 500 to 1 500

52 Metadata Table structure information ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас 35 90

53 Metadata Table structure information Distribution fields ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас hash(id)

54 Metadata Table structure information Distribution fields Number of hash buckets ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас hash(id) ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас 35 90

55 Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас hash(id) ID Name Num Price 1 Яблоко Груша Банан Апельсин Киви Арбуз Дыня Ананас 35 90

56 Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups

57 Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups Access privileges

58 Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups Access privileges Stored procedures PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R

59 Query Optimizer HAWQ uses cost-based query optimizers

60 Query Optimizer HAWQ uses cost-based query optimizers You have two options Planner evolved from the Postgres query optimizer ORCA (Pivotal Query Optimizer) developed specifically for HAWQ

61 Query Optimizer HAWQ uses cost-based query optimizers You have two options Planner evolved from the Postgres query optimizer ORCA (Pivotal Query Optimizer) developed specifically for HAWQ Optimizer hints work just like in Postgres Enable/disable specific operation Change the cost estimations for basic actions

62 Storage Formats Which storage format is the most optimal?

63 Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal

64 Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data

65 Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage

66 Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage Minimal time to retrieve record by key

67 Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage Minimal time to retrieve record by key Minimal time to retrieve subset of columns etc.

68 Storage Formats Row-based storage format Similar to Postgres heap storage No toast No ctid, xmin, xmax, cmin, cmax

69 Storage Formats Row-based storage format Similar to Postgres heap storage No toast No ctid, xmin, xmax, cmin, cmax Compression No compression Quicklz Zlib levels 1-9

70 Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format

71 Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format Compression No compression Snappy Gzip levels 1 9

72 Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format Compression No compression Snappy Gzip levels 1 9 The size of row group and page size can be set for each table separately

73 Resource Management Two main options Static resource split HAWQ and YARN does not know about each other

74 Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources

75 Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small

76 Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small Query might have many executors on each cluster node to make it run faster

77 Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small Query might have many executors on each cluster node to make it run faster You can control the parallelism of each query

78 Resource Management Resource Queue can be set with Maximum number of parallel queries

79 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority

80 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits

81 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit

82 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system

83 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system MIN/MAX number of executors on each node

84 Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system MIN/MAX number of executors on each node Can be set up for user or group

85 External Data PXF Framework for external data access Easy to extend, many public plugins available Official plugins: CSV, SequenceFile, Avro, Hive, HBase Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe

86 External Data PXF Framework for external data access Easy to extend, many public plugins available Official plugins: CSV, SequenceFile, Avro, Hive, HBase Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe HCatalog HAWQ can query tables from HCatalog the same way as HAWQ native tables

87 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

88 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

89 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

90 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

91 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

92 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

93 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

94 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

95 Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

96 Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

97 Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

98 Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

99 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

100 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

101 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

102 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

103 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

104 Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

105 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

106 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

107 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

108 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

109 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

110 Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

111 Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

112 Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

113 Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

114 Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

115 Query Performance Data does not hit the disk unless this cannot be avoided

116 Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided

117 Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP

118 Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer

119 Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer C/C++ implementation is more efficient than Java implementation of competitive solutions

120 Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer C/C++ implementation is more efficient than Java implementation of competitive solutions Query parallelism can be easily tuned

121 Competitive Solutions Query OpJmizer Hive SparkSQL Impala HAWQ

122 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL

123 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages

124 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO

125 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism

126 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons

127 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons Stability

128 Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons Stability Community

129 Roadmap AWS and S3 integration

130 Roadmap AWS and S3 integration Mesos integration

131 Roadmap AWS and S3 integration Mesos integration Better Ambari integration

132 Roadmap AWS and S3 integration Mesos integration Better Ambari integration Cloudera, MapR and IBM Hadoop distributions native support

133 Roadmap AWS and S3 integration Mesos integration Better Ambari integration Cloudera, MapR and IBM Hadoop distributions native support Make the SQL-on-Hadoop engine ever!

134 Summary Modern SQL-on-Hadoop engine For structured data processing and analysis Combines the best techniques of competitive solutions Just released to the open source Community is very young Join our community and contribute!

135 Questions Apache HAWQ Reach me on

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,

More information

Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database

Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database Copyright 2014 Pivotal Software, Inc. All rights reserved. 1 Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database Kuien Liu Pivotal, Inc. FOSS4G Seoul 2015 Warm-up: GeoSpatial on Hadoop

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Fundamentals Curriculum HAWQ

Fundamentals Curriculum HAWQ Fundamentals Curriculum Pivotal Hadoop 2.1 HAWQ Education Services zdata Inc. 660 4th St. Ste. 176 San Francisco, CA 94107 t. 415.890.5764 zdatainc.com Pivotal Hadoop & HAWQ Fundamentals Course Description

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Apache Sentry. Prasad Mujumdar [email protected] [email protected]

Apache Sentry. Prasad Mujumdar prasadm@apache.org prasadm@cloudera.com Apache Sentry Prasad Mujumdar [email protected] [email protected] Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture

More information

Pivotal HAWQ 1.2.1 Release Notes

Pivotal HAWQ 1.2.1 Release Notes Pivotal HAWQ 1.2.1 Release Notes Rev: A03 Published: September 15, 2014 Updated: November 12, 2014 Contents About the Pivotal HAWQ Components What's New in the Release Supported Platforms Installation

More information

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc. Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda Goals; user view of Impala Impala performance Impala internals Comparing Impala to other systems Impala Overview:

More information

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB ABSTRACT As companies increasingly adopt data lakes as a platform for storing data from a variety of sources, the need for

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected]

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam [email protected] Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here JusIn Erickson Senior Product Manager, Cloudera Speaker Name or Subhead Goes Here May 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12 Agenda

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC Agenda Quick Overview of Impala Design Challenges of an Impala Deployment Case Study: Use Simulation-Based Approach to Design

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex Holmes @ The Top 10 7 Hadoop Patterns and Anti-patterns Alex Holmes @ whoami Alex Holmes Software engineer Working on distributed systems for many years Hadoop since 2008 @grep_alex grepalex.com what s hadoop...

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING

More information

SQL on NoSQL (and all of the data) With Apache Drill

SQL on NoSQL (and all of the data) With Apache Drill SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com Hortonworks Data Platform: Hive Performance Tuning Copyright 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Actian SQL in Hadoop Buyer s Guide

Actian SQL in Hadoop Buyer s Guide Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop

More information

AtScale Intelligence Platform

AtScale Intelligence Platform AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta [email protected] Radu Chilom [email protected] Buzzwords Berlin - 2015 Big data analytics / machine

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

Actian Vector in Hadoop

Actian Vector in Hadoop Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 5/18/2016 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the

More information

Case Study : 3 different hadoop cluster deployments

Case Study : 3 different hadoop cluster deployments Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle Oracle Big Data SQL Architectural Deep Dive Dan McClary, Ph.D. Big Data Product Management Oracle Copyright 2014, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is

More information

PivotalR: A Package for Machine Learning on Big Data

PivotalR: A Package for Machine Learning on Big Data PivotalR: A Package for Machine Learning on Big Data Hai Qian Predictive Analytics Team, Pivotal Inc. [email protected] Copyright 2013 Pivotal. All rights reserved. What Can Small Data Scientists Bring

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Big Data SQL and Query Franchising

Big Data SQL and Query Franchising Big Data SQL and Query Franchising An Architecture for Query Beyond Hadoop Dan McClary, Ph.D. Big Data Product Management Oracle Copyright 2014, Oracle and/or its affiliates. All rights reserved. Safe Harbor

More information

Oracle Database 11g: SQL Tuning Workshop

Oracle Database 11g: SQL Tuning Workshop Oracle University Contact Us: + 38516306373 Oracle Database 11g: SQL Tuning Workshop Duration: 3 Days What you will learn This Oracle Database 11g: SQL Tuning Workshop Release 2 training assists database

More information

Using RDBMS, NoSQL or Hadoop?

Using RDBMS, NoSQL or Hadoop? Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest

More information

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved. EMC Federation Big Data Solutions 1 Introduction to data analytics Federation offering 2 Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Open Source Technologies on Microsoft Azure

Open Source Technologies on Microsoft Azure Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions

More information

Cloudera Manager Training: Hands-On Exercises

Cloudera Manager Training: Hands-On Exercises 201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release Compliments of PREVIEW EDITION Hadoop Application Architectures DESIGNING REAL WORLD BIG DATA APPLICATIONS Mark Grover, Ted Malaska, Jonathan Seidman & Gwen Shapira This Preview Edition of Hadoop Application

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Sisense. Product Highlights. www.sisense.com

Sisense. Product Highlights. www.sisense.com Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze

More information

HiBench Introduction. Carson Wang ([email protected]) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation

More information

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse SQL Server 2012 PDW Ryan Simpson Technical Solution Professional PDW Microsoft Microsoft SQL Server 2012 Parallel Data Warehouse Massively Parallel Processing Platform Delivers Big Data HDFS Delivers Scale

More information

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved. Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!

More information

Wisdom from Crowds of Machines

Wisdom from Crowds of Machines Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009 Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009 Hadoop @ Facebook Who generates this data? Lots of data

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 [email protected]

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 [email protected] Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality

More information

... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ...

... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ... ..................................... WHITEPAPER PEPPERDATA OVERVIEW AND DIFFERENTIATORS INTRODUCTION Prospective customers will often pose the question, How is Pepperdata different from tools like Ganglia,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org #apacheignite Agenda Apache Ignite (tm) In- Memory

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 12/7/2015 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the

More information