Processing Big Data With SQL on Hadoop. Jens Albrecht jens.albrecht@th-nuernberg.de

Similar documents
Unified Big Data Processing with Apache Spark. Matei

SQL on NoSQL (and all of the data) With Apache Drill

Moving From Hadoop to Spark

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Unified Big Data Analytics Pipeline. 连 城

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Constructing a Data Lake: Hadoop and Oracle Database United!

Native Connectivity to Big Data Sources in MSTR 10

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop Ecosystem B Y R A H I M A.

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Beyond Hadoop with Apache Spark and BDAS

Workshop on Hadoop with Big Data

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

This is a brief tutorial that explains the basics of Spark SQL programming.

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

The Internet of Things and Big Data: Intro

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Parquet. Columnar storage for the people

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Integrating Big Data into the Computing Curricula

How Companies are! Using Spark

Integration of Apache Hive and HBase

Open Source Technologies on Microsoft Azure

Oracle Big Data SQL Technical Update

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

A Brief Introduction to Apache Tez

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

APACHE DRILL: Interactive Ad-Hoc Analysis at Scale

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Luncheon Webinar Series May 13, 2013

Big Data Analytics with Cassandra, Spark & MLLib

Hadoop Job Oriented Training Agenda

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Business Intelligence for Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Bringing Big Data to People

Analytics on Spark &

Hadoop: The Definitive Guide

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Spark: Making Big Data Interactive & Real-Time

An Approach to Implement Map Reduce with NoSQL Databases

Big Data Course Highlights

Tap into Hadoop and Other No SQL Sources

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Spark Application Carousel. Spark Summit East 2015

How To Create A Data Visualization With Apache Spark And Zeppelin

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

How To Scale Out Of A Nosql Database

The Hadoop Eco System Shanghai Data Science Meetup

COURSE CONTENT Big Data and Hadoop Training

Practical Hadoop by Example

Integrating Apache Spark with an Enterprise Data Warehouse

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Comparing SQL and NOSQL databases

Big Data Analytics - Accelerated. stream-horizon.com

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Using RDBMS, NoSQL or Hadoop?

NoSQL for SQL Professionals William McKnight

Trafodion Operational SQL-on-Hadoop

#TalendSandbox for Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Scaling Out With Apache Spark. DTL Meeting Slides based on

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

How to Choose Between Hadoop, NoSQL and RDBMS

Big Data and Scripting Systems build on top of Hadoop

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

How, What, and Where of Data Warehouses for MySQL

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Oracle Database 12c Plug In. Switch On. Get SMART.

Next-Gen Big Data Analytics using the Spark stack

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Complete Java Classes Hadoop Syllabus Contact No:

MapR: Best Solution for Customer Success

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Big Data Research in the AMPLab: BDAS and Beyond

Hadoop and Map-Reduce. Swati Gore

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Conquering Big Data with BDAS (Berkeley Data Analytics)

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Transcription:

Processing Big Data With SQL on Hadoop Jens Albrecht jens.albrecht@th-nuernberg.de

Why SQL for Big Data? Mature technology Broad knowledge available Powerful query language High interactive performance Many third party tools for data analysis and visualization Flexible data structures Semi-structured data Changing schemas Self-Service Data integration on-the-fly Scalability Analysis, integration, volumen (Relatively) Low Cost CommodityHardware, Open Source Prof. Dr. Jens Albrecht SQL on Hadoop 3

Why SQL for Big Data? + Extended DWH Data Lake Agility Prof. Dr. Jens Albrecht SQL on Hadoop 4

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 5

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 6

Hive Hadoop Meta Store Hive HiveQL Execution SerDe General Developed initially by Facebook SQL-processing for HDFS and HBase Table definitions in Hive Meta Store Generation of MapReduce Code Schema-on-Read via SerDe MapReduce Advantages Mature part of every Hadoop distribution Simple setup Java-API for UDFs Usage of many data formats via SerDe HDFS/HBase Disadvantages Batch-oriented, slow Prof. Dr. Jens Albrecht SQL on Hadoop 7

Schema-on-Write vs. Schema-on-Read Relational Database: Schema-on-Write Multi-structured Source Data Relational DBMS ETL SQL Big Data Processing: Schema-on-Read Multi-structured Source Data Load as-is Hadoop Schema mapped to original files SQL Prof. Dr. Jens Albrecht SQL on Hadoop 8

Schema-on-Read: Hive & CSV CREATE TABLE gps_data( userid INT, deviceid INT, longitude STRING, latitude STRING, utctime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; -- load data = copy LOAD DATA LOCAL INPATH 'new_data/gps.dat' OVERWRITE INTO TABLE gps_data; SELECT COUNT(*) FROM gps_data; Sample Data 4711 542815 49.454N 11.077E 10/01/2014@10:00:00UTC Prof. Dr. Jens Albrecht SQL on Hadoop 9

Schema-on-Read: Hive& Regexp CREATE TABLE weblog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (- \\[[^\\]]*\\]) ([^ \"]* \"[^\"]*\") (- [0-9]*) (- [0-9]*)(?: ([^ \"]* \"[^\"]*\") ([^ \"]* \"[^\"]*\"))?" ) STORED AS TEXTFILE; Prof. Dr. Jens Albrecht SQL on Hadoop 10

Schema-Read: Hive & JSON JSON Data Relational Mapping in Hive Source: http://thornydev.blogspot.de/2013/07/querying-json-records-via-hive.html Prof. Dr. Jens Albrecht SQL on Hadoop 11

Hive on Tez (Stinger) Stinger-Initiative Hadoop Meta Store Hive HiveQL Execution SerDe "Make Hive 100x faster" Finished with Hive 0.13 (April 2014) Replace MapReduce with Tez Native columnar data format (ORC) Stinger.Next MapReduce Tez/ Yarn HDFS/HBase Phase 1: Hive 0.14 (November 2014) ACID transactions Phase 2: (Q2 2015) Subsecond Queries mit LLAP Machine Learning Integration Phase 3: (Q4 2015) SQL:2011 Analytic Functions Materialized Views Prof. Dr. Jens Albrecht SQL on Hadoop 12

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 13

Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent Approach Distributed, parallel SQL engine Often usage of Hive Metadata Support of optimized data formats Hadoop as mandatory basis Advantages and Disadvantages Significantly faster as Hive Low latency through dedicated engine Operator pipelining and result caching Data Files Data Files Data Files HDFS/HBase Data Files Diffentiation of solutions Supported SQL functionality Point querying Cost-based optimizer / performance Transaction support Prof. Dr. Jens Albrecht SQL on Hadoop 14

Pure Hadoop SQL Engines Distributed SQL Engine SQL Query Coordination Big Insights Local Agent Local Agent Local Agent Local Agent Data Files Data Files Data Files Data Files HDFS/HBase Prof. Dr. Jens Albrecht SQL on Hadoop 15

Pure Hadoop SQL Engines Example: IBM Big SQL Source: IBM Prof. Dr. Jens Albrecht SQL on Hadoop 16

Apache Spark & Spark SQL Schema RDD Spark SQL SQL Execution SerDe General SQL engine based on Spark Data access via data frames (former SchemaRDD) In-Memory columnar format HDFS / HBase as file format Apache Spark Advantages Spark as general-purpose parallel computing framework Support of Hive extensions like UDFs and SerDes and Hive metadata HDFS/HBase Disadvantages Not yet fully mature Not yet as fast as competitors Prof. Dr. Jens Albrecht SQL on Hadoop 17

Apache Spark Distributed In-Memory Computing Framework Data caching General framework for all kinds of SQL and non-sql analytics Support for out-of-the box libraries as well as Java, Python and Scala in the same engine and for the same data New datasources API allows to write plugins for non-hadoop sources Spark SQL Spark Streaming Machine Learning (MLlib) Graph Computation (GraphX) Spark Execution Engine ZooKeeper Hadoop YARN (optional) HDFS Prof. Dr. Jens Albrecht SQL on Hadoop 18

Apache Spark lines = spark.textfile("hdfs://...") errors = lines.filter(_.startswith(error)) errors.persist() // Return the time fields of errors mentioning // assuming time is field number 3 in a tsv file hdfs_errors = errors.filter(_.contains(hdfs)) time_fields = hdfs_errors.map(_.split( \t )(3)).collect() Action Resilient Distributed Data Sets (RDDs) Transformations Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Prof. Dr. Jens Albrecht SQL on Hadoop 19

Spark Transformations Transformations: Create a new RDD from an existing one Lazy evaluation results are not materialized Much more than map reduce map filter sample groupbykey sortbykey reducebykey union pipe repartition join leftouterjoin rightouterjoin Actions: Return a value or dataset to the calling program reduce collect count first take(n) saveastextfile Prof. Dr. Jens Albrecht SQL on Hadoop 20

Adding Schema to RDDs Source: http://de.slideshare.net/jeykottalam/spark-sqlamp-camp2014 Prof. Dr. Jens Albrecht SQL on Hadoop 22

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access Distributed SQL Engine HDFS/ SQL NoSQL Hive RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 23

SQL-Engine with Pluggable Storage Distributed SQL Engine SQL Query Coordination Local Agent Local Agent Local Agent Local Agent HDFS Hive JSON Parquet Cassandra Connector Plugins MySQL Oracle Ansatz Verteilte, parallele SQL-Engine Oftmals Nutzung von Hive Metadaten Unterstützung optimierter Dateiformate Hadoop obligatorisch als Basis Allgemeine Vorteile Deutlich schneller als Hive Geringe Latenzzeiten durch Vermeidung von Map-Reduce Operatoren-Pipelining und Caching Skalierbar Allgemeine Nachteile In der Regel keine Transaktionsunterstützung Prof. Dr. Jens Albrecht SQL on Hadoop 24

Google Dremel Dremel Prof. Dr. Jens Albrecht SQL on Hadoop 25

Apache Drill Quelle: http://www.mapr.com Self-Service Data Integration No metadata repository required Dynamic schema discovery: Metadata automatically extracted for data sources RDBMS and Hive (comprehensive), HBase (partial) or files (on-the-fly) Utilizes self-describing data formats (Parquet, JSON, AVRO) SQL-DDL can be used to create metadata explicitly ANSI-SQL plus Flexible Data Model Fully ANSI compliant SQL DrQL with SQL extensions for nested data structures (like JSON) Prof. Dr. Jens Albrecht SQL on Hadoop 26

Apache Drill: SQL for JSON select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; Source: https://www.mapr.com/products/apache-drill Prof. Dr. Jens Albrecht SQL on Hadoop 27

Apache Drill: SQL for Heterogeneous Data Formats JSON CSV ORC Parquet HBasetables MongoDB select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorlevel > 5 order by count(*); Prof. Dr. Jens Albrecht SQL on Hadoop 28

Hadoop-SQL Integration Hive(Native Hadoop) HiveQL MR / Tez HDFS Pure Hadoop SQL Engines Distributed SQL Engine HDFS Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 29

External Tables in HDFS RDBMS Logical Mapping HDFS CREATE TABLE SCOTT.SALES_HDFS_EXT_TAB ( PROD_ID NUMBER(6), CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID CHAR(1), PROMO_ID NUMBER(6), QUANTITY_SOLD NUMBER(3), AMOUNT_SOLD NUMBER(10,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY SALES_EXT_DIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ',' ( PROD_ID DECIMAL EXTERNAL, ) PREPROCESSOR HDFS_BIN_PATH:hdfs_stream ) LOCATION ( 'file_sales_1', 'file_sales_2', 'files_sales_3') ); Prof. Dr. Jens Albrecht SQL on Hadoop 30

RDBMS with Hadoop Integration Parallel Database SQL Query Coordination Approach Towards genuine integration of Hadoop into RDBMS Utilize Hadoop's computational power Cost-based choice External Tables MR Loader Map Reduce HDFS/HBase Relational Tables Advantages Easiest way to use Hadoop as data source Combined access to traditional and new data sources Disadvantages Cost Limited data sources Vendor lock-in Prof. Dr. Jens Albrecht SQL on Hadoop 31

RDBMS with Hadoop Integration External Tables HDFS/HBase Parallel Database SQL Query Coordination MR Loader Map Reduce Relational Tables Products Microsoft Polybase (part of MS Analytics Platform) Oracle Big Data SQL (part of Oracle Big Data Appliance in combination with Exadata) Use Cases Extension of traditional BI System Data-lake scenario with RDBMS as primary system and Hadoop for mass data Mix of analytic and transactional load Prof. Dr. Jens Albrecht SQL on Hadoop 32

Oracle Big Data SQL Integrates Hive Metadata Allows hybrid queries Include Hadoop and NoSQL in relational queries Use Exadata Smartscan- Technology Source: http://www.oracle.com Prof. Dr. Jens Albrecht SQL on Hadoop 33

Hadoop-SQL Integration Hive(Native Hadoop) Pure Hadoop SQL Engines HiveQL MR / Tez HDFS Stinger Distributed SQL Engine HDFS Big Insights Format-agnostic SQL Engines RDBMS with Hadoop Access HDFS Distributed SQL Engine Hive NoSQL RDBMS Relational Hadoop Prof. Dr. Jens Albrecht SQL on Hadoop 34

> Hadoop File Formats Prof. Dr. Jens Albrecht SQL on Hadoop 35

File formats for Big Data Text Formats High storage usage bad scan performance Low compression bad scan performance Dedicated Formats (e.g. DB internal) Not open, no interoperability Requirements for Big Data Interoperability Low storage / good compression High performance Flexible schema A schema for a file format?? Query performance for a file format?? Big Data often have a nested structure and multiple schema variants and versions Prof. Dr. Jens Albrecht SQL on Hadoop 36

Motivation for file formats Quelle: http://de.slideshare.net/cloudera/hadoop-summit-36479635 Prof. Dr. Jens Albrecht SQL on Hadoop 37

Considerations for File Formats Query Tools none Frameworks like MapReduce, Spark, Cascading Query Engines like Pig, Hive, Impala Schema Versioning Schema present? If so, can it change? Splittability Partitioning Splitting of files possible for distributed processing? Example: CSV: yes, XML: partial, MP4: no Block Compression Can blocks be independently compressed and distributed? Block compression is a prerequisite for partition compresssion! File Size Size in bytes and number of files? Hadoop likes big, splittable files! Lots of small files cost performance Load Profile Write Performance Filter operations Reading of single columns Full scans Source: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore Prof. Dr. Jens Albrecht SQL on Hadoop 38

Example Row-Format: Avro Schema specification internally stored in binary format self-describing files record Person { string username; union { null, long } favouritenumber; array<string> interests; } Reader- vs. Writer Schema Allows different"views" on files Read Avro Parser Write Reader Schema Resolution Rules Writer Schema Writer Schema Avro Data Prof. Dr. Jens Albrecht SQL on Hadoop 39

Example Column Format: Columnar formats in general Trade faster reads for slower writes Very good compression Parquet Files Hybrid Partitioning sets of records in blocks, columnar within blocks Zone maps per block as kind of index (min/max values per column) Image: http://de.slideshare.net/cloudera/hadoop-summit-36479635 Prof. Dr. Jens Albrecht SQL on Hadoop 40

Databases as Lego Construction Kit!? Traditional monolithic RDBMS SQL Hadoop DB building blocks SQL SQL Prozessor SQL Prozessor Data Dictionary Verteilte Ausführung Speicherverwaltung Map Reduce Spark CSV Seq Avro JSON ORC Parquet Generic Execution Engine Metadata sharing in Hive Repository or self-describing file formats Operator push-down to intelligente file interfaces Prof. Dr. Jens Albrecht SQL on Hadoop 41

> Summary Prof. Dr. Jens Albrecht SQL on Hadoop 42

Considerations for SQL on Hadoop Solutions SQL Functionality Coverage of SQL standard User-defined functions Transactional Safety Performance and Stability Multi-user workloads Efficiency of joins and aggregations (I/O problems? Size limits?) Supported Data and Storage Formats Logical Format: relational, JSON, none, Physical Formats: CSV, Parquet, ORC, Avro, Intelligent Storage Plugins / Data Federation Access to various data sources beyond Hadoop Pushdown predicates, access selected columns only Support for your Hadoop Distribution Prof. Dr. Jens Albrecht SQL on Hadoop 43

Hadoop vs. SQL Hadoop SQL Technologies Supplement traditional RDBMS Extend traditional RDBMS Develop new RDBMS SQL Hadoop Hadoopand SQL move closely together. SQL universe gets wider. Database systems become open and modular. Prof. Dr. Jens Albrecht SQL on Hadoop 44

References L. Chang et al.: HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of Data, Pages 1223-1234 A. Floratour, U. Minhas, F. Özcan: SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014 M. Hausenblas, J. Nadeau: Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data Magazine, June 2013 M. Kornacker, e.a.: Impala: A Modern, Open-Source SQL Engine for Hadoop. 7th Biennial Conference on Innovative Data Systems Research (CIDR 15) D. J. DeWitt, e.a.: Split Query Processing in Polybase. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, Pages 1255-1266 S. Melnik, e.a.: Dremel: Interactive Analysis of Web-scale Datasets. PVLDB, 3(1-2):330 339, 2010 M. Zaharia, e.a.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012 J. Albrecht, S. Alexander: Hadoop und SQL rücken enger zusammen. Computerwoche, Nov. 2013, http://www.computerwoche.de/a/hadoop-und-sql-ruecken-enger-zusammen,2549475 C. Deptula: Hadoop File Formats: Ist not Just CSV Anymore. Blog Eintrag, 2014, http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore J. Le Dem: Efficient Data Storage Analytics with Apache Parquet 2.0, Hadoop Summit 2014, http://de.slideshare.net/cloudera/hadoop-summit-36479635 M. Rathbone: 8 SQL-on-Hadoop frameworks worth checking out. Blog Eintrag, 2014, http://blog.matthewrathbone.com/2014/06/08/sql-engines-for-hadoop.html P. Srivati: Resilient Distributed Datasets (RDD) for the impatient. Blog Eintrag, 2014, http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html S. Yegulalp: 10 ways to query Hadoop with SQL. Infoworld, 2014, http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html Prof. Dr. Jens Albrecht SQL on Hadoop 45