E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Similar documents

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

HadoopRDF : A Scalable RDF Data Analysis System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Implement Hadoop jobs to extract business value from large and varied data sets

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Scaling Out With Apache Spark. DTL Meeting Slides based on

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

Hadoop Job Oriented Training Agenda

Customer Case Study. Sharethrough

Unified Big Data Analytics Pipeline. 连城

Brave New World: Hadoop vs. Spark

Architectures for massive data management

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Integrating Apache Spark with an Enterprise Data Warehouse

Querying DBpedia Using HIVE-QL

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Semantic Stored Procedures Programming Environment and performance analysis

Shark Installation Guide Week 3 Report. Ankush Arora

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

DISCOVERING RESUME INFORMATION USING LINKED DATA

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

This is a brief tutorial that explains the basics of Spark SQL programming.

Industry 4.0 and Big Data

The Internet of Things and Big Data: Intro

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Moving From Hadoop to Spark

Big Data Analytics with Cassandra, Spark & MLLib

Data processing goes big

How To Create A Data Visualization With Apache Spark And Zeppelin

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

How Companies are! Using Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Flink Next-gen data analysis. Kostas

Graph Database Performance: An Oracle Perspective

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lift your data hands on session

Introduction to Big Data Training

Oracle Big Data SQL Technical Update

Unified Big Data Processing with Apache Spark. Matei

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Workshop on Hadoop with Big Data

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Databricks. A Primer

Distributed Query Processing on the Cloud: the Optique Point of View (Short Paper)

Publishing Linked Data Requires More than Just Using a Tool

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING

COURSE CONTENT Big Data and Hadoop Training

Geospatial Technology Innovations and Convergence

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Databricks. A Primer

LDIF - Linked Data Integration Framework

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Manifest for Big Data Pig, Hive & Jaql

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Bayesian networks - Time-series models - Apache Spark & Scala

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

SparkLab May 2015 An Introduction to

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

On a Hadoop-based Analytics Service System

MapReduce with Apache Hadoop Analysing Big Data

Putting Apache Kafka to Use!

Chase Wu New Jersey Ins0tute of Technology

How To Use An Orgode Database With A Graph Graph (Robert Kramer)

Introduction to Ontologies

HADOOP. Revised 10/19/2015

Integrating Open Sources and Relational Data with SPARQL

Geospatial Platforms For Enabling Workflows

Big Data With Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Beyond Hadoop with Apache Spark and BDAS

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Transcription:

E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center E6895 Advanced Big Data Analytics Lecture 4 CY Lin, 2015 Columbia University

Reference 2

Spark SQL 3

Spark SQL 4

Apache Hive 5

Using Hive to Create a Table 6

Creating, Dropping, and Altering DBs in Apache Hive 7

Another Hive Example 8

Hive s operation modes 9

Using HiveQL for Spark SQL 10

Hive Language Manual 11

Using Spark SQL Steps and Example 12

Query testtweet.json Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files 13

SchemaRDD 14

Row Objects Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields. 15

Types stored by Schema RDDs 16

Look at the Schema (not a complete screen shot) 17

Another way to create SchemaRDD 18

JDBC Server Spark SQL provides JDBC connectivity, which is useful for connecting business intelligence tools to a Spark cluster and for sharing a cluster across multiple users. 19

User-Defined Functions (UDF) UDFs allow you to register custom functions in Python, Java, and Scala to call within SQL.! This is a very popular way to expose advanced functionality to SQL users in an organization, so that these users can call into it without writing code. 20

RDF and SPARQL 21 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

Spark Streaming In Spark 1.1, Spark Streaming is available only in Java and Scala. Spark 1.2 has limited Python support. 22

Spark Streaming architecture 23

Spark Streaming with Spark s components 24

Try these examples 25

RDF and SPARQL 26 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

RDF and SPARQL 27 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

Resource Description Format (RDF) A W3C standard sicne 1999 Triples Example: A company has nince of part p1234 in stock, then a simplified triple rpresenting this might be {p1234 instock 9}. Instance Identifier, Property Name, Property Value. In a proper RDF version of this triple, the representation will be more formal. They require uniform resource identifiers (URIs). 28 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

An example complete description 29 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

Advantages of RDF Virtually any RDF software can parse the lines shown above as self-contained, working data file. You can declare properties if you want. The RDF Schema standard lets you declare classes and relationships between properties and classes. The flexibility that the lack of dependence on schemas is the first key to RDF's value.! Split trips into several lines that won't affect their collective meaning, which makes sharding of data collections easy. Multiple datasets can be combined into a usable whole with simple concatenation.! For the inventory dataset's property name URIs, sharing of vocabulary makes easy to aggregate. 30 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

SPARQL Query Langauge for RDF The following SPQRL query asks for all property names and values associated with the fbd:s9483 resource: 31 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

The SPAQRL Query Result from the previous example 32 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

Another SPARQL Example What is this query for? Data 33 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University

Open Source Software Apache Jena 34 E6893 Big Data Analytics Lecture 9: Linked Big Data: Graph Computing 2014 CY Lin, Columbia University