CS 378 Big Data Programming. Lecture 9 Complex Writable Types



Similar documents
About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial

Serializing Data with Protocol Buffers. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. February 12, 2014.

HDFS. Hadoop Distributed File System

AVRO - SERIALIZATION

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Extreme Computing. Hadoop MapReduce in more detail.

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

The C Programming Language course syllabus associate level

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

ANDROID APPS DEVELOPMENT FOR MOBILE GAME

Developing a Web Server Platform with SAPI Support for AJAX RPC using JSON

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

CSE 373: Data Structure & Algorithms Lecture 25: Programming Languages. Nicki Dell Spring 2014

What is COM/DCOM. Distributed Object Systems 4 COM/DCOM. COM vs Corba 1. COM vs. Corba 2. Multiple inheritance vs multiple interfaces

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Getting Started with the Internet Communications Engine

JAXB: Binding between XML Schema and Java Classes

JAXB Tips and Tricks Part 2 Generating Java Classes from XML Schema. By Rob Ratcliff

Evolution of the Major Programming Languages

Openbus Documentation

ECE 122. Engineering Problem Solving with Java

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

DEVELOPING CONTRACT - DRIVEN WEB SERVICES USING JDEVELOPER. The purpose of this tutorial is to develop a java web service using a top-down approach.

Performance Testing for GDB

Java Web Services Training

Dr. Chuck Cartledge. 15 Oct. 2015

Contents. 9-1 Copyright (c) N. Afshartous

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

RTI Routing Service. Release Notes

Java CPD (I) Frans Coenen Department of Computer Science

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Assignment 3 Version 2.0 Reactive NoSQL Due April 13

ITS. Java WebService. ITS Data-Solutions Pvt Ltd BENEFITS OF ATTENDANCE:

University of Maryland. Tuesday, February 2, 2010

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Web Services Technologies

CSE 130 Programming Language Principles & Paradigms


Jeffrey D. Ullman slides. MapReduce for data intensive computing

Cloudera Certified Developer for Apache Hadoop

Hadoop Streaming. Table of contents

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

1.Remote Procedure Call Basics Requests Types Response Strategies/Goals FAQ...

PL/JSON Reference Guide (version 1.0.4)

Illustration 1: Diagram of program function and data flow

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

XML Processing and Web Services. Chapter 17

Apache Thrift and Ruby

Firewall Builder Architecture Overview

Web Services Description Language (WSDL) Wanasanan Thongsongkrit

Hadoop Project for IDEAL in CS5604

Java (12 Weeks) Introduction to Java Programming Language

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

Types of Cloud Computing

MS ACCESS DATABASE DATA TYPES

Introduction To Hive

BIG DATA, MAPREDUCE & HADOOP

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop Distributed File System (HDFS) Overview

Introduction to Apache Pig Indexing and Search

University of Hull Department of Computer Science. Wrestling with Python Week 01 Playing with Python

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop Design and k-means Clustering

Integrating VoltDB with Hadoop

INTRODUCTION TO OBJECTIVE-C CSCI 4448/5448: OBJECT-ORIENTED ANALYSIS & DESIGN LECTURE 12 09/29/2011

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

CS506 Web Design and Development Solved Online Quiz No. 01

CS 378 Big Data Programming

Advanced Java Client API

C#5.0 IN A NUTSHELL. Joseph O'REILLY. Albahari and Ben Albahari. Fifth Edition. Tokyo. Sebastopol. Beijing. Cambridge. Koln.

C++ Wrapper Library for Firebird Embedded SQL

CORBA Objects in Python

Developing Java Web Services

Comp151. Definitions & Declarations

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Flash and Python. Dynamic Object oriented Rapid development. Flash and Python. Dave Thompson

Problem 1. CS 61b Summer 2005 Homework #2 Due July 5th at the beginning of class

Python 2 and 3 compatibility testing via optional run-time type checking

Introduction to Python

Programming Project 1: Lexical Analyzer (Scanner)

Lecture 5: Java Fundamentals III

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Microsoft Azure Data Technologies: An Overview

Oracle Database: SQL and PL/SQL Fundamentals NEW

JAVA r VOLUME II-ADVANCED FEATURES. e^i v it;

Programming Languages CIS 443

OTN Developer Day: Oracle Big Data. Hands On Lab Manual. Introduction to Oracle NoSQL Database

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Distributed Network Management Using SNMP, Java, WWW and CORBA

The presentation explains how to create and access the web services using the user interface. WebServices.ppt. Page 1 of 14

Transcription:

CS 378 Big Data Programming Lecture 9 Complex Writable Types

Review Assignment 4 - CustomWritable QuesIons/issues?

Hadoop Provided Writables We ve used several Hadoop Writable classes Text LongWritable ArrayWritable Extended as LongArrayWritable, DoubleArrayWritable Hadoop provides many other classes Wrappers for all Java primiive types Some for Hadoop usage (TaskTrackerStatus) Others for us to extend (MapWritable)

User Defined Writables Hadoop provided classes cover commonly used types and data structures But we re likely to need more applicaion specific data structures/types For example, WordStatistics We can define these one by one Must implement the Writable interface This will become tedious

User Defined Data Types Where might we look for a soluion? How are ad hoc types transferred elsewhere? Web formats for data structures XML, JSON Plus: Human readable, self describing Minus: verbose, serializaion is slower Java serializaion We have to write the serializaion code Again tedious, as data types get complex

User Defined Data Types RPC mechanisms Marshall data in objects to be transferred to a remote procedure (no shared memory) Usually procedure calls share memory Java serializaion is one such mechanism Some others we ll look at: Google protocol buffers (protobufs) AVRO

Protobuf and AVRO These two approaches are interesing in that They allow us to define complex types via a schema or IDL (Interface DefiniIon Language) They handle all the data marshalling/serializaion They create bindings for various langauges AVRO was designed for use with Hadoop Protobufs require a Writable wrapper May be provided now, wasn t a few years ago

Protobuf Basics Protocol buffers (protobufs) used extensively at Google as the RPC mechanism MulIple language support (Java, C++, Python) Used in the Google map- reduce framework The schema language (IDL) defines messages Data structures containing primiive data types Required or opional Repeated (array) Embedded message

Protobuf Example package stats; option java_package = com.refactorlabs.cs378.utils ; option java_outer_classname = WordStatisticsProto ; message WordStatistics { } required int64 document_count = 1; required int64 total_count = 2; required int64 sum_of_squares = 3; optional double mean = 4; optional double variance = 5;

Schema EvoluIon As your data changes and you update the message definiion Old Java code can read and use data wrigen under the new schema It simply doesn t see the new fields New Java code can read and use data wrigen under the old schema New fields added must be opional The has() methods can be used to determine where new fields are unpopulated

AVRO Basics AVRO provides serializaion of objects RPC mechanism Container file for storing objects (schema stored also) Binary format as well as text format The schema language allows us to define complex objects Schema language uses JSON syntax Data structures containing primiive data types Complex types: record, enum, array, map, union, fixed

AVRO Basics PrimiIve types null boolean int, long float, double bytes, string Union: list of possible types If null included, field can have no value

AVRO Basics Records name, namespace doc aliases fields Name, doc, type, default, order, aliases Enums name, namespace aliases, doc symbols

AVRO Basics Arrays items { type : array, items : string } Maps values { type : map, values : string } Keys are assumed to be strings Fixed Fixed number of bytes

AVRO Basics With a schema defined, we compile it to create bindings to a language Output is Java source code (Python available too) Package and class name as we defined them So what does this Java class do for us? Allows instance to be created and populated Allows access to the data stored therein Performs serializaion This is one main reason for using AVRO objects AVRO objects implement Writable for use in Hadoop mapreduce AVRO objects implement other stuff (tostring(), parsing, )

Schema EvoluIon As your data changes and you update the message definiion In AVRO objects, the writer s schema is included, and can be compared to the reader s schema Comparison rules and rules for handling missing fields (in one schema but not the other) can be found here: hgp://avro.apache.org/docs/1.7.4/spec.html#schema+resoluion