AVRO - SERIALIZATION



Similar documents
About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial

INPUT AND OUTPUT STREAMS

The Java I/O System. Binary I/O streams (ascii, 8 bits) The decorator design pattern Character I/O streams (Unicode, 16 bits)

File I/O - Chapter 10. Many Stream Classes. Text Files vs Binary Files

What is an I/O Stream?

JAVA - FILES AND I/O

WRITING DATA TO A BINARY FILE

Extreme Computing. Hadoop MapReduce in more detail.

CS 1302 Ch 19, Binary I/O

Hadoop WordCount Explained! IT332 Distributed Systems

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Introduction To Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Network/Socket Programming in Java. Rajkumar Buyya

Getting to know Apache Hadoop

University of Maryland. Tuesday, February 2, 2010

Learning Outcomes. Networking. Sockets. TCP/IP Networks. Hostnames and DNS TCP/IP

Java Network Programming. The java.net package contains the Socket class. This class speaks TCP (connection-oriented protocol).

Programming in Java

Java Network. Slides prepared by : Farzana Rahman

Word Count Code using MR2 Classes and API

Chapter 20 Streams and Binary Input/Output. Big Java Early Objects by Cay Horstmann Copyright 2014 by John Wiley & Sons. All rights reserved.

D06 PROGRAMMING with JAVA

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Lecture 7: Class design for security

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

How To Write A Mapreduce Program In Java.Io (Orchestra)

Serialization in Java (Binary and XML)

Developing a Web Server Platform with SAPI Support for AJAX RPC using JSON

Java CPD (I) Frans Coenen Department of Computer Science

Java Interview Questions and Answers

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

CS 378 Big Data Programming. Lecture 9 Complex Writable Types

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Transparent Redirection of Network Sockets 1

Scanner. It takes input and splits it into a sequence of tokens. A token is a group of characters which form some unit.

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Building a Multi-Threaded Web Server

Today s Outline. Computer Communications. Java Communications uses Streams. Wrapping Streams. Stream Conventions 2/13/2016 CSE 132

Advanced Java Client API

CS506 Web Design and Development Solved Online Quiz No. 01

5 HDFS - Hadoop Distributed System

java.util.scanner Here are some of the many features of Scanner objects. Some Features of java.util.scanner

The Sun Certified Associate for the Java Platform, Standard Edition, Exam Version 1.0

Creating a Simple, Multithreaded Chat System with Java

HPCHadoop: MapReduce on Cray X-series

An Overview of Java. overview-1

Application Development with TCP/IP. Brian S. Mitchell Drexel University

JAVA Program For Processing SMS Messages

Xiaoming Gao Hui Li Thilina Gunarathne

Oracle WebLogic Server

A Java-based system support for distributed applications on the Internet

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Amazon Glacier. Developer Guide API Version

HDFS. Hadoop Distributed File System

Crash Course in Java

AP Computer Science Java Subset

Data Science in the Wild

Question1-part2 What undesirable consequences might there be in having too long a DNS cache entry lifetime?

Communicating with a Barco projector over network. Technical note

Principles of Software Construction: Objects, Design, and Concurrency. Design Case Study: Stream I/O Some answers. Charlie Garrod Jonathan Aldrich

Java Coding Practices for Improved Application Performance

Introduction to Spark

Big Data Management and NoSQL Databases

Building a Java chat server

Chapter 2: Remote Procedure Call (RPC)

Socket Programming in Java

RMI Client Application Programming Interface

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

The Java Series. Java Essentials I What is Java? Basic Language Constructs. Java Essentials I. What is Java?. Basic Language Constructs Slide 1

Readings and References. Topic #10: Java Input / Output. "Streams" are the basic I/O objects. Input & Output. Streams. The stream model A-1.

Mrs: MapReduce for Scientific Computing in Python

1.Remote Procedure Call Basics Requests Types Response Strategies/Goals FAQ...

Intel Retail Client Manager Audience Analytics

MR-(Mapreduce Programming Language)

Chulalongkorn University International School of Engineering Department of Computer Engineering Computer Programming Lab.

Masters programmes in Computer Science and Information Systems. Object-Oriented Design and Programming. Sample module entry test xxth December 2013

BIG DATA APPLICATIONS

CS 111 Classes I 1. Software Organization View to this point:

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

NETWORK PROGRAMMING IN JAVA USING SOCKETS

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

File class in Java. Scanner reminder. Files 10/19/2012. File Input and Output (Savitch, Chapter 10)

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

AWS Encryption SDK. Developer Guide

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Transparent Redirection of Network Sockets 1

Explain the relationship between a class and an object. Which is general and which is specific?

Lecture J - Exceptions

An Android-based Instant Message Application

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Transcription:

http://www.tutorialspoint.com/avro/avro_serialization.htm AVRO - SERIALIZATION Copyright tutorialspoint.com What is Serialization? Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling. Serialization in Java Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object's data as well as information about the object's type and the types of data stored in the object. After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory. ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java. Serialization in Hadoop Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage. Interprocess Communication To establish the interprocess communication between the nodes connected in a network, RPC technique was used. RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message. The RPC serialization format is required to be as follows Compact To make the best use of network bandwidth, which is the most scarce resource in a data center. Fast Since the communication between the nodes is crucial in distributed systems, the serialization and deserialization process should be quick, producing less overhead. Extensible Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and servers. Interoperable The message format should support the nodes that are written in different languages. Persistent Storage Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. For example - Magnetic disks and Hard Disk Drives. Writable Interface This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes the methods Methods and Description

1 void readfieldsdatainputin This method is used to deserialize the fields of the given object. 2 void writedataoutputout This method is used to serialize the fields of the given object. WritableComparable Interface It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison. 1 Methods and Description int comparetoclassobj This method compares current object with the given object obj. In addition to these classes, Hadoop supports a number of wrapper classes that implement WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy of Hadoop serialization is given below These classes are useful to serialize various types of data in Hadoop. For instance, let us consider the IntWritable class. Let us see how this class is used to serialize and deserialize the data in Hadoop. IntWritable Class This class implements Writable, Comparable, and WritableComparable interfaces. It wraps an integer data type in it. This class provides methods used to serialize and deserialize integer type of

data. Constructors Summary 1 IntWritable 2 IntWritableintvalue Methods 1 Summary int get Using this method you can get the integer value present in the current object. 2 void readfieldsdatainputin This method is used to deserialize the data in the given DataInput object. 3 void setintvalue This method is used to set the value of the current IntWritable object. 4 void writedataoutputout This method is used to serialize the data in the current object to the given DataOutput object. Serializing the Data in Hadoop The procedure to serialize the integer type of data is discussed below. Instantiate IntWritable class by wrapping an integer value in it. Instantiate ByteArrayOutputStream class. Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it. Serialize the integer value in IntWritable object using write method. This method needs an object of DataOutputStream class. The serialized data will be stored in the byte array object which is passed as parameter to the DataOutputStream class at the time of instantiation. Convert the data in the object to byte array. Example The following example shows how to serialize data of integer type in Hadoop import java.io.bytearrayoutputstream;

import java.io.dataoutputstream; import java.io.ioexception; import org.apache.hadoop.io.intwritable; public class Serialization { public byte[] serialize() throws IOException{ //Instantiating the IntWritable object IntWritable intwritable = new IntWritable(12); //Instantiating ByteArrayOutputStream object ByteArrayOutputStream byteoutputstream = new ByteArrayOutputStream(); //Instantiating DataOutputStream object DataOutputStream dataoutputstream = new DataOutputStream(byteoutputStream); //Serializing the data intwritable.write(dataoutputstream); //storing the serialized object in bytearray byte[] bytearray = byteoutputstream.tobytearray(); //Closing the OutputStream dataoutputstream.close(); return(bytearray); public static void main(string args[]) throws IOException{ Serialization serialization= new Serialization(); serialization.serialize(); System.out.println(); Deserializing the Data in Hadoop The procedure to deserialize the integer type of data is discussed below Instantiate IntWritable class by wrapping an integer value in it. Instantiate ByteArrayOutputStream class. Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it. Deserialize the data in the object of DataInputStream using readfields method of IntWritable class. The deserialized data will be stored in the object of IntWritable class. You can retrieve this data using get method of this class. Example The following example shows how to deserialize the data of integer type in Hadoop import java.io.bytearrayinputstream; import java.io.datainputstream; import org.apache.hadoop.io.intwritable; public class Deserialization { public void deserialize(byte[]bytearray) throws Exception{ //Instantiating the IntWritable class

IntWritable intwritable =new IntWritable(); //Instantiating ByteArrayInputStream object ByteArrayInputStream InputStream = new ByteArrayInputStream(byteArray); //Instantiating DataInputStream object DataInputStream datainputstream=new DataInputStream(InputStream); //deserializing the data in DataInputStream intwritable.readfields(datainputstream); //printing the serialized data System.out.println((intwritable).get()); public static void main(string args[]) throws Exception { Deserialization dese = new Deserialization(); dese.deserialize(new Serialization().serialize()); Advantage of Hadoop over Java Serialization Hadoop s Writable-based serialization is capable of reducing the object-creation overhead by reusing the Writable objects, which is not possible with the Java s native serialization framework. Disadvantages of Hadoop Serialization To serialize Hadoop data, there are two ways You can use the Writable classes, provided by Hadoop s native library. You can also use Sequence Files which store the data in binary format. The main drawback of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language. Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a limited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure. Loading [MathJax]/jax/output/HTML-CSS/jax.js