Private Content Based Image Information Retrieval using Map-Reduce

Size: px
Start display at page:

Download "Private Content Based Image Information Retrieval using Map-Reduce"

Transcription

1 Private Content Based Image Information Retrieval using Map-Reduce A Dissertation submitted in fulfillment for the award of the degree Master of Technology, Computer Engineering Submitted by Arpit D. Dongaonkar MIS No: Under the guidance of Prof. Sunil B. Mane College of Engineering, Pune DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE-5 June, 2013

2 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE CERTIFICATE This is to certify that the dissertation entitled Private Content Based Image Information Retrieval using Map-Reduce has been successfully completed by Mr. Arpit Dilip Dongaonkar MIS No as a fulfillment of End Semester evaluation of Master of Technology. SIGNATURE SIGNATURE Prof Sunil B. Mane Dr. J. V. Aghav Project Guide Head Dept. of Computer Engineering Dept. of Computer Engineering and Information Technology, and Information Technology, College of Engineering Pune, College of Engineering Pune, Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

3 Dedicated to My Mother Sunita D. Dongaonkar, My Father Prof. Dilip R. Dongaonkar for their love and everything, that they have given me... and also to All family members for supporting me...

4 Acknowledgements I would like to express my sincere gratitude towards Prof. Sunil B. Mane, my research guide, for his patient guidance, enthusiastic encouragement and useful critiques on this research work. Without his invaluable guidance, this work would never have been a successful one. I would also like to thank Vikas Jadhav, my classmate for continuous and detail knowledge sharing on Hadoop and other aspects from installation to working. I would also like to thank all my M.Tech friends for unconditional support and help. Last but not least, my sincere thanks to all my teachers who directly or indirectly helped me to learn new technologies and complete my post graduation successfully. Arpit Dilip Dongaonkar College of Engineering, Pune

5 Abstract Today, Role of Information technology is changing and this change is leading to real life software application developments i.e. Information technology has entered every part of our day to day life like traffic signals, power grids, our day to day financial transactions are handled by Information technology. But this technology comes with a cost to pay for it. As this technology is touching each and every layer of life and community, there is a need to make this technology available to each and every layer of community. Cloud computing is an emerging technology which reduces this cost while providing very good and efficient information technology platform as well as service. Cloud reduces the overload on hard disk, operating system and eventually on CPU processing. Cloud providers provide infrastructure, platform, database etc. for you to install your operating system, your required software s, to store your data on their cloud. But this service comes at a cost. Cloud providers provide it on pay per hour basis. So any application which can be deployed on cloud must be fast, efficient and secure. As Internet is expanding, creation of data over it is also increasing exponentially. Maximum part of this data contains images. Today large amount of variety image data is produced through digital cameras, mobile phones, and photo editing softwares etc. These images are private to particular user. This digital image data should be secured such that it should not be accessed by others except the user. There are some domains which uses content based image retrieval applications. These applications should be fast enough to carry out user functionalities efficiently and secure in order to protect user data. If application is to be deployed on cloud, then it should be supported by cloud structure. One technology which has support from cloud and does distributed parallel computing is Map Reduce. I am proposing a system which will allow user to upload, search and retrieve some of users personal images depending upon one particular image, securely and efficiently from

6 his continuously expanding image database. This system is incorporated with encryption technique for security and Map-Reduce technique for efficiently upload, search and retrieval of images over large dataset of user images. Proposed solution will be a system to securely upload, search and query images among massive image data storage using content based image retrieval scheme using Map Reduce technique, and evaluating and comparing proposed solution with existing solution. Keywords - Information technology, Cloud computing, secured personal data.

7 Contents List of Tables List of Figures iv vi 1 INTRODUCTION Private Information Retrieval Image Retrieval Content Based Image Retrieval Similarity Measurement Euclidean Distance Mahalanobis Distance Chord Distance Hadoop MapReduce HDFS HBase Oblivious Technique Literature Survey PIRMAP [1] PIR MapReduce in PIRMAP Working of PIRMAP Future Scope/Issues Map/Reduce in CBIR Application [2] System Model i

8 2.2.2 Future scope/issues An Oblivious Image Retrieval Protocol [3] System Model Future Scope/Issues Distributed Image Retrieval System Based on MapReduce.[4] System Model Future Scope/Issues Summary Motivation Problem Definition Scope of Research Objectives System Requirement Specifications System Design Complete Architecture Upload Images Image Retrieval Implementation of System Log In Screen Create Account User Operation Screen Changes in Hadoop Upload Process Image Retrieval Experiments, Testing and Result Analysis Cluster Configuration Experiments and Testing Result Analysis Error Check in Application

9 6.5 Comparison of Different Systems System Output Input Images Conclusions and Future Scope Conclusion Future Scope A Publication Status 49

10 List of Tables 6.1 Cluster Configuration Second Cluster Configuration iv

11 List of Figures 1.1 Map Reduce Process System Architecture Upload Process Image Retrieval Process LogIn Screen Create Account Screen Account Creation Screen - Validation User Operations Screen Upload Screen Uploading Screen Process behind Upload Search Screen Searching Screen Processes behinds search Screen One Node Cluster Difference at Upload MapReduce One Node Cluster Difference at Retrieval MapReduce Three Node Cluster Difference at Upload MapReduce Three Node Cluster Difference at Retrieval MapReduce Performance of Upload MapReduce at Cluster Performance of Retrieval MapReduce at Cluster Scale Up between Different Cluster Nodes Scale Up between Different Cluster Nodes Error Checking in v

12 6.10 Error Check at Upload Images Error check at Search Images Error check at Search Images Input set of Images Uploaded Input Search Image Output of System

13 Chapter 1 INTRODUCTION 1.1 Private Information Retrieval Today most of data over Internet lies into database servers which are connected to Internet. If a user makes a query to database, he/she should get required queried information. There are many curious users who makes query to database intentionally or unintentionally to get private information of other users. Lot of research is devoted to protect databases from these types of curious users. Private information retrieval is a technique or a research area where security with respect to this type of intrusion has been considered. A private information retrieval (PIR) protocol allows a user to retrieve an item from a server in possession of a database without revealing which item he/she is retrieving. Over period different improvements in PIR has been proposed which makes PIR faster, more efficient and more secure. Implementation like oblivious transfer is also implemented in PIR. This restricts the user from getting information about other database items [3]. PIR also maintains privacy of the queries from database. 1.2 Image Retrieval An image retrieval system is designed to browse, search and retrieve images from large database of digital images. Most traditional and common methods of image retrieval 1

14 utilize method of adding metadata such as captioning, keywords, or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation [2]. Annotated images are searched based on Images s matadata. Image Meta search: Search of images based on associated metadata such as keywords, text, etc. Content-based image retrieval (CBIR): The application of computer vision to the image retrieval. CBIR aims at avoiding the use of textual descriptions and instead retrieves images based on similarities in their contents (textures, colors, shapes etc.) to a user-supplied query image or user-specified image features Content Based Image Retrieval Since 1970 s, image retrieval has been an active research area. In the beginning, research was concentrated to text based search only. It was new framework at that time, which used names of image files as a search criteria. In this framework, images were need to be annotated first and then from database management system, images were retrieved [5]. This framework was having two limitations firstly size of image data that can fit into database and secondly extensive manual annotation work. There was need to do research work on retrieval which would not be having limitation of manual entry and big size data. It lead to Content Based Image Retrieval technique. Content-based image retrieval (CBIR), also known as query by image content (QBIC) [6]. Content based image retrieval (CBIR) is a technique in which content of an image is used as matching criteria instead of image s metadata such as keywords, tags, or any name associated with image. This provides most approximate match as compared to text based image retrieval [7], [8], [9], [10]. 2

15 The term content in this context might refer to colors, shapes, textures, or any other information that can be derived from the image itself [6], [8], [10]. Image content can be categorized into visual and semantic contents. Visual content can be very general or domain specific. General visual content include color, texture, shape, spatial relationship, etc. Domain specific visual content, like human faces, is application dependent and may involve domain knowledge. Semantic content is obtained either by textual annotation or by complex inference procedures based on visual content [8]. COLOR: Image retrieval based on color actually means retrieval on color descriptors. Most commonly used color descriptors are the color histogram, color coherence vector, color correlogram, and color moments [9]. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using similarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8]. TEXTURE: Texture of an image is actually visual patterns that an image possesses and how they are spatially defined. Textures are represented by texels which are then placed into a number of sets, depending on how many textures are detected in the image. These sets not only define the texture, but also where in the image the texture is located. Statistical method such as co-occurrence Matrices can be used for quantitative measure of the arrangement of intensities in a region [8], [12]. SHAPE:Shape in image does not mean shape of an image but it means that shape of a particular region or an object. Segmentation and edge detection are prominent techniques that can be used in shape detection [6]. There are several components that are calculated from image content such as color intensity, entropy, mean of image etc. which are useful in creation of feature vector. Feature vector of every image is calculated and is stored in database. When user wants to retrieve set of images, he queries system through an image. Feature vector of required image is 3

16 matched with vectors of stored images and images with vector similar to vector of queried image are retrieved. User will select required image from set of shown images. 1.3 Similarity Measurement A similarity measurement is always selected to find how similar the two vectors are. The problem can be converted to computing the discrepancy between two vectors x,y R d. There are three distance measurements: Euclidean, Mahalanobis, and chord distances which are reviewed as follows. [11] Euclidean Distance The Euclidean distance between x,y R d is computed by [11] δ(x, y) = x y 2 = dj=1 (x j y j ) Mahalanobis Distance The Mahalanobis distance between two vectors x and y with respect to the training patterns x i is computed by [11] δ(x, y) 2 = (x y) t S 1 (x y) where the mean vector u and the sample covariance matrix S from the sample {x i 1 i n} of size n are computed by [11] S = 1 ni=1 (x n i u)(x i u) t with u = 1 ni=1 x n i Chord Distance The chord distance between two vectors x and y is to measure the distance between the projected vectors of x and y onto the unit sphere, which can be computed by[11] δ 3 (x, y) = x r y s 2, where r = x 2 and s = y 2 4

17 1.4 Hadoop The Apache Hadoop is a collection open-source software projects for reliable, scalable, distributed computing. Software libraries of Hadoop specify a framework that allows distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single to thousands of machines, each offering local computation and storage [13]. Some modules of Hadoop are listed below [13]: 1. Hadoop Common: The common utilities that support the other Hadoop modules. 2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. 3. Hadoop YARN : A framework for job scheduling and cluster resource management. 4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. 5. HBase: A scalable, distributed database that supports structured data storage for large tables MapReduce Map reduce is a framework for processing parallel problems across huge datasets using a large number of commodity computers (nodes). The computation performed on the nodes is independent of the physical location of nodes i.e. whether all nodes are on same network and use similar hardware(called as cluster) or nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware (called as grid). It is also independent of location of data. Data can be stored into cluster at any node. The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two 5

18 functions: map and reduce. Map function, written by the programmer, takes an input pair of key/value and produces a set of intermediate key/value pairs. The MapReduce library groups together all values associated with the same key and passes them to the Reduce function. Reduce function, also written by the programmer, accepts pair of key/value which is generated by map function. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value, which is actual output of the user program. map (k1, v1) list(k2, v2) reduce (k2, list(v2)) list(k3, v3) Computational processing can occur on data stored either in a file system (unstructured/hdfs) or in a database (structured/hbase). Map: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node [14]. Reduce: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve [14]. MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others. All maps can be performed in parallel though in practice it is limited by the number of independent data sources and/or number of CPUs near each source. Similarly, a set of reducers can perform the reduction phase provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative. 6

19 Figure 1.1: Map Reduce Process. Example [15] The map function emits each word plus an associated count of occurrences (just 1 in this simple example). The reduce function sums together all counts emitted for a particular word. map Word(String key file, String value line): // key file: text file name, value line: text file contents for each word m in value line: EmitIntermediateval(m, 1 ); reduce Worx(String key file, Iterator value line): // key file: a word from text file, value line: a list of counts of each word int total count = 0; for each word t in value line: total count += ParseInt(t); Emitfinalval(AsString(total count)); 7

20 1.4.2 HDFS The Hadoop Distributed File System (HDFS) [16] is a another module of Hadoop Project. It is a distributed file system, which is designed to run on commodity hardware. HDFS is highly fault-tolerant and works on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets [16] HBase HBase actually works on HDFS. It is a distributed column oriented database built on top of HDFS [17]. It is a Hadoop application which is used, when user require real-time read/write random-access to very large datasets. HBase can scale linearly just by adding nodes. It is not relational and does not support SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host very large, sparsely populated tables on clusters made from commodity hardware [17]. 1.5 Oblivious Technique Oblivious retrieval technique is designed to achieve user and database privacy. Actually Oblivious retrieval technique contains implementation of homomorphic encryption technique. This encryption technique allows user to do addition, multiplication operations on encrypted data without changing its original plain text data. In this application homomorphic encryption is done on image feature data. Today many of the website firm outsource their data to external storage servers. In this case, it is necessary to maintain privacy of user with respect to external server as well as vice versa. Oblivious technique is one such attempt to maintain this type of security. In oblivious technique privacy of user with respect to the server as well as privacy of server with respect to user is maintained with the use of homomorphic encryption technique. Homomorphic encryption is a asymmetric cryptography technique. It works on public and private key pair. It allows specific types of computations to be carried out on cipher text and obtain an encrypted result which decrypted matches the result of operations 8

21 performed on the plain text. For instance, one person could add two encrypted numbers and then another person could decrypt the result, without either of them being able to find the value of the individual numbers. Using the public key of user, feature vector will be stored in encrypted form into database. When User inputs a query, it will go into database in encrypted form and images will be searched and retrieved on the basis of similar encrypted feature vector. These encrypted image data will be sent to user and user will decrypt that image data using his private key. Cryptographic technique which is used in this system is Paillier Cryptosystem. In the Paillier cryptosystem, the homomorphic property is supported as follows. Homomorphic addition of plaintexts The product of two ciphertexts will decrypt to the sum of their corresponding plaintexts, D(E(m1,r1)*E(m2,r2) mod n 2 ) = (m1 + m2)mod n Homomorphic Encryption Algorithms: 1. Simple RSA Algorithm 2. ElGamal Cryptosystem 3. Paillier Cryptosystem 9

22 Chapter 2 Literature Survey 2.1 PIRMAP [1] Private Information Retrieval (PIR) allows for retrieval of bits from a database in a way that hides a user s access pattern from the server. This paper presents PIRMAP, a practical, highly efficient protocol for PIR in MapReduce, a widely supported cloud computing API. PIRMAP focuses especially on the retrieval of large files from the cloud, where it achieves optimal communication complexity (O(1) for retrieval of an l bit file) with query times significantly faster than previous schemes PIR When user connects to a cloud, there is series of interactions between the user and the server. User connects to cloud to retrieve his private information which is stored. The information can be in the form of different files. Now user wants to retrieve x number of files out of his n files of his choice and also server should not learn about his choice of files. Now user of cloud has to provide such query to cloud such that user will receive his required number of files from cloud and this transaction should be secured. 10

23 2.1.2 MapReduce in PIRMAP This information retrieval technique uses map reduce to parallelize the process which will reduce the computational time to retrieve the file from cloud and will make it very easy to use. The first phase is called the Map phase. MapReduce will automatically split input computations, equally among available nodes in the cloud data center, and each node will then run a function called map on their respective pieces (called InputSplits). It is important to note that the splitting actually occurs when the data is uploaded into the cloud. This means that each mapper node will have local access to its InputSplit as soon as computation is started and you avoid a lengthy copying and distributing period. The map function runs a user defined computation on each InputSplit and outputs (emits) a number of key-value pairs that go into the next phase. The second phase, Reduce, takes as input all of the key-value pairs emitted by the mappers and sends them reducer nodes in the data center. Specifically, each reducer node receives a single key, along with the sequence of values emitted by the mappers which share that key. The reducers then take each set and combine it in some way, emitting a single value for each key [1] Working of PIRMAP PIRMAP is an extension of the PIR protocol by Kushile-vitz and Ostrovsky, targeting the retrieval of large files in a parallelization-aggregation computation framework such as MapReduce. We will start by giving an overview of PIRMAP which can be used with any additively homomorphic encryption scheme. Upload: In the following, we assume that the cloud user has already uploaded its files into the cloud using the interface provided to them by the cloud provider. Query: In keeping with standard PIR notation, our data set holds n files, each of which 11

24 is l bits in length. There is also an additional parameter k which is the block size of the chosen cipher. For ease of presentation, we will consider the case where all files are the same length, but PIRMAP can easily be extended to accommodate variable length files by padding or prepending each file with a few bytes that specify its length Future Scope/Issues This implementation is specifically for files which contains textual data, it does not support for multimedia data such as image, video or audio files. 2.2 Map/Reduce in CBIR Application [2] At the start, image retrieval is mainly dependent on the text based retrieval of an image. This technology is widely used, the current mainstream search engines Google, Baidu, Yahoo, etc. mainly used this method to search image. In this technique, name of image file was used to compare and retrieve the mage from database. But it has drawbacks, researchers often need to manually mark with text for all images, and this mark text can t objectively and accurately describe the visual information of the images. After the 1990s, there has been Content-Based Image Retrieval, the following unified call CBIR, and different from TBIR, it can extract visual features from image automatically, and then retrieval image by their visual features. Since CBIR intuitive, efficient, and can be widely used in information retrieval, medical diagnosis, trademarks and intellectual property protection, crime prevention and other areas, it has a very high applied value System Model Selecting an Algorithm The technique used in this paper uses Color feature-based image retrieval. This mainly contains the following algorithms: Color histogram, hue histogram, color moments, color entropy etc. [15] 12

25 Taking into account the color histogram algorithm is widely used, it s feature extraction and similarity matching are more easier, this method will be used as our target color algorithm. Color Histogram Color Histogram of an image gives extensive information about its structure. It gives number of pixels, its colors in RGB format etc. So it is easy to calculate mean, entropy, median of an image which can be used as features of an image in case of feature extraction. This feature can be used to compare with queried image s feature and to retrieve all similar images from database. Feature Calculation and Similarity Matching Feature vector of each uploaded image is stored in database. This feature vector is match with feature vector of an input image. Both feature vectors are used to calculate similarity coefficient. This is called as Pearson correlation coefficient. Map Reduce is used in similarity matching phase only to provide fast results. Feature vector of input image is calculated in the query and is matched with the images from image library and images for which images are matching are retrieved Future scope/issues Map Reduce technique is used at similarity matching stage of system. Map reduce technique can be efficiently used in feature extraction process by splitting image into parts at map stage and then combining it at reduce level. 2.3 An Oblivious Image Retrieval Protocol [3] Today many website providers have large collection of images. For e.g. google.com, facebook.com have large database of images where each user uploads his private data to the data center of website. This data should be protected from external or internal threat. The threat may be of stealing of image data, modification of the data, deleting private 13

26 data of an individual, or misusing it for particular purpose etc. As the size of this data is increasing day by day, companies find it hard to manage and secure this kind of large data. Their data center servers are getting overloaded due to huge data. To reduce strain form their storage servers they are looking for another option. This option is of using external storage servers. These external storage servers are maintained and managed by other companies. i.e. data is now outsourced by the companies to other companies to reduce overhead of their internal servers. In this case it is becomes very important to protect and secure the data of a customer over external outsourced database servers. Oblivious Image Retrieval Protocol comes in to picture for outsourced image databases. It is a technique to securely query image database for required image data and retrieve matched images form database. This retrieved data should also securely get transferred to the user System Model System assumes that all feature vectors of all the images are already in database. This feature data is stored in encrypted form in to the database. All query operations are done on encrypted data. This mainly contains two parts one is privacy preserving querying mechanism and oblivious transfer of the decryption keys. Privacy preserving protocol of paper suggests following. First, to query image set, user generates an encryption of the query feature vector set using a homomorphic public-key encryption technique and sends it to the database server. The query feature vector is distorted by the user with a constant random vector to prevent any statistical inference by the database server. Second, the database server uses the encrypted query feature vector and performs a homomorphic subtraction with a random feature vector. The server performs a subtraction of the database image features with the same random feature vector. Before sending the results back to the user, the database server performs a permutation of the subtracted feature vectors so that the user is not able to learn the relative indexing structure of the database images. The server includes pseudo-identifiers 14

27 for the permuted images to allow the user to identify his choices. Third, upon receiving the server response, the user removes the distorting random constant from the server response and calculates the Euclidean norm of the numerical difference the query image feature vector and the database image feature vectors [3] Future Scope/Issues The oblivious image retrieval technique can be implemented using map reduce technique, map reduce can be used at content based image retrieval stage. 2.4 Distributed Image Retrieval System Based on MapReduce.[4] A Distributed Image Retrieval System(DIRS) is a system in which images are retrieved in a content based way, and the retrieval among massive image data storage is carried parallely by utilizing MapReduce distributed computing model System Model In this system, images are uploaded to HDFS in one map reduce process. This map reduce process extracts features of images by using LIRE library and stores images and their feature vectors to Hbase table. Images are also retrieved parallelly. In the feature matching process, CBIR system calculates the similarity between the sample image and the target images, and then returns those images matching the sample image most closely. Before MapReduce job is started, the sample image should be added to Distributed Cache to enable every map task to access it. Image is read from Distributed cache, extract its visual features, and then compare the feature vectors with feature vectors of target images from HBase. All matched images are retrieved back to user. 15

28 2.4.2 Future Scope/Issues The distributed image retrieval technique needs some secure technique to protect data. 2.5 Summary After literature survey, I found that each paper has some future scope. I want to combine all the three feature scopes into one application. This application will upload, search and retrieve image files from distributed database for particular user and will perform this in secured manner. It will eventually maintain user and data privacy. This will be carried out by using map reduce technology which will optimize the time complexity for retrieval. 16

29 Chapter 3 Motivation With the explosive growth of digital media data, there is a huge demand for new tools and systems. It should enable average user to search, access, process, manage, author and share these digital media contents more efficiently and more effectively. This leads to change people s interest from work to work with entrainment. Examples: Business Phones to Smartphones, Watching movies/t.v., Listening music, Social Media: connect and share, There are different types of multimedia content present today. A multimedia information system can store and retrieve text, 2D grey-scale and color images, 1D time series, digitized voice or music, video and there is need to propose an efficient, robust, secure solution to retrieve this kind of multimedia information. We are considering image retrieval over audio or video information retrieval because today image retrieval system has lot of applications. Image retrieval techniques have gone under several changes and are becoming more and more advanced over the period of time. In early years of image retrieval, images were searched on text based where each image has to be annotated with particular text. This technique was more troublesome because of its complexity. Today content based image retrieval technique provides extensive search 17

30 facility over database which retrieves all related images which are similar in content with queried image. The CBIR technology has been used in several applications such as fingerprint identification, biodiversity information systems, digital libraries, crime prevention, medicine, historical research, among others. This CBIR covers wide randge of different domains. Today many websites like google.com, facebook.com etc. has lot of images on their servers. Many users access these websites regularly. They upload, search download images from them. There is a need of fast and secured technique to upload, search and retrieve these images on user demand. 3.1 Problem Definition The purpose of this research work is to create an efficient private content based image information upload, search and retrieval system using map reduce. The system will be created from easily available API s and which has support on cloud so that system can be easily ported to cloud for use. 3.2 Scope of Research Scope of this research is not limited to one particular domain. It spans over multiple domains. Domains in which extensive research can be done are Private Information Retrieval, Content Based Image Retrieval and Oblivious transfer i.e. security. Research papers in this area suggests content based image retrieval technique over cloud computing as well as secured transfer techniques for the images but not a single paper talks about fast (over cloud computing) as well as secured retrieval of images form servers using Private Information Retrieval Technique. This system can be evolved in many ways. By changing CBIR algorithm, by using different encryption techniques, by using different storage methods over HDFS, this system can be moved to new more efficient and secure level 18

31 3.3 Objectives 1. To implement CBIR in Map-Reduce 2. To select and implement homomorphic encryption technique for secure retrieval 3. To evaluate proposed solution 3.4 System Requirement Specifications Hardware: Minimum one Desktop system with Intel s or AMD s processor and 2Gb of RAM. Latest technology will provide better results. i.e. Cluster of 5 nodes having Intel s core series processor(i3,i5,i7) of high frequency with 4Gb RAM will definitely perform better than 5 node cluster having older processor version working on less frequency and less RAM. Cluster should have nodes having same configuration. If it has nodes having different RAM capacity, processor frequency then cluster will not perform up to capacity. Software: 1. Operating System: Any open source Linux operating system. 2. Coding Language: Hadoop map reduce technology/software Java language (JDK) 3. IDE: Eclipse environment 19

32 Chapter 4 System Design 4.1 Complete Architecture Figure 4.1 shows complete architecture of the system. Figure 4.1: System Architecture 20

33 It provides an web based interface to receive user s request. After receiving request, system performs specific operations on the images whether it is upload or retrieval. Core of this system is divided into two phases. One phase is to upload images and other phase is to search and retrieve similar images. Both modules are deployed on Hadoop cluster, which follows the MapReduce paradigm. Upload process will read input data from HDFS, will process it using map and reduce functions, and then write output results to HDFS again. Similarly Image retrieval process will use map and reduce functions for retrieving similar images depending upon queried image. These images will be given to user on local file system Upload Images Upload Images is an independent process where user can upload his image or photo information to the systems database. The process of uploading images is carried out with the help of Map/Reduce techniques which parallelize the process of upload. In upload, there are three sub process. First is splitting of an image, second is feature extraction and third is encryption and storage. Upload Image process is parallelized with the help of Map Reduce technique for each image. When user uploads his private photos or images to the database, he might wants to upload an image or many images at a single point of time. So if user wants to upload many images at a single point of time, he has to provide only path of image folder where images are stored and map stage will parallelize the upload process instead of uploading one image at a time. Architecture of upload process can be seen in figure4.2. For better understanding, explanation of upload process is split into three distinct phases. These three phases actually specifies image splitting, image feature extraction and encryption part. Phase 1 : An image which is going to be uploaded is first split into 16 small images. This step is taken to disperse data across nodes and attacker should not be able to track data from one location. Map/Reduce technique works on its distributed file system called as 21

34 Figure 4.2: Upload Process Hadoop Distributed File System(HDFS) [18], [19], [13], [16]. specifically built for parallelization. All image files are stored into file system. The information about all split images of every image are stored into HBase [17] table. HBase works on top of HDFS. Table in HBase will store path of image and other features of that image. Phase 2 : These images are given to the image processing part. In this phase, Color Histogram of each image is calculated. This histogram will provide different feature values such as entropy, intensity, mean, median etc. These values are used for the similarity calculation [11] of images i.e. distance between quired image and images from HDFS. Phase 3 : This calculated feature will be given to homomorphic encryption system which will store this vector in HBase table in encrypted form. Homomorphic encryption system actually provides additive and multiplicative operations to be carried out on encrypted data without changing its original data i.e. multiplication of all 16 encrypted feature vectors of an particular image will lead to the summation of feature vector of image in plaintext form. Advantage of this will come into picture at the time of retrieval. At the time of retrieval when user will query to the system then encrypted feature vector of quired image will be used. This feature vector which is in encrypted form will be used against total feature 22

35 vector of all split image of particular image. Algorithm used for this encryption technique is Paillier cryptosystem. It uses pair of private and public key of user. Image feature vector of an image is encrypted with the public key of user and is entered into HDFS. Users public key will be available with the database server whereas private and public key will be available with user. This encrypted feature vector is stored into files on HDFS. Each split image will have its feature vector corresponding to it and it will be stored in same row along with encrypted image name Image Retrieval Figure 4.3: Image Retrieval Process This phase of the system provides user,a GUI to search and retrieve images. Here user will upload one image for which he wants to find similar images. When user wants to search images based on particular image, he uploads it to system. Processes included in upload phase are also carried out on this image. Feature vector of each image is calculated from color histogram. All feature vectors of split images are encrypted with the public key of user. These all encrypted feature vectors of queried image are sent to server. This procedure of searching,comparing and retrieving images is also one Map reduce process. In this process, when user uploads image, encrypted feature vectors of queried image i.e. feature vectors of all split image files along with all image feature vectors in HBase table goes to Map process. 23

36 Map Stage: This map process multiplies all feature vectors of particular image and emits total feature vector of image. This process does this in parallel for all images present into system. It will create one mapper for one image and emits total feature vector after combining 16 feature vectors. Reduce Stage:Vectors of all images are collected in reduce step. Each vector is compared with feature vector of queried image. Comparison is done on the basis of similarity measures. This similarity measure is calculated from the different standard formulas such as Euclidean Distance, Pearson correlation coefficient [11]. In proposed protocol, euclidean distance between color histograms of images is calculated. 24

37 Chapter 5 Implementation of System Graphical user interface(gui) is most important part of any application. It allows user to interact with a system. The success of application depends upon how well it is designed, how simple is it to use? This application even though, is designed in bottom up fashion, i.e. first main Map-Reduce code and lastly GUI, I will introduce GUI firstly, because when any user will start this application, he or she will interact with GUI only. Complete application including GUI is created in eclipse[20]. It is an open source integrated development environment(ide) which helps to develop applications easily. This IDE comes with integrated J2EE or Java EE [21]. J2EE is actually Java to enterprise edition technology. It is used to build enterprise web applications. Java EE includes several API specifications, such as JDBC,Enterprise JavaBeans, Connectors, servlets, Java Server Pages (JSPs) [12] and several web service technologies. This allows developers to create enterprise applications that are portable and scalable, and that integrate with legacy technologies [21]. I have created GUI with JSPs, it enables Web developers and designers to rapidly develop and easily maintain, information-rich, dynamic Web pages. Java code can be integrated with HTML in JSPs. JSP technology separates the user interface from content generation, enabling designers to change the overall page layout without altering the underlying dynamic content [22]. In this application, JSPs have all GUI code and Java code contains all business logic of the application. All map reduce processes can run only through Java code as there are some specific commands which runs Map Reduce process. 25

38 This application is designed in such a way that if any process required for MapReduce program to run is not present then, the application automatically restarts all process in a system. Further it makes sure that all processes will be up and running. This is done only for the local deployment of hadoop cluster because when server is restarted all data will be formated. In actual industry deployment, this case will not arise because all precesses will always be running. 5.1 Log In Screen As title suggest, this application is private content based image information retrieval, so every user will have his user id and password through which user will login to his/her account to upload or retrieve images. Log In Screen gives user to maintain his privacy. This is because many time databases are outsourced by the companies in that case data of many users can be present on same server/cluster. This technique provides facility that user can see only his data and not of others. Figure 5.1: LogIn Screen 5.2 Create Account Create account form comes after clicking link Create Account from sign up. To access application, user has to create his account with web site, its more like an sign up form. 26

39 Figure 5.2: Create Account Screen User account is created from users id. Validations are done on id field i.e. if in case user enters wrong id, it asks for correct id again. After valid id, user account will get created along with security keys. User will be notified with User ID to login with. This User ID will create his account i.e. folder in HDFS. Information about login i.e. user id and passwords are stored in HBase table which is also a part of Hadoop. It is a database which allows to create table and store data in it. It is scalable database which can expand at run time. Figure 5.3: Account Creation Screen - Validation After successful creation of account, user will get a message that account has been created and it is ready to access. User now will go to LogIn screen to log in. 27

40 5.3 User Operation Screen After logging in, user will see a screen having three links. One link is for logout, second is for uploading images and third one is for retrieving similar images for one particular image from database. Figure 5.4: User Operations Screen Changes in Hadoop Conventional set up of Hadoop does not allow process to write output files to same output path. User of this system will not upload files only once. He will use system any time, any where. It is necessary for a system to upload user images at one particular location. So when map reduce process of upload happens all encrypted feature vector must go it same folder that belongs to user. This process will happen any time. So at each time we can allocate new folder space to user. to remove this limitations, I have modified Hadoop code so that output of every map reduce process will go into user specific folder and not to other location. Modified Files : FileOutputFormat.java, MultipleOutputFormat.java, MultipleTextOutputFormat.java 28

41 5.3.2 Upload Process When user selects upload images link on screen, he is directed to a page where he can specify the folder path of total images. All the images present in the folder gets uploaded on HDFS first. These images are used in upload Map Reduce. At time of each upload, Map-Reduce process will run. It splits an image into 16 small images, and calculates feature vector of every small image i.e. color moment of each split image. This feature vector is nothing but a color moment of an image calculated from color histogram. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using similarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8]. All moments are encrypted with public key of user and are then stored into a file. Whenever user uploads images to the database, they will go into same folder. Actually Hadoop does not support, usage of same output folder in different Map Reduce process. So I had modified hadoop so that in each upload Map-Reduce, files will go into same folder. Figure 5.5: Upload Screen 29

42 Map Reduce process takes some time to complete, if user does not get proper message he might thought that images are uploaded and he might shut down the process. So user gets a message showing that files are being uploaded. When map reduce will be executing, user will see following screen till process completely executes. Figure 5.6: Uploading Screen Again there is a chance of repetition of file names. So those files who has their names similar to files present into HDFS, will be renamed at the time of upload. First all file names of user will be checked from HDFS. They are taken in one string and file name of image which is going to get uploaded will be matched with the string if string contains image name then it is renamed. Program will again check, name which is generated whether it is present in string or not if it is present then file will be renamed again. Map Reduce Algorithm Map reduce technique is used to create feature vector of images. Map will split input image file in to small chunks of fixed size, will calculate its feature vector, will encrypt it using homomorphic encryption technique and will store it into database. Reducer will get sorted input from mapper. It will combine all split encrypted feature vector and will emit imageid and combined list. 30

43 Algorithm 5.1 Upload Map Reduce Process Algorithm Mappers emit encrypted feature vectors keyed by names f eaturekey, the execution framework groups vectors by names along with names f eaturekey, and the reducers write names and encrypted feature values to files. imgs new BufferedImage; random new BigInteger; n, g, r, nsquare new BigInteger; dmeanr new double; dmeang new double; dmeanb new double; 1: procedure printpixelargb(int pixel) 2: dmeanr dmeanr + Red Pixel Value 3: dmeang dmeang + Green Pixel Value 4: dmeanb dmeanb + Blue Pixel Value 1: procedure Encryption(int meanv al) 2: Return BigInteger encrypted(meanv al) 1: class Mapper 2: procedure Map(Imgid iid, Img im) 3: m0, m1, m2 new BigInteger; 4: imgs splitimages 5: for all img i imgs[count] do 6: for all pixel p imgs i do 7: printp ixelargb 8: m0 Encryption(dM eanr) 9: m1 Encryption(dM eang) 10: m2 Encryption(dM eanb) 11: Emit(Imgid iid, encrypted vals iid i, m0, m1, m2 ) 1: class Reducer 2: procedure Reduce(Imgid iid, encrypted vals iid i, m0, m1, m2 ) 3: iidl new List 4: for all encrypted vals iid i, m encrypted vals 5: [ iid 0, m0, m1, m2, iid 1, m0, m1, m2...] do 6: Append(iIdL, iid i, m ) 7: Emit(Imgid iid, iidl) 31

44 Following image shows the working of Map Reduce process in eclipse or behind the screen. Figure 5.7: Process behind Upload Image Retrieval When user clicks search images link, he is directed to a page where he can specify the file path of image. This is an image on the basis of which images are to be retrieved from database. This comparison is actually done between features of images stored in HDFS and feature of an image in search process. In upload process, feature vectors are calculated and are stored into HDFS. These feature vectors are used for comparison. Figure 5.8: Search Screen 32

45 When user selects image, it passes through several steps. These steps are actually similar that of upload process. i.e. Image feature calculation and its encryption. Image is split into 16 parts first same as in upload phase. This step is taken to maintain loss of pixels in both cases equal. Features are encrypted with same public key of an user. So at similarity matching phase features will match exactly. At upload stage, features of split images are stored into HDFS. So at similarity measurement time they should be combined back into one feature vector. Here paillier cryptosystem comes into picture. Paillier provides facility to perform addition and multiplication operations on encrypted data without changing original data. Similarity distance is calculated between stored image feature vectors and feature vector of an image to be searched. Similarity distance is nothing but a mathematical formula which gives relation between the images. There are different types of measures which can be used for calculation [8],[11]. Euclidean distance measurement is used for calculation of similarity between images. This distance is calculated between means. This will provide information about similarity between two images. This process of comparison is done in an map reduce to make it parallel. When map reduce process is going on user will see following screen. Figure 5.9: Searching Screen 33

46 Background processes can be seen from following screen. Figure 5.10: Processes behinds search Screen Map Reduce Algorithm Algorithm 5.2 Image Retrieval Map Reduce Process Algorithm Mappers emit encrypted feature vectors keyed by opn ame f eatureid, the execution framework groups vectors by names along with opn ame f eatureid, combiner gets encrypted feature vectors of quired image and calculates difference between combined values from datastore and vectors of quired images also calculates square and emits values. The reducers gets values of difference write names and square root of encrypted feature values to files. upload img m1 new BigInteger; upload img m2 new BigInteger; upload img m3 new BigInteger; 34

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

CLOUD STORAGE USING HADOOP AND PLAY

CLOUD STORAGE USING HADOOP AND PLAY 27 CLOUD STORAGE USING HADOOP AND PLAY Devateja G 1, Kashyap P V B 2, Suraj C 3, Harshavardhan C 4, Impana Appaji 5 1234 Computer Science & Engineering, Academy for Technical and Management Excellence

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Privacy and Security in Cloud Computing

Privacy and Security in Cloud Computing Réunion CAPPRIS 21 mars 2013 Monir Azraoui, Kaoutar Elkhiyaoui, Refik Molva, Melek Ӧnen Slide 1 Cloud computing Idea: Outsourcing Ø Huge distributed data centers Ø Offer storage and computation Benefit:

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code. Content Introduction... 2 Data Access Server Control Panel... 2 Running the Sample Client Applications... 4 Sample Applications Code... 7 Server Side Objects... 8 Sample Usage of Server Side Objects...

More information

Private Content Based Image Retrieval Using Hadoop

Private Content Based Image Retrieval Using Hadoop Private Content Based Image Retrieval Using Hadoop Dissertation Submitted in partial fulfillment of the requirement for the degree of Master of Technology in Computer Engineering By Swapnil P. Dravyakar

More information

Cloud Based Distributed Databases: The Future Ahead

Cloud Based Distributed Databases: The Future Ahead Cloud Based Distributed Databases: The Future Ahead Arpita Mathur Mridul Mathur Pallavi Upadhyay Abstract Fault tolerant systems are necessary to be there for distributed databases for data centers or

More information

1.1.1 Introduction to Cloud Computing

1.1.1 Introduction to Cloud Computing 1 CHAPTER 1 INTRODUCTION 1.1 CLOUD COMPUTING 1.1.1 Introduction to Cloud Computing Computing as a service has seen a phenomenal growth in recent years. The primary motivation for this growth has been the

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper. The EMSX Platform A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks A White Paper November 2002 Abstract: The EMSX Platform is a set of components that together provide

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Feasibility Study of Searchable Image Encryption System of Streaming Service based on Cloud Computing Environment

Feasibility Study of Searchable Image Encryption System of Streaming Service based on Cloud Computing Environment Feasibility Study of Searchable Image Encryption System of Streaming Service based on Cloud Computing Environment JongGeun Jeong, ByungRae Cha, and Jongwon Kim Abstract In this paper, we sketch the idea

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Encryption and Decryption for Secure Communication

Encryption and Decryption for Secure Communication Encryption and Decryption for Secure Communication Charu Rohilla Rahul Kumar Yadav Sugandha Singh Research Scholar, M.TECH CSE Dept. Asst. Prof. IT Dept. Asso. Prof. CSE Dept. PDMCE, B.Garh PDMCE, B.Garh

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur Lecture No. #06 Cryptanalysis of Classical Ciphers (Refer

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS WHITEPAPER BASHO DATA PLATFORM BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS INTRODUCTION Big Data applications and the Internet of Things (IoT) are changing and often improving our

More information

The big data revolution

The big data revolution The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Secure cloud access system using JAR ABSTRACT:

Secure cloud access system using JAR ABSTRACT: Secure cloud access system using JAR ABSTRACT: Cloud computing enables highly scalable services to be easily consumed over the Internet on an as-needed basis. A major feature of the cloud services is that

More information

Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching

Project Proposal. Data Storage / Retrieval with Access Control, Security and Pre-Fetching 1 Project Proposal Data Storage / Retrieval with Access Control, Security and Pre- Presented By: Shashank Newadkar Aditya Dev Sarvesh Sharma Advisor: Prof. Ming-Hwa Wang COEN 241 - Cloud Computing Page

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Fast remote data access for control of TCP/IP network using android Mobile device

Fast remote data access for control of TCP/IP network using android Mobile device RESEARCH ARTICLE OPEN ACCESS Fast remote data access for control of TCP/IP network using android Mobile device Vaibhav Muddebihalkar *, R.M Gaudar** (Department of Computer Engineering, MIT AOE Alandi

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Big Data Analytics by Using Hadoop

Big Data Analytics by Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Big Data Analytics by Using Hadoop Chaitanya Arava Governors State University

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

A block based storage model for remote online backups in a trust no one environment

A block based storage model for remote online backups in a trust no one environment A block based storage model for remote online backups in a trust no one environment http://www.duplicati.com/ Kenneth Skovhede (author, kenneth@duplicati.com) René Stach (editor, rene@duplicati.com) Abstract

More information

Chapter 17. Transport-Level Security

Chapter 17. Transport-Level Security Chapter 17 Transport-Level Security Web Security Considerations The World Wide Web is fundamentally a client/server application running over the Internet and TCP/IP intranets The following characteristics

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Journal of Environmental Science, Computer Science and Engineering & Technology

Journal of Environmental Science, Computer Science and Engineering & Technology JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences

More information

Final Year Project Interim Report

Final Year Project Interim Report 2013 Final Year Project Interim Report FYP12016 AirCrypt The Secure File Sharing Platform for Everyone Supervisors: Dr. L.C.K. Hui Dr. H.Y. Chung Students: Fong Chun Sing (2010170994) Leung Sui Lun (2010580058)

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Sync Security and Privacy Brief

Sync Security and Privacy Brief Introduction Security and privacy are two of the leading issues for users when transferring important files. Keeping data on-premises makes business and IT leaders feel more secure, but comes with technical

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

MySQL and Virtualization Guide

MySQL and Virtualization Guide MySQL and Virtualization Guide Abstract This is the MySQL and Virtualization extract from the MySQL Reference Manual. For legal information, see the Legal Notices. For help with using MySQL, please visit

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

BlackBerry Enterprise Service 10. Secure Work Space for ios and Android Version: 10.1.1. Security Note

BlackBerry Enterprise Service 10. Secure Work Space for ios and Android Version: 10.1.1. Security Note BlackBerry Enterprise Service 10 Secure Work Space for ios and Android Version: 10.1.1 Security Note Published: 2013-06-21 SWD-20130621110651069 Contents 1 About this guide...4 2 What is BlackBerry Enterprise

More information

A NOVEL APPROACH FOR MULTI-KEYWORD SEARCH WITH ANONYMOUS ID ASSIGNMENT OVER ENCRYPTED CLOUD DATA

A NOVEL APPROACH FOR MULTI-KEYWORD SEARCH WITH ANONYMOUS ID ASSIGNMENT OVER ENCRYPTED CLOUD DATA A NOVEL APPROACH FOR MULTI-KEYWORD SEARCH WITH ANONYMOUS ID ASSIGNMENT OVER ENCRYPTED CLOUD DATA U.Pandi Priya 1, R.Padma Priya 2 1 Research Scholar, Department of Computer Science and Information Technology,

More information

1. Introduction 1.1 Methodology

1. Introduction 1.1 Methodology Table of Contents 1. Introduction 1.1 Methodology 3 1.2 Purpose 4 1.3 Scope 4 1.4 Definitions, Acronyms and Abbreviations 5 1.5 Tools Used 6 1.6 References 7 1.7 Technologies to be used 7 1.8 Overview

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information