Private Content Based Image Information Retrieval using Map-Reduce

Transcription

1 Private Content Based Image Information Retrieval using Map-Reduce A Dissertation submitted in fulfillment for the award of the degree Master of Technology, Computer Engineering Submitted by Arpit D. Dongaonkar MIS No: Under the guidance of Prof. Sunil B. Mane College of Engineering, Pune DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE-5 June, 2013

2 DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE CERTIFICATE This is to certify that the dissertation entitled Private Content Based Image Information Retrieval using Map-Reduce has been successfully completed by Mr. Arpit Dilip Dongaonkar MIS No as a fulfillment of End Semester evaluation of Master of Technology. SIGNATURE SIGNATURE Prof Sunil B. Mane Dr. J. V. Aghav Project Guide Head Dept. of Computer Engineering Dept. of Computer Engineering and Information Technology, and Information Technology, College of Engineering Pune, College of Engineering Pune, Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

3 Dedicated to My Mother Sunita D. Dongaonkar, My Father Prof. Dilip R. Dongaonkar for their love and everything, that they have given me... and also to All family members for supporting me...

4 Acknowledgements I would like to express my sincere gratitude towards Prof. Sunil B. Mane, my research guide, for his patient guidance, enthusiastic encouragement and useful critiques on this research work. Without his invaluable guidance, this work would never have been a successful one. I would also like to thank Vikas Jadhav, my classmate for continuous and detail knowledge sharing on Hadoop and other aspects from installation to working. I would also like to thank all my M.Tech friends for unconditional support and help. Last but not least, my sincere thanks to all my teachers who directly or indirectly helped me to learn new technologies and complete my post graduation successfully. Arpit Dilip Dongaonkar College of Engineering, Pune

5 Abstract Today, Role of Information technology is changing and this change is leading to real life software application developments i.e. Information technology has entered every part of our day to day life like traffic signals, power grids, our day to day financial transactions are handled by Information technology. But this technology comes with a cost to pay for it. As this technology is touching each and every layer of life and community, there is a need to make this technology available to each and every layer of community. Cloud computing is an emerging technology which reduces this cost while providing very good and efficient information technology platform as well as service. Cloud reduces the overload on hard disk, operating system and eventually on CPU processing. Cloud providers provide infrastructure, platform, database etc. for you to install your operating system, your required software s, to store your data on their cloud. But this service comes at a cost. Cloud providers provide it on pay per hour basis. So any application which can be deployed on cloud must be fast, efficient and secure. As Internet is expanding, creation of data over it is also increasing exponentially. Maximum part of this data contains images. Today large amount of variety image data is produced through digital cameras, mobile phones, and photo editing softwares etc. These images are private to particular user. This digital image data should be secured such that it should not be accessed by others except the user. There are some domains which uses content based image retrieval applications. These applications should be fast enough to carry out user functionalities efficiently and secure in order to protect user data. If application is to be deployed on cloud, then it should be supported by cloud structure. One technology which has support from cloud and does distributed parallel computing is Map Reduce. I am proposing a system which will allow user to upload, search and retrieve some of users personal images depending upon one particular image, securely and efficiently from

6 his continuously expanding image database. This system is incorporated with encryption technique for security and Map-Reduce technique for efficiently upload, search and retrieval of images over large dataset of user images. Proposed solution will be a system to securely upload, search and query images among massive image data storage using content based image retrieval scheme using Map Reduce technique, and evaluating and comparing proposed solution with existing solution. Keywords - Information technology, Cloud computing, secured personal data.

7 Contents List of Tables List of Figures iv vi 1 INTRODUCTION Private Information Retrieval Image Retrieval Content Based Image Retrieval Similarity Measurement Euclidean Distance Mahalanobis Distance Chord Distance Hadoop MapReduce HDFS HBase Oblivious Technique Literature Survey PIRMAP [1] PIR MapReduce in PIRMAP Working of PIRMAP Future Scope/Issues Map/Reduce in CBIR Application [2] System Model i

8 2.2.2 Future scope/issues An Oblivious Image Retrieval Protocol [3] System Model Future Scope/Issues Distributed Image Retrieval System Based on MapReduce.[4] System Model Future Scope/Issues Summary Motivation Problem Definition Scope of Research Objectives System Requirement Specifications System Design Complete Architecture Upload Images Image Retrieval Implementation of System Log In Screen Create Account User Operation Screen Changes in Hadoop Upload Process Image Retrieval Experiments, Testing and Result Analysis Cluster Configuration Experiments and Testing Result Analysis Error Check in Application

9 6.5 Comparison of Different Systems System Output Input Images Conclusions and Future Scope Conclusion Future Scope A Publication Status 49

10 List of Tables 6.1 Cluster Configuration Second Cluster Configuration iv

11 List of Figures 1.1 Map Reduce Process System Architecture Upload Process Image Retrieval Process LogIn Screen Create Account Screen Account Creation Screen - Validation User Operations Screen Upload Screen Uploading Screen Process behind Upload Search Screen Searching Screen Processes behinds search Screen One Node Cluster Difference at Upload MapReduce One Node Cluster Difference at Retrieval MapReduce Three Node Cluster Difference at Upload MapReduce Three Node Cluster Difference at Retrieval MapReduce Performance of Upload MapReduce at Cluster Performance of Retrieval MapReduce at Cluster Scale Up between Different Cluster Nodes Scale Up between Different Cluster Nodes Error Checking in v

12 6.10 Error Check at Upload Images Error check at Search Images Error check at Search Images Input set of Images Uploaded Input Search Image Output of System

13 Chapter 1 INTRODUCTION 1.1 Private Information Retrieval Today most of data over Internet lies into database servers which are connected to Internet. If a user makes a query to database, he/she should get required queried information. There are many curious users who makes query to database intentionally or unintentionally to get private information of other users. Lot of research is devoted to protect databases from these types of curious users. Private information retrieval is a technique or a research area where security with respect to this type of intrusion has been considered. A private information retrieval (PIR) protocol allows a user to retrieve an item from a server in possession of a database without revealing which item he/she is retrieving. Over period different improvements in PIR has been proposed which makes PIR faster, more efficient and more secure. Implementation like oblivious transfer is also implemented in PIR. This restricts the user from getting information about other database items [3]. PIR also maintains privacy of the queries from database. 1.2 Image Retrieval An image retrieval system is designed to browse, search and retrieve images from large database of digital images. Most traditional and common methods of image retrieval 1

14 utilize method of adding metadata such as captioning, keywords, or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation [2]. Annotated images are searched based on Images s matadata. Image Meta search: Search of images based on associated metadata such as keywords, text, etc. Content-based image retrieval (CBIR): The application of computer vision to the image retrieval. CBIR aims at avoiding the use of textual descriptions and instead retrieves images based on similarities in their contents (textures, colors, shapes etc.) to a user-supplied query image or user-specified image features Content Based Image Retrieval Since 1970 s, image retrieval has been an active research area. In the beginning, research was concentrated to text based search only. It was new framework at that time, which used names of image files as a search criteria. In this framework, images were need to be annotated first and then from database management system, images were retrieved [5]. This framework was having two limitations firstly size of image data that can fit into database and secondly extensive manual annotation work. There was need to do research work on retrieval which would not be having limitation of manual entry and big size data. It lead to Content Based Image Retrieval technique. Content-based image retrieval (CBIR), also known as query by image content (QBIC) [6]. Content based image retrieval (CBIR) is a technique in which content of an image is used as matching criteria instead of image s metadata such as keywords, tags, or any name associated with image. This provides most approximate match as compared to text based image retrieval [7], [8], [9], [10]. 2

15 The term content in this context might refer to colors, shapes, textures, or any other information that can be derived from the image itself [6], [8], [10]. Image content can be categorized into visual and semantic contents. Visual content can be very general or domain specific. General visual content include color, texture, shape, spatial relationship, etc. Domain specific visual content, like human faces, is application dependent and may involve domain knowledge. Semantic content is obtained either by textual annotation or by complex inference procedures based on visual content [8]. COLOR: Image retrieval based on color actually means retrieval on color descriptors. Most commonly used color descriptors are the color histogram, color coherence vector, color correlogram, and color moments [9]. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using similarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8]. TEXTURE: Texture of an image is actually visual patterns that an image possesses and how they are spatially defined. Textures are represented by texels which are then placed into a number of sets, depending on how many textures are detected in the image. These sets not only define the texture, but also where in the image the texture is located. Statistical method such as co-occurrence Matrices can be used for quantitative measure of the arrangement of intensities in a region [8], [12]. SHAPE:Shape in image does not mean shape of an image but it means that shape of a particular region or an object. Segmentation and edge detection are prominent techniques that can be used in shape detection [6]. There are several components that are calculated from image content such as color intensity, entropy, mean of image etc. which are useful in creation of feature vector. Feature vector of every image is calculated and is stored in database. When user wants to retrieve set of images, he queries system through an image. Feature vector of required image is 3

16 matched with vectors of stored images and images with vector similar to vector of queried image are retrieved. User will select required image from set of shown images. 1.3 Similarity Measurement A similarity measurement is always selected to find how similar the two vectors are. The problem can be converted to computing the discrepancy between two vectors x,y R d. There are three distance measurements: Euclidean, Mahalanobis, and chord distances which are reviewed as follows. [11] Euclidean Distance The Euclidean distance between x,y R d is computed by [11] δ(x, y) = x y 2 = dj=1 (x j y j ) Mahalanobis Distance The Mahalanobis distance between two vectors x and y with respect to the training patterns x i is computed by [11] δ(x, y) 2 = (x y) t S 1 (x y) where the mean vector u and the sample covariance matrix S from the sample {x i 1 i n} of size n are computed by [11] S = 1 ni=1 (x n i u)(x i u) t with u = 1 ni=1 x n i Chord Distance The chord distance between two vectors x and y is to measure the distance between the projected vectors of x and y onto the unit sphere, which can be computed by[11] δ 3 (x, y) = x r y s 2, where r = x 2 and s = y 2 4

17 1.4 Hadoop The Apache Hadoop is a collection open-source software projects for reliable, scalable, distributed computing. Software libraries of Hadoop specify a framework that allows distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single to thousands of machines, each offering local computation and storage [13]. Some modules of Hadoop are listed below [13]: 1. Hadoop Common: The common utilities that support the other Hadoop modules. 2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. 3. Hadoop YARN : A framework for job scheduling and cluster resource management. 4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. 5. HBase: A scalable, distributed database that supports structured data storage for large tables MapReduce Map reduce is a framework for processing parallel problems across huge datasets using a large number of commodity computers (nodes). The computation performed on the nodes is independent of the physical location of nodes i.e. whether all nodes are on same network and use similar hardware(called as cluster) or nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware (called as grid). It is also independent of location of data. Data can be stored into cluster at any node. The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two 5

18 functions: map and reduce. Map function, written by the programmer, takes an input pair of key/value and produces a set of intermediate key/value pairs. The MapReduce library groups together all values associated with the same key and passes them to the Reduce function. Reduce function, also written by the programmer, accepts pair of key/value which is generated by map function. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value, which is actual output of the user program. map (k1, v1) list(k2, v2) reduce (k2, list(v2)) list(k3, v3) Computational processing can occur on data stored either in a file system (unstructured/hdfs) or in a database (structured/hbase). Map: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node [14]. Reduce: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve [14]. MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others. All maps can be performed in parallel though in practice it is limited by the number of independent data sources and/or number of CPUs near each source. Similarly, a set of reducers can perform the reduction phase provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative. 6

19 Figure 1.1: Map Reduce Process. Example [15] The map function emits each word plus an associated count of occurrences (just 1 in this simple example). The reduce function sums together all counts emitted for a particular word. map Word(String key file, String value line): // key file: text file name, value line: text file contents for each word m in value line: EmitIntermediateval(m, 1 ); reduce Worx(String key file, Iterator value line): // key file: a word from text file, value line: a list of counts of each word int total count = 0; for each word t in value line: total count += ParseInt(t); Emitfinalval(AsString(total count)); 7

20 1.4.2 HDFS The Hadoop Distributed File System (HDFS) [16] is a another module of Hadoop Project. It is a distributed file system, which is designed to run on commodity hardware. HDFS is highly fault-tolerant and works on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets [16] HBase HBase actually works on HDFS. It is a distributed column oriented database built on top of HDFS [17]. It is a Hadoop application which is used, when user require real-time read/write random-access to very large datasets. HBase can scale linearly just by adding nodes. It is not relational and does not support SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host very large, sparsely populated tables on clusters made from commodity hardware [17]. 1.5 Oblivious Technique Oblivious retrieval technique is designed to achieve user and database privacy. Actually Oblivious retrieval technique contains implementation of homomorphic encryption technique. This encryption technique allows user to do addition, multiplication operations on encrypted data without changing its original plain text data. In this application homomorphic encryption is done on image feature data. Today many of the website firm outsource their data to external storage servers. In this case, it is necessary to maintain privacy of user with respect to external server as well as vice versa. Oblivious technique is one such attempt to maintain this type of security. In oblivious technique privacy of user with respect to the server as well as privacy of server with respect to user is maintained with the use of homomorphic encryption technique. Homomorphic encryption is a asymmetric cryptography technique. It works on public and private key pair. It allows specific types of computations to be carried out on cipher text and obtain an encrypted result which decrypted matches the result of operations 8

21 performed on the plain text. For instance, one person could add two encrypted numbers and then another person could decrypt the result, without either of them being able to find the value of the individual numbers. Using the public key of user, feature vector will be stored in encrypted form into database. When User inputs a query, it will go into database in encrypted form and images will be searched and retrieved on the basis of similar encrypted feature vector. These encrypted image data will be sent to user and user will decrypt that image data using his private key. Cryptographic technique which is used in this system is Paillier Cryptosystem. In the Paillier cryptosystem, the homomorphic property is supported as follows. Homomorphic addition of plaintexts The product of two ciphertexts will decrypt to the sum of their corresponding plaintexts, D(E(m1,r1)*E(m2,r2) mod n 2 ) = (m1 + m2)mod n Homomorphic Encryption Algorithms: 1. Simple RSA Algorithm 2. ElGamal Cryptosystem 3. Paillier Cryptosystem 9

22 Chapter 2 Literature Survey 2.1 PIRMAP [1] Private Information Retrieval (PIR) allows for retrieval of bits from a database in a way that hides a user s access pattern from the server. This paper presents PIRMAP, a practical, highly efficient protocol for PIR in MapReduce, a widely supported cloud computing API. PIRMAP focuses especially on the retrieval of large files from the cloud, where it achieves optimal communication complexity (O(1) for retrieval of an l bit file) with query times significantly faster than previous schemes PIR When user connects to a cloud, there is series of interactions between the user and the server. User connects to cloud to retrieve his private information which is stored. The information can be in the form of different files. Now user wants to retrieve x number of files out of his n files of his choice and also server should not learn about his choice of files. Now user of cloud has to provide such query to cloud such that user will receive his required number of files from cloud and this transaction should be secured. 10

23 2.1.2 MapReduce in PIRMAP This information retrieval technique uses map reduce to parallelize the process which will reduce the computational time to retrieve the file from cloud and will make it very easy to use. The first phase is called the Map phase. MapReduce will automatically split input computations, equally among available nodes in the cloud data center, and each node will then run a function called map on their respective pieces (called InputSplits). It is important to note that the splitting actually occurs when the data is uploaded into the cloud. This means that each mapper node will have local access to its InputSplit as soon as computation is started and you avoid a lengthy copying and distributing period. The map function runs a user defined computation on each InputSplit and outputs (emits) a number of key-value pairs that go into the next phase. The second phase, Reduce, takes as input all of the key-value pairs emitted by the mappers and sends them reducer nodes in the data center. Specifically, each reducer node receives a single key, along with the sequence of values emitted by the mappers which share that key. The reducers then take each set and combine it in some way, emitting a single value for each key [1] Working of PIRMAP PIRMAP is an extension of the PIR protocol by Kushile-vitz and Ostrovsky, targeting the retrieval of large files in a parallelization-aggregation computation framework such as MapReduce. We will start by giving an overview of PIRMAP which can be used with any additively homomorphic encryption scheme. Upload: In the following, we assume that the cloud user has already uploaded its files into the cloud using the interface provided to them by the cloud provider. Query: In keeping with standard PIR notation, our data set holds n files, each of which 11

24 is l bits in length. There is also an additional parameter k which is the block size of the chosen cipher. For ease of presentation, we will consider the case where all files are the same length, but PIRMAP can easily be extended to accommodate variable length files by padding or prepending each file with a few bytes that specify its length Future Scope/Issues This implementation is specifically for files which contains textual data, it does not support for multimedia data such as image, video or audio files. 2.2 Map/Reduce in CBIR Application [2] At the start, image retrieval is mainly dependent on the text based retrieval of an image. This technology is widely used, the current mainstream search engines Google, Baidu, Yahoo, etc. mainly used this method to search image. In this technique, name of image file was used to compare and retrieve the mage from database. But it has drawbacks, researchers often need to manually mark with text for all images, and this mark text can t objectively and accurately describe the visual information of the images. After the 1990s, there has been Content-Based Image Retrieval, the following unified call CBIR, and different from TBIR, it can extract visual features from image automatically, and then retrieval image by their visual features. Since CBIR intuitive, efficient, and can be widely used in information retrieval, medical diagnosis, trademarks and intellectual property protection, crime prevention and other areas, it has a very high applied value System Model Selecting an Algorithm The technique used in this paper uses Color feature-based image retrieval. This mainly contains the following algorithms: Color histogram, hue histogram, color moments, color entropy etc. [15] 12

25 Taking into account the color histogram algorithm is widely used, it s feature extraction and similarity matching are more easier, this method will be used as our target color algorithm. Color Histogram Color Histogram of an image gives extensive information about its structure. It gives number of pixels, its colors in RGB format etc. So it is easy to calculate mean, entropy, median of an image which can be used as features of an image in case of feature extraction. This feature can be used to compare with queried image s feature and to retrieve all similar images from database. Feature Calculation and Similarity Matching Feature vector of each uploaded image is stored in database. This feature vector is match with feature vector of an input image. Both feature vectors are used to calculate similarity coefficient. This is called as Pearson correlation coefficient. Map Reduce is used in similarity matching phase only to provide fast results. Feature vector of input image is calculated in the query and is matched with the images from image library and images for which images are matching are retrieved Future scope/issues Map Reduce technique is used at similarity matching stage of system. Map reduce technique can be efficiently used in feature extraction process by splitting image into parts at map stage and then combining it at reduce level. 2.3 An Oblivious Image Retrieval Protocol [3] Today many website providers have large collection of images. For e.g. google.com, facebook.com have large database of images where each user uploads his private data to the data center of website. This data should be protected from external or internal threat. The threat may be of stealing of image data, modification of the data, deleting private 13

26 data of an individual, or misusing it for particular purpose etc. As the size of this data is increasing day by day, companies find it hard to manage and secure this kind of large data. Their data center servers are getting overloaded due to huge data. To reduce strain form their storage servers they are looking for another option. This option is of using external storage servers. These external storage servers are maintained and managed by other companies. i.e. data is now outsourced by the companies to other companies to reduce overhead of their internal servers. In this case it is becomes very important to protect and secure the data of a customer over external outsourced database servers. Oblivious Image Retrieval Protocol comes in to picture for outsourced image databases. It is a technique to securely query image database for required image data and retrieve matched images form database. This retrieved data should also securely get transferred to the user System Model System assumes that all feature vectors of all the images are already in database. This feature data is stored in encrypted form in to the database. All query operations are done on encrypted data. This mainly contains two parts one is privacy preserving querying mechanism and oblivious transfer of the decryption keys. Privacy preserving protocol of paper suggests following. First, to query image set, user generates an encryption of the query feature vector set using a homomorphic public-key encryption technique and sends it to the database server. The query feature vector is distorted by the user with a constant random vector to prevent any statistical inference by the database server. Second, the database server uses the encrypted query feature vector and performs a homomorphic subtraction with a random feature vector. The server performs a subtraction of the database image features with the same random feature vector. Before sending the results back to the user, the database server performs a permutation of the subtracted feature vectors so that the user is not able to learn the relative indexing structure of the database images. The server includes pseudo-identifiers 14

27 for the permuted images to allow the user to identify his choices. Third, upon receiving the server response, the user removes the distorting random constant from the server response and calculates the Euclidean norm of the numerical difference the query image feature vector and the database image feature vectors [3] Future Scope/Issues The oblivious image retrieval technique can be implemented using map reduce technique, map reduce can be used at content based image retrieval stage. 2.4 Distributed Image Retrieval System Based on MapReduce.[4] A Distributed Image Retrieval System(DIRS) is a system in which images are retrieved in a content based way, and the retrieval among massive image data storage is carried parallely by utilizing MapReduce distributed computing model System Model In this system, images are uploaded to HDFS in one map reduce process. This map reduce process extracts features of images by using LIRE library and stores images and their feature vectors to Hbase table. Images are also retrieved parallelly. In the feature matching process, CBIR system calculates the similarity between the sample image and the target images, and then returns those images matching the sample image most closely. Before MapReduce job is started, the sample image should be added to Distributed Cache to enable every map task to access it. Image is read from Distributed cache, extract its visual features, and then compare the feature vectors with feature vectors of target images from HBase. All matched images are retrieved back to user. 15

28 2.4.2 Future Scope/Issues The distributed image retrieval technique needs some secure technique to protect data. 2.5 Summary After literature survey, I found that each paper has some future scope. I want to combine all the three feature scopes into one application. This application will upload, search and retrieve image files from distributed database for particular user and will perform this in secured manner. It will eventually maintain user and data privacy. This will be carried out by using map reduce technology which will optimize the time complexity for retrieval. 16

29 Chapter 3 Motivation With the explosive growth of digital media data, there is a huge demand for new tools and systems. It should enable average user to search, access, process, manage, author and share these digital media contents more efficiently and more effectively. This leads to change people s interest from work to work with entrainment. Examples: Business Phones to Smartphones, Watching movies/t.v., Listening music, Social Media: connect and share, There are different types of multimedia content present today. A multimedia information system can store and retrieve text, 2D grey-scale and color images, 1D time series, digitized voice or music, video and there is need to propose an efficient, robust, secure solution to retrieve this kind of multimedia information. We are considering image retrieval over audio or video information retrieval because today image retrieval system has lot of applications. Image retrieval techniques have gone under several changes and are becoming more and more advanced over the period of time. In early years of image retrieval, images were searched on text based where each image has to be annotated with particular text. This technique was more troublesome because of its complexity. Today content based image retrieval technique provides extensive search 17

30 facility over database which retrieves all related images which are similar in content with queried image. The CBIR technology has been used in several applications such as fingerprint identification, biodiversity information systems, digital libraries, crime prevention, medicine, historical research, among others. This CBIR covers wide randge of different domains. Today many websites like google.com, facebook.com etc. has lot of images on their servers. Many users access these websites regularly. They upload, search download images from them. There is a need of fast and secured technique to upload, search and retrieve these images on user demand. 3.1 Problem Definition The purpose of this research work is to create an efficient private content based image information upload, search and retrieval system using map reduce. The system will be created from easily available API s and which has support on cloud so that system can be easily ported to cloud for use. 3.2 Scope of Research Scope of this research is not limited to one particular domain. It spans over multiple domains. Domains in which extensive research can be done are Private Information Retrieval, Content Based Image Retrieval and Oblivious transfer i.e. security. Research papers in this area suggests content based image retrieval technique over cloud computing as well as secured transfer techniques for the images but not a single paper talks about fast (over cloud computing) as well as secured retrieval of images form servers using Private Information Retrieval Technique. This system can be evolved in many ways. By changing CBIR algorithm, by using different encryption techniques, by using different storage methods over HDFS, this system can be moved to new more efficient and secure level 18

31 3.3 Objectives 1. To implement CBIR in Map-Reduce 2. To select and implement homomorphic encryption technique for secure retrieval 3. To evaluate proposed solution 3.4 System Requirement Specifications Hardware: Minimum one Desktop system with Intel s or AMD s processor and 2Gb of RAM. Latest technology will provide better results. i.e. Cluster of 5 nodes having Intel s core series processor(i3,i5,i7) of high frequency with 4Gb RAM will definitely perform better than 5 node cluster having older processor version working on less frequency and less RAM. Cluster should have nodes having same configuration. If it has nodes having different RAM capacity, processor frequency then cluster will not perform up to capacity. Software: 1. Operating System: Any open source Linux operating system. 2. Coding Language: Hadoop map reduce technology/software Java language (JDK) 3. IDE: Eclipse environment 19

32 Chapter 4 System Design 4.1 Complete Architecture Figure 4.1 shows complete architecture of the system. Figure 4.1: System Architecture 20

33 It provides an web based interface to receive user s request. After receiving request, system performs specific operations on the images whether it is upload or retrieval. Core of this system is divided into two phases. One phase is to upload images and other phase is to search and retrieve similar images. Both modules are deployed on Hadoop cluster, which follows the MapReduce paradigm. Upload process will read input data from HDFS, will process it using map and reduce functions, and then write output results to HDFS again. Similarly Image retrieval process will use map and reduce functions for retrieving similar images depending upon queried image. These images will be given to user on local file system Upload Images Upload Images is an independent process where user can upload his image or photo information to the systems database. The process of uploading images is carried out with the help of Map/Reduce techniques which parallelize the process of upload. In upload, there are three sub process. First is splitting of an image, second is feature extraction and third is encryption and storage. Upload Image process is parallelized with the help of Map Reduce technique for each image. When user uploads his private photos or images to the database, he might wants to upload an image or many images at a single point of time. So if user wants to upload many images at a single point of time, he has to provide only path of image folder where images are stored and map stage will parallelize the upload process instead of uploading one image at a time. Architecture of upload process can be seen in figure4.2. For better understanding, explanation of upload process is split into three distinct phases. These three phases actually specifies image splitting, image feature extraction and encryption part. Phase 1 : An image which is going to be uploaded is first split into 16 small images. This step is taken to disperse data across nodes and attacker should not be able to track data from one location. Map/Reduce technique works on its distributed file system called as 21

34 Figure 4.2: Upload Process Hadoop Distributed File System(HDFS) [18], [19], [13], [16]. specifically built for parallelization. All image files are stored into file system. The information about all split images of every image are stored into HBase [17] table. HBase works on top of HDFS. Table in HBase will store path of image and other features of that image. Phase 2 : These images are given to the image processing part. In this phase, Color Histogram of each image is calculated. This histogram will provide different feature values such as entropy, intensity, mean, median etc. These values are used for the similarity calculation [11] of images i.e. distance between quired image and images from HDFS. Phase 3 : This calculated feature will be given to homomorphic encryption system which will store this vector in HBase table in encrypted form. Homomorphic encryption system actually provides additive and multiplicative operations to be carried out on encrypted data without changing its original data i.e. multiplication of all 16 encrypted feature vectors of an particular image will lead to the summation of feature vector of image in plaintext form. Advantage of this will come into picture at the time of retrieval. At the time of retrieval when user will query to the system then encrypted feature vector of quired image will be used. This feature vector which is in encrypted form will be used against total feature 22

35 vector of all split image of particular image. Algorithm used for this encryption technique is Paillier cryptosystem. It uses pair of private and public key of user. Image feature vector of an image is encrypted with the public key of user and is entered into HDFS. Users public key will be available with the database server whereas private and public key will be available with user. This encrypted feature vector is stored into files on HDFS. Each split image will have its feature vector corresponding to it and it will be stored in same row along with encrypted image name Image Retrieval Figure 4.3: Image Retrieval Process This phase of the system provides user,a GUI to search and retrieve images. Here user will upload one image for which he wants to find similar images. When user wants to search images based on particular image, he uploads it to system. Processes included in upload phase are also carried out on this image. Feature vector of each image is calculated from color histogram. All feature vectors of split images are encrypted with the public key of user. These all encrypted feature vectors of queried image are sent to server. This procedure of searching,comparing and retrieving images is also one Map reduce process. In this process, when user uploads image, encrypted feature vectors of queried image i.e. feature vectors of all split image files along with all image feature vectors in HBase table goes to Map process. 23

36 Map Stage: This map process multiplies all feature vectors of particular image and emits total feature vector of image. This process does this in parallel for all images present into system. It will create one mapper for one image and emits total feature vector after combining 16 feature vectors. Reduce Stage:Vectors of all images are collected in reduce step. Each vector is compared with feature vector of queried image. Comparison is done on the basis of similarity measures. This similarity measure is calculated from the different standard formulas such as Euclidean Distance, Pearson correlation coefficient [11]. In proposed protocol, euclidean distance between color histograms of images is calculated. 24

37 Chapter 5 Implementation of System Graphical user interface(gui) is most important part of any application. It allows user to interact with a system. The success of application depends upon how well it is designed, how simple is it to use? This application even though, is designed in bottom up fashion, i.e. first main Map-Reduce code and lastly GUI, I will introduce GUI firstly, because when any user will start this application, he or she will interact with GUI only. Complete application including GUI is created in eclipse[20]. It is an open source integrated development environment(ide) which helps to develop applications easily. This IDE comes with integrated J2EE or Java EE [21]. J2EE is actually Java to enterprise edition technology. It is used to build enterprise web applications. Java EE includes several API specifications, such as JDBC,Enterprise JavaBeans, Connectors, servlets, Java Server Pages (JSPs) [12] and several web service technologies. This allows developers to create enterprise applications that are portable and scalable, and that integrate with legacy technologies [21]. I have created GUI with JSPs, it enables Web developers and designers to rapidly develop and easily maintain, information-rich, dynamic Web pages. Java code can be integrated with HTML in JSPs. JSP technology separates the user interface from content generation, enabling designers to change the overall page layout without altering the underlying dynamic content [22]. In this application, JSPs have all GUI code and Java code contains all business logic of the application. All map reduce processes can run only through Java code as there are some specific commands which runs Map Reduce process. 25

38 This application is designed in such a way that if any process required for MapReduce program to run is not present then, the application automatically restarts all process in a system. Further it makes sure that all processes will be up and running. This is done only for the local deployment of hadoop cluster because when server is restarted all data will be formated. In actual industry deployment, this case will not arise because all precesses will always be running. 5.1 Log In Screen As title suggest, this application is private content based image information retrieval, so every user will have his user id and password through which user will login to his/her account to upload or retrieve images. Log In Screen gives user to maintain his privacy. This is because many time databases are outsourced by the companies in that case data of many users can be present on same server/cluster. This technique provides facility that user can see only his data and not of others. Figure 5.1: LogIn Screen 5.2 Create Account Create account form comes after clicking link Create Account from sign up. To access application, user has to create his account with web site, its more like an sign up form. 26

39 Figure 5.2: Create Account Screen User account is created from users id. Validations are done on id field i.e. if in case user enters wrong id, it asks for correct id again. After valid id, user account will get created along with security keys. User will be notified with User ID to login with. This User ID will create his account i.e. folder in HDFS. Information about login i.e. user id and passwords are stored in HBase table which is also a part of Hadoop. It is a database which allows to create table and store data in it. It is scalable database which can expand at run time. Figure 5.3: Account Creation Screen - Validation After successful creation of account, user will get a message that account has been created and it is ready to access. User now will go to LogIn screen to log in. 27

40 5.3 User Operation Screen After logging in, user will see a screen having three links. One link is for logout, second is for uploading images and third one is for retrieving similar images for one particular image from database. Figure 5.4: User Operations Screen Changes in Hadoop Conventional set up of Hadoop does not allow process to write output files to same output path. User of this system will not upload files only once. He will use system any time, any where. It is necessary for a system to upload user images at one particular location. So when map reduce process of upload happens all encrypted feature vector must go it same folder that belongs to user. This process will happen any time. So at each time we can allocate new folder space to user. to remove this limitations, I have modified Hadoop code so that output of every map reduce process will go into user specific folder and not to other location. Modified Files : FileOutputFormat.java, MultipleOutputFormat.java, MultipleTextOutputFormat.java 28

41 5.3.2 Upload Process When user selects upload images link on screen, he is directed to a page where he can specify the folder path of total images. All the images present in the folder gets uploaded on HDFS first. These images are used in upload Map Reduce. At time of each upload, Map-Reduce process will run. It splits an image into 16 small images, and calculates feature vector of every small image i.e. color moment of each split image. This feature vector is nothing but a color moment of an image calculated from color histogram. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using similarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8]. All moments are encrypted with public key of user and are then stored into a file. Whenever user uploads images to the database, they will go into same folder. Actually Hadoop does not support, usage of same output folder in different Map Reduce process. So I had modified hadoop so that in each upload Map-Reduce, files will go into same folder. Figure 5.5: Upload Screen 29

42 Map Reduce process takes some time to complete, if user does not get proper message he might thought that images are uploaded and he might shut down the process. So user gets a message showing that files are being uploaded. When map reduce will be executing, user will see following screen till process completely executes. Figure 5.6: Uploading Screen Again there is a chance of repetition of file names. So those files who has their names similar to files present into HDFS, will be renamed at the time of upload. First all file names of user will be checked from HDFS. They are taken in one string and file name of image which is going to get uploaded will be matched with the string if string contains image name then it is renamed. Program will again check, name which is generated whether it is present in string or not if it is present then file will be renamed again. Map Reduce Algorithm Map reduce technique is used to create feature vector of images. Map will split input image file in to small chunks of fixed size, will calculate its feature vector, will encrypt it using homomorphic encryption technique and will store it into database. Reducer will get sorted input from mapper. It will combine all split encrypted feature vector and will emit imageid and combined list. 30

43 Algorithm 5.1 Upload Map Reduce Process Algorithm Mappers emit encrypted feature vectors keyed by names f eaturekey, the execution framework groups vectors by names along with names f eaturekey, and the reducers write names and encrypted feature values to files. imgs new BufferedImage; random new BigInteger; n, g, r, nsquare new BigInteger; dmeanr new double; dmeang new double; dmeanb new double; 1: procedure printpixelargb(int pixel) 2: dmeanr dmeanr + Red Pixel Value 3: dmeang dmeang + Green Pixel Value 4: dmeanb dmeanb + Blue Pixel Value 1: procedure Encryption(int meanv al) 2: Return BigInteger encrypted(meanv al) 1: class Mapper 2: procedure Map(Imgid iid, Img im) 3: m0, m1, m2 new BigInteger; 4: imgs splitimages 5: for all img i imgs[count] do 6: for all pixel p imgs i do 7: printp ixelargb 8: m0 Encryption(dM eanr) 9: m1 Encryption(dM eang) 10: m2 Encryption(dM eanb) 11: Emit(Imgid iid, encrypted vals iid i, m0, m1, m2 ) 1: class Reducer 2: procedure Reduce(Imgid iid, encrypted vals iid i, m0, m1, m2 ) 3: iidl new List 4: for all encrypted vals iid i, m encrypted vals 5: [ iid 0, m0, m1, m2, iid 1, m0, m1, m2...] do 6: Append(iIdL, iid i, m ) 7: Emit(Imgid iid, iidl) 31

44 Following image shows the working of Map Reduce process in eclipse or behind the screen. Figure 5.7: Process behind Upload Image Retrieval When user clicks search images link, he is directed to a page where he can specify the file path of image. This is an image on the basis of which images are to be retrieved from database. This comparison is actually done between features of images stored in HDFS and feature of an image in search process. In upload process, feature vectors are calculated and are stored into HDFS. These feature vectors are used for comparison. Figure 5.8: Search Screen 32

45 When user selects image, it passes through several steps. These steps are actually similar that of upload process. i.e. Image feature calculation and its encryption. Image is split into 16 parts first same as in upload phase. This step is taken to maintain loss of pixels in both cases equal. Features are encrypted with same public key of an user. So at similarity matching phase features will match exactly. At upload stage, features of split images are stored into HDFS. So at similarity measurement time they should be combined back into one feature vector. Here paillier cryptosystem comes into picture. Paillier provides facility to perform addition and multiplication operations on encrypted data without changing original data. Similarity distance is calculated between stored image feature vectors and feature vector of an image to be searched. Similarity distance is nothing but a mathematical formula which gives relation between the images. There are different types of measures which can be used for calculation [8],[11]. Euclidean distance measurement is used for calculation of similarity between images. This distance is calculated between means. This will provide information about similarity between two images. This process of comparison is done in an map reduce to make it parallel. When map reduce process is going on user will see following screen. Figure 5.9: Searching Screen 33

46 Background processes can be seen from following screen. Figure 5.10: Processes behinds search Screen Map Reduce Algorithm Algorithm 5.2 Image Retrieval Map Reduce Process Algorithm Mappers emit encrypted feature vectors keyed by opn ame f eatureid, the execution framework groups vectors by names along with opn ame f eatureid, combiner gets encrypted feature vectors of quired image and calculates difference between combined values from datastore and vectors of quired images also calculates square and emits values. The reducers gets values of difference write names and square root of encrypted feature values to files. upload img m1 new BigInteger; upload img m2 new BigInteger; upload img m3 new BigInteger; 34