How To Use Sensor Fingerprint Matching In Large Databases

Transcription

1 APPLICATIONS OF MULTIMEDIA FORENSICS DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY (Electrical Engineering) at the POLYTECHNIC INSTITUTE OF NEW YORK UNIVERSITY Sevinc Bayram January 2012

2

3

4 Microfilm or other copies of this dissertation are obtainable from UMI Dissertation Publishing ProQuest CSA 789 E. Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI

5 Vita Sevinç Bayram was born in Bursa, the lovely green city of Turkey. She received her B.Sc. and M.Sc. degrees in Electronics Engineering from Uludag University, Bursa, Turkey in 2002 and 2005, respectively. In her B.Sc studies, she researched on fingerprint classification problem. With her M.Sc. studies, she was introduced to the world of Multimedia Forensics where she developed a passion for. In 2006, Sevinç Bayram started working towards her Ph.D. degree at Polytechnic Institute of NYU. In 2010, she was a summer intern at Dolby Labs, where she worked on audio forensics techniques. Her research interests includes all aspects of Multimedia Forensics; consisting Tamper Detection, Source Device/Model Identification, Computer Generated Image Identification, Efficient Techniques in Multimedia Forensics for Large Databases and applications of Multimedia Forensics Techniques. ii

6 To my dear parents, Mustafa, and Seviye Bayram iii

7 Acknowledgements First and foremost, I would like to thank my supervisor Prof. Nasir Memon for welcoming me into his research group, for his continuous help and guidance, and many insightful discussions he provided. I would also like to thank him for his unlimited patience, tolerance and for his great personality, not only guiding me for research but guiding and helping me in other aspects of life. It was and is still an honor to know and work with him. I wish to express my sincere thanks to Professor Hüsrev Taha Sencar for being there always when needed, for sharing his valuable ideas with me, for being a very good advisor, very good friend and very good example. I cannot emphasize enough how much I benefited from his knowledge, wisdom and personality. I would also like to thank Professors Yao Wang and Ivan Selesnick for serving on my thesis committee. I was very fortunate to have them as my professors on very important subjects which I use daily in my research. I would like to especially thank them for being an inspiration and for saving me with their lecture notes (which I check quite often) whenever I am stuck. iv

8 I was also very fortunate to have Ismail Avcibas as my Master thesis advisor, who introduced me to the multimedia forensics problems. His supervision, and support, is truly appreciated. I would like to take this opportunity to also thank all the members of ISIS LAB. I have learned a lot from each of them. My life in this country would not be livable without my dear friends Anagha Mudigonda, Kagan Bakanoglu, Naren Venkatraman, Cagdas Dogan, Senem Acet Coskun, Baris Coskun, Ozgu Alay, Yagiz Sutcu, Kurt Rosenfeld, Napa Sae-Bae, Apuroop Gadde, and I am grateful to each of them for being there for me all the time. Special thanks to my better-half Dervis Salih for his love, and support; for being my best friend, listening to all my complaints and no matter what for finding a way to make me happy. I would like to especially thank him for his efforts to make me a better researcher, for motivating me to work harder and setting a great example himself. Last but not least, I would like to express my gratitude to my beloved family, for their never ending love and support; to my parents who raised me with a love towards education and science, and supported me in all my pursuits; to my sister and brother for their continuous encouragement; and to my two beautiful nieces Esra and Serra for adding joy to my life. v

9 AN ABSTRACT APPLICATIONS OF MULTIMEDIA FORENSICS by Sevinç Bayram Nasir Memon, Advisor Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy (Electrical Engineering) January 2011 In recent years, the problem of multimedia source verification has received rapidly growing attention. To determine the source of a multimedia object (image/video) several techniques have been developed that can identify characteristics that relate to the physical processes and algorithms used in their generation. In particular, it has been shown that noise-like variations in images and videos, due to the different light sensitivity of pixels, can be accurately measured, and used as a fingerprint of an imaging sensor. The presence of a sensor fingerprint in a multimedia object would provide evidence that the given multimedia object was captured by that exact sensor. Motivated by this, in this thesis, we investigate the different apvi

10 plication potentials of the sensor fingerprint matching technique. For this purpose, we first focus on using sensor fingerprints in a source identification scenario where the aim is to find the multimedia objects captured by a given device in a large collection of multimedia objects, in a timely fashion. The associated fingerprint matching method itself can be computationally expensive, especially for applications that involve large-scale databases (e.g. YouTube or Flickr). To overcome the limitations, we propose two different approaches. In the first approach, we propose to represent sensor fingerprints in binary-quantized form. While, a significant improvement on efficiency is achieved with this approach, it is shown through both analytical study and simulations that the reduction in matching accuracy due to quantization is insignificant as compared to conventional approaches. Experiments on actual sensor fingerprint data are conducted to confirm that there s only a slight increase in the probability of error and to demonstrate the computational efficacy of the approach. In our second approach, we present a binary search tree (BST) data structure based on group testing to enable the fast identification. Our results on the real-world and simulation data show that with the proposed scheme major improvement in search time can be achieved. The limitations of the BST are also shown analytically. Furthermore, we demonstrate how to use device characteristics for conventional content-based video copy detection task.we show the viability of our scheme by both analyzing its robustness against common video processing operations and evaluating its performance on real world data, including controlled video vii

11 sequences as well as the videos that are downloaded from YouTube. Our results show that proposed scheme is very effective and suitable for video copy detection application. viii

12 Publications Publications Related to Thesis S. Bayram, H. T. Sencar, N. Memon, Video Copy Detection Based on Source Device Characteristics: A Complementary Approach to Content-Based Methods, ACM International Conference on Multimedia Information Retrieval, October 2008, Vancouver CA. 5% oral presentation acceptance rate S. Bayram, H. T. Sencar, N. Memon, Efficient Techniques For Sensor Fingerprint Matching In Large Image & Video Databases, SPIE Electronic Imaging, January 2010, San Jose, CA. S. Bayram, H. T. Sencar, N. Memon, Efficient Sensor Fingerprint Matching Through Fingerprint Quantization, manuscript accepted to appear in IEEE Transactions on Information Forensics and Security S. Bayram, H. T. Sencar, N. Memon, Efficient Video Copy Detection Based on Source Device Characteristics, manuscript in preparation to be submitted to IEEE Transactions on Multimedia S. Bayram, H. T. Sencar, N. Memon, Group testing Based Sensor Fingerprint Identification in Large Databases, manuscript in preparation to be submitted to IEEE Transactions on Information Forensics and Security ix

13 Other Publications During PhD Studies Journal S. Bayram, H. T. Sencar, N. Memon, Classification of digital camera-models based on demosaicing artifacts, Journal of Digital Investigation, Volume 5, Issues 1-2, September 2008, Pages Editorial in New Scientist Magazine, Issue 2682, Page 30, 14 November 2008 S. Bayram, J. Ma, P. Tao, V. Svetnik, High-throughput Ocular Artifact Reduction in Multichannel Electroencephalography (EEG) Using Component Subspace Projection, Journal of Neuroscience Methods March 15;196(1): S. Bayram, J. Ma, P. Tao, V. Svetnik, Muscle Artifacts in MultiChannel EEG: Characteristics and Reduction, accepted to Clinical Neurophysiology S. Bayram, H. T. Sencar, N. Memon, Ensemble Systems for Steganalysis, manuscript submitted to IEEE Transactions on Information Forensics and Security Conference S. Bayram, H. T. Sencar, N. Memon, and I. Avcibas, Improvements on source camera-model identification based on CFA interpolation, Proc. of WG 11.9 International Conference on Digital Forensics, 2006, Florida. Y. Sutcu, S. Bayram, H. T. Sencar, and N. Memon, Improvements on sensor noise based source camera identification, IEEE International Conference on Multimedia and Expo (ICME), 2007, Beijing, China. A.E. Dirik, S. Bayram, H. T. Sencar, and N. Memon, New Features to Identify Computer Generated Images, IEEE International Conference on Image Processing (ICIP), 2007, San Antonio, TX. x

14 S. Bayram, H. T. Sencar, N. Memon, A Survey of Copy-Move Forgery Detection Techniques, IEEE Western New York Image Processing Workshop, September 2008, NY Best student paper award S. Bayram, H. T. Sencar, N. Memon, An Efficient and Robust Method For Detecting Copy-Move Forgery, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, Taipei Taiwan. S.Bayram, A.E. Dirik, H.T. Sencar and N. Memon, An Ensemble of Classifiers Approach to Steganalysis, ICPR,2010, xi

15 Contents Vita ii Acknowledgements iv Publications ix 1 Introduction Contributions of Thesis Organization of the Thesis Background Imaging sensor output model PRNU noise estimation Verification with PRNU Robustness of PRNU and Anti-Forensics Efficient Sensor Fingerprint Identification In Large Image & Video Databases 16 xii

16 3.1 Through Sensor Fingerprint Quantization Effect on Matching Performance Computation and Storage Aspects Experimental Results Through Binary Search Tree Structure Based on Group Testing Binary Search Tree Building The Tree by Hierarchical Clustering Retrieving Multiple Objects Experimental Results Video Copy Detection Based on Source Device Characteristics: A Complementary Approach to Content-Based Methods Camcorder Identification based on PRNU Obtaining Source Characteristics Based Video Signatures Robustness Properties of Video Signatures Contrast Adjustment Brightness Adjustment Blurring AWGN Addition Compression Random Frame Dropping Performance Evaluation xiii

17 5 Conclusions and Future Directions 83 xiv

18 List of Figures 1.1 High Level Description of Proposed Multimedia Source Device Identification System A simplified depiction of an imaging pipeline within a camera and the different components involved PRNU extraction and verification process The change in correlation due to binary-quantization of sensor fingerprints. The solid line shows the theoretical correlation between the two real-valued fingerprints, dashed lines shows the correlation between a real and a binary fingerprint, and dotted line shows correlation between two binary fingerprints. Circles show the correlation between the two real-valued fingerprints obtained through numerical simulation. Similarly, stars show the correlation between a real- and binary-valued fingerprints and rectangles show correlation between two binary-valued simulated fingerprints xv

19 3.2 The change in PCE due binarization of fingerprints is obtained. The solid line shows the PCE between the two real simulated fingerprints, the dashed line shows the PCE between a real and a binary simulated fingerprint, and dotted line shows PCE between two binary simulated fingerprints ROC curves comparing performance when fingerprint matching involves (a) only real-valued fingerprints, (b)real- and binary-valued fingerprints, and (c) only binary-valued fingerprints The distribution of intra-camera correlations for a) Camera A, b) Camera B, and c) Camera C obtained between the camera fingerprints and fingerprints of images captured by the same camera under the three settings The distribution of inter-camera correlations for a) Camera A, b) Camera B, and c) Camera C obtained between the camera fingerprints and fingerprints of images captured by other cameras under the three settings xvi

20 3.6 The distribution of correlations. (a),(b), and (c) corresponds to Camera A, (d),(e), and (f) to Camera B and (g),(h), and (i) Camera C. The first column is the correlation between real-valued fingerprints, the second column is correlation between real-valued and binaryvalued fingerprints and the last column is the correlation between two binary-valued fingerprints The accuracies for varying k values in each case for Camera C. Solid line shows the real-real case, dashed line shows real-binary and dotted line shows the binary binary case Distribution of correlation between a query fingerprint and a composite when composite doesn t contain a matching fingerprint. The fingerprints are n = 10 7 in size and the composite contains N = The red line shows the analytical findings and the blue lines show the results on simulation data Distribution of correlation between a query fingerprint and a composite when composite contains a matching fingerprint. The fingerprints are n = 10 7 in size and the composite contains N = The red line shows the analytical findings and the blue lines show the results on simulation data ROC curves comparing performance when fingerprint is correlated with composite fingerprints, and n = xvii

21 3.11 Binary search tree The distribution of correlation values using the fingerprint of Sony Cybershot P72 and the fingeprint estimates from Sony Cybershot S90(green) and Sony Cybershot P72(purple) (a)precision-recall Diagrams for 3 cameras when group based approach is used and for Sony S90 when the tree is built by random splitting. (b)precision-recall Diagrams for Sony S90 with different quality fingerprints (a) A video and its contrast enhanced duplicate. (b) A video and its advertisement overlaid version. (c) Two videos taken at at slightly different angels. (d) Similar but not duplicate videos The distribution of inter- and intra-correlation values. Distributions in blue indicate the correlation values of fingerprints associated with the videos shot by the reference camcorder. Distributions in red indicate the correlation values of fingerprints between the reference camcorder and other camcorder. Bit-rate of videos and the number of frames in each segment are (a) 1 Mbps and 1000 frames, (a) 1 Mbps and 1500 frames, (c) 2 Mbps and 1000 frames, and (d) 2 Mbps and 1500 frames Distribution of correlation values obtained by correlating signatures 50 video clips with each other xviii

22 4.4 The distribution of correlation values. Blue distribution is obtained by correlating the unmodified video clips and their modified versions, red is obtained by correlating video clips with the ones coming from the same camcorder and green is obtained by cross correlation of different videos for different types of manipulations. (a) Decreased contrast. (b) Increased contrast. (c) Decreased brightness. (d) Increased brightness. (e) Blurring. (f) AWGN addition. (g) Compression. (h) Random frame dropping The change in mean of correlation values as a function of the strength of (a) contrast increase and (b) contrast decrease The change in mean of correlation values as a function of the strength of (a) brightness decrease and (b) brightness increase The change in mean of correlation values as a function of the strength of (a) blurring and (b) AWGN addition The change in mean of correlation values as a function of the strength of (a) compression and (b) frame dropping (a)cross-correlation of extracted video signatures. (b) ROC curve for detection results on the videos downloaded from YouTube Video copies for which the extracted signatures are dissimilar Video copies with similar signatures Different videos with similar signatures. corr(a,b)= xix

23 4.13 The frames of four example videos from a commercial series The distribution of correlation values obtained from the commercial series. Red distribution is obtained by pair-wise correlations of each video and blue distribution is obtained by correlation of composite videos xx

24 List of Tables 3.1 Comparison of resource requirements of different methods proposed for fingerprint matching Change in True Positive Rates due to Binarization Using Camera- Dependent Thresholds Change in Probability of Error Due to Binary-Quantization Performance Results with Binary Search Tree xxi

25 Chapter 1 Introduction Recent research in digital image and video forensics [1] has shown that media data has certain characteristics that relate to physical mechanisms and algorithms used in its generation. These characteristics, although imperceptible to the human eye, get embedded within multimedia data and are essentially a combination of two interrelated factors: first, the class properties that are common among all devices of a brand and/or model; and second, individual properties that set a device apart from others in its class. For example, for image data, many approaches have been demonstrated to identify the class of the device that created the image (for example was the picture taken by a Sony camera or an iphone camera) based on the properties of the components used in the imaging pipeline, such as the type of the color filter array used, the type of lens, compression parameters or the specifics of the demosaicing (interpolation) 1

26 technique. [2, 3, 4, 5, 6, 7, 8, 9, 10]. It has also been also shown that certain unique low-level characteristics of the image, such as the noise-like characteristics of the imaging sensor[11, 12, 13], and traces of sensor dust[14], can be successfully used in determining the specific source device used to capture the image (for example was this picture taken with this specific camera even though both are Sony?). While the existence of multimedia forensics techniques is essential in determining the origin, veracity and nature of media data, these techniques have a potential for much wider applications. Motivated by this, in this thesis, we show how source device characteristics can be used for several other multimedia applications. We particularly focus on the applications of one of the most successful unique source camera verification technique, namely, the Photo Response Non-Uniformity (PRNU) noise-based sensor fingerprint matching method[11, 15]. It is well established now that any sensor leaves a unique fingerprint in every image/frame captured with the sensor, much like how every gun leaves unique scratch marks on every bullet that passes through its barrel. And furthermore, this unique fingerprint is hard to remove or forge and survives a multitude of operations performed on the image such as blurring, scaling, compression, and even printing and scanning. The sensor fingerprint matching technique has been shown to be very successful when a one to one comparison of two cameras/camcorders was conducted. Therefore, it can be very useful in the cases where legal entities wants to verify whether an image/video with an illegal content has been captured by a suspect s device. 2

27 Figure 1.1: High Level Description of Proposed Multimedia Source Device Identification System 3

28 Although existing capabilities today can reliably verify the source of an image or video, there are many cases that require solutions far beyond current capabilities. Consider the following scenario where a legal entity gets hold of some illegal (e.g., child pornographic, terror related) content but there is no suspect or any other evidence immediately available. Now, it is highly conceivable that the owner of the device that captured this content has also publicly available multimedia data, such as in a Flickr or Facebook account. In this practical scenario, the question arises whether it is possible to search a large collection of content like Flickr, or Facebook (or a database accessible only to the legal entity) to link the illegal content at hand with other content in the available multimedia collection. Considering these facts, the most straightforward application of the sensor fingerprint matching technique would be source device (camera/camcorder) identification, where the aim is to find the multimedia object(s) captured by a given device in a large database of objects. While the identification problem can be solved by performing multiple one to one matching, this would require linear comparisons in the order of the database, since the fingerprint size is very large this is clearly not feasible. There are two obvious choices to increase the matching speed in the large databases. In the first approach, the fingerprint size can be decreased by obtaining a compact representation so that even linear comparisons can be performed. Another standard approach, as depicted in 1.1, would be to organize and index the database such that one is able to quickly and efficiently search only a small part of the database, and find the objects whose 4

29 sensor fingerprint match the query object within a certain tolerance threshold. Moreover, we explore other application possibilities of sensor fingerprints beside forensics. In the Chapter 4, we show how these fingerprints can be deployed to achieve the goals of conventional content-based video copy detection techniques. Video copy detection is defined as the automated analysis procedure used to identify the duplicate and modified copies of a video among a large number of videos. This procedure can then be used for efficient indexing, copyright management and accurate retrieval of videos as well as detection and removal of repeated videos to reduce storage costs. In this thesis, we develop a technique which can adapt sensor fingerprint matching technique to be used for video copy detection purposes. 1.1 Contributions of Thesis In this thesis, we explore the different applications of sensor fingerprint matching method. The first application we consider is efficient source device identification in large databases. The straightforward approaches to achieve efficiency would be either to compress the fingerprints or to index the database so that it can be queried faster. While indexing and searching large collections of multimedia objects, has been a well studied problem with some remarkable engineering successes, indexing and querying device fingerprints poses some unique challenges, including the large dimensionality and randomness of device fingerprints, and a complicated matching procedure that is required to map the query to the nearest fingerprint. In order to 5

30 overcome these challenges, we propose two different approaches. The first approach shows that sensor fingerprint matching can be done significantly faster by binary quantization of the real-valued sensor fingerprint data. We show that the use of binary fingerprints can significantly decrease the storage space needed, the I/O time, and the matching process time. Furthermore, we show through both numerical and experimental analysis that the reduction in the matching performance due to information loss from binarization is minimal. In the second approach, the main idea is to lower the computational complexity by reducing the number of matching to be done. This is realized by a group test based approach, where the query fingerprint is matched with composite fingerprints instead of each fingerprint in the database. The method proposes to construct a binary search tree to index fingerprints. The advantages of the approach are demonstrated on fingerprints created synthetically and also on fingerprints extracted from real world images. The limitations of the method are shown analytically. Another application we propose to use sensor fingerprints for, is video copy detection. In our approach, we used the fact that a video would be a combination of segments which were captured by several different camcorders. Hence, we propose to use the weighted combination of the fingerprints of camcorders (e.g., imaging sensors) involved in generation of a video as our video signature. We investigate the robustness of our video fingerprint under different manipulations. We also explore the viability of our scheme on videos downloaded from YouTube. 6

31 In summary the contributions of this thesis are: We propose to compress sensor fingerprints to enable fast identification of multimedia objects in large databases captured by a given device. For this purpose, we propose to quantize each element of fingerprint into a binary number. For the same application, we propose to index the sensor fingerprints by building a binary search tree based database. While the structure of the tree is studied in detail, approaches for building and updating the tree are also proposed. We adapt sensor fingerprint matching technique to video copy detection problem. We show how to get a reliable and robust video fingerprint using sensor fingerprints of the camcorders used in capturing process of the video. 1.2 Organization of the Thesis In the next chapter, an overview of the sensor fingerprint matching method will be provided and the verification process will be described. Sensor fingerprint quantization and binary search tree methodologies for efficient sensor fingerprint identification in large databases will be detailed in Chapter-3. In Chapter-4, our video copy detection technique will be described. Finally, in Chapter 5, we will conclude the thesis and discuss the future directions. 7

32 Chapter 2 Background Advancements in sensor technology have led to many new and novel imaging sensors. However, images captured by these sophisticated sensors still suffer from systematic noise components such as photo response non-uniformity (PRNU) noise. PRNU noise signal is caused mainly by the impurities in silicon wafers. These imperfections affect the light sensitivity of each individual pixel and form a fixed noise pattern. Since every image captured by the same sensor exhibits the same pattern, PRNU noise can be used as a fingerprint of the sensor. In the rest of this chapter we briefly summarize the basic sensor output model, the estimation of PRNU process, and the matching methodology. 8

33 Figure 2.1: A simplified depiction of an imaging pipeline within a camera and the different components involved. 2.1 Imaging sensor output model In a digital imaging device, the light entering the camera through it s lens is first filtered and focused onto sensor (e.g. CCD) elements which capture the individual pixels that comprise the image. The sensor is the main and most expensive component of a digital imaging device. Each light sensing element of the sensor array integrates the incident light and obtains a digital signal representation of the scenery. Generally, the sensor elements are monochromatic, therefore for each pixel one color value is captured; typically red green, or blue (RGB). Later, a demosaicing operation takes place to calculate the missing color values. This is followed by the white balancing operation, colorimetric interpretation, and gamma correction. After these, noise reduction, anti-aliasing and sharpening are performed to avoid color artifacts. At the end the image/video is compressed and saved in the device s memory [16]. A simplified version of an imaging pipeline is shown in Figure 2.1. For every color channel, let us denote the digital signal representation before demosaicing as I[i], and the incident light intensity as Y[i], where i = 1,..., n specifies 9

34 a specific pixel. Below, all the matrixes shown in bold fonts are in vector form and all the operations would be element-wise. A simplified model of the sensor output model can then be written as: I = g γ.[(1 + K)Y + Λ] γ.θ q (2.1.1) In the above equation, g denotes the color channel gain and γ is the gamma correction factor which is typically γ The zero mean noise-like signal responsible for PRNU is denoted by K. This signal is called the sensor fingerprint. Moreover, Λ denotes the combination of all other additive noise sources such as dark current noise, shot noise and read-out noise. Finally, Θ q denotes the quantization noise. To factor out the most dominant component light intensity Y from this equation, the Taylor expansion (1 + x) γ = 1 + γx + O(x 2 ) can be used, yielding I = (gy) γ.[1 + K + Λ/Y] γ + Θ q = (gy) γ.(1 + γk + γλ/y) + Θ q (2.1.2) Finally, to simplify the notation and to reduce the number of symbols, γ can be absorbed into the PRNU factor K and the sensor output model can be written as: I = I (0) + I (0) K + Θ (2.1.3) where I (0) = (gy) γ is the sensor output in the absence of noise, I (0) K is the PRNU noise term, and Θ = γ (0) Λ/Y + Θ q is the composite of independent random noise component. 10

35 Figure 2.2: PRNU extraction and verification process 2.2 PRNU noise estimation To get a reliable estimation of sensor fingerprint K, the camera in question or a collection of images taken by the camera/sensor is needed. Let us assume that L images are available from the camera. A denoised version of the image I is obtained using a denoising filter F, Î (0) = F(I) [17]. To find a reliable estimate of PRNU Î (0) should be removed from both sides of Equation 2.1.3; so that host signal rejection and noise suppression can be done to improve the signal to noise ratio between I (0) K and I. W = I Î (0) = IK + I (0) Î (0) + (I (0) I)K + Θ = IK + Ξ (2.2.1) 11

36 The noise term Ξ is a combination of Θ, and the two terms introduced by the denoising filter. This noise can be non-stationary in the textured areas, therefore images with smooth regions help to obtain better PRNU estimates. The estimator for sensor fingerprint K from L images I 1, I 2,..., I L, along with the gaussian noise terms with σ 2 variance Ξ 1, Ξ 2,..., Ξ L can then be written as: W k = K + Ξ, W k = I k Î (0) k I k I, Î(0) k = F(I k ) (2.2.2) k where k = 1, 2,..., L. Finally the Maximum Likelihood estimate of sensor fingerprint ˆK can then be written as: ˆK = L k=1 W ki k L k=1 (I k) 2 (2.2.3) The top row of Figure 2.2 shows the PRNU noise estimation process. This estimate is referred to as the fingerprint of the camera. Every image taken by a specific camera will have this PRNU as part of the image which would uniquely identify the camera. 2.3 Verification with PRNU In the previous section, we described how to estimate a camera s PRNU noise or in other words it s sensor fingerprint ˆK. Now given a camera with fingerprint K and a query image, the presence of K in the query would indicate that the image was captured by the given camera. To determine if K is present in image I, we denoise the image with the same denoising filter F. The PRNU estimate of the 12

37 image therefore is W = I F(I). The detection problem can be formalized as a binary hypothesis test, where H 0 : W = Ξ and H 1 : W = I ˆK + Ξ.To decide which hypothesis to accept, a detection statistics is computed by either computing the correlation, ρ = corr(w, I ˆK), or the peak to correlation energy, P = pce(w, I ˆK), between the query fingerprint and the database fingerprint. If we are considering correlation as our matching metric, H 0 will be accepted when ρ < τ ρ or else if P CE is used to match the fingerprints H 0 will be accepted when P < τ P. Here, τ ρ and τ P are two predetermined threshold values used for correlation and P CE comparisons, respectively. Let us assume two fingerprints X and Y. Then, the correlation is defined as ρ(x, Y) = Likewise, P CE is defined as n j=1 Xj.Y j n n. (2.3.1) j=1 Xj2. j=1 Y j2 P(X, Y) = c2 (0), k = 0, 1,..., n (2.3.2) E[c 2 (k)] where c(k) is the circular convolution between the two fingerprints computed as c(k) = 1 n n X j.y j+k (2.3.3) j= Robustness of PRNU and Anti-Forensics PRNU noise is caused due to the manufacturing defects which are impossible to be prevented. Therefore, all multimedia objects produced by a sensor would exhibit a PRNU noise. In addition to this, the probability of two sensors exhibiting same 13

38 PRNU is very low due to the large size(typically larger than 10 6 ) and random nature of PRNU noise. The direct correction of PRNU requires an operation called flat-fielding which in essence can be only realized by creating a perfectly lit scene within the device. However, obtaining a uniform sensor illumination in camera is not trivial, therefore PRNU noise cannot be fixed easily. All this facts make PRNU a perfect candidate to be used as a sensor fingerprint. Of course, one can always wonder how robust PRNU (sensor fingerprint) is and if it can survive the common image processing operations and/or attacks. In [18], the authors have studied the correction and robustness properties of PRNU. This study showed that PRNU is highly robust against denoising, JPEG compression and out of the camera demosaicing. In [19, 20, 21], the authors further removed the effects of demosaicing, JPEG compression and gamma correction type of common image processing operations by a series of post-processing operations on PRNU estimates. In another study Goljan et. al [22], showed that even after scaling and cropping PRNU can be detected given that the right alignment can be found by a brute force search, and in [23], the PRNU detection on printed images has been investigated. It has been shown that PRNU can survive after high quality printing and scanning cycle. However, PRNU failed the so-called fingerprint-copy attack. This attack is essentially realized by estimating camera A s fingerprint and superimposing it onto an image captured by camera B.In [24, 25], it has been shown that such attack cannot 14

39 be detected and the fake fingerprint cannot be distinguished from the genuine one. Such an attack, for example, could potentially be used to frame an innocent person. Recently, Goljan et. al [26] proposed a countermeasure for fingerprint-copy attack. In this approach, it is assumed that the forensics analyst has access to the images captured by camera A that were not available to the attacker. Such images can be used to estimate a new fingerprint. The idea behind the countermeasure was that the fingerprints of the images captured by camera A should be consistent with each other but the fingerprint of the image that attempt to mimic the original fingerprint should be different. 15

40 Chapter 3 Efficient Sensor Fingerprint Identification In Large Image & Video Databases As mentioned before sensor fingerprint matching technique can achieve very high accuracy with false positives and negatives in the order of 10 6 or less [20]. However, when a large database is concerned, sensor fingerprint matching method presents its own unique set of challenges. These challenges revolve around two main issues. The first issue relates to the large dimensionality and high precision representation of sensor fingerprints. As a result, main memory operations like loading of fingerprint data takes considerable amount of time. At the same time, each sensor fingerprint needs a fairly large amount of space for storage. Further, since the sensor finger- 16

41 print data look more like random, compression is not very effective. Typically, the fingerprint extracted from a 10 megapixel image may take up to 50 MB of space even after compression. The second issue is the computational complexity of the matching algorithm. The matching process involves vector operations which, when combined with the high dimensionality of data, becomes a critical concern. In this chapter, we describe in detail the different approaches we take to make identification process possible. Before we do that, let us fix some notation and terminology. Let us assume that we have a database D, which is comprised of N fingerprints. The fingerprints are modeled as normally distributed random sequences p i = [p 1, p 2,..., p n ] such that p j = X j +µ j, µ j N (0, σ 2 ) and X j N (0, 1 σ 2 ) for i = 1,.., N, and j = 1,.., n. Furthermore, the distribution of correlation between matching fingerprints will be; ρ(p k, p l ) N (0, 1 n ) (3.0.1) In this equation, since the fingerprints are non-matching, X k, and X l are independent from each other. On the other hand, when X k = X l, meaning the fingerprints are matching, the distribution of correlation would be; ρ(p k, p l ) N (1 σ 2, 2σ2 σ 4 ) (3.0.2) n The sensor fingerprint verification process is discussed in the previous chapter. The naïve way of using this verification technique for identification purposes would be simply comparing all fingerprints in the database with the query sensor fingerprint. As can be deduced from Eq , computing the correlation between two 17

42 fingerprints of length n requires 3n multiplications. On the contrary, when calculating PCE, n 2 multiplications need to be performed. However, since the numerator of PCE is equivalent to sum square of circular convolution between the two vectors, it can be obtained through fast Fourier transform, which involves n log n + n complex multiplications. Considering typical values of n, where n > 10 6 even for a basic camera, and noting that each element of the fingerprint is a double precision number, it can be seen that the correlation and PCE operations are quite involved, yet ultimately negligible when only the match between two fingerprints is considered. However, in our case, the database contains very large number of fingerprints, which increases the computational requirements immensely. In this case, the query fingerprint has to be compared with all the fingerprints in the database, which means that N fingerprints of size n should be loaded into main memory, and N correlation or PCE values should be calculated. Since N could range from hundreds up to many millions depending on the database, the computational load may become simply overwhelming. In [27], the authors proposed to use only k elements of the camera fingerprints with the highest energy values, where k << n. For each fingerprint in the database, only these high energy coefficients and their locations are stored, which resulted with a considerable storage gain. In this system, when a fingerprint is queried, a matching metric is calculated between the query fingerprint and the fingerprints in the database using the elements in the previously stored locations only. This 18

43 would result, ideally, with an n/k times speed up both in memory load time and in computation time. Obviously, one needs to choose k carefully as probability of error increases with decreasing k. One limitation of the above approach is the assumption that the database only stores fingerprints of known sources, i.e., camera fingerprints. In the case where the database stores fingerprints obtained from individual images, this method cannot be used directly because determining the k most significant fingerprint elements requires access to the corresponding camera fingerprint. In those cases, when query fingerprint is a camera fingerprint, its k highest energy elements can be used for matching while still keeping the sensor fingerprints in full length. Although this would eliminate all the advantages due to storage and memory operations, the computation of the matching metric will be fast. In the rest of this chapter, we will introduce two approaches aiming to increase the efficiency of sensor identification process. In Section 3.1, we will show how to quantize the fingerprints and discuss the effects of quantization. In Section 3.2, we will present a group testing based binary search tree procedure. 19

44 3.1 Through Sensor Fingerprint Quantization Efficient similarity matching in large databases has been studied in many different research areas in multimedia, including biometrics and video copy detection. Different approaches have been proposed on how to index and store the database so that when an object is queried, the entities in the database can be accessed easily and matched efficiently. These approaches, in common, try to obtain a more compact representation for the data. For this purpose, both the structure (like minutiae points in biometric fingerprints) and inherent dependencies (like eigenfaces for face images) in the data are exploited. Although the problem setting in sensor fingerprint matching is much like these problems, the proposed solutions cannot be trivially extended here as sensor fingerprints do not have the same structural properties and do not exhibit systematic dependencies. To increase efficiency of sensor fingerprint matching in large databases, in this section, we propose to apply an information reduction operation and represent sensor fingerprints in a quantized form. Ideally, we would like to obtain a representation as compact as possible. Therefore, we particularly focus on binary quantization and, essentially, use each element s sign information only and disregard magnitude information completely. Hence, given two real-valued fingerprints X and Y to be matched, their quantized versions ˆX and Ŷ are obtained by the following relation: 1, X j < 0 ˆX j = 1, Y j < 0 Ŷ j = (3.1.1) +1, X j 0 +1, Y j 0 20

45 In what follows, we investigate how binarization of sensor fingerprints effects matching performance and describe the advantages gained due to use of binary fingerprints Effect on Matching Performance The main question that arises then is what is the loss in matching performance when using a binarized version of a fingerprint. In this sub-section, we show by analysis as well as simulation that the loss is small. The information loss due to quantization of fingerprints will obviously cause a degradation in the detection statistic, i.e., the initially computed correlation and PCE values. Given the definitions for the two statistics in Eqs and and defining ˆX and Ŷ as the binary-quantized versions of the two fingerprints X and Y, respectively, the corresponding correlation, ˆρ, and PCE, ˆP, values can be computed in terms of ρ and P as follows ˆρ = corr( ˆX, Ŷ) = 4Q(0, 0; ρ) 1, (3.1.2) ˆP = pce( ˆX, Ŷ) = (4Q(0, 0; c(0)) 1)2, k = 1,..., n (3.1.3) E[(4Q(0, 0; c(k)) 1) 2 ] where Q(0,0,ρ) is the two-dimensional Q-function defined as Q(x, y; ρ) = 1 2π 1 ρ 2 x y exp( x2 1 + y1 2 2ρx 1 y 1 )dx 2(1 ρ 2 1 dy 1. (3.1.4) ) The details of the derivation can be found in the Appendix A. 21

46 An alternative setting for fingerprint matching can also be considered by assuming the case where one of the fingerprints is binary-valued and the other is real-valued, i.e., between X and Ŷ or vice versa. This would correspond to a scenario where the sensor fingerprints in the database are kept in binary form, due to reasons of efficiency, and the query fingerprint is real-valued. In this case, the correlation ˇρ between the real-valued query X and binary-valued database fingerprint Ŷ can be calculated by numerically solving the integral given in Eq Figure 3.1 demonstrates through analytical computation how ˆρ and ˇρ change with respect to ρ. It is no surprise that ρ > ˇρ > ˆρ over the entire range of values except for when ρ gets closer to one. In this regime, ˇρ is lower than ˆρ. This is expected because correlation yields high values only if the correlated sequences are extremely related to each other in terms of magnitude values of the coefficients. Therefore, when two fingerprints are related to each other their binary versions will also be related. For example consider, the case where X = Y. In this case, ρ = ˆρ = 1. However, the relation between real-valued and binary-quantized versions will remain limited, even when the two fingerprints are the same prior to quantization, e.g. ˇρ 1. Regardless of this issue, it can be noticed that there is a significant amount of drop in ˆρ when ρ takes values in the range of ρ = 0.4 to ρ = 0.9. This essentially implies that use of binary fingerprints will incur a significant performance penalty; however, as earlier studies demonstrate [11, 28], correlation between two sensor fingerprints can rarely reach as high as 0.3. Therefore, in practice, we are 22

47 only interested in the ρ < 0.3 regime, where the gap between ˆρ and ρ is relatively low. Numerical simulations were also performed to demonstrate the accuracy of analytical results. For this purpose, we created a database of synthetically generated fingerprints. We start with a sequence, X = [X 1,..., X n ], of independent samples from a normal distribution, where X j N (0, 1 σ 2 ) and j = 1,.., n, and n = This sequence is then mixed with zero-mean white Gaussian noise sequences at varying power levels to generate the fingerprints, p j i = Xj + µ j i, and µj i N (0, σ2 ), and i = 1,.., 100. It should be noted that each fingerprint is a sequence of normally distributed zero-mean, unit variance, and independent random variable, i.e., p j i N (0, 1). Also note that, the designated correlation between the two sequences will be ρ = 1 σ 2. To obtain distributions of correlation coefficients, for all σ, therefore ρ values, 100 fingerprints were generated and correlated with each other. The mean correlation value with respect to different ρ values are plotted with circle markers in Figure 3.1. These synthetic fingerprints were then quantized into binary values as described in Eq Mean correlation values were also obtained by correlating binaryvalued fingerprints with each other and by correlating binary-valued and real-valued fingerprints in the two sets. Corresponding results are plotted using square and star markers, respectively, in Figure 3.1. It can be seen that simulation results fit perfectly with the analytical findings. Likewise, a simulation analysis is performed 23

48 Figure 3.1: The change in correlation due to binary-quantization of sensor fingerprints. The solid line shows the theoretical correlation between the two real-valued fingerprints, dashed lines shows the correlation between a real and a binary fingerprint, and dotted line shows correlation between two binary fingerprints. Circles show the correlation between the two real-valued fingerprints obtained through numerical simulation. Similarly, stars show the correlation between a real- and binaryvalued fingerprints and rectangles show correlation between two binary-valued simulated fingerprints. to determine the change in the PCE metric due to quantization. Corresponding results are given in Figure 3.2. This figure also shows that the reduction in PCE is considerably small in the region of interest, i.e., when ρ < 0.3. Now that we have determined how the correlation would change after quantization, we can estimate the probability of error (POE) approximately. For non-quantized full fingerprints, Goljan et al. [29] calculated the POE by assuming that the distribution of correlation coefficients computed among non-matching fingerprints (i.e., fingerprints associated with different image sensors) follows Equation-3.0.1,ρ N (0, 1 ), and n 24

49 Figure 3.2: The change in PCE due binarization of fingerprints is obtained. The solid line shows the PCE between the two real simulated fingerprints, the dashed line shows the PCE between a real and a binary simulated fingerprint, and dotted line shows PCE between two binary simulated fingerprints. the distribution between matching fingerprints (i.e., fingerprints obtained from images taken by the same image sensor) follows Equation-3.0.2, ρ N (1 σ 2, 2σ2 σ 4 n ). Correspondingly, for a database consisting of N 1 non-matching fingerprints and one matching fingerprint, the probability of detection and false alarm rates are calculated as: P F A = 1 (1 Q(τ ρ n)) N, and P D = (1 Q(τ ρ n)) N 1 Q( n(τρ 1 + σ 2 ) 2σ2 σ 4 ). (3.1.5) These results can be extended to compute error rates after quantization of fingerprints as well. Note that, the distribution of correlation between two non-matching binary fingerprints will be approximately the same as that of real-valued finger- 25

50 prints. The distribution of correlation between matching binary-valued fingerprints will change though. In Figure3.1, it can be observed that in the region of interest, there s almost a linear relation between ρ, ˇρ and ˆρ, which can be approximated as ρ 1.57ˆρ and ρ 1.25ˇρ by simple linear curve fitting. Using this approximation,the distribution of correlation between matching binary fingerprints can be determined as ˆρ N ( 1 σ2, 2σ2 σ n and false alarm rates can be obtained as: ). Correspondingly, the probability of detection ˆP F A = 1 (1 Q(1.57τˆρ n)) N, and (3.1.6) ˆP D = (1 Q(1.57τˆρ n)) N 1 Q( 1.57 n(τˆρ 1 σ ) 2σ2 σ 4 ). Similarly, for the real-valued query fingerprint and the binary-valued database fingerprint case, ˇρ N ( 1 σ2 rates would be:, 2σ2 σ n ) and probability of detection and false alarm ˆP F A = 1 (1 Q(1.25τˇρ n)) N, and (3.1.7) ˆP D = (1 Q(1.25τˇρ n)) N 1 Q( 1.25 n(τˇρ 1 σ ) 2σ2 σ 4 ). To compare the performance, ROC curves are plotted for a few selected ρ values. Figure 3.3 displays the ROC curves corresponding to use of real-valued and binary-valued fingerprints during matching. ROC curves are obtained for different ρ values and N is set to It can be seen in all these figures that as ρ increases ROC curves get closer to each other. It should be noted that, in practice, when performing fingerprint matching ρ will take values in the ρ > 0.02 range for matching 26

51 Table 3.1: Comparison of resource requirements of different methods proposed for fingerprint matching # of com- Storage Data to be # multipli- complexity parisons need (bits) loaded (bits) cation (ρ) of mult Conventional N 64Nn 64Nn 3n d 2 Tree based search tree N t log 2t 128Nn 64 N t log 2tn 3n d 2 Short digest (database) N 64Nk 64Nk 3k d 2 Short digest (query) N 64Nn 64Nn 3k d 2 Binary-quantization N Nn Nn n 1 fingerprints, and at this range, the gap between the ROC curves is marginal. This essentially translates to the fact that regardless of whether fingerprints are real- or binary-valued, the change in performance will be insignificant Computation and Storage Aspects From an application standpoint, the most important implication of quantization is the reduction in computation and storage as compared to the conventional approach. Considering a database of sensor fingerprints and query fingerprint to be matched against this database, Table 3.1 provides the computational and storage requirements of the conventional methods for fingerprint matching and of the schemes proposed in [27, 30] in comparison to the proposed approach. It should be noted that in this table, N represents the number of fingerprints in the database, n is the fixed length of each fingerprint, t is the number of elements in a tree as defined in [30], k << n is the length of the short digest as defined in [27], and d represents the 27

52 (a) (b) (c) Figure 3.3: ROC curves comparing performance when fingerprint matching involves (a) only real-valued fingerprints, (b)real- and binary-valued fingerprints, and (c) only binary-valued fingerprints. 28

53 bits needed to store each fingerprint element. As mentioned before, for fair comparison, we consider two versions of the digest based approach described [27]. In the first one, the database only stores real-valued digests obtained from the camera fingerprints. In the second one, however, only the digest for the query fingerprint is available and the database stores full-length fingerprints. It can be seen in the table that the least number of matching operations can be achieved only with the tree based structure. All other approaches will require linear number of matching operations to be conducted. From storage point of view, binarization reduces storage requirement by a factor of 64 assuming the other techniques use a 64 bit floating point number to store fingerprint elements. Correspondingly, I/O operations that involve transfer of fingerprints from and to disk storage will be much faster as compared other approaches. This aspect is very important as I/O operations are the main bottleneck for sensor fingerprint matching in a large database. With the use of binary-valued fingerprints the computational complexity of correlation decreases as well. For binary fingerprints only n multiplications is enough instead of 3n since n j=1 (Xj ) 2. n j=1 (Y j ) 2 = n. Further, correlation between two binary-valued fingerprints effectively reduces to computation of Hamming distance between two sequences, which can be implemented much faster. Let d H = Hamming( ˆX, Ŷ) represent the Hamming distance between two binary 29

54 fingerprints ˆX and Ŷ. The correlation ˆρ can be expressed in terms of d H as ˆρ = n 2d H. (3.1.8) n On the contrary, with floating-point numbers the complexity of multiplication depends on the number of bits in the floating-point representation Experimental Results To demonstrate the performance and efficiency of the proposed approach, 300 images from three different digital cameras, including a Canon Powershot A80, Sony Cybershot S90, and a Canon Powershot S1 IS were collected at their native resolutions. We refer to these cameras as Camera A, B and C respectively, for the sake of simplicity. All images were then cropped to the size of to have a fixed resolution. Sensor fingerprints of each image were extracted. For each camera, 100 images were averaged together to obtain the camera fingerprint while the rest of the sensor fingerprints from single images were saved in a database of 600 fingerprints. All the fingerprints including the camera fingerprints were binarized and saved in another database. Performance Results In the experiments, three different settings were considered. The first setting corresponds to the conventional fingerprint matching procedure where both the query camera fingerprint and the database fingerprints are real-valued. The second one 30

55 (a) (b) (c) Figure 3.4: The distribution of intra-camera correlations for a) Camera A, b) Camera B, and c) Camera C obtained between the camera fingerprints and fingerprints of images captured by the same camera under the three settings. 31

56 refers to case where the query camera fingerprint is real-valued and all the database fingerprints are binary-valued. And finally, third setting represents the case where both query and database fingerprints are binary-valued. In our experiments, since we assume a camera fingerprint is available and used as query to the database, a camera dependent threshold is used rather than a fixed threshold for all tests. Therefore, distinct thresholds for both correlation and PCE statistics were determined for each camera under the three settings. For all settings, first, the correlation and PCE between the camera fingerprints with the fingerprints of images captured by the same camera were calculated. Later, the same metrics between the camera fingerprints and fingerprints of images captured by other cameras were calculated. The thresholds for each metric was set in a way that no false positives would occur. The true positive rates are shown in Table-3.2. From this table, it can be seen that only for Camera C quantization caused a slight increase of nearly 0.3% in the error rate for both correlation and PCE metrics. To get a closer look on how the correlation changes, Gaussian curves were fit on the histograms of correlation values obtained in the previous experiment. Figure 3.4 shows the distributions for intra-camera correlations where camera fingerprints are correlated with fingerprints from images of the same camera under the three settings. Similarly, Figure 3.5 provides the distributions of inter-camera correlations obtained between camera fingerprints and the fingerprints of images from other cameras for all settings. In these figures, straight lines correspond to conventional 32

57 (a) (b) (c) Figure 3.5: The distribution of inter-camera correlations for a) Camera A, b) Camera B, and c) Camera C obtained between the camera fingerprints and fingerprints of images captured by other cameras under the three settings. 33

58 Table 3.2: Change in True Positive Rates due to Binarization Using Camera- Dependent Thresholds Camera ID Metric Real-Real Real-Binary Binary-Binary Camera A ρ 100% 100% 100% P 100% 100% 100% Camera B ρ 100% 100% 100% P 100% 100% 100% Camera C ρ 100% 99.82% 99.64% P 100% 100% 99.82% setting for fingerprint matching, dashed lines to second setting where only database fingerprints are binary, and the dotted lines to third setting where all fingerprints are binary-valued. (In all the figures, x-axis is fixed.) As expected, the mean correlation value drops with the use of binary fingerprints. However, since the variance also decreases, it compensates for the errors due to reduction in the mean. It can also be seen that the mean correlation values for Camera C is very low even in the case where both fingerprints are real-vaued, E[ρ] Alternatively, for each setting and for each camera we plotted histograms for the correlation of camera fingerprints with the fingerprints of the images from the same camera and from other cameras on the same figure, as shown in Figure 3.6. Using the sample distributions on these figures, we can analytically calculate the change in probability of error due to quantization. Table 3.3 presents the false reject rates 34

59 Table 3.3: Change in Probability of Error Due to Binary-Quantization Camera ID Real-Real Real-Binary Binary-Binary Camera A 3.59% 4.39% 4.47% Camera B 2.51% 3.42% 3.92% Camera C 3.05% 5.59% 10.27% for each camera for a fixed false positive rate of These results show that the increase in probability of error for Camera A and B is minimal after binarization; however for Camera C an increase of 7% is observed. This was expected since the correlation of fingerprints from Camera C among each other were relatively low as compared to other cameras in the first place. We also investigated the possibility of combining quantization with the idea of using short fingerprint digests for matching, as described in [27]. For this purpose we conducted an experiment by selecting k highest energy coefficients and converting them to binary as in Eq In Figure 3.7 matching accuracy for Camera C is given under the three settings for varying k. It can be seen that when k > 25000, the drop in accuracy for both real-binary and binary-binary settings are very small. This shows us that the two methods can be combined for faster computation of correlation. 35

60 (a) (b) (c) 36

61 (d) (e) (f) 37

62 (g) (h) (i) Figure 3.6: The distribution of correlations. (a),(b), and (c) corresponds to Camera A, (d),(e), and (f) to Camera B and (g),(h), and (i) Camera C. The first column is the correlation between real-valued fingerprints, the second column is correlation between real-valued and binary-valued fingerprints and the last column is the correlation between two binary-valued fingerprints 38

63 Figure 3.7: The accuracies for varying k values in each case for Camera C. Solid line shows the real-real case, dashed line shows real-binary and dotted line shows the binary binary case Efficiency Results Having established the performance of fingerprint matching with binary fingerprints we now examine the gain obtained in terms of storage and computational requirements. To test the efficacy of the method, 1000 fingerprints were extracted from random images. These fingerprints were then cropped to a size of and vectorized. Same fingerprints were quantized into binary fingerprints as in Eq In our experiments, both Matlab and C programming language implementations were used for evaluations. However, it must be noted that, none of these implementations were optimized for the best performance, and better results could be achieved. Our intention is to demonstrate the relative improvements that can be obtained and not to set performance bounds. 39

64 To quantify the storage gain, first, Matlab was used to save the fingerprints into proprietary.mat format. Noting that Matlab uses its own compression method, each real-valued fingerprint required 5750KB of storage space. For the corresponding binary-valued fingerprints, however, storage space was reduced to 127KB, yielding a storage gain of around 45 times. The real-valued fingerprints were also saved in uncompressed file format using a C implementation. This required 6145KB of storage space for real-valued fingerprints and 97KB storage space for binary fingerprints, which showed an exact 64 times reduction in the file size. To observe the speed-up in I/O loading times, both the real-valued and binaryvalued fingerprint were loaded into the main memory one-by-one using Matlab and C implementations. Since each element of real-valued fingerprint was stored, in double-precision floating-point format, loading will involve reading a sequence of doubles. Noting that the smallest storage unit is a character, which uses one byte, each element of binary-valued fingerprint has to be stored in a byte. Therefore, the improvement in loading time will not reflect the reduction in the storage size. To overcome this limitation, we performed an 8-bit encoding where the elements of binary-valued fingerprints are converted into binary and grouped into blocks of eight bits and stored in byte format. Experiments show that with the use of this naive scheme in Matlab, loading time improves eight times; whereas, our C implementation got 21 times faster as compared to loading time of real-valued fingerprints. 40

65 The other important concern is the speed-up in the computation of the decision statistics. To measure this, we picked a fingerprint from the database and correlated it with all of the 1000 fingerprints. In Matlab implementation, for real-valued fingerprints we use corr2 command, and for binary-valued fingerprints, we compute dot product of fingerprints followed by dividing with the length of the fingerprint. These operations were also implemented in C. In both implementations, we achieved around four times speed-up in correlation computation. It must be noted, however, that although the simple encoding approach described above is suitable for fast loading of binary fingerprints to memory, it, nevertheless, requires decoding of data stored in the memory to obtain the fingerprints, which would incur additional computation and time. To avoid decoding, we considered two approaches that operates directly on decimal digits to compute the Hamming distance between two binary fingerprints. One of the approaches is based on XOR ing bytes and counting the number of ones and the other one uses a pre-computed look-up table of dot products of 8-bit representations of byte values. These methods yielded nearly 9 times faster computation of the correlation between binary-valued fingerprints over real-valued fingerprints. Overall, it can be stated that binary-quantization of fingerprints, as compared to use of real-valued fingerprints, enables 64 times gain in storage space, and at least 21 times speed-up in loading time and 9 times speed-up in computation of correlation while providing almost the same matching accuracy. Obviously, these 41

66 improvements will be further enhanced when binarization is incorporated with the short digest method, which by itself improves correlation time by more than 20 times as demonstrated in the previous subsection. 42

67 3.2 Through Binary Search Tree Structure Based on Group Testing In this section we will use ideas inspired from group testing approaches to design a source device identification system that can potentially be used with a large collection of multimedia objects. A group test is a simultaneous test on an arbitrary group of items that can give one of two outcomes, positive or negative. The outcome is negative if and only if all the items in the group test negative. Group testing has been used in many applications to efficiently identify rare events in a large population[31, 32]. It also has been used in the design of a variety of multi-access communication protocols where multiple users are simultaneously polled to detect their activity [33]. Consider the very primitive fake coin problem example. Let us assume that we are given eight coins and we would like to know which one is fake (positive), given that the fake coins weigh less than the genuine ones. If we divide the coins into two groups and scale each, the group containing the fake coin should weigh less. Therefore, we can determine right away which group has the fake coin. We can further divide this group into two groups and continue scaling until we find the fake coin with only a logarithmic number of weighings. 43

68 3.2.1 Binary Search Tree Similar to the approach in fake coin problem, given a query sensor fingerprint, rather than checking for a match with each object s fingerprint in the database, we can perform a match with fingerprint estimates that are combined together to form a composite fingerprint. To be more clear, let us assume, we have a database D of } sensor fingerprint estimates D = {p i, extracted from N multimedia objects(i = { } 1, 2,..., N). Each element of fingerprint estimate is modeled as p j i = X j i + µj i as defined in the beginning of this chapter (X j i N (0, 1 σ 2 ),µ j i N (0, σ 2 ), p j i N (0, 1), andj = 1,..., n). Let us also assume that we have a query fingerprint p q = X q +µ q that has the same properties. We can define the composite fingerprint as the normalized summation of all N fingerprints in the database; C = 1 N N i=1 p i (3.2.1) The factor 1 N here is to make the composite have unit variance. Our problem turns into a hypothesis testing where; H0: There is no matching fingerprint in the database D; { D = p i : p i = X i + µ i, X i and X q are independent }. H1: There are one or more matching fingerprints in D; } D = {p i : p i = X i + µ i, X i = X q For the null-hypothesis, the distribution of correlation between the query and composite fingerprint will be the same as the distribution of correlation between two 44

69 single non-matching fingerprints, ρ N = C, p q N (0, 1/n), since the composite fingerprint will still be independent of p q. For H1, let us assume the worst case scenario where there is only one matching fingerprint in the composite. In this case, the distribution of correlation can be calculated as follows: ρ M = C, p q = 1 N ρ M N i=1 = 1 N ( p j, p q + p i, p q N i=1,i j p i, p q N ( 1 σ2, 2σ2 σ 4 ) + N (0, 1 N n n ) (3.2.2) Note that the two distributions in Equation are highly dependent to each other, so we can t simply assume that their variance will be added. However, for large N and n, we can assume that the contribution from the first argument to the variance will be negligible. Therefore the distribution of a query fingerprint with a ) composite that contains only one matching fingerprint can be written as : ρ M = C, p q N ( 1 σ 2, 1 ) N n (3.2.3) To verify the correctness of this model, we generated a sample simulation database } containing N = 4096 fingerprints normal with zero mean and unit variance, {ps i such that i = 1,..., 4096 and ps j i N (0, 1), j = 1,..., 107. We also generated a query database with fingerprints using the first set such that corr(ps i, qs i ) = 0.1 by adding white gaussian noise to the entities in the first database. The query fingerprints were normalized to have zero mean and unit variance as well. In the first experiment, our aim was to measure the distribution of correlation where the 45

70 composite fingerprint doesn t include a fingerprint similar to the query. For each query fingerprint(m = 1,..., 4096) we measured the following correlation: ρ s N = 1 N qs m, N k=1,k m ps k (3.2.4) The distribution of ρ s N is shown in Figure-3.8 with blue color. We also drawn our analytical finding for the corresponding correlation distribution, ρ N on the same figure with red color. Furthermore, we measured the correlation between the query fingerprints and composite with a matching fingerprint: ρ s M = 1 N N k=1 ps k, qs m (3.2.5) Similarly, the distribution of ρ s M is shown in Figure-3.9 with blue color, and corresponding analytical findings, ρ M, is drawn on the same figure with red color. From these two figures, we can say that our model approximate the real situation very well. Now let us calculate the probability of detection and probability of false alarm for a fixed threshold τ N : P F A = Q(τ N n), and (3.2.6) ( ) τn 1 σ2 P D = Q N (3.2.7) n Using these In Figure-3.10, one can see the corresponding ROC curves, when n = 10 7 and the correlation between the matching fingerprints are 0.1. This figure shows us that the probability of error would increase with the number of fingerprints in 46

71 Figure 3.8: Distribution of correlation between a query fingerprint and a composite when composite doesn t contain a matching fingerprint. The fingerprints are n = 10 7 in size and the composite contains N = The red line shows the analytical findings and the blue lines show the results on simulation data. the composite. With the given parameters, N = 4096 seems to give a good trade of between performance/efficiency. Therefore, if the composite has no matching fingerprint, we could determine it with only one correlation which would result a 4096 times improvement. On the other hand, if there is a matching fingerprint in the composite, one should further search this composite. This can be realized by a binary search tree structure where the database is split into two groups. Figure 3.11 illustrates the binary search tree created in this way. In this example, we assumed that the database contains 8 fingerprint estimates extracted from 8 images. The leaves of the tree represents the fingerprint estimates p i (i = 1, 2,..., 8) and the 47

72 Figure 3.9: Distribution of correlation between a query fingerprint and a composite when composite contains a matching fingerprint. The fingerprints are n = 10 7 in size and the composite contains N = The red line shows the analytical findings and the blue lines show the results on simulation data. parent nodes represent the normalized sum of their children. We also assumed we have a query fingerprint f A, and one of the fingerprint estimates in the database exhibit the same fingerprint as the query (p 3 = f A + η). The matching fingerprint can identified through marching along the branches of the search tree, from top to bottom, by using the hypothesis testing in each level with different thresholds fixed for that level. In the figure, red arrows depict the route the algorithm follows before identifying p 3 at the leaf of the tree. The binary search tree constructed in this way can potentially yield a logarithmic reduction in identification complexity. However, as mentioned before, the probability of error increases with the size of 48

73 Figure 3.10: ROC curves comparing performance when fingerprint is correlated with composite fingerprints, and n = the tree. On the other hand, it is always possible to build higher trees, all we need to do is essentially to perform a hypothesis test and based on the probability that we obtain, we can decide to go down or not. In the worst case scenario, a match will be detected in every level. In that case, the complexity of the binary search tree would be O( Nn log(h)), where h is the number of fingerprints in the tree. For h example, when N = 4096 and there is a single fingerprint, the improvement will be 4096/12 = 341 times. In addition, when a database contains more than one fingerprint from the same device, random splitting will further effect the performance of the method. Therefore, the BST should be constructed in a way such that fingerprints of media objects 49

74 Figure 3.11: Binary search tree captured by the same device should be placed close in the branches of the tree. In the next section a hierarchical clustering based tree building scheme will be explained that addresses this problem. Another problem binary search tree introduces is the storage space. Realize that the leaves of the tree represents the fingerprints in the database, and the nodes are their compositions. Since we need to keep all the nodes as well, the storage space doubles. Because of the size of sensor fingerprints, this might create a big problem. However, we believe that, in the application we consider, such as the legal applications, more resources can be used for storage to obtain gain in the efficiency. 50

75 3.2.2 Building The Tree by Hierarchical Clustering In this section, we describe a method to build the BST in a way that ensures that media objects captured by the same device are located close in the tree. An obvious way to ensure this is to correlate each single fingerprint estimate with the rest of the database and sort them according to the correlation results. However, building the tree like this would take O(nN 2 ) correlations and therefore would not be feasible. In this thesis, we study a more efficient method to build the tree which is based on hierarchical divisive clustering. The divisive clustering process starts at the top level with all the entities in one cluster. This top level cluster is split using a flat clustering algorithm. This procedure is applied recursively until each entity is in its own singleton cluster[34, 35]. The root of the tree contains a composite fingerprint which is obtained by summing all the fingerprint estimates in the database, C = N i p i. Each individual estimate is then correlated with this composite. The estimates are sorted and divided into two equal sized clusters based on their correlation values. The basic idea here is if there are more than one media objects from the same device, then the correlation value of the corresponding fingerprint estimates with the composite fingerprint should be close to the same. Let s assume that we have m objects from device A in our database {p j = f A + µ j : j = 1, 2,..., m}. The correlation between composite fingerprint C with a single 51

76 fingerprint estimate of camera A can be shown as : m m ρ(c, p q ) = (f A + µ q, (f A + µ j ) = m. f A 2 + µ k. µ j (3.2.8) j=1 j=1 Using this fact, the fingerprint estimates in the database are sorted according to their correlation results. It is expected that, the fingerprint estimates of images from the same device will list in succession after the sorting operation. Then the database is split into two subsets and this process is repeated in every level within each subset which makes the complexity of tree building method as O(nNlogN) Retrieving Multiple Objects The search procedure also needs to accommodate operations like retrieving several media objects captured by the same imaging device. To be able to retrieve all the media objects captured by the same device, the tree needs to be updated after every search. For this, the most recently matched fingerprint estimates are subtracted from the composite fingerprints of all the parent nodes. (In Figure 3.11 this is equivalent to subtracting fingerprint estimate associated with p 3 from all the nodes in the path depicted with red arrows.) The tree can be restored to its initial form, when the search for one device is ended. In a forensics setting, where it is important to limit the false positives, the update and search operations can be repeated consecutively until the search yields a fingerprint whose correlation with the query fingerprint is lower than the preset threshold. On the other hand, this method can be used in a retrieval setting. In this case, rather than setting a 52

77 threshold for the correlation value, the number of searches can be pre-determined. By this way, one can retrieve as many objects as wanted and eliminate the false positives according to threshold later on. 3.3 Experimental Results To demonstrate the performance and efficiency of proposed approach we provide results corresponding to different experimental scenarios. For this, we collected 300 images from 5 different digital cameras, including a Canon Powershot A80, Sony Cybershot S90, Sony Cybershot P72, Canon Powershot S1 IS, and Panasonic DMC FZ20 at their native resolutions. We also downloaded 17, 000 images from the Internet. All images are then cropped to the size of to have a fixed resolution. As a part of the offline step, fingerprint estimates are extracted from all the images and saved in a database to be used later. In the following experiments, the performance of the proposed approach is presented under different settings. I. The first experiment is designed to evaluate the efficiency of the proposed method when the fingerprint of the sensor in question is present and there is only one image in the database exhibiting this fingerprint. For this purpose, we built a database of 1024 images containing only one image from each camera and 1019 images from the Internet. We obtained the fingerprints of the devices from 200 PRNU noise profiles associated with each device. With the conventional approach, where the device fingerprint is correlated with the 53

78 fingerprint estimate of each image, it took 7 minutes and 11 seconds to find an image captured by one of the devices and 9.61 seconds using the proposed approach with no errors in both cases. Table 3.4: Performance Results with Binary Search Tree Camera Number of Searches Time Performance Panasonic DMC FZ min., 34sec. 50/50 Canon Powershot A min., 11sec. 50/50 Canon Powershot S1 IS 27 6min., 47sec. 48/50 Sony Cybershot S min., 47sec. 47/50 Sony Cybershot P min., 42sec. 46/50 II. In this experiment, we built a database of images by mixing 50 images from each camera with the images from the Internet. Our goal here was to measure the performance and efficiency of our approach in identifying as many images as possible while minimizing false matches (i.e., false positives). For this purpose, we set the threshold for the correlation value to Sensor fingerprints of the devices are again obtained using 200 PRNU noise profiles from a camera. The search is repeated until a fingerprint estimate with a correlation lower than the threshold is found. Table-3.4 shows the number of searches, the time and the number of images we were able to detect with our method. 54

79 Figure 3.12: The distribution of correlation values using the fingerprint of Sony Cybershot P72 and the fingeprint estimates from Sony Cybershot S90(green) and Sony Cybershot P72(purple) In Table-3.4, it can be seen that the errors primarily involve Sony made cameras. This is a result of high correlation between the sensor fingerprints of the two Sony camera models. To further test this phenomenon, in Figure-3.12 we provide the distribution of the correlation values between the fingerprint of Sony P72 with the fingerprint estimates of itself and of Sony S90. Results show that to minimize the error one needs the further increase the threshold value. This observation is also in line with the results of [36] where it is shown that fingerprints of cameras from same manufacturer correlate better with each other due to use of similar demosaicing algorithms which reveals itself as a systematic artifact in the extracted PRNU noise profiles. Further improving these results require removal of such demosaicing artifacts from the PRNU noise profiles during the offline step. 55

80 Table-3.4 also shows that to identify 50 images one doesn t need to perform 50 searches as the fingerprints of the same devices were mostly placed in the neighboring leaves of the tree. In this case, the average time to detect 50 images was around 6 minutes and 40 seconds with our method. On the other hand, when a linear search is performed, it would take more than 2 hours. III. In order to show how the proposed method can be used for a retrieval task as described in Section 3.2.3, we use the same setting as in the previous experiment. In this case, we repeatedly update the tree until we identify all the fingerprint estimates associated with the given cameras. Figure 3.13-a shows the precision recall diagrams for all the cameras that made an error during matching to show how many searches have to be performed. The figure indicates that the worst precision is about 0.5 which means that we need to search at most 100 times to find all the relevant fingerprints in the database. In this experiment, we also demonstrated a case where group based approach is not used, and tree is built by splitting the fingerprint estimates randomly. The red line in Figure 3.13-a shows the precision-recall diagram after 100 searches where the fingerprint of Sony S90 is used in search over the database. As expected, the search accuracy with random splitting is inferior in comparison to structured case. This is primarily because when fingerprint estimates are distributed over the nodes randomly, the distance between the nodes will not be far enough. As a result, fingerprint estimates associated with a given node 56

81 are more likely to be close to other node descriptors IV. In this experiment, we investigate the impact of device fingerprint s quality on the search results. For this purpose, tree is constructed from 4096 fingerprint estimates, 50 of which were due to Sony Powershot S90. Then 4 sensor fingerprints are generated for the Sony Powershot S90 by averaging 50, 100, 150 and 200 fingerprint estimates coming from the same camera. In addition, to test the limits, we also used a single fingerprint estimate as the device fingerprint during the search. The Precision-Recall diagrams corresponding the 5 different device fingerprints are presented in 3.13-b. Results show that performance does not change much with the number of fingerprints used when generating the device fingerprint. It must be noted that even though we used a single fingerprint estimate as the device fingerprint this promising result was achieved because related fingerprints were located close leaves of the tree which caused nodes to act almost as device fingerprints. Although, we present results for one camera, we observed that performance was very similar with the other cameras as well. V. We conducted a final experiment to test how the fingerprints associated with images from a given camera are placed in the search tree. For this purpose, we constructed trees using 64, 128, 256, 512, 1024, 2048 and 4096 fingerprint estimates. In each case, roughly 5% of the images were coming from the same camera (Sony S90). The experiments showed us that for trees sizes of 64, 128, 57

82 (a) (b) Figure 3.13: (a)precision-recall Diagrams for 3 cameras when group based approach is used and for Sony S90 when the tree is built by random splitting. (b)precision- Recall Diagrams for Sony S90 with different quality fingerprints. 58

83 and 256 all the fingerprints of Sony S90 were placed successively. When the size was increased to 512 and 1024, we observed that in both cases only two fingerprints from the Internet images were placed in between. For sizes of 2048, and 4096, this number increased to three. These results show that our approach can successfully cluster fingerprints associated with a given camera when building the binary search tree. Finally, we address the issue of using a threshold when deciding the validity of match returned by the search operation. We noted earlier that to make a decision as to whether a database contains an image associated with a query sensor fingerprint, one needs to rely on a preset a threshold below which a match will be considered invalid. Obviously, setting such a threshold might lead to missed matches. Hence, the choice of threshold poses a trade-off between early termination the search (i.e., avoiding false positives during matching) and increased number of missed matches. Since our matching criteria is based [11], the threshold values given there and the corresponding false-positive rates can also be used here when selecting a value for threshold. 59

84 Chapter 4 Video Copy Detection Based on Source Device Characteristics: A Complementary Approach to Content-Based Methods Video copy detection techniques are automated analysis procedures to identify the duplicate and modified copies of a video among a large number of videos so that their use can be managed by content owners and distributors. These techniques are required to accomplish various tasks involved in identifying, searching and retrieving videos from a database. Furthermore, due to the increase in the scale of video databases, the ability to accurately and rapidly perform these tasks become increas- 60

85 ingly crucial.for example, it is reported that the number of videos in video sharing site YouTube s 1 database have reached 73.8 millions by March 2008 and every day more than 150 thousand videos are uploaded to its servers 2 In such systems, copy detection techniques are needed for efficient indexing, copyright management and accurate retrieval of videos as well as detection and removal of repeated videos to reduce storage costs. Development of monitoring systems that can track commercials and media content (e.g., songs, movies, etc.) over various broadcast channels is another application where copy detection techniques are needed most. In any case, realizing above tasks requires techniques that are capable of providing distinguishing characteristics of videos which are also robust to various types of modification. The most prominent approach in video copy detection has been to extract unique features from the audiovisual content [37, 38]. Therefore, many content-based features have been proposed. These features included color features like layouts [39], histograms [40, 41, 42], and coherence [43]; spatial features like edge maps [44] and texture properties [45, 46]; temporal features like shot length [47]; and spatiotemporal features like 3D-DCT coefficient properties [48] and differential luminance characteristics [49]. A video signature is generated by either organizing the computed features into suitable representations or through cryptographically hashing them to obtain more succinct representations. The resulting signatures are expected Kansas State University Digital Ethnography Group s YouTube Statistics report can be obtained at 61

86 to be unique and robust under common processing operations. These signatures are stored in a database for later verifying the match of a given video. The biggest challenge in video copy detection is to retrieve duplicate or modified versions of a video while being able to discriminate it from other similar videos. Since a video can be modified in many different ways, including common video processing operations, overlaying graphical objects onto video frames and insertion/deletion of video content, obtaining video signatures that are robust to all these types of modifications is a challenging task. Figure 4.1-a display frames from a video and its contrast enhanced version which are expected to yield the same signature. Similarly, Figure 4.1-b displays the copy of a video with overlaid advertisement. While robustness of extracted video signatures is crucial for the success of video detection techniques, such a requirement, at the same time, makes it very difficult to differentiate between videos that are very similar in content. Figures 4.1- c and 4.1-d show frames from videos that are visually very similar but essentially different. Therefore, in the presence of many content-wise similar videos, detecting modified copies of a given video becomes a very challenging task, and rapidly increasing size of video databases significantly exacerbates the problem. In the context of these difficulties, main insight of this work is that use of source device characteristics provides a new level of information that can help alleviate above problems. The fact that source characteristics are not primarily content dependent makes it potentially very effective against problems arising due to similarity 62

87 (a) (b) (c) (d) Figure 4.1: (a) A video and its contrast enhanced duplicate. (b) A video and its advertisement overlaid version. (c) Two videos taken at at slightly different angels. (d) Similar but not duplicate videos of content. Moreover, since source device characteristics are not equally subject to constraints of audiovisual content, they are not prone to effects of common video processing operations in the same way, which makes them robust against certain modifications. Hence, incorporation of source device characteristics with contentbased features will improve overall accuracy of video copy detection techniques. In this chapter of the thesis, we propose a new video copy detection scheme that utilizes unique characteristics of imaging sensors used in cameras and camcorders. The underlying idea of the proposed scheme is that a video signature can be defined as a weighted combination of the fingerprints of camcorders (e.g., imaging sensors) involved in generation of a video. The resulting signature essentially depends on various factors that include duration of video, number of involved camcorders, contribution of each camcorder, and partly the content of the video. We demonstrate 63

88 the viability of the idea on videos taken by several different camcorders and on several copies of duplicate and near-duplicate videos downloaded from YouTube. Our results show that signatures extracted from a set of videos downloaded from YouTube do not yield a false-positive in detecting near-duplicate videos and that the signatures are robust to both temporal changes and various common processing operations. 4.1 Camcorder Identification based on PRNU Chen et al. [15] extended the approach in [11]which was described in Section 2.2 to videos to identify source camcorder. Although digital cameras and camcorders are very similar in their operation, obtaining an estimate of the sensor fingerprint from a video is a more challenging task. As a comparison, for internet quality videos, of size 264x352 at 150 kb/sec. bit-rate, the needed duration of the video to obtain a reliable fingerprint is around 10 minutes [15]; whereas, a few hundred images is typically sufficient to obtain the fingerprint of a digital camera. There are several reasons for that: (i) frame sizes of typical videos are smaller which decreases the available information needed for reliable detection; (ii) successive frames are very much alike, hence averaging successive instances of PRNU noise patterns do not effectively eliminate content dependency; and (iii) because of motion compensation sensor fingerprints might be lost in some parts of the frames. Essentially, the accuracy of the fingerprint estimate depends on the quality (compression and resolution) and 64

89 the duration of video (i.e., number of frames). In Figure 4.2, we show the impact of the quality and length of the video on fingerprint estimates obtained from videos taken by 5 different camcorders. Each video is encoded at 1 Mbps and 2 Mbps bitrate and divided into segments of 1000 frames and 1500 frames, and a PRNU noise pattern is extracted from each segment to obtain a fingerprint of the camcorder. By designating one of the camcorders as reference, inter- and intra-correlations of the obtained fingerprints are computed with respect to the reference camcorder. It can be seen that for increasing quality and longer segments the fingerprint estimates yield better differentiation of videos taken by the reference camcorder from the videos taken by other camcorders. 4.2 Obtaining Source Characteristics Based Video Signatures Since a video can be generated by a single camcorder or by combining multiple video segments captured by several camcorders, we define a video signature to be the weighted combination of the fingerprints of the involved camcorders. We, therefore, utilize a procedure similar to one described in [15] in extracting PRNU noise pattern from a video frame. We denoise each video frame with a wavelet-based denoising filter and extract the noise residues which are then averaged together. The resulting pattern is the combination of camcorder fingerprints, and it is treated as 65

90 80 30 Density Density Correlation Values (a) Correlation values (b) Density Density Correlation Values (c) Correlation Values (d) Figure 4.2: The distribution of inter- and intra-correlation values. Distributions in blue indicate the correlation values of fingerprints associated with the videos shot by the reference camcorder. Distributions in red indicate the correlation values of fingerprints between the reference camcorder and other camcorder. Bit-rate of videos and the number of frames in each segment are (a) 1 Mbps and 1000 frames, (a) 1 Mbps and 1500 frames, (c) 2 Mbps and 1000 frames, and (d) 2 Mbps and 1500 frames. the signature of the video. If a video is shot by, for example, two camcorders, the extracted signature will be the weighted average of the fingerprints of these two camcorders. The weighting will depend on the length of the video shot by each camcorder. To detect whether two videos are copies of each other, we assess the correlation between two video signatures. Since PRNU noise pattern is intrinsic to an imaging sensor, one issue that needs 66

91 to be addressed is how to identify videos taken by the same camcorder (or a fixed set of camcorders) as they are expected to yield the same signature. Essentially, due to inability to extract an accurate estimate of the underlying PRNU noise, fingerprints extracted from a video has also contributions from the content itself. That is, the extracted video signature will not only depend on the imaging sensor fingerprints but also it will exhibit some degree of content dependency. In Figure 4.2, it can be seen that the fingerprints extracted from videos captured by the reference camcorder correlate more; however, in the best case the correlation value is just around For unmodified or slightly modified videos correlation would take values close to one. On the other hand, for near-duplicate videos, no matter how similar they are, as long as the source camcorders are different, correlation values will not take high values. These will be further explored in the following sections. Another challenge in video copy detection is the robustness of the extracted video signature when the video is subjected to common processing. Proposed video signature extraction scheme is expected to be robust to the linear operations as they will not degrade the PRNU noise. The scheme would also be robust to temporal changes like random frame droppings, and time desynchronizations as long as number of frames in a video is not reduced dramatically. Since modifications like blurring, noise addition and compression are expected to degrade fingerprint estimates, proposed scheme would be robust to this type of modifications only up to certain extent. One critical type of modification that will impact the performance 67

92 negatively is frame cropping or scaling. This would require establishing synchronization between the sensor fingerprints from the original video and its scaled/cropped version prior to comparison of video signatures. Although Goljan et al. [50] showed that sensor fingerprints can be detected under image cropping or re-scaling through a search of relevant (cropping and scaling) parameters, it would, nevertheless, increase the computational complexity. To evaluate our video copy detection scheme, we performed two sets of experiments. In the first set of experiments, we provide results demonstrating the robustness of the video signatures against various common processing. In the second set of experiments, we apply the proposed scheme to videos downloaded from YouTube and show how the scheme performs on real life test data, where no information is available on the source camcorders. In the following sections, we provide an evaluation of the proposed scheme. 4.3 Robustness Properties of Video Signatures To test the robustness of the extracted signatures, we used videos captured by five different camcorders in Mini-DV format with a frame resolution of 0.68 megapixels. The videos are initially encoded at an average bit-rate of 2 Mbps and at 30 frames per second. The videos depict various sceneries that include indoor/outdoor scenes, fast moving objects, and still scenes, shot at varying optical zoom levels and also using camcorder panning. The videos captured with each camera are divided into 10 68

93 clips of 1000 frames, and the signatures of the resulting 50 video clips are extracted. Figure 4.3 shows that the correlation values computed between different video clips range from to 0.2. The results demonstrate that each video clip yields a different signature even though they are shot by the same camcorder. Next, we assess robustness properties of extracted video signatures by subjecting the video clips to various types of modifications at varying strengths. We extracted signatures from video clips that has undergone manipulation and correlate these signatures with the signatures from original (unmodified) video clips, For each manipulation we provide distributions of how original signatures correlate with (a) signatures extracted from their modified versions (blue distribution), (b) signatures extracted from other videos taken by the same camera (red distribution), and (c) signatures extracted from videos taken by other camcorders (green distribution). The distribution of these correlation values are shown in Figure 4.4. As can be seen in this figure, when the content is different correlation of signatures are less than 0.2 for all types of modifications. Therefore, if the signatures from the original and modified version of a video is above 0.2, video copies can be reliably detected. In our experiments, we set the threshold for identification to 0.2 so that none of the different videos, whether they are taken by same camcorder or not, would be identified as copies. As a performance measure we consider true positive rate (TPR), which determines the rate of correctly detected copies of the video. For each manipulation, we also provide a figure showing the change in mean of signa- 69

94 30 25 Density Correlation values Figure 4.3: Distribution of correlation values obtained by correlating signatures 50 video clips with each other. ture correlations between each video and its modified version with the change in manipulation strength Contrast Adjustment Contrast adjustment operation modifies the range of pixel values without changing their mutual dynamic relationship. Contrast enhancement (increase) maps the luminance values in the interval [v l, v h ] to the interval [0, 255]. The luminance values below v l and higher than vh are saturated to 0 and 255, respectively. In the same manner, when contrast is decreased the luminance values ranging in [0, 255] are mapped to the range [v l, v h ]. Under contrast adjustment since the PRNU noise is largely preserved, the resulting video signatures will not be modified in any significantly. In the experiments, we tried various [v l, v h ] values changing from [25, 230] to [115, 140]. As can be seen in Figure 4.5-a, video signatures are robust up to 90% 70

95 Density Density Correlation values (a) Correlation values (b) Density Density Correlation values (c) Correlation Values (d) Density Density Correlation values (e) Correlation values (f) Density Density Correlation values Correlation Values (f) (g) Figure 4.4: The distribution of correlation values. Blue distribution is obtained by correlating the unmodified video clips and their modified versions, red is obtained by correlating video clips with the ones coming from the same camcorder and green is obtained by cross correlation of different videos for different types of manipulations. (a) Decreased contrast. (b) Increased contrast. (c) Decreased brightness. (d) Increased brightness. (e) Blurring. (f) AWGN addition. (g) Compression. (h) Random frame dropping 71

96 contrast increase, which corresponds to [102, 153] range. For enhancement values [114, 140], the mean correlation value was around 0.18 and actually all correlation values were lower than 0.2. However, it must be noted that this is a very extreme case and most of the luminance values of the frames are saturated to 0 and 255. On the other hand, even in the most extreme cases of contrast decrease, where the luminance values in the range [0, 255] are mapped to [114, 140] range, we were able to detect all the copies of video clips, Figure 4.5-b. The distributions of correlation values between the signatures of original video clips and their contrast increased versions can be seen in Figure 4.4-a and contrast decreased versions can be seen in Figure 4.4-b. These results show that the extracted signatures are very robust to contrast manipulations Mean of correlation values Mean of correlation values Strength of contrast increase Strength of contrast decrease (a) (b) Figure 4.5: The change in mean of correlation values as a function of the strength of (a) contrast increase and (b) contrast decrease. 72

97 4.3.2 Brightness Adjustment Brightness adjustment is performed by either adding or subtracting p percent of the frame mean luminance value to or from each pixel in the frame, where p is a user defined parameter. Since this operation only offsets the pixel values, the PRNU noise will be almost fully preserved and video signature will not change much. During the experiments we varied p value between 10% to 190%, where 10%-99% indicates brightness increase and 101%-190 indicates brightness decrease. The correlation of signatures after adjusting brightness are given in Figure 4.4-c and 4.4-d. Also, the average change in the correlation values with respect to changes in brightness level is given in Figure 4.6-a and 4.6-b. As can be seen in these figures, detection fails only when brightness increase is at an extreme level. For other instances, the video signatures are observed to be robust; therefore, we were able to detect all the copies videos without any false positives Mean of correlation values Mean of correlation values Strength of brightness decrease Strength of brightness increase (a) (b) Figure 4.6: The change in mean of correlation values as a function of the strength of (a) brightness decrease and (b) brightness increase. 73

98 4.3.3 Blurring Blurring is performed by filtering each frame using a standard Gaussian filter function with parameter σ (i.e., standard deviation). Since blurring will remove much of the medium- to high-frequency content, the PRNU noise may be largely removed (depending on the choice of σ), making extracted signatures unreliable. In the experiments, we considered σ = 2, 3, 5, 7 values. Figure 4.7-a shows the mean value of the resulting correlations with the change in the filter size. Results indicate that the signatures are robust to blurring only if Gaussian filter width σ is less than 3. The distribution of correlation values can be seen in Figure 4.4-e AWGN Addition Noise addition will degrade the accuracy of the PRNU noise estimates. It is perceivable that with increasing noise power levels reliable detection of PRNU noise will get more and more difficult. When the noise is additive and frame-wise independent, its impact can be reduced by averaging it over large number of frames; however, this will be effective only very long videos. In the experiments, we added additive white Gaussian noise (AWGN), with varying standard deviation σ, to each video frame. The considered range of noise levels are σ = 2, 3, 5, 10, 20, 30. The results in Figure 4.7-b show that performance is not satisfactory when σ > 5. For σ = 20, 30 our scheme didn t work at all, for σ = 5, we achieved 80% true positive rate (TPR) and for σ = 10 the TPR was 30%, in both cases there were no false 74

99 positives. Figure 4.7-f provides the distribution of correlation values after AWGN addition Mean of correlation values Mean of correlation values Sigma Sigma (a) (b) Figure 4.7: The change in mean of correlation values as a function of the strength of (a) blurring and (b) AWGN addition Compression To show the impact of compression, we re-encoded all videos at bit-rates ranging from 0.8 Mbps to 2 Mbps, while still preserving the frame resolution. (Since compression beyond 0.8 Mbps caused a decrease in frame resolution, we did not consider lower bit-rate values.) We observed that accuracy does not vary with the bit-rate as can be seen as in Figure 4.8-a and Figure 4.4. Therefore, we can conclude that the signatures are very robust to bit rate changes. The distributions of correlation values are given in Figure 4.4-g and shows a similar trend. 75

100 Mean of correlation values Mean of correlation values Bit rate Mbps Percentage of dropped frames (a) (b) Figure 4.8: The change in mean of correlation values as a function of the strength of (a) compression and (b) frame dropping Random Frame Dropping To illustrate the impact of a lossy channel we randomly removed frames from each video clip before extracting the signature. The drop rate varied between 50% to 90%. As Figure 4.8-b and Figure 4.4-h indicate, for all frame drop rates extracted signatures were reliable. Also, the correlation of signatures after random frame dropping can be found in Figure 4.4-h. 4.4 Performance Evaluation To test the performance of the proposed video copy detection scheme, we used videos from the video sharing site YouTube. For this purpose, we downloaded more than 400 videos searched under 44 distinct names without imposing any other constraint (e.g., resolution, compression level, synchronization in time). Each distinct video had copies ranging from 2 to 39. These videos include TV commercials, movie trails, 76

101 and music clips, and duration of each video varies from 20 seconds to 10 minutes at a resolution of 240x320 pixels. Then, signatures extracted from the 400 videos are cross-correlated. The distributions of the resulting correlation values are given in in Figure 4.9-a. In this figure, blue distribution curve indicates the correlation of signatures associated with the same videos and red is for the correlation of signatures associated with different videos. From these distributions, it can be immediately seen that for the same videos, the correlation values are in general greater than 0.5 and mostly close to 1. For different videos, on the other hand, correlation values are centered around 0 with a maximum less than 0.5. To evaluate the performance of the scheme in detecting video copies, at a given decision threshold, we counted the number of decision errors when the copy of the video is deemed to be a different video (false-rejection) and a different video is detected as a copy (false-acceptance). (This is realized by comparing the correlation values associated will all pairs of values with a preset threshold.) Figure 4.9-b displays the receiver operating characteristic (ROC) curve which shows the change in false-acceptance rate in comparison to false-rejection rate by varying the decision threshold across all values. The ROC curve shows that the misidentification rate is very low. Note that in the best case, accuracy performance is 99.30%. To see why some of the videos didn t correlate with their copies, we examined the videos more closely. We found several reasons for not getting similar signatures from the available copies. The most common reason for a misidentification is (slight) 77

102 Density FRR ROC curve Correlation Values FAR (a) (b) Figure 4.9: (a)cross-correlation of extracted video signatures. (b) ROC curve for detection results on the videos downloaded from YouTube. 0 scaling of the videos. Since after scaling extracted signatures do not align, those videos yielded very low correlation values. Figure 4.10-a shows an example of a video and its scaled copy. As expected, another reason for observing low correlation values is compression. Figure 4.10-b provides an example where the copied version of the video is compressed by a factor of 0.75 which yields a correlation value just below the threshold. Another factor contributing to mis-identifications is the extra content (like advertisements) inserted into the videos. We observed that if the added content is around 10% of the length of original video, the signatures yield correlation values more than 0.5. When the added content is more than 30% of the original one in duration, the resulting signatures become substantially dissimilar. Another reason for low correlation values is video summarization. It is observed that even if the a video is shortened more than 30% of its original, signatures yield satisfactory correlation. However, in general when videos are shortened by more than 40%, our scheme was not able to correctly detect the copies. 78

103 (a) Scaled versions (b) Highly compressed ver. Figure 4.10: Video copies for which the extracted signatures are dissimilar. We notice that our signatures are quite robust in the presence on-screen graphic objects, like subtitles and small advertisements, that overlay the video content. Figure 4.11-a gives one such example where detection can be successfully achieved. In addition, small shifts in time didn t effect the signature much. In Figure 4.11-b one can see the 300th frames of a video and it s copy. In this example, the second video started with a blank screen with a duration around 2 seconds that yielded a shift in time. We also examined the videos that are falsely detected as copies of videos with high correlation values. To our observation, most dominant factor in those cases is the continuous presence of a logo or advertisement in different videos, as exemplified in Figure (a) Added subtitles (b) Shifted in time Figure 4.11: Video copies with similar signatures. To determine the impact of content and imaging sensor fingerprint on the resulting video signatures, we performed another experiment on YouTube videos. For this purpose, we downloaded 36 distinct videos of a commercial series (the now 79

104 (a) (b) Figure 4.12: Different videos with similar signatures. corr(a,b)=0.45 famous PC vs Mac commercial) hypothesizing that they are captured using the same set of source devices(s). The videos were short, typically 27 seconds videos, with an average of 700 frames per video and at resolution 240x320 pixels per frame, and content-wise they are quite similar. Figure-4.13 shows representative frames extracted from four of the videos. We extracted signatures from each of the video and computed pair-wise correlations among all signatures. In Figure-4.14, red distribution shows the values obtained by pair-wise correlations. As can be seen, most of the resulting values are very close to 0, implying no relation between the videos. These results indicate that if the content of the videos are not same, but very similar, and even if they might have been captured by the same set of source devices, our scheme doesn t detect them as copies. On the other hand, these results do not allow us to conclude whether or not the commercials are captured using the same source device(s) as videos captured by the same device are expected to yield higher correlation values. Since extracting reliable fingerprint of the sensors from internet quality videos require longer duration videos, we performed another experiment to see if the source devices for the videos match. For this purpose, we first randomly chose 10 videos and combined them together to generate a composite video. Then 80

105 Figure 4.13: The frames of four example videos from a commercial series. we generated another composite video by choosing 10 different videos from the remaining ones and correlated the resulting signature from the two composite videos. (Note that the two composite videos have no overlapping content.) We repeated the same experiment 250 times by drawing different combinations of 36 videos each time. The distribution of resulting correlation values are shown in Figure-4.14 in blue. These results strongly imply that at least some of the videos are taken by the same set of cameras/camcorders. Overall, the experiments on these commercial series showed that when the same camcorders are used in capturing process of two videos, if their contents are not same, although they may be similar, the resulting signature would be significantly different. 81

106 Figure 4.14: The distribution of correlation values obtained from the commercial series. Red distribution is obtained by pair-wise correlations of each video and blue distribution is obtained by correlation of composite videos 82