MusicGuide: Album Reviews on the Go Serdar Sali Abstract The cameras on mobile phones have untapped potential as input devices. In this paper, we present MusicGuide, an application that can be used to provide the user information on a music album on the go by simply using a photo of the album cover as the input. Relevant data on the album such as reviews and track samples are then collected from various sources on the web and presented to the user. Our experiments show that MusicGuide is robust against various transforms and artifacts in the input images, and it also has good performance in terms of accuracy, with recognition rates above 90%. 1. Introduction Mobile devices continue to grow enormously in popularity all around the world, and through convergence with PDAs, digital cameras and music players, they are set to become the mobile computing platform of choice for most people in the future. However, this computing potential is largely untapped right now, despite the fact that strong and continuously improving communication abilities, practicality due to small size and their ubiquitous nature make mobile devices an ideal platform for conveying custom-tailored, context-based information to the user. In this paper, we propose a system that can be used to provide a user with product information on the go by utilizing the camera on the phone as an input device. While various methods exist to provide users with product information, they usually rely on text based input by the user, or data obtained by other means such as barcodes or RFID. However, entering text-based input on these devices is not practical due to space limitations which make full size keyboards either impossible or cumbersome to use due to cramped, tiny keys. RFID readers are not widely available on phones and RFID tags are not widely available on products. While cameras on mobile phones can be used as barcode readers, due to their small size, barcodes are difficult to focus on; the resulting images are usually too blurry to have reliable recognition rates. An ideal method would be to use photos of product covers taken by the camera embedded on the phone as the input image for a search. In a wider sense, this problem is a subset of the content-based image retrieval (CBIR) problem and is called query by example. Various CBIR applications exist on mobile phones. The main differences lie on the type of input used, where the computation takes place and to which domain the architecture applies. The architecture we propose is aimed at searching for information on music albums from a mobile device. In contrast to applications where all computation is performed on the device, since the dataset for our domain is too large and computation power on the device is likely to be a bottleneck, our architecture employs a client-server architecture where the mobile device is the client, and all recognition and information aggregation is done on the remote server. The user can take a photo of an album cover and send this photo to the server via e-mail, a WiFi connection or MMS messaging. The server takes this photo as the input, performs the search, aggregates information from various resources and sends this information back to the user.
The contribution of this paper is a mobile album cover recognition system utilizing cameras on mobile phones. The motivation behind selecting this domain is the difficulty in obtaining information on an album based only on the album cover, which is the only information available to a user in a store. While a buyer can read sections of a book or a magazine, therefore sampling it, for albums sampling is limited to what is available in the listening kiosks. Our architecture provides users with the ability to obtain information on an album by taking just a photo of the album cover, which is intuitive and practical. 2. Related Work While CBIR is a large problem and encompasses a broad area of applications such as semantic retrieval of images, searching based on particular features of an image such as color distribution or frequency, or retrieving pictures with similar textual annotations, the relevant area to our research is retrieval of similar images to a sample input image, which is also called query by example. Since we use as input images taken by a user, the recognition algorithm employed must be robust against rotation, translation, scaling and partial occlusion. It is also important that the algorithm is invariant to some degree to noise in the image and differences in illumination. Furthermore, since we are comparing against actual digital versions of cover art, we need a reliable descriptor with a good accuracy. 2.1. Similarity based search Over the years, various approaches have been proposed to address the problem of matching images based on similarity. These approaches mainly differ in how they form the feature vector for an image. The most widely used sources for feature extraction are color, shape and texture. One of the first algorithms utilizing color was proposed by Swain et al. [1]. This algorithm computes color histograms for images, followed by an intersection of the histograms to perform the similarity search. Improvements that add spatial information and correlation to color histograms were proposed in [2] and [3]. Color histograms are very sensitive to noise and they work best when both the input image and the database images are taken by the same device and therefore have similar color representation. Unfortunately, this is not the case for our domain: cameras on phones tend to take relatively noisy pictures, and our database consists of digital versions of cover art, not photos of them. In addition to color histograms, various methods look for textures in an image and their spatial placement, and construct a feature vector based on this information. One important work utilizing textures is given by Tamura et al. [4]. In this paper, the authors propose approximations to the following texture features: Coarseness, contrast, roughness, regularity, directionality and linelikeness. These properties are based on how humans actually perceive textures. While these methods are usually good for detecting uniform regions such as sky and sea in an image, they are
not applicable to our domain with good results because such regions are not enough to correctly identify a single match for an input image. Shape can also be used to build a feature vector from an image. Methods relying on shape usually work on the similarity of edges, corners and shapes of the objects in the image. Feature extraction using shapes usually works in a local level: it is concerned with locating points of interest in the image, rather than considering the global distribution of a feature as in color-based methods. One of the most commonly used shape detectors is the corner detector by Harris [5]. Another descriptor that makes use of local points of interest is SIFT (Scale-invariant feature transform) [6]. SIFT is a popular method based on detection of key points in an image using the Difference of Gaussians method. The resulting features are invariant to scaling, differences in illumination and rotation, and works good even for 3D images and different points of view. These properties make SIFT ideal for our architecture. 2.2. Object Recognition Systems on Mobile Phones While many web-based systems exist for content-based image search, applications of these systems to mobile devices are relatively rare. Some research is being conducted on building assistive technologies for the blind and visually impaired using mobile devices and object recognition. One such system is GroZi [7]. GroZi aims to help the visually impaired by recognizing and locating the items on a customer s shopping list among the items on the shelf. Another set of applications aim to build on-demand mobile tour or museum guides [8]. Some of these systems try to perform location recognition by using photos from the location as input. Yeh et al. [9] propose a system that first queries an image database by using a photo taken by a user. After a match is found, the system then searches the web based on keywords associated with the determined location. PhoneGuide [10] is another application that makes use of the mobile phone as a museum guide. The user can take pictures of the various exhibits in the museum using the digital camera on her/his mobile phone, and then query a database for information on the exhibit using the image as the input. Seifert et al. [11] propose a system that utilizes object recognition and GPS on a mobile device for taking an inventory of traffic signs. The algorithm for sign recognition takes advantage of the fact that traffic signs have obvious shapes and patterns which are known in advance. Jia et al. [12] propose a generic architecture called Photo-to-Search that can be used to query the web directly from mobile devices using images taken by the digital camera on the device and minimal textual input. The underlying object recognition system uses SIFT to detect key points and perform the matching. There are also systems for outdoor hobbyists, such as systems for flower or fish recognition [13,14].
3. Our Method Image recognition systems require high performance, and an album cover database is very large to store on a phone. For these reasons, we chose to implement our system as a client-server architecture (Fig.1). Take photo of cover & send it to server MusicGuide Server User Match Image Collect Data Send Data to Phone Figure 1 - MusicGuide Architecture The user interface on the phone lets the user take a photo with a resolution of 500 500 pixels using the camera on the device. This photo is then automatically uploaded to our server where we perform the matching and data aggregation. The server then pushes the results as an HTML file back to the phone. This HTML file contains product rating and track samples from Amazon, and average critic score and excerpt reviews from Metacritic with links to full reviews, and it is also saved on the phone for subsequent views if desired. Our server has two main tasks: image matching and data collection. The image matching component uses David Lowe s SIFT. SIFT has been shown to have good precision, and it is invariant to various image transforms. The implementation we use is a modified version of David Lowe s implementation. Once the matching is performed, the server gets user rating and track samples for the album from Amazon using the Amazon Web Services API, and parses the HTML data from the product page on Metacritic to get the review excerpts, links to full versions of these reviews and average critic score for the album. Since the input image is small, the total data communicated between the phone and the server is on average 100Kb, and most of this data is the input image and the matching cover image from our database that the server sends back so that the user can evaluate the correctness of the match. The user interface is implemented in Python. Data collection on the server was implemented on VB.Net, whereas the object recognition algorithm is written in C for efficiency purposes. A web interface to the server was also built using ASP.Net. While the user interface we implemented provides a practical mechanism to perform the search, through this web interface our server can be accessed by any web-capable phone with a browser.
4. Performance In order to evaluate the performance of our architecture, we test our system with 45 input images taken by the camera on a Nokia N95 phone against a database of up to 500 cover images. We measure the percentage of correct matches as well as the time it takes to execute a query and find a match. Fig. 2 shows the percentage of correct matches for different database sizes. 100 : 97.7% 300 : 93.3% 500 : 93.3% percentage of correct matches 100 80 60 40 20 0 percentage of correct matches 0 100 200 300 400 500 600 # of images in database Figure 2 - Percentage of correct matches for different database sizes. Our results show that our method has really good accuracy in finding the correct match. Among the 45 images, only one was identified incorrectly when we have 100 images in our database and only 3 when we have 500. Furthermore, the recognition rate falls rather slowly with increasing database size, which implies that our method can be relied on to work with good accuracy on larger databases. Some examples of correct matches given in Fig. 2 show how reliable our architecture is even under various transforms and artifacts in input images and background clutter. All the examples given were correctly matched in all our trials. (a) (b) (c) Figure 3 - Some examples of correct matches. Our method works reliably even with (a) heavy background clutter in the input image, (b) different viewpoints, and (c) blurry, noisy input images.
Since we perform an exhaustive search to find the matching image, the time it takes to execute a query increases linearly with the number of images in the database. The time performance measurements on a 1.6 GHz computer with 1GB of RAM indicate that it takes on average 200 seconds to search through 500 images in our database. The time performance can be improved by using approximate methods to find the correct match at the expense of some accuracy. 5. Conclusion and Future Work In this paper, we have presented a mobile CBIR architecture for album cover recognition, which can be used to provide users with reviews and sample tracks for an album on the go. The high accuracy of our system ensures a reliable and practical application. Furthermore, our system can be used readily with existing cover databases as we perform the matching against actual digital versions of cover art. As future work, we plan to implement a faster search by using approximate methods to find the correct match for a query image. We also plan to enhance the data collected on an album by including pricing data from various sources and recommending similar items, and extend our application to handle different media such as books and DVDs. References [1] S. Ballard, Color Indexing, International Journal of Computer Vision, Vol. 7, No. 1, pp. 11-32, 1991. [2] G.Pass, and R. Zabith, "Comparing images using joint histograms," Multimedia Systems, Vol.7, pp.234-240, 1999. [3] G. Pass, and R. Zabith, "Histogram refinement for content-based image retrieval," IEEE Workshop on Applications of Computer Vision, pp. 96-102, 1996. [4] H. Tamura, S. Mori, and T. Yamawaki, "Texture features corresponding to visual perception", IEEE Trans. On Systems, Man, and Cybernetics, Vol. Smc-8, No. 6, 1978. [5] C. Harris and M. Stephens, A combined corner and edge detector, ALVEY Vision Conference, pages 147-151, 1988. [6] D. G. Lowe, "Distinctive image features from scale-invariant keypoints", International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004. [7] M. Merler, C. Galleguillos, and S. Belongie. Recognizing Groceries in situ Using in vitro Training Data, to appear, SLAM 2007, Minneapolis, MN, retrieved from http://grozi.calit2.net/files/grozi_slam.pdf. [8] N. Davies, K. Cheverst, A. Dix, and A. Hesse, Understanding the role of image recognition in mobile tour guides, Proceedings of the 7th international Conference on Human Computer interaction with Mobile Devices & Services, pp. 191-198, 2005. [9] T. Yeh, K. Tollmar, and T. Darrell, "Searching the Web with Mobile Images for Location Recognition", 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), Vol. 2, pp. 76-81, 2004. [10] P. Föckler, T. Zeidler, B. Brombach, E. Bruns, and O. Bimber, PhoneGuide: museum guidance supported by on-device object recognition on mobile phones, Proceedings of the 4th international Conference on Mobile and Ubiquitous Multimedia, MUM '05, Vol. 154, pp. 3-10, 2005.
[11] C. Seifert, L. Paletta, A. Jeitler, E. Hödl, J.. Andreu, P. Luley and A. Almer, Visual Object Detection for Mobile Road Sign Inventory, Lecture Notes in Computer Science, MobileHCI 2004, pp. 491-495, Springer, 2004. [12] M. Jia, X. Fan, X. Xie, M. Li, W. Ma, Photo-to-Search: Using Camera Phones to Inquire of the Surrounding World, The 7th International Conference on Mobile Data Management (MDM'06), p.46, 2006. [13] M. Noda, H. Sonobe, S. Takagi, et al., "Cosmos: Convenient Image Retrieval System of Flowers for Mobile Computing Situations", Proceedings of the IASTED Conf. on Information Systems and Databases 2002, pp. 25-30, 2002. [14] H. Sonobe, S. Takagi, and F. Yoshimoto, "Image Retrieval System of Fishes Using a Mobile Device", Proceedings of International Workshop on Advanced Image Technology 2004, pp. 33-37, 2004.