Brett Kuprel kuprel@stanford.edu Abstract It is often the case that a human can determine the genre of a movie by looking at its movie poster. This task is not trivial for computers. A recent advance in machine learning called deep learning allows algorithms to learn important features from large datasets. Rather than analyzing an image pixel by pixel, for example, higher level features can be used for classification. In this project I attempted to train a neural network of stacked autoencoders to predict a movie s genre given an image of its movie poster. My hypothesis is that a good algorithm can correctly guess the genre based on the movie poster at least half the time. 1. Introduction A neuron is a computational unit that takes as input x R n and outputs an activation h(x) = f(w T x + b), where f is a sigmoidal function. A neural network is a network of these neurons. See the example in figure 1 from the UFLDL tutorial (Ng et al., 2010). 1.1. Forward Propagation A neural network performs a computation on an input x by forward propagation. Let a (l) R n l be the set of activations (i.e. outputs) of the n l neurons and W (l) the matrix of weight vectors w for layer l. We have the following recursion relationship: a (l+1) = f(w (l) a (l) + b (l) ) (1) To determine the final hypotheses h given an input x, iteratively apply this recursion, starting with a (0) = x 1.2. Autoencoders An autoencoder is a neural network that takes as input x [0, 1] n, maps it to a latent representation y [0, 1] n, and finally outputs z [0, 1] n, a reconstructed version of x. If the input is interpreted as bit vectors, Figure 1. Top: a single neuron. Bottom: a neural network (specifically a feedforward network) the reconstruction error can be measured by the cross entropy J(x, z) = (x log z + (1 x) log(1 z)) (2) When n < n, the latent layer y can be thought of as a lossy compression of x. It does not generalize for arbitrary x, but this is usually ok since many datasets lie on lower dimensional manifolds. Natural images for instance are a very small subset of all possible images. 2. Methods 2.1. Model Let each movie poster be a vector x (i) R n where n is the number of pixels in the image. Each movie belongs to at most 3 genres. I express this as a boolean vector y (i) R G where G is the set of genres, and y (i) j = 1 if movie i belongs to genre j, 0 otherwise.
is indexed by the boolean matrix Y. This is still a differentiable function in the parameters W and b. 2.2. Learning Parameters Backpropagation is a greedy method used to train the weights in a neural network. It involves using gradient descent to update the parameters, W (l) ij = α J W (l) ij, b (l) i = α J b (l) i (7) Many times gradients are messy to derive, and do not provide much insight into the problem at hand. There is a package in python that I used called Theano that calculates these gradients and applies updates to the model parameters W and b behind the scenes. 2.3. Stacking Autoencoders Figure 2. an autoencoder The algorithm will produce a single genre prediction ŷ G. Define the prediction as the argmax of the conditional probability distribution: ŷ (i) = arg max P (Y = j x (i) ) (3) j and the CPT as the softmax of the final hypothesis layer of the network P (Y = j x (i) ) = exp(h j) exp(h) (4) Where h R G is found by forward propagation of x (i) through the network. The goal is to minimize prediction error rate. Define an error to occur when the predicted genre for some movie i is not in the set of genres that movie i belongs to: % Error = 1 D 1 y (i) [ŷ (i) ] (5) i D This error rate function is not differentiable in the model parameters W and b because of the argmax expression to find ŷ. To train the network, I instead try to minimize the negative log likelihood: J(W, b) = log P (Y = j X)[Y ] (6) Where the ith rows of X and Y are x (i) and y (i), and the sum is over all elements. Notice the CPT matrix Before applying backpropagation, it would be nice if W and b were initialized to something reasonable. A known problem with training neural networks is diffusion of gradients. When back propagation is run from scratch, only the nodes close to the final layer will be updated properly. A greedy method for initializing W and b is by stacking auto encoders. The idea is simple: train an autoencoder on a set of data X. Use the learned feature representations as the input for another autoencoder. Repeat until you have as many layers as you want. For this project I had 3 latent layers (aside from the initial data and hypothesis layers). This process results in a reasonable initialization of the weights W and b. It also allows unlabeled data to be used effectively for feature learning. Of the movie poster images I had, very few than 1/6 had genre labels. 2.4. Getting the Data On IMDB there is a link that goes to a random popular movie. http : //www.imdb.com/random/title Using this link, I can obtain the movie rating and poster for N movies as shown in algorithm 1. I used the BeautifulSoup package in python to scrape the HTML. This algorithm seems to exhaust IMDB s random popular movie function around a little less than 1000 movies. At that point, the algorithm will visit close to 100 sites before seeing a movie that hasn t been scraped yet. There is another website, called Movie Poster DB that also has a random movie link, and claims to host over 100 thousand movie posters. While
Judging a Movie by its Poster using Deep Learning Algorithm 1 Scrape IMDB Input: desired number of movies N Output: dictionary M where a movie key m points to genre g, rating r, and movie poster p M {} U = http : //www.imdb.com/random/title while M < N do m = getmovietitle(u ) if m not in M then p getmovieposter(u ) g getmoviegenre(u ) r getmovierating(u ) M [m] { genre : g, rating : r, poster : p} end if end while these posters do not have ratings or genre labels, they can still be used for feature learning. I wrote a similar script for this website and was able to scrape 5,000 posters in just a few hours. Table 1. Genre counts for movies in IMDB dataset. One movie can belong to multiple genres Genre Drama Comedy Action Adventure Crime Thriller Sci-Fi Fantasy Romance Mystery Horror Animation Family Biography History Documentary War Sport Western Musical Count 365 247 234 178 170 135 102 90 89 79 54 53 49 30 23 18 16 11 4 3 playing around with different image sizes, I decided on 100 100. The change in aspect ratio doesn t really affect the poster as much as I would have expected. Also, I had to decide what to do with the color. It seems that color does not add enough information to warrant tripling the features (or reducing the number of pixels per image to 1/3). I used a luminosity function, gray = 0.299 red + 0.587 green + 0.114 blue to convert each image to grayscale. See figure 3 for the preprocessed IMDB dataset. 2.5. Preparing Data Let M be the dictionary returned by algorithm 1. I decided to split the data into a training set Dtrain, a validation set Dvalid, and a test set Dtest with sizes 80%, 10%, and 10%. Figure 3. Movie posters from IMDB, standardized to 100x100 pixels and converted to grayscale One frustration I ran into into while scraping posters was that there was no standard image shape. In order to apply most machine learning algorithms, each data point should have the same set of features (i.e. same image size). In the PIL package in python, there is a function PIL.Image.resize(new size) which converts any size image to any other size image. After 2.6. Implementing a Neural Net I used a package for python called Theano (Bergstra et al., 2010) designed for deep learning. It simplifies running algorithms on the GPU. The same code written for the CPU will work on the GPU (as long as floats are used). Among other things, it uses a lazy evaluation technique, and performs symbolic differentiation. I found an example of using stacked autoencoders to classify the MNIST handwritten digits dataset. I used this as starter code to build by movie poster classifier.
Judging a Movie by its Poster using Deep Learning Figure 5. Negative log likelihood set is shown in figure 5 The images that most highly activate the neurons in the 3rd layer are shown in figure 6. Notice, a few of them look like faces. The first and second layer features were less exciting so I did not include them. Also, my code is split across many files and I decided to omit it from the report. Please email me if you want any part or all of it. Figure 4. Training speedup using GPU 3. Results I scraped a total of 5,800 images, each 100 by 100 pixels grayscale. 800 of the images are shown in figure 3. 5,000 of the images have no genre labels. The remaining 800 have genre labels distributed as shown in table 1. Each movie has anywhere between 1 and 3 genre labels associated with it. Training a 3 layer (layers of 1000, 500, 300 nodes) architecture topped with a layer of multiclass logistic regression results in a validation set error rate of 47% and a test set error percentage of 49.5%. This means that given a movies poster, the algorithm can correctly predict on of its genre out of 20 possible genres about 50% of the time. Note that if drama is guessed every time, the algorithm will predict the genre correctly 45.6% of the time. A plot of the negative log likelihood over number of iterations through the training Figure 6. Learned features in the 3rd hidden layer
4. Conclusion It was difficult to implement a deep neural network for the first time in 1 week. That said, I think my neural network suffered from the curse of dimensionality. My images were 100 by 100 pixels for a total of 10,000 variables per training example, and I only had 5000 training examples. A smarter method might be to cut each poster into patches and then classify using a voting scheme amongst the patches in the poster. Another method could be to simply use lower resolution movie posters. References Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation. Ng, Andrew, Ngiam, Jiquan, Foo, Chuan Y., Mai, Yifan, and Suen, Caroline. Ufldl tutorial, 2010. URL http://ufldl.stanford.edu/wiki/ index.php/ufldl_tutorial.