Automatic Name-Face Alignment to Enable Cross-Media News Retrieval

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Autoatic ae-face Alignent to Enable Cross-Media ews Retrieval Yueie Zhang*, Wei Wu*, Yang Li*, Cheng Jin*, Xiangyang Xue*, Jianping Fan + *School of Coputer Science, Shanghai Key Laboratory of Intelligent Inforation Processing, Fudan University, Shanghai, China + Departent of Coputer Science, The University of orth Carolina at Charlotte, USA *{yzhang, 10210240122, 11210240052, c, xyxue}@fudan.edu.cn, + fan@uncc.edu Abstract A new algorith is developed in this paper to support autoatic nae-face alignent for achieving ore accurate cross-edia news retrieval. We focus on extracting valuable inforation fro large aounts of news iages and their captions, where ulti-level iage-caption pairs are constructed for characterizing both significant naes with higher salience and their cohesion with huan faces extracted fro news iages. To reedy the issue of lacking enough related inforation for rare nae, Web ining is introduced to acquire the extra ultiodal inforation. We also ephasize on an optiization echanis by our Iproved Self-Adaptive Siulated Annealing Genetic Algorith to verify the feasibility of alignent cobinations. Our experients have obtained very positive results. 1 Introduction With the explosive growth of ultiodal news available both online and offline, how to integrate ultiodal inforation sources to achieve ore accurate cross-edia news retrieval becoes an iportant research focus [Datta et al., 2008]. Usually ultiodal news is exhibited with the for of captioned news iages, which ostly describe stories about people [Liu et al., 2008]. Thus enabling autoatic nae-face alignent has becoe a critical issue for supporting cross-edia news retrieval. However, because of the seantic gap [Fan et al., 2012], there ay exist huge uncertainty on the correspondence relationships aong naes (in captions) and faces (in iages) [Deschacht et al., 2007], [Yang et al., 2009], e.g., the relationship between naes and faces in a news is any-to-any rather than one-to-one. To achieve autoatic nae-face alignent, three inter-related issues should be addressed siultaneously: 1) ultiodal analysis of cross-edia news to identify better correspondences between naes and faces; 2) discovery of the issing inforation for rare nae; and 3) ultiodal optiization to achieve ore effective alignents for all possible nae-face pairs. To address the first issue, it is very iportant to develop robust algoriths that are able to achieve ore accurate nae-face alignent. To address the second issue, it is very interesting to leverage large-scale Web news for issing inforation prediction. To address the third issue, it is critical to develop a new optiization algorith with high accuracy rate but low coputational cost. Based on these observations above, a novel schee is developed in this paper for facilitating autoatic nae-face alignent to enable ore effective cross-edia news retrieval. Our schee significantly differs fro other earlier work in: a) Salient naes and cohesive faces are extracted fro cross-edia news. Such naes ight have higher correspondence possibility with the faces in news iages, and such faces ight have higher coherence with the salient naes. b) Web news is integrated for discovering issing inforation for rare nae. c) An efficient easureent and optiization echanis is established to verify the feasibility of our autoatic alignent algorith. d) A new algorith is developed to achieve ore precise characterization of the correlations between naes and faces. Such cross-edia alignent is treated as a proble of bi-edia seantic apping, and odeled as a correlation distribution over seantic representations of naes and faces. Our experients on a large nuber of public data fro Yahoo! ews have obtained very positive results. 2 Related Work Alignent of naes and faces in cross-edia news is not a novel task [Satoh et al., 1997&1999], [Berg et al., 2004&2005]. Earlier research considered acquiring the relevant faces based on the original text query over iage captions, and ranking or filtering the returned iages by a face detector. However, the irrelevant or incorrect results ay be produced by utilizing the siple atching between a query nae and captions. Most face recognition approaches are only applied to the controlled setting and liited data collections [Mensink et al., 2008], [Bozorgtabar et al., 2011]. Thus the increasing research interest focuses on cobining textual and visual inforation to support precise alignent. In recent years, there is soe related research work by using textual inforation that accopanies the iage. Yang et al. [2004] proposed an approach for finding the specific persons in broadcast news videos by exploring various clues such as naes in the transcript, face, anchor scenes and the tiing pattern between naes and people. Everingha et al. 2768

[2006], [2009] investigated the proble of autoatically labeling appearances of the naes in TV or fil, and deonstrated that high precision can be achieved by cobining ultiple sources of inforation. Ozkan et al. [2006] proposed a graph-based ethod to retrieve the correct faces of a queried person using both text and visual appearances. Le et al. [2007], [2008] introduced an unsupervised ethod for face annotation by ining Web, which aied to retrieve relevant faces of one person by learning the visual consistency aong results retrieved fro text-correlation-based search engines. It s necessary to ention the ost interesting research work developed by Berg et al. [2004], [2005], [2007], Guillauin et al. [2008], [2012], and Pha et al. [2010], [2011]. Berg et al. proposed a ethod for association of naes to faces using ore realistic dataset. Guillauin et al. considered two scenarios of naing people in databases of news photos with captions, that is, finding faces of a single query person and assigning naes to all faces. Pha et al. reported their experients on aligning naes and faces as found in the iages and captions of online news websites. Unfortunately, all these existing approaches have not provided good solutions for the following iportant issues: (1) Analysis and Measure for ae Salience and ae-face Cohesion in Deep Level Most existing alignent ethods focus on exploiting all the inter-linkage inforation between each nae and each face in a sae captioned news iage, which leads to a serious cobinatorial proble for possible nae-face pairs. Without sufficient analysis of nae salience and easure for nae-face cohesion, soe naes and faces ay for the noisy inforation and cause the insignificant nae-face atching udgeent. (2) Discovering Extra Multiodal Inforation for Rare ae Most existing work concerns finding faces of a specific nae underlying such a precondition that the relevant face set consists of a large group of highly siilar faces for the original nae query. With the increasing growth of Web news, it appears that such Web inforation can itigate the lack of relevant faces for rare nae. According to our best knowledge, no existing research has ade full use of Web news to achieve ore accurate alignent for rare nae. (3) Multiodal Alignent Optiization for ae-face Pairs Finding the best atching between naes and faces for all the pairs is very difficult and coplex, and ay becoe a P-hard proble. Probability-statistical odels can be used for coputing the alignent likelihood, but have the serious proble of getting stuck in a local axiu. Fro the view of cobinatorial optiization, it s a very significant way to set a global obective function, the local constraint conditions and an integer prograing odel. 3 ae Salience Identification In a news caption, not all naes are equally iportant. Thus it s necessary to evaluate every nae in the sae caption and udge which naes ay possibly co-occur with the faces in the original iage-caption pair. Here, salience is defined as a easureent for the iportant degree of each nae in a caption. A pattern for coputing such a score is constructed based on the in-depth ulti-level analysis of caption. In the initial nae list, the naes are enuerated without any ranking. Since the syntactic structure is indicative of various inforation distribution in a caption, the relative rank of naes can be confired by eans of the roles of naes and the relationships aong naes in the syntactic parse tree. (1) Syntactic Parse Tree Depth (SPTD) This factor indicates the iniu depth for all the naes of a certain nae clustering in the parse tree for a caption. If a nae has a shallower depth, it will have the higher iportance. Given a caption with naes, each nae is associated with its corresponding nae clustering C i and C i represents the th nae in C i. The depth value for each C i can be defined as: C Ci SC SPT Depth Ci i SPTD in 1 _ (1) where SC(C i ) denotes the size of C i, i.e., the nuber of naes with the co-reference relationship contained in C i, which can also be understood as the occurrence frequency of the sae naes with different fors in the given caption; and SPT_Depth(C i) is the depth of C i in the parse tree. (2) Syntactic Parse Tree Traversal Order (SPTTO) This factor indicates the iniu breadth-first traversal order for all the naes of a certain nae clustering in the parse tree for a caption. It s proposed based on such a fact that aong the nodes in the sae level, a node with the higher priority of the traversal order eans its higher relative iportance. The traversal order value for each C i can be defined as: C Ci SC SPT _ BFT OrderCi i SPTTO where SPT_BFT-Order(C i) is the breadth-first traversal order of C i in the syntactic parse tree. In ost cases the higher frequency for a specific nae in a caption indicates its higher iportance. It s very necessary to take such frequency into account as a key reference factor. Given different naes in a caption, in which each nae corresponds to a nae clustering C i, i=1,,, the initial rank value, Relative Salience (RS), can be coputed as: in (2) 1 SC Ci 1SPTDC SPTDCi RS Ci (3) 1 SCC 1 1SPTDC 1 SPTTOC SPTTOCi, i1 RS Ci 1 1 1 SPTTOC where α, β and γ are the relative iportance coefficients, α+β+γ=1. Maybe there is an extree situation that only one nae is involved in the caption, it s insignificant to evaluate the relative salience for such a unique nae, RS(C 1 )=1. It s rather crude to assue that every nae in the caption appears in the iage. Thus it s very significant to ake the further refineent to filter the naes with the extreely lower possibility to be pictured as the detected faces. A visual survey for nae-face pairs with the association linkage learned that the nuber of available faces is generally less than or equal to the nuber of detected naes in a pair. We ake the siplifying assuption that all the naes with the initial rank values below a certain threshold ay not be pictured as the selected faces. Given F faces and naes (F<=) in the initial rank list for a pair, F selected faces (F <=F) are retained after face filtering and naes are refined to naes (F <= <=) according to the visual exhibition inforation. Under such refineent, ultiple naes with the relatively higher iportance are chosen to interpret the selected faces in the iage, i.e., nae denoising. 2769

4 ae-face Cohesion Measure The general assuption for nae-face alignent is that the appearances of the faces for a person/nae are ore siilar than those for different persons/naes. Thus it s very useful to establish a cohesion easureent to evaluate the intrinsic association degree aong all the faces in the whole database. Firstly, the siilarity between two faces is evaluated based on the coon neighbors aong k-nearest neighbors for every face. Given two faces F i and F in the local face set FS_ related to a real nae, if F i is close to F and they are both close to FS_, F i and F are close with higher confidence for the utual density since their siilarity is deterined based on all the faces in FS_. Thus the siilarity between a pair of faces can be defined as: KSFi, FS _, k KSF, FS _, k SiFi, F (4) k where KS(F i, FS_, k) and KS(F, FS_, k) denote the k-nearest neighbors when F i or F positions in FS_ ; and k is a dynaically varied value and changes according to the size of the local face set related to the current nae. Secondly, the Local Density Score (LDS) is utilized to easure the density for each face in the local face set. The larger the density is, the ore relevant the face and the nae associated with the face set are. The LDS for a face F i can be described as the average siilarity between this face and its k-nearest faces in the sae face set. The higher LDS indicates the higher connectivity between F i and its k-nearest faces, that is, F i is ore relevant with associated with FS_. F KSF FS k SiFi F i LDSFi FS k, _,,, _, (5) k Thirdly, the Local Cohesion Degree (LCD) is especially proposed to easure the utual density aong all the faces in the face set related to a particular nae under the current global alignent anner, and defined as: FS, k LDSFi, FS _ k LCD _ FiFS _, (6) If a face has the higher LDS value in the related face set, this face will have the higher density with the other faces in the sae face set and then will be ore relevant with the nae corresponding to the face set. Therefore, as the su of the LDS values for all the faces in a local face set, LCD can be well representative of the utual density aong all the faces in this face set. The larger the LCD is, the higher the utual density aong all the faces in the face set will be. The higher LCD can be regarded as a good indication of the higher local cohesion for the face set, that is, the face set is ore relevant with the assigned nae. Under an arbitrary global alignent anner, the su of the LCD values for all the face sets can reflect the whole global cohesion for the coplete face set related to all the naes. Thus the larger su of the LCD values for the face sets ebodies the higher global cohesion for the coplete face set, and then further expresses the higher global relevance between the whole face set and the nae set under the current global alignent anner. 5 Web Mining for Rare ae The nae-face alignent can also be regarded as a restricted face naing or retrieval proble. The general assuption is that the nuber of faces corresponding to the query nae is relatively large. However, soe rare naes ay lead to such a fact that the nuber of their faces is very sall. The rare nae here does not ean the real single nae, but refers to the rare person that occurs only several ties or even one tie in the whole database. Thus for the alignent evaluation of rare nae, it s necessary to establish a discovery echanis to suppleent ore available ultiodal inforation. As the vast knowledge base, Web has becoe the best resource to create a reference annotated face iage set for rare nae. Face iages about a specific person can be autoatically retrieved by using Web iage search engines. Based on Web ining, a certain nuber of relevant face iages and their annotations can be discovered siultaneously, and becoe the valuable enrichent for rare nae. All the nae expression fors for a rare nae are collected in a rare nae clustering. Each for in the clustering is taken as an original nae query and subitted to Web iage search engines. Thus a list of relevant face iages and their annotations can both be acquired. Each returned face iage will be processed through the face detection and filtering, and the available and eaningful face iages are kept. Given a rare nae ae 0, all of its face iages can be ranked according to the salience value of ae 0 in each face iage. The Top-R face iages are aintained as the copleentary face inforation to ae 0. R is a generalized threshold for the copleentary face iage selection, and does not have a fixed setting. Because there ay be different nubers of face iages returned for different rare naes, R eans a liited proportion of the face iages preferred, and varies with the size of the returned face iage set. 6 ae-face Alignent via ISSAGA We consider such an alignent task as an optiization proble for searching the optial nae-face atching in each pair. The Standard Genetic Algorith (SGA) can obtain the global optial search using genetic operations, but there are soe disadvantages such as preaturity and the poor search power for local optial solutions [Wang et al., 2005]. However, the Siulated Annealing Algorith (SAA) has the stronger ability for local optiization [Andresen et al., 2008]. Thus the Iproved Self-Adaptive Siulated Annealing Genetic Algorith (ISSAGA), which integrates the advantages of SGA and SAA and a self-adaptive probability adustent strategy, is introduced to ake a better solution. 6.1 Multiodal ae-face Alignent Suppose P iage-caption pairs include F real faces (i.e., ot ull Face) and real naes (i.e., ot ull ae), PS denotes the Iage-Caption Pair Set for P pairs; FS denotes the Face Set for F faces; S denotes the ae Set for naes; W_FP i denotes whether the face F belongs to the pair P i, P i PS, F FS, i=1,, PS, =1,, FS, which is equal to: 1, if the face F is inthe pair Pi W _ FPi 0, otherwise W_P ik denotes whether the nae k belongs to P i, P i S, i=1,, PS, k=1,, S, which is equal to: k PS, 2770

1, if the nae k is in the pair Pi W _ Pik 0, otherwise FP i denotes the face set for P i, FP i ={F W_FP i=1, F FS}, P i PS, i=1,, PS ; P i denotes the nae set for P i, P i ={ k W_P ik =1, k S}, P i PS, i=1,, PS ; W_F k denotes whether F is assigned to k, F FS, k S, =1,, FS, k=1,, S, which is equal to: 1, if the face F is assigned to the nae k W _ Fk 0, otherwise FS_ k denotes the total associated face set for k based on the global assignent of naes and faces, k S, k=1,, S ; LCD(FS_, k) denotes the utual density aong all the faces in the face set FS_ related to the nae under a certain global alignent anner, S, =1,, S. Thus our atheatical odel can be foralized as: ax LCDFS _, k Fi FS _ LDSFi, FS _, k(7) S ax S FS _ FS _ s.t., 1) FPi FS, Pi S. That is, the union set of the 1i PS 1i PS face set included in each pair is the universal face set of the whole database, and the union set of the nae set included in each pair is the universal nae set. 2) FP i FP =Ф, i, i, =1,, PS. That is, there is no intersection set between the face sets of two pairs. 3) FP i = P i, i=1,, PS. That is, to establish ore efficient coding inforation for the optiization algorith, we for a constraint that the nubers of faces and naes in an arbitrary pair ust be sae. ull faces or naes can be used to suppleent the less ones. 4) Pi 1 _ k W Fk 1, F FP i, k P i, i=1,, PS, =1,, FP i. That is, in a pair each real face can only be assigned to FPi at ost a real nae in the sae pair. 5) 1 _ W Fk 1, F FP i, k P i, i=1,, PS, k=1,, P i. That is, in a pair each real nae can only be assigned to at ost a real face in the sae pair. 6), _ F 0 FPi l P W Fl, i=1,, PS, i =1,, FP i, l=1,, S. That is, the face that belongs to a pair can only be assigned to the nae in the sae pair. 6.2 ISSAGA-based Multiodal Optiization a. Encoding of Chroosoe The naes fro all the pairs are positioned in a fixed order, and the faces fro all the pairs are ranked in segents and then grouped into a chroosoe. Each chroosoe corresponds to a solution. A repeatable natural nuber encoding pattern is adopted to design the gene g i of the chroosoe C, C={g i}, i=1,, PS, =1,, FP i, where i is the pair nuber; is the face nuber for the pair P i ; and g i denotes the nuber for the face F in P i. C can be expanded as {g 11,, g 1 FP1,, g i1,, g i FPi,, g PS 1,, g PS FP PS }, where {g i1,, g i FPi } is a segent and each segent keeps the relatively independent relationship with the other segents. b. Initial Population The initial population P(t) with L chroosoes is produced by a stochastic ethod. The stochastic order in each segent is used to generate an initial chroosoe C, and then the other L-1 different chroosoes are achieved fro C. According to the obective function value, the current best chroosoe can be regarded as the initial optial solution. c. Self-Adaptive Selection-Reproduction Operator According to our atheatical odel, the obective function for the l th chroosoe in a population can be constructed as: LCD FS _, k OF Cl (8) S FS _ To liit the quantity of chroosoes that dissatisfy the constraints in the new population, we use a Multiple Roulette Wheel ethod and copute the selection probability as: f ' C (9) l PS Cl, l 1,, M M f ' Ck k1 where M is the nuber of chroosoes in the current population, and f ( ) is obtained through the self-adaptive transforation based on the original fitness function f( ) to avoid preaturity and slow convergence speed. g ax e e f Cl OFCl, f ' Cl a * f Cl g * f ax f in(10) g ax e e where f ax and f in are the axiu and iniu fitness values for the current population; g is the nuber of generations under the current genetic anner; g ax is the axiu nuber of genetic generations; and a is a constant paraeter. d. Self-Adaptive Siulated Annealing Crossover Operator Considering in ost cases there are two or three faces in a news iage and the axiu order nuber for the faces in a segent of a chroosoe is also two or three, thus a One-Point crossover ethod is adopted. Two special strategies of the self-adaptive crossover probability (P C ) coputation and the siulated annealing operation are introduced. PC1 PC 2ax f Ci, f C (11) PC1, ax f Ci, f C PC Ci, C f ax PC1, ax f Ci, f C where f avg is the average fitness value in the current population; P C1 and P C2 are two predefined paraeters. The siulated annealing operation is utilized to udge whether the inferior solution should be accepted to substitute the previous chroosoe when facing with an inferior solution and generate a probability (P A ) for accepting the inferior solution. 1 PA Cl (12) f Cl' f Cl g T 0 1 e where f(c l ) denotes the fitness value for the new chroosoe C l generated after the crossover; T 0 is the initial teperature for the siulated annealing operation; and δ is a preset ratio coefficient for the teperature reduction. e. Self-Adaptive Siulated Annealing Mutation Operator The utation operator cannot be executed aong segents but inside every segent by utilizing an Exchange Mutation ethod. Meanwhile, two special strategies of the self-adaptive utation probability (P M ) coputation and the siulated annealing operation are also introduced. PM 1 PM 2f ax f Cl PM 1, f Cl (13) PM Cl f ax PM 1, f Cl where P M1 and P M2 are two predefined paraeters. Given a chroosoe C k ={S 1,, S i,, S }, = PS, S i is the i th segent. Taking a segent as a unit, the 2-exchange utation is accoplished aong the segents of C k by P M (C k ). g 2771

7 Experient and Analysis 7.1 Dataset and Evaluation Metrics Our dataset is established based on Labeled Yahoo! ews Data constructed by Berg et al. [2005] and further developed by Guillauin et al. [2008]. It consists of 20,071 captioned news iages, and all the iage-caption pairs contain 31,147 face iages that belong to 5,873 different persons/naes. To evaluate the effectiveness of our algorith, the ground truth associations are considered to easure the benchark etrics for nae-face aligned pairs. Correct Label Rate (CLR) is defined as the percentage of correct pairs generated in the ground truth links. Our evaluation can not only consider the association links between real naes and faces, but also concern the assignents to null naes or null faces. 7.2 Experient on Cross-Media Alignent Our cross-edia alignent odel is created by integrating ae Salience Ranking (SR), ae-face Cohesion Measure (FCM), Web-based Multiodal Inforation Mining (WMIM) for rare nae and the Iproved Self-Adaptive Siulated Annealing Genetic Algorith of ISSAGA. To investigate the effect of each part on the whole alignent perforance, we introduce three evaluation patterns: 1) Baseline(FCM) based on the basic preprocessing with o SR for captions and FCM for iages; 2) Baseline(FCM)+SR based on the baseline odel with FCM, SR is integrated; 3) Baseline(FCM)+SR+WMIM by fusing the baseline odel with FCM and SR, WMIM is added. These patterns are tested in two alignent optiization anners of general GA and ISSAGA, which ais to explore the advantages of our proposed ISSAGA. To verify the consistency of our alignent odel in different apping cases, the perforance evaluation focuses on two assignent schees. One contains the specific appings of Real Face->ull ae (RF->) and Real ae->ull Face (R->F) except the general apping of Real Face<->Real ae (), and another only involves. The experiental results are shown in Table 1. Assignent Schee Evaluation Pattern GA ISSAGA CLR (%) CLR (%) Baseline(FCM) [o SR] 52.12 53.49 RF-> Baseline(FCM)+SR 58.81 60.76 R->F Baseline(FCM)+SR+WMIM 61.71 63.16 Baseline(FCM) [o SR] 54.31 56.78 Baseline(FCM)+SR 61.17 62.47 Baseline(FCM)+SR+WMIM 63.21 65.77 Table 1. The experiental results on our whole dataset. It can be seen fro Table 1 that for the nae-face linking in our whole dataset, we can still obtain the best CLR value of 65.77% in the sae evaluation pattern of fusing Baseline with FCM, SR, WMIM and ISSAGA and excluding the appings related to ull ae and ull Face. In coparison with the baseline odel, the alignent perforance could be prooted to a great degree by successively adding SR and WMIM under our ISSAGA-based optiization. Copared the results for different assignent schees, we can observe that when only considering the alignent between real naes and faces, the alignent perforance increases and becoes pretty good. Although when evaluating the links with null naes and faces the alignent perforance aybe slightly influenced by such extra uncertainty, the CLR can still reach a relatively high value. Through the coparison between the results based on the traditional GA and ISSAGA, it can be found that our ISSAGA is obviously superior to the traditional GA and ore suitable for the alignent optiization. In addition, the convergence curves of the best solutions found by the traditional GA and our ISSAGA are shown in Figure 1. The traditional GA iproves the solutions very fast within 60 iterations, and reaches a plateau after that. In contrast, our ISSAGA keeps constantly iproving the solutions with ore iterations, gives uch better solutions, and has a stronger ability of escaping fro the local optia. It can be concluded that our ISSAGA provides better solutions within a satisfactory aount of tie and is very useful to contribute better solutions for such a special optiization proble. Figure 1. The convergence curves of the traditional GA and our ISSAGA. 7.3 Experient on Cross-Media ews Retrieval To further explore the applicability of our alignent odel in cross-edia news retrieval, we particularly select 16 naes with the larger nuber of faces in the whole dataset as the query topics and carry out the retrieval runs based on the original dataset and the dataset with the alignent processing respectively. The Precision (P), Recall (R) and F-easure (F) values for each nae query are shown in Table 2. It can be seen that the best run is based on the dataset with alignent processing, and its results exceed those by another run based on the original dataset without any alignent processing draatically. For all the nae queries, their F values under the run with alignent processing are obviously higher than those under another run, and the highest increase can reach to 20%. By adopting our alignent odel for the original dataset, the whole retrieval perforance for the selected nae queries has gained the significant iproveent. Captioned ews Iage Dataset for Testing ae Query The Original Dataset The Dataset with Alignent Processing Baseline+SR SR+FCM+WMIM+ISSAGA P (%) R (%) F (%) P (%) R (%) F (%) George W. Bush 55.30 95.72 70.11 75.44 87.69 81.11 Collin Powell 57.38 72.92 64.23 75.00 69.08 71.92 Tony Blair 62.59 80.45 70.41 77.49 78.12 77.80 Donald Rusfeld 70.77 76.72 73.63 87.22 75.57 80.98 Gerhard Schroeder 56.88 80.52 66.67 77.83 74.46 76.11 Ariel Sharon 77.55 22.75 35.18 97.44 22.75 36.89 Hugo Chavez 47.26 90.32 62.05 86.61 88.71 87.65 Junichiro Koizui 55.78 73.21 63.32 73.11 77.68 75.32 Jean Chretien 62.67 86.24 72.59 80.73 80.73 80.73 John Ashcroft 55.82 83.49 66.91 76.85 76.15 76.50 Jacques Chirac 48.25 73.40 58.22 63.72 76.60 69.57 Hans Blix 47.10 81.11 59.59 73.96 78.89 76.34 Vladiir Putin 42.76 75.58 54.62 51.85 65.12 57.73 Silvio Berlusconi 55.88 71.25 62.64 66.27 68.75 67.48 Arnold Schwarzenegger 60.00 88.73 71.59 95.45 88.73 91.97 John egroponte 72.22 60.00 65.55 86.36 58.46 69.72 Table 2. The experiental results for cross-edia retrieval. 2772

7.4 Coparison with Existing Approaches To give full exhibition to the superiority of our alignent odel, we have also perfored a coparison between our approach and the other classical ones in recent years. Three approaches developed by Berg et al. [2005], Guillauin et al. [2008] and Pha et al. [2010] are analogous with ours, and then we accoplished the on the sae dataset. The experiental results are presented in Table 3, which reflect the difference of power between these four approaches. Approach Assignent Schee Evaluation Pattern CLR (%) Berg [o SR] 49.58 RF-> Berg+SR 53.07 R->F Berg+SR+WMIM 57.12 Berg et al. s [2005] (Berg) Guillauin et al. s [2008] (Guill) Pha et al. s [2010] (Pha) Our Approach with ISSAGA RF-> R->F RF-> R->F RF-> R->F Berg [o SR] 52.71 Berg+SR 56.45 Berg+SR+WMIM 59.93 Guill [o SR] 52.14 Guill+SR 57.87 Guill+SR+WMIM 60.97 Guill [o SR] 54.41 Guill+SR 59.95 Guill+SR+WMIM 62.43 Pha [o SR] 52.69 Pha+SR 56.13 Pha+SR+WMIM 61.57 Pha [o SR] 55.61 Pha+SR 58.98 Pha+SR+WMIM 62.73 Baseline(FCM) [o SR] 53.49 Baseline(FCM)+SR 60.76 Baseline(FCM)+SR+WMIM 63.16 Baseline(FCM) [o SR] 56.78 Baseline(FCM)+SR 62.47 Baseline(FCM)+SR+WMIM 65.77 Table 3. The coparison between our and the other existing ethods. It can be found fro Table 3 that for the nae-face linking on our whole dataset by Berg/Guiilauin/Pha et al. s approaches, we can obtain the best CLR values of 52.71%, 54.41% and 55.61% in the evaluation patterns of Berg/Guill/ Pha [o SR] and excluding the appings related to ull ae and ull Face respectively. The ain reason is that when only considering the alignents between real faces and real naes, all these three approaches can reduce the cobination abiguities between naes and faces in the sae iage-caption pair to a certain degree and then the relatively better CLR values can be obtained. Copared the results of Berg/Guill/Pha [o SR] and our baseline odel with FCM, we can find the best CLR value of 56.78% appears in the results of our odel, which is obviously higher than those CLR values of 52.71%, 54.41% and 55.61% for Berg/Guill/ Pha [o SR] respectively. This iplies that both of our nae-face cohesion easure echanis FCM and cobination optiization algorith ISSAGA are feasible for facilitating ore effective nae-face alignent evaluation and optiization. Furtherore, through fusing our SR and WMIM with Berg/Guillauin/Pha et al. s approaches in turn, the better perforance can be acquired, in which the best CLR values are considerably iproved to 59.93%, 62.43% and 62.73% respectively. This confirs the proinent roles of our SR and WMIM in autoatic nae-face alignent once again. Copared with the iproved Berg/ Guillauin/Pha et al. s approaches that integrate with SR and WMIM, our alignent odel with FCM, SR, WMIM and ISSAGA can still present the better perforance and the best CLR value of 65.77% differs greatly fro those of Berg/Guillauin/Pha et al. s. This indicates that our approach is really superior to Berg/Guillauin/Pha et al. s, and also further confirs that our alignent odel with FCM, SR, WMIM and ISSAGA is exactly a better way for deterining nae-face associations and can support such a special cobinatorial optiization proble ore effectively. 7.5 Analysis and Discussion Through the analysis for the aligned nae-face linkages with failure, it can be found: 1) The alignent acquisition is associated with the preprocessing for iage-caption pairs, especially for caption text. It s easier for iage and caption preprocessing to produce soe errors or issing detections for faces and naes. Such wrong or undetected inforation will seriously affect the whole alignent perforance. 2) The other naes that co-occur with a certain nae in the sae caption aybe helpful for facilitating this nae s related face finding and naing. The sae case is for faces in iages. Such nae or face co-occurrence can be utilized to further iprove the alignent easureent. 3) For ultiodal inforation fro Web ining, instead of directly using the feedback fro iage search engines, it s helpful to exploit a relevance evaluation anner to select ore appropriate copleentary inforation for rare nae. (4) It is a novel way to solve nae-face alignent fro the view of cobinatorial optiization. Although our ISSAGA have obtained the satisfactory perforance on large-scale dataset, aking ore highly specialized consideration about genetic operations will be ore beneficial to acquiring ore feasible solutions. 5) Soe news captions involve extreely rare nae. With very liited inforation suppleent fro Web, such news lacks enough ultiodal inforation for alignent processing. This ay be a ore stubborn proble. 8 Conclusions and Future Work A new algorith is introduced in this paper to support ore precise autoatic nae-face alignent for cross-edia news retrieval. The ulti-level analysis of iage-caption pairs is established for characterizing the eaningful naes with higher salience and the cohesion between naes and faces. To reedy the issue of rare nae, the extra ultiodal inforation is acquired through Web ining. ISSAGA is introduced to solve the alignent optiization better. Thus a novel odel is developed by integrating all these aspects to effectively exploit nae-face correlations. Our future work will focus on aking our syste available online, so that ore Internet users can benefit fro our research. Acknowledgents This work is supported by ational atural Science Fund of China (o. 61170095; o. 61272285), ational Science & Technology Pillar Progra of China (o. 2012BAH59F04), 973 Progra of China (o. 2010CB327906), Shanghai Outstanding Acadeic Leader Progra (o. 12XD1400900), and Shanghai Leading Acadeic Discipline Proect (o. B114). Cheng Jin is the corresponding author. 2773

References [Andresen et al., 2008] M. Andresen, H. Bräsel, J. Tusch, M. Mörig, F. Werner, and P. Willenius. Siulated annealing and genetic algoriths for iniizing ean flow tie in an open shop. Matheatical and Coputer Modelling, 48(7-8):1279-1293, 2008. [Berg et al., 2005] T.L. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth. Who s in the picture. Advances in eural Inforation Processing Systes 17, 37-144, 2005. [Berg et al., 2007] T.L. Berg, A.C. Berg, J. Edwards, and M. Maire. aes and faces. Technical Report, U.C. at Berkeley, 2007. [Berg et al., 2004] T.L. Berg, A.C. Berg, J. Edwards, M. Maire, R. White, Y.W. The, E. Learned-Miller, and D. Forsyth. aes and faces in the news. In Proceedings of CVPR 2004, 2:848-854, 2004. [Bozorgtabar et al., 2011] B. Bozorgtabar and G.A. Rezai Rad. A genetic prograing - PCA hybrid face recognition algorith. Journal of Signal and Inforation Processing, 2:170-174, 2011. [Datta et al., 2008] R. Datta, D. Joshi, J. Li, and J.Z. Wang. Iage retrieval: Ideas, influences, and trends of the new age. ACM Coputing Surveys (CSUR), 40(2), Article 5, 2008. [Deschacht et al., 2007] K. Deschacht and M.F. Moens, Text analysis for autoatic iage annotation. In Proceedings of ACL2007, 1000-1007, 2007. [Everingha et al., 2006] M. Everingha, J. Sivic, and A. Zisseran. Hello! My nae Is Buffy - Autoatic naing of characters in TV video. In Proceedings of BMVC 2006, 889-908, 2006. [Everingha et al., 2009] M. Everingha, J. Sivic, and A. Zisseran. Taking the bite out of autoatic naing of characters in TV video. Iage and Vision Coputing, 27(5):545-559, 2009. [Fan et al., 2012] J.P. Fan, X.F. He,. Zhou, J.Y. Peng, and R. Jain. Quantitative characterization of seantic gaps for learning coplexity estiation and inference odel selection. IEEE Transactions on Multiedia, 14(5):1414-1428, 2012. [Guillauin et al., 2008] M. Guillauin, T. Mensink, J. Verbeek, and C. Schid. Autoatic face naing with caption-based supervision. In Proceedings of CVPR 2008, 1-8, 2008. [Guillauin et al., 2012] M. Guillauin, T. Mensink, J. Verbeek, and C. Schid. Face recognition fro caption-based supervision. International Journal of Coputer Vision, 96(1):64-82, 2012. [Le et al., 2008] D.D. Le and S. Satoh. Unsupervised face annotation by ining the Web. In Proceedings of ICDM 2008, 383-392, 2008. [Le et al., 2007] D.D. Le, S. Satoh, M.E. Houle, and D.P.T. guyen. Finding iportant people in large news video databases using ultiodal and clustering analysis. In Proceedings of ICDEW 2007, 127-136, 2007. [Liu et al., 2008] C. Liu, S. Jiang, and Q. Huang. aing faces in broadcast news video by iage Google. In Proceedings of MM 2008, 717-720, 2008. [Mensink et al., 2008] T. Mensink and J. Verbeek. Iproving People Search using Query expansions: How friends help to find people. In Proceedings of ECCV 2008, 86-99, 2008. [Ozkan et al., 2006] D. Ozkan and P. Duygulu. A graph based approach for naing faces in news photo. In Proceedings of CVPR 2006, 1477-1482, 2006. [Pha et al., 2010] P.T. Pha, M.F. Moens, and T. Tuytelaars. Cross-edia alignent of naes and faces. IEEE Transactions on Multiedia, 12(1):13-27, 2010. [Pha et al., 2011] P.T. Pha, T. Tuytelaars, and M.F. Moens. aing people in news videos with label propagation. IEEE Multiedia, 18(3): 44-55, 2011. [Satoh et al., 1997] S. Satoh and T. Kanade. ae-it: Association of face and nae in video. In Proceedings of CVPR 1997, 368-373, 1997. [Satoh et al., 1999] S. Satoh, Y. akaura, and T. Kanade. ae-it: aing and Detecting Faces in ews Videos. IEEE Multiedia, 6(1):22-35, 1999. [Wang et al., 2005] Z.G. Wang, M. Rahan., and Y.S. Wong. Optiization of ulti-pass illing using parallel genetic algorith and parallel genetic siulated annealing. International Journal of Machine Tools and Manufacture, 45(15):1726-1734, 2005. [Yang et al., 2004] J. Yang, M.Y. Chen, and A.G. Hauptann. Finding person x: Correlating naes with visual appearances. In Proceedings of CIVR 2004, 270-278, 2004. [Yang et al., 2009] Y. Yang, D. Xu, F.P. ie, J.B. Luo, and Y.T. Zhuang. Ranking with local regression and global alignent for cross edia retrieval. In Proceedings of MM 2009, 175-184, 2009. 2774