Analyzing Chinese-English Mixed Language Queries in a Web Search Engine Hengyi Fu School of Information Florida State University 142 Collegiate Loop, FL 32306 hf13c@my.fsu.edu Shuheng Wu School of Information Florida State University 142 Collegiate Loop, FL 32306 sw09f@my.fsu.edu ABSTRACT With the increasing number of multilingual web pages on the Internet, multilingual information retrieval has become an important research issue. While queries are the key element of information retrieval process, mixed-language queries have not yet been adequately studied. This study investigates the search topics and user intents of Chinese- English mixed language queries submitted to a Chinese web search engine, and develops a typology of English terms used in those mixed language queries. The preliminary findings show that the category of search topics and distribution of user intents differ from those of monolingual queries reported in previous studies, suggesting a specific searching behavior of Chinese-English mixed language queries users. The findings of this study could provide useful insights in understanding specific searching behavior of Chinese-English mixed language queries users, and inform the construction of controlled vocabularies and cross-lingual query expansion. Keywords Information retrieval, multilingual search query, search behavior, search topics, user intent. INTRODUCTION English has remained the predominant language online accounting for 55.5% of the content languages used in websites (W3Techs, 2013); however, the amount of content in other languages is increasing rapidly. It is becoming more common to find web pages that are available in multiple languages or a single web page in more than one language. In cultures where people use both Chinese and English, using mixed language in spoken language is very common. In general, a mixed language consists of 77th ASIS&T Annual Meeting, October 31- November 4, 2014, Seattle, WA, USA. Copyright is retained by the author(s). sentences or terms mostly in the primary language with some words in a secondary language. For instance, Hong Kong people typically speak Cantonese with English words. When such people are searching the web, they often use Chinese-English mixed language queries in order to approximate their information need more accurately than using Chinese-only queries. A mixed language query is a search query that includes words mixed from two or more languages. For example, the query 3G 手 机 is a Chinese-English mixed language search query that a user created to look for information about 3G mobile phones. This phenomenon is called codeswitching in linguistics and used to be considered of peripheral importance, but now has become a general focus of research in the sociological, psycholinguistic, and linguistic communities (Auer, 1998). However, little is known about why and how people use mixed language queries in web searching or the characteristics of the mixed language queries. Research about Chinese search engines shows the amount of Chinese-English mixed language queries submitted by users has increased in resent years (Chau, Lu, Fang, & Yang, 2009). Studying the characteristics of Chinese-English mixed language queries in web searching will be fruitful for a better understanding of users needs behind the queries, and inform the construction of controlled vocabularies and query expansion. Although previous studies (Chau, Fang, & Yang, 2007; Chau et al., 2009; Lu, Fang, & Yang, 2006) have analyzed the characteristics of search queries in English and Chinese, their findings may not be applicable to Chinese-English mixed language queries, due to the great discrepancy between these two languages. To our best knowledge, no previous research has explicitly studied Chinese-English mixed language queries in depth. The purpose of this study is to examine the topics and user intents of Chinese-English mixed language queries in a Chinese web search engine, and identify the types of English terms used in those mixed queries. The topics and user intents identified in this study can inform the construction of topic-specific thesaurus, and enable web search engines to provide users with more relevant results and more precisely targeted sponsored links. English terms
used in Chinese-English mixed language queries, especially those without well-known Chinese translations, reflect users vocabularies; they can be used to enhance Chinese controlled vocabularies or other knowledge organization systems, and also be applied for query expansion to improve recall. RELATED WORK Search Topics Determining the topics of queries is an on-going area of study. Spink, Wolfram, Jansen, and Saracevic (2001) studied query logs from the Excite TM search engine from 1997 and 1999. They classified about 2,500 queries into 11 overlapping topic categories using a grounded theory approach. They found that no single query subject category accounted for more than 17% of the traffic, and the distribution of topics of web queries does not coincide with the distribution of subject content of web sites. Beitzel et al. (2005) presented a topical breakdown of a manually classified sample of queries taken from one week s worth of queries posed to AOL TM search, which has a large number of backend databases with topic-specific information. With this topical label dataset, Jansen and Booth (2010) manually labeled each query with three-level classification of user intent and developed a classification scheme of user intent varied across topics. User Intents in Web Search Several studies (Broder, 2002; Rose & Levinson, 2004; Jansen, Booth, & Spink, 2008) have attempted to classify user queries in terms of users informational actions. Due to page limitation, this study only uses queries themselves to determine user intents. Broder (2002) manually classified a small set of queries into transactional, navigational, and informational by using a survey of AltaVista users. Navigational queries are used to find a certain webpage the user has already in mind or at least assumes that such a site exists. Typical queries in this category could be searches for a homepage of an organization. With informational queries users want to find information on certain topics. Such queries usually lead to a set of results rather than to just one suitable document. The purpose of transactional queries is to reach a site where a further interaction is necessary. The interaction could be downloading, shopping, or accessing certain databases. Based solely on the log analysis, Border (2002) reported that 48% of the queries were informational, 20% navigational, and 30% transactional. Analysis of Chinese-English Mixed Language Queries Chau et al. (2007) collected 1,255,633 queries from a Chinese search engine, and found that the search topics were similar to those in English search engines. Chau et al. (2009) reported that Chinese users tended to use more diversified search terms and that the use of characters in search queries was quite different from that in general online information in Chinese. However, none of these studies investigated Chinese-English mixed language queries, although Chau et al. (2007) observed the codeswitching phenomenon and indicated it would be an interesting topic to explore. The only study found related to Chinese-English mixed language queries was conducted by Lu et al. (2006). They found that mixed language queries submitted to a Chinese web search engine in Hong Kong called Timway were primarily caused by technical terms in computer science, names of magazines and firms, and some English words not having a popular Chinese translation. RESEARCH QUESTIONS This poster addresses the following research questions: RQ1. What are the main topics of Chinese-English mixed language queries web search engine users are searching for? RQ2. How are the user intents of these Chinese- English mixed language queries distributed amongst each topic? RQ3. What kinds of English terms do users choose to create a Chinese-English mixed language query? DATA COLLECTION AND RESEARCH METHOD This study uses queries submitted to the Sogou web search engine (http://www.sogou.com/), which is one of the most popular search engines in China. The query-log data was collected in June 2012. A record in the log file consists of six fields: time of the click event, user ID, user query, ranking of the clicked URL, ordering of user click, and the clicked URL. The query-log data contains 86,538,613 nonempty queries, 4,345,557 of them unique. The researchers developed C++ code to select and pre-process all the Chinese-English queries that contain both ASCII and double-byte characters from the log data. The final dataset has 346,989 Chinese-English mixed language queries, accounting for 7.98% of all queries. A random sample of 384 Chinese-English mixed language queries was drawn from this dataset to conduct content analysis. The sample size was determined using a technique introduced by Powell and Connaway (2004). The unit of analysis is each Chinese-English query in the sample. To determine the topic of those Chinese-English mixed language queries, two researchers independently coded all the queries in the sample using a scheme developed by Jansen and Booth (2010) consisting of 19 topical categories. The researchers compared their coding to discuss and resolve any differences. During the coding process, the researchers decided to drop the Other category from Jansen and Booth s scheme, and added five new ones, Arts & Design, Education, Employment, Law & Legislation, and Technology. The researchers achieved an intercoder reliability of 0.872 (Cohen s Kappa). Using a similar coding process, the researchers independently coded the same set of queries into three different user intents Informational, Transactional, and
Navigational (Broder, 2002) and obtained an intercoder reliability of 0.856. To identify reasons for users employing English terms as part of their search queries, the researchers used the open coding approach (Charmaz, 2006; Strauss & Corbin, 1994) to analyze all the English terms in the same set of queries. Noticeably, an English term in a mixed language query refers to a single English letter, word, phrase, or sentence fragment, and thus one English term may generate one or multiple codes. For example, the query Michael Jackson You rock my world 在 线 视 频 looking for a specific online video will generate two codes for the English term, Singer and Song. The researchers independently coded the same set of English terms and created a codebook after comparing, discussing, and resolving differences. The researcher then used the codebook to recode all the English terms and attained an intercoder reliability of 0.884. FINDINGS The analysis identified 21 categories of search topics from the Chinese-English mixed language queries (See Table 1). Category # of queries Percentage Arts & Design 2 0.5% Auto 7 1.8% Business 20 5.2% Computing 88 22.9% Education 24 6.3% Employment 2 0.5% Entertainment 87 22.7% Games 36 9.4% Health 7 1.8% Home 1 0.3% Law & Legislation 8 2.1% News 1 0.3% Organization 6 1.6% Places 1 0.3% Porn 4 1.0% Research 17 4.4% Shopping 24 6.3% Sports 8 2.1% Technology 34 8.9% Total 384 100% Table 1. Categories of queries by topic. The most popular topics are Computing (22.9%), Entertainment (22.7%), and Games (9.4%), accounting for 55% of the queries. This may due to a large number of software and online games developed in the United States, Europe, and Japan do not have popular Chinese translations. Besides Other, two categories Holiday and Misspellings in Jansen and Booth s (2010) scheme were not found in the sample of this study. The researchers found that a large number of English terms in Chinese- English mixed language queries are abbreviations (e.g., MV, GRE, ISO ) and words that are short or easy to spell (e.g., blog, Nike, Windows ). This may explain the absence of Misspellings in the queries. Category Informatio nal Transact ional Navigati onal Arts & Design 1.0% 0.0% 0.0% Auto 3.0% 0.7% 0.0% Business 7.5% 0.7% 12.5% Computing 23.4% 26.5% 3.1% Education 3.5% 7.9% 15.6% Employment 1.0% 0.0% 0.0% Entertainment 7.5% 43.0% 21.9% Games 7.5% 13.2% 3.1% Health 3.5% 0.0% 0.0% Home 0.5% 0.0% 0.0% Law & Legislation 3.0% 0.7% 3.1% News 0.0% 0.0% 3.1% Organization 0.5% 0.0% 15.6% Place 0.5% 0.0% 0.0% Porn 0.5% 2.0% 0.0% Research 7.5% 0.7% 3.1% Shopping 10.9% 1.3% 0.0% Sports 4.0% 0.0% 0.0% Technology 13.9% 3.3% 3.1% Travel 1.0% 0.0% 0.0% URL 0.0% 0.0% 15.6% Total 100% 100% 100% Travel 2 0.5% URL 5 1.3%
Table 2. Categories of queries distributed amongst user intent. The second research question examines user intent varied by search topics. Among 384 queries in the sample, 52.3% are informational, 39.3% are transactional, and 8.3% are navigational. While the percentage of informational queries reported in this study is similar to that of Broder (2002), much fewer navigational queries are identified. The distribution of user intents amongst different topics is shown in Table 2, which differs largely from that of a previous study (Jansen & Booth, 2010). A large number of informational queries fall into the topic of Computing and Technology, accounting for almost 40% of all informational queries. This may indicate users are interested in knowing about new technologies, computer software, and hardware, most of which were developed outside of China, such as AutoCad, Photoshop, and Java. The largest topic groups in transactional queries are Entertainment, Computing, and Games, accounting for more than 80% of all the transactional queries. The popularity of these transactional queries resulted from users looking for specific downloadable software, video games, songs, movies, and books from web search engines. Assuming that transactional queries carry a higher commercial inclination, these would be of most interest to online advertising. For these users, web search engines could place heavy weight in results on commercial content or sponsored links. The most frequently occurred topics in navigational queries are Entertainment, Education, Organization, and URL, accounting for almost 70% of all the navigational queries. This may suggests users are interested in specific entertainment, education, and organization websites. Type Frequency Percentage Band/Singer 14 3.3% Brand 22 5.2% TV Channel 3 0.7% Company 24 5.7% Drama/Movie 4 1.0% Game 10 2.4% Hardware 12 2.9% Manga 4 1.0% Model Name/ Version Number 42 10.0% Organization 9 2.1% Region 2 0.5% Software 68 16.2% Song 14 3.3% Symbol 17 4.0% Terminology 142 33.8% URL 24 5.7% Website 9 2.1% Table 3. Types of English terms in Chinese-English mixed language queries. To answer the third research question, this study developed a typology of 16 types of English terms used in those Chinese-English mixed language queries (See Table 3). The most frequent type of English terms is Terminology (33.8%), followed by Software (16.2%) and Model name/version number (10%). Terminologies (e.g. GRE, mp3, and blog ) can be further categorized into three levels: cultural, community, and technical. For example, the terminology av abbreviation for adult video originated from Japan and is well known in Asian culture. The analysis found that not all the English terms used in the mixed language queries have corresponding Chinese translations, including the version number of certain software (e.g., XP), hardware, video games, and standards; the model name of electronic devices, automobiles, and materials; URLs; and symbols (e.g., C programming language). Although most of the English terms have Chinese translations, some of them may be easier to type and memorize (e.g., NBA vs. 美 国 职 业 篮 球 联 赛 ), or better known than their Chinese translations (e.g., mp3, pdf, DHL, DIY ). Some of the English terms can be translated into Chinese, but people may not bother to do that, such as those less popular English songs and movies. However, the underlying motives for using mixed language queries need to be further identified by interviewing the users of Chinese-English mixed language queries. CONCLUSION AND FUTURE RESEARCH This study analyzed the search topics and user intents of Chinese-English mixed language queries submitted to a Chinese web search engine, and developed a typology of English terms used in those mixed language queries. The distribution of search topics and user intents reported in this study differ from those of monolingual queries studies, suggesting a specific searching behavior of Chinese-English mixed language queries users. Future research includes interviewing users who employ Chinese-English mixed language queries in web searching and investigating the context in which users desire to use Chinese-English mixed language queries, and in which situations these queries would be of beneficial. REFERENCES Auer, P. (1998). Code-Switching in conversation: Language, interaction, and identify. London: Routledge.
Beitzel, S. M., Jensen, E. C., Frieder, O., Lewis, D. D., Chowdhury, A., & Kołcz, A. (2005). Improving automatic query classification via semi-supervised learning. Proceedings of the 5th IEEE International Conference on Data Mining (ICDM) (pp. 42 49). Menlo Park, CA: IEEE Computer Society. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3-10. Charmaz, K. (2006). Coding in grounded theory practice. In Constructing grounded theory: A practical guide through qualitative analysis (pp. 42-71). Thousand Oaks, CA: Sage. Chau, M., Fang, X., & Yang, C. C. (2007). Web searching in Chinese: A study of a search engine in Hong Kong. Journal of the American Society for Information Science and Technology, 58(7), 1044-1054. Chau, M., Lu, Y., Fang, X., & Yang, C. C. (2009). Characteristics of character usage in Chinese web searching. Information Processing and Management, 45(1), 115-130. Jansen, B. J., & Booth, D. L. (2010). Classifying web queries by topic and user intent. Retrieved June 1, 2014 from http://faculty.ist.psu.edu/jjansen/academic/jansen_user_in tent.pdf Jansen, B. J., Booth, D. L., & Spink, A. (2008). Determining the informational, navigational, and transactional intent of web queries. Information Processing and Management, 44(3), 1251-1266. Lu, Y., Chau, M., Fang, X., & Yang, C. C. (2006). Analysis of the bilingual queries in a Chinese web search engine. Retrieved June 1, 2014 from http://www.fbe.hku.hk/~mchau/papers/bilingualqueries_ web.pdf Powell, R. R., & Connaway, L. S. (2004). Basic research methods for librarians (4 th ed.). Westport, CT: Libraries Unlimited. Rose, D., & Levinson, D. (2004). Understanding user goals in web search. In S. Feldman, M. Uretsky, M. Najork, & C. Wills (Eds.), Proceedings of the 13th International Conference on World Wide Web (WWW 2004) (pp. 13 19). New York: ACM Press. Spink, A., Wolfram, D., Jansen, M. B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226-234. Strauss, A. & Corbin, J. (1994). Grounded theory methodology: An overview. In N. K. Denzin & Y. S. Lincoln (Eds.), The handbook of qualitative research (pp. 273-285). Thousand Oaks, CA: Sage Publications. W3Techs. (2013). Usage of content languages for website. Retrieved June 1, 2014 from http://w3techs.com/technologies/overview/content_langu age/all