Analyzing Chinese-English Mixed Language Queries in a Web Search Engine



Similar documents
A Rule-Based Short Query Intent Identification System

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

On the Fly Query Segmentation Using Snippets

Understanding News Researchers through a Content Analysis of Dissertations and Theses

Computational Advertising Andrei Broder Yahoo! Research. SCECR, May 30, 2009

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Intercoder reliability for qualitative research

Log Analysis of Academic Digital Library: User Query Patterns

Dynamics of Genre and Domain Intents

Effective Prediction of Kid s Behaviour Based on Internet Use

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Search and Information Retrieval

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Engaging and Empowering News Audiences Online: A Feature Analysis of Nine Asian News Websites

Multitasking Web Search on Alta Vista

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

The Impact of Query Suggestion in E-Commerce Websites

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

China Search International. Baidu Guide for Advertisers

Promoting Agriculture Knowledge via Public Web Search Engines : An Experience by an Iranian Librarian in Response to Agricultural Queries

Turn the page to find out more about ELITE ASIA GROUP and the services we offer.

Recommendations for enhancing the quality of flexible online support for online teachers

Term extraction for user profiling: evaluation by the user

Analyzing Qualitative Data

Welcome to a Maine State Library tutorial about LearningExpress Library, your online learning platform.

Westlaw China Online Legal Database Structure

Mapping User Search Queries to Product Categories

Dissecting the Learning Behaviors in Hacker Forums

Additional details >>> HERE <<<

Language Translation Services RFP Issued: January 1, 2015

Source of all statistics:

Impressive Analytics

Nonprofit Technology Collaboration. Web Analytics

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Investigating customer click through behaviour with integrated sponsored and nonsponsored results

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Considering the Cultural Issues of Web Design in Implementing Web-Based E-Commerce for International Customers

Opinion 04/2012 on Cookie Consent Exemption

GERMAN IA CCO I: Interpersonal Communication

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

INTERNET MARKETING SERVICES (IMS)

SEO 101. Learning the basics of search engine optimization. Marketing & Web Services

Raising Reliability of Web Search Tool Research through. Replication and Chaos Theory

OvidSP Quick Reference Guide

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

Online Advertising Agency.

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Digital media glossary

Grounded Theory. 1 Introduction Applications of grounded theory Outline of the design... 2

PROJECT: Analysis of ios App Store Metadata / STUDENT NAME: Grant Patten

WEB DESIGN & SEO PLANNING WORKSHEET

Welcome to Generating Better Leads and Converting them to more Sales. Jim Dodez SVP Marketing & Strategic Planning KVH Industries, Inc.

Bibliometrics and Transaction Log Analysis. Bibliometrics Citation Analysis Transaction Log Analysis

Lesson 31: Backlinks and Negative Backlink SEO Audits

Precision and Relative Recall of Search Engines: A Comparative Study of Google and Yahoo

Add external resources to your search by including Universal Search sites in either Basic or Advanced mode.

DYNAMIC SUPPLY CHAIN MANAGEMENT: APPLYING THE QUALITATIVE DATA ANALYSIS SOFTWARE

QDA Miner 3.2 (with WordStat & Simstat) Distinguishing features and functions Christina Silver & Ann Lewins

Open Netscape Internet Browser or Microsoft Internet Explorer and follow these steps:

Broadcast Yourself. User Guide

Computers and iphones and Mobile Phones, oh my!

Improving Web Page Readability by Plain Language

An Overview of Computational Advertising

HOW TO CREATE YOUR TRANSLATION GLOSSARY

ANNUAL SURVEY ON INFOCOMM USAGE IN HOUSEHOLDS AND BY INDIVIDUALS FOR 2012

The ad units mobile users will most likely click on P5

Localizing Your Mobile App is Good for Business

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

How to Get Your Website on the Internet: Web Hosting Basics

Disclaimer. The author in no case shall be responsible for any personal or commercial damage that results due to misinterpretation of information.

SEO PRESS RELEASES. and. How to Drive Search Visibility While Following Google Best Practices

Search Engine optimization

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Search Engine Optimization

Web Analytics Definitions Approved August 16, 2007

CIBC Business Toolkit Grow and Manage Your Business Online. Part 5: Grow Online Worksheet

Best Practice Search Engine Optimisation

Fortune Cookies on China. Source:

A Survey on Product Aspect Ranking


Controlled Vocabulary and Folksonomies. Louise Spiteri School of Information Management

Getting Content For Your Site! Issue 5. Welcome to Issue Number 5!

Google Product. Google Module 1

The Impact of Digital Media on Lead Quality:

Gutenberg 3.2 Ebook-Piracy Report

Mobilozophy L.L.C. All Rights Reserved

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Factiva. User s Guide. Introduction

Getting Traffic to your Website

See how social media listening and engagement can help your business

Salesforce Customer Portal Implementation Guide

Web Mining as a Tool for Understanding Online Learning

A Survey on Bilingual Teaching in Higher Education Institute in the Northeast of China

a translation and localization industry. We provide services from manufacturers and OEM research, interactive marketing

THE MOBILE INTERNET CONSUMER INDIA 2013 AUDIENCE INSIGHTS ON MOBILE WEB AND APP USERS

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Marketing and Promoting Your Cooperative Through Social Media. How social media can be a success for your housing cooperative

Transcription:

Analyzing Chinese-English Mixed Language Queries in a Web Search Engine Hengyi Fu School of Information Florida State University 142 Collegiate Loop, FL 32306 hf13c@my.fsu.edu Shuheng Wu School of Information Florida State University 142 Collegiate Loop, FL 32306 sw09f@my.fsu.edu ABSTRACT With the increasing number of multilingual web pages on the Internet, multilingual information retrieval has become an important research issue. While queries are the key element of information retrieval process, mixed-language queries have not yet been adequately studied. This study investigates the search topics and user intents of Chinese- English mixed language queries submitted to a Chinese web search engine, and develops a typology of English terms used in those mixed language queries. The preliminary findings show that the category of search topics and distribution of user intents differ from those of monolingual queries reported in previous studies, suggesting a specific searching behavior of Chinese-English mixed language queries users. The findings of this study could provide useful insights in understanding specific searching behavior of Chinese-English mixed language queries users, and inform the construction of controlled vocabularies and cross-lingual query expansion. Keywords Information retrieval, multilingual search query, search behavior, search topics, user intent. INTRODUCTION English has remained the predominant language online accounting for 55.5% of the content languages used in websites (W3Techs, 2013); however, the amount of content in other languages is increasing rapidly. It is becoming more common to find web pages that are available in multiple languages or a single web page in more than one language. In cultures where people use both Chinese and English, using mixed language in spoken language is very common. In general, a mixed language consists of 77th ASIS&T Annual Meeting, October 31- November 4, 2014, Seattle, WA, USA. Copyright is retained by the author(s). sentences or terms mostly in the primary language with some words in a secondary language. For instance, Hong Kong people typically speak Cantonese with English words. When such people are searching the web, they often use Chinese-English mixed language queries in order to approximate their information need more accurately than using Chinese-only queries. A mixed language query is a search query that includes words mixed from two or more languages. For example, the query 3G 手 机 is a Chinese-English mixed language search query that a user created to look for information about 3G mobile phones. This phenomenon is called codeswitching in linguistics and used to be considered of peripheral importance, but now has become a general focus of research in the sociological, psycholinguistic, and linguistic communities (Auer, 1998). However, little is known about why and how people use mixed language queries in web searching or the characteristics of the mixed language queries. Research about Chinese search engines shows the amount of Chinese-English mixed language queries submitted by users has increased in resent years (Chau, Lu, Fang, & Yang, 2009). Studying the characteristics of Chinese-English mixed language queries in web searching will be fruitful for a better understanding of users needs behind the queries, and inform the construction of controlled vocabularies and query expansion. Although previous studies (Chau, Fang, & Yang, 2007; Chau et al., 2009; Lu, Fang, & Yang, 2006) have analyzed the characteristics of search queries in English and Chinese, their findings may not be applicable to Chinese-English mixed language queries, due to the great discrepancy between these two languages. To our best knowledge, no previous research has explicitly studied Chinese-English mixed language queries in depth. The purpose of this study is to examine the topics and user intents of Chinese-English mixed language queries in a Chinese web search engine, and identify the types of English terms used in those mixed queries. The topics and user intents identified in this study can inform the construction of topic-specific thesaurus, and enable web search engines to provide users with more relevant results and more precisely targeted sponsored links. English terms

used in Chinese-English mixed language queries, especially those without well-known Chinese translations, reflect users vocabularies; they can be used to enhance Chinese controlled vocabularies or other knowledge organization systems, and also be applied for query expansion to improve recall. RELATED WORK Search Topics Determining the topics of queries is an on-going area of study. Spink, Wolfram, Jansen, and Saracevic (2001) studied query logs from the Excite TM search engine from 1997 and 1999. They classified about 2,500 queries into 11 overlapping topic categories using a grounded theory approach. They found that no single query subject category accounted for more than 17% of the traffic, and the distribution of topics of web queries does not coincide with the distribution of subject content of web sites. Beitzel et al. (2005) presented a topical breakdown of a manually classified sample of queries taken from one week s worth of queries posed to AOL TM search, which has a large number of backend databases with topic-specific information. With this topical label dataset, Jansen and Booth (2010) manually labeled each query with three-level classification of user intent and developed a classification scheme of user intent varied across topics. User Intents in Web Search Several studies (Broder, 2002; Rose & Levinson, 2004; Jansen, Booth, & Spink, 2008) have attempted to classify user queries in terms of users informational actions. Due to page limitation, this study only uses queries themselves to determine user intents. Broder (2002) manually classified a small set of queries into transactional, navigational, and informational by using a survey of AltaVista users. Navigational queries are used to find a certain webpage the user has already in mind or at least assumes that such a site exists. Typical queries in this category could be searches for a homepage of an organization. With informational queries users want to find information on certain topics. Such queries usually lead to a set of results rather than to just one suitable document. The purpose of transactional queries is to reach a site where a further interaction is necessary. The interaction could be downloading, shopping, or accessing certain databases. Based solely on the log analysis, Border (2002) reported that 48% of the queries were informational, 20% navigational, and 30% transactional. Analysis of Chinese-English Mixed Language Queries Chau et al. (2007) collected 1,255,633 queries from a Chinese search engine, and found that the search topics were similar to those in English search engines. Chau et al. (2009) reported that Chinese users tended to use more diversified search terms and that the use of characters in search queries was quite different from that in general online information in Chinese. However, none of these studies investigated Chinese-English mixed language queries, although Chau et al. (2007) observed the codeswitching phenomenon and indicated it would be an interesting topic to explore. The only study found related to Chinese-English mixed language queries was conducted by Lu et al. (2006). They found that mixed language queries submitted to a Chinese web search engine in Hong Kong called Timway were primarily caused by technical terms in computer science, names of magazines and firms, and some English words not having a popular Chinese translation. RESEARCH QUESTIONS This poster addresses the following research questions: RQ1. What are the main topics of Chinese-English mixed language queries web search engine users are searching for? RQ2. How are the user intents of these Chinese- English mixed language queries distributed amongst each topic? RQ3. What kinds of English terms do users choose to create a Chinese-English mixed language query? DATA COLLECTION AND RESEARCH METHOD This study uses queries submitted to the Sogou web search engine (http://www.sogou.com/), which is one of the most popular search engines in China. The query-log data was collected in June 2012. A record in the log file consists of six fields: time of the click event, user ID, user query, ranking of the clicked URL, ordering of user click, and the clicked URL. The query-log data contains 86,538,613 nonempty queries, 4,345,557 of them unique. The researchers developed C++ code to select and pre-process all the Chinese-English queries that contain both ASCII and double-byte characters from the log data. The final dataset has 346,989 Chinese-English mixed language queries, accounting for 7.98% of all queries. A random sample of 384 Chinese-English mixed language queries was drawn from this dataset to conduct content analysis. The sample size was determined using a technique introduced by Powell and Connaway (2004). The unit of analysis is each Chinese-English query in the sample. To determine the topic of those Chinese-English mixed language queries, two researchers independently coded all the queries in the sample using a scheme developed by Jansen and Booth (2010) consisting of 19 topical categories. The researchers compared their coding to discuss and resolve any differences. During the coding process, the researchers decided to drop the Other category from Jansen and Booth s scheme, and added five new ones, Arts & Design, Education, Employment, Law & Legislation, and Technology. The researchers achieved an intercoder reliability of 0.872 (Cohen s Kappa). Using a similar coding process, the researchers independently coded the same set of queries into three different user intents Informational, Transactional, and

Navigational (Broder, 2002) and obtained an intercoder reliability of 0.856. To identify reasons for users employing English terms as part of their search queries, the researchers used the open coding approach (Charmaz, 2006; Strauss & Corbin, 1994) to analyze all the English terms in the same set of queries. Noticeably, an English term in a mixed language query refers to a single English letter, word, phrase, or sentence fragment, and thus one English term may generate one or multiple codes. For example, the query Michael Jackson You rock my world 在 线 视 频 looking for a specific online video will generate two codes for the English term, Singer and Song. The researchers independently coded the same set of English terms and created a codebook after comparing, discussing, and resolving differences. The researcher then used the codebook to recode all the English terms and attained an intercoder reliability of 0.884. FINDINGS The analysis identified 21 categories of search topics from the Chinese-English mixed language queries (See Table 1). Category # of queries Percentage Arts & Design 2 0.5% Auto 7 1.8% Business 20 5.2% Computing 88 22.9% Education 24 6.3% Employment 2 0.5% Entertainment 87 22.7% Games 36 9.4% Health 7 1.8% Home 1 0.3% Law & Legislation 8 2.1% News 1 0.3% Organization 6 1.6% Places 1 0.3% Porn 4 1.0% Research 17 4.4% Shopping 24 6.3% Sports 8 2.1% Technology 34 8.9% Total 384 100% Table 1. Categories of queries by topic. The most popular topics are Computing (22.9%), Entertainment (22.7%), and Games (9.4%), accounting for 55% of the queries. This may due to a large number of software and online games developed in the United States, Europe, and Japan do not have popular Chinese translations. Besides Other, two categories Holiday and Misspellings in Jansen and Booth s (2010) scheme were not found in the sample of this study. The researchers found that a large number of English terms in Chinese- English mixed language queries are abbreviations (e.g., MV, GRE, ISO ) and words that are short or easy to spell (e.g., blog, Nike, Windows ). This may explain the absence of Misspellings in the queries. Category Informatio nal Transact ional Navigati onal Arts & Design 1.0% 0.0% 0.0% Auto 3.0% 0.7% 0.0% Business 7.5% 0.7% 12.5% Computing 23.4% 26.5% 3.1% Education 3.5% 7.9% 15.6% Employment 1.0% 0.0% 0.0% Entertainment 7.5% 43.0% 21.9% Games 7.5% 13.2% 3.1% Health 3.5% 0.0% 0.0% Home 0.5% 0.0% 0.0% Law & Legislation 3.0% 0.7% 3.1% News 0.0% 0.0% 3.1% Organization 0.5% 0.0% 15.6% Place 0.5% 0.0% 0.0% Porn 0.5% 2.0% 0.0% Research 7.5% 0.7% 3.1% Shopping 10.9% 1.3% 0.0% Sports 4.0% 0.0% 0.0% Technology 13.9% 3.3% 3.1% Travel 1.0% 0.0% 0.0% URL 0.0% 0.0% 15.6% Total 100% 100% 100% Travel 2 0.5% URL 5 1.3%

Table 2. Categories of queries distributed amongst user intent. The second research question examines user intent varied by search topics. Among 384 queries in the sample, 52.3% are informational, 39.3% are transactional, and 8.3% are navigational. While the percentage of informational queries reported in this study is similar to that of Broder (2002), much fewer navigational queries are identified. The distribution of user intents amongst different topics is shown in Table 2, which differs largely from that of a previous study (Jansen & Booth, 2010). A large number of informational queries fall into the topic of Computing and Technology, accounting for almost 40% of all informational queries. This may indicate users are interested in knowing about new technologies, computer software, and hardware, most of which were developed outside of China, such as AutoCad, Photoshop, and Java. The largest topic groups in transactional queries are Entertainment, Computing, and Games, accounting for more than 80% of all the transactional queries. The popularity of these transactional queries resulted from users looking for specific downloadable software, video games, songs, movies, and books from web search engines. Assuming that transactional queries carry a higher commercial inclination, these would be of most interest to online advertising. For these users, web search engines could place heavy weight in results on commercial content or sponsored links. The most frequently occurred topics in navigational queries are Entertainment, Education, Organization, and URL, accounting for almost 70% of all the navigational queries. This may suggests users are interested in specific entertainment, education, and organization websites. Type Frequency Percentage Band/Singer 14 3.3% Brand 22 5.2% TV Channel 3 0.7% Company 24 5.7% Drama/Movie 4 1.0% Game 10 2.4% Hardware 12 2.9% Manga 4 1.0% Model Name/ Version Number 42 10.0% Organization 9 2.1% Region 2 0.5% Software 68 16.2% Song 14 3.3% Symbol 17 4.0% Terminology 142 33.8% URL 24 5.7% Website 9 2.1% Table 3. Types of English terms in Chinese-English mixed language queries. To answer the third research question, this study developed a typology of 16 types of English terms used in those Chinese-English mixed language queries (See Table 3). The most frequent type of English terms is Terminology (33.8%), followed by Software (16.2%) and Model name/version number (10%). Terminologies (e.g. GRE, mp3, and blog ) can be further categorized into three levels: cultural, community, and technical. For example, the terminology av abbreviation for adult video originated from Japan and is well known in Asian culture. The analysis found that not all the English terms used in the mixed language queries have corresponding Chinese translations, including the version number of certain software (e.g., XP), hardware, video games, and standards; the model name of electronic devices, automobiles, and materials; URLs; and symbols (e.g., C programming language). Although most of the English terms have Chinese translations, some of them may be easier to type and memorize (e.g., NBA vs. 美 国 职 业 篮 球 联 赛 ), or better known than their Chinese translations (e.g., mp3, pdf, DHL, DIY ). Some of the English terms can be translated into Chinese, but people may not bother to do that, such as those less popular English songs and movies. However, the underlying motives for using mixed language queries need to be further identified by interviewing the users of Chinese-English mixed language queries. CONCLUSION AND FUTURE RESEARCH This study analyzed the search topics and user intents of Chinese-English mixed language queries submitted to a Chinese web search engine, and developed a typology of English terms used in those mixed language queries. The distribution of search topics and user intents reported in this study differ from those of monolingual queries studies, suggesting a specific searching behavior of Chinese-English mixed language queries users. Future research includes interviewing users who employ Chinese-English mixed language queries in web searching and investigating the context in which users desire to use Chinese-English mixed language queries, and in which situations these queries would be of beneficial. REFERENCES Auer, P. (1998). Code-Switching in conversation: Language, interaction, and identify. London: Routledge.

Beitzel, S. M., Jensen, E. C., Frieder, O., Lewis, D. D., Chowdhury, A., & Kołcz, A. (2005). Improving automatic query classification via semi-supervised learning. Proceedings of the 5th IEEE International Conference on Data Mining (ICDM) (pp. 42 49). Menlo Park, CA: IEEE Computer Society. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3-10. Charmaz, K. (2006). Coding in grounded theory practice. In Constructing grounded theory: A practical guide through qualitative analysis (pp. 42-71). Thousand Oaks, CA: Sage. Chau, M., Fang, X., & Yang, C. C. (2007). Web searching in Chinese: A study of a search engine in Hong Kong. Journal of the American Society for Information Science and Technology, 58(7), 1044-1054. Chau, M., Lu, Y., Fang, X., & Yang, C. C. (2009). Characteristics of character usage in Chinese web searching. Information Processing and Management, 45(1), 115-130. Jansen, B. J., & Booth, D. L. (2010). Classifying web queries by topic and user intent. Retrieved June 1, 2014 from http://faculty.ist.psu.edu/jjansen/academic/jansen_user_in tent.pdf Jansen, B. J., Booth, D. L., & Spink, A. (2008). Determining the informational, navigational, and transactional intent of web queries. Information Processing and Management, 44(3), 1251-1266. Lu, Y., Chau, M., Fang, X., & Yang, C. C. (2006). Analysis of the bilingual queries in a Chinese web search engine. Retrieved June 1, 2014 from http://www.fbe.hku.hk/~mchau/papers/bilingualqueries_ web.pdf Powell, R. R., & Connaway, L. S. (2004). Basic research methods for librarians (4 th ed.). Westport, CT: Libraries Unlimited. Rose, D., & Levinson, D. (2004). Understanding user goals in web search. In S. Feldman, M. Uretsky, M. Najork, & C. Wills (Eds.), Proceedings of the 13th International Conference on World Wide Web (WWW 2004) (pp. 13 19). New York: ACM Press. Spink, A., Wolfram, D., Jansen, M. B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226-234. Strauss, A. & Corbin, J. (1994). Grounded theory methodology: An overview. In N. K. Denzin & Y. S. Lincoln (Eds.), The handbook of qualitative research (pp. 273-285). Thousand Oaks, CA: Sage Publications. W3Techs. (2013). Usage of content languages for website. Retrieved June 1, 2014 from http://w3techs.com/technologies/overview/content_langu age/all