1 SELECTION OF BEST KEYWORDS: A POISSON REGRESSION MODEL Ji Li, Rui Pan, and Hansheng Wang ABSTRACT: With the rapid development of the Internet and information technology, consumers have increasingly begun to acquire information through search engines, thus creating profitable advertising opportunities and advancing the practice of paid search advertising. For paid search advertising, a key issue is determining which keywords to bid on, because the total number of possible keywords is huge. To provide theoretical guidance, this study proposes a statistical model that links advertising effectiveness to keyword characteristics. Through empirical tests with a real data set obtained from the Web site of a service company in China, this research reveals that parsing features of keywords affects their advertising effectiveness, which can help advertisers create and select new keywords for paid search advertising campaigns. Keywords: search engine marketing, paid search advertising, keyword, Poisson regression The rapid development and popularization of the Internet have induced profound changes to the philosophies and related behaviors of consumers. They increasingly search for product information and make purchases on the Internet, driving the development of online advertising. By 2011, according to some estimates, companies will have spent US$36.5 billion in online advertising, and 40% of that growth will come from paid search advertising (Agarwal, Hosanagar, and Smith 2008). The emergence of paid search advertising relates closely to search engine technology development and consumer search behavior. Search engines provide a powerful search technology that enables consumers to obtain a large amount of comparatively precise target content in a very short time. Therefore, search engines offer an important tool for consumers to access needed information through the Internet. Usually consumers input a keyword into a search engine, which then displays relevant information in the list of search results in a certain order. Top-ranked links appear more likely to be clicked on than links with lower ranks (Feng, Bhargava, and Pennock 2007), so earning a top position for a Web site among the search results can facilitate effective advertising. Therefore, understanding the ranking policies of a search engine is crucial. Typically, a search engine applies two ranking policies, depending on whether links have been paid for or not. For unpaid (or unsponsored) links, a search engine ranks them according to their "relevance," though different search engines (e.g., Yahoo, Google, Baidu, Bing) have different definitions of relevance. If a Web site receives the highest relevance score according to a search engine's algorithm, its link will be placed on top of the displayed results. Such results typically are referred to as organic search results (Yang and Ghose 2009), as differentiated from paid search results. To fall among the top-ranked entries in an organic search result is truly beneficial, because it is free. However, to ensure a Web site appears among the top search results is very difficult, if not impossible. Therefore, another option is to pay the search engine to guarantee the Web site will be ranked as close to the top of the search results as possible. Through the practice of paid search advertising, an advertiser bids a price for every click a keyword receives from the search engine. Usually multiple bids exist for the same keyword. The search engine then determines their relative order, according to the bid prices, together with some other search engine-specific concerns (e.g., Google evaluates the landing page quality). With the help of paid search advertising, a good ranking can be obtained simply by bidding a sufficient price. Such paid search advertising is extremely popular. To make good use of their paid search advertisements, marketers also must decide which keywords to select. Search engines typically allow a single sponsor to bid on tens of thousands of keywords simultaneously, though most potential sponsors must consider total possible keywords that run into Journal of Interactive Advertising, Vol 11 No 1(Fall 2010), pp American Academy of Advertising, All rights reserved ISSN
2 28 Journal of Interactive Advertising Fall 2010 the billions. Selecting the right keywords from a huge pool of possible keyword candidates is challenging; even if the total number is relatively small, bidding on all of them simultaneously rarely is preferable. Most keywords do not generate reliable advertising impact and instead just exhaust the sponsor's advertising budget. Bidding on the right keywords is thus the first and very crucial step in optimizing advertising budget spending. In addition to the practical importance of keyword selection, there is a lack of principled theoretical guidance for the actual practice. Practitioners mainly rely on subjective experience and/or experiments to select keywords. Because of the vast number of keywords, this typical practice is ineffective and time consuming. We therefore construct a statistical model that links advertising effectiveness to keyword characteristics. We test our model empirically using a real data set from a Chinese service Web site. The parsing characteristics of keywords affect their advertising effectiveness, which can help advertisers create and select new keywords in paid search advertising campaigns. Furthermore, the general principles of our model apply to other, similar online companies. LITERATURE REVIEW To facilitate our understanding of keywords in paid search advertising, as well as gain insights on the clickthrough rates of paid search advertisements, we systematically reviewed research into three related aspects: clickthroughs as a key performance index of paid search advertising, the effect of keywords, and the effects of keyword location (or ranking) on clickthrough rates for paid search advertising. Internet Advertising and Clickthroughs Prior to the emergence of paid search ads, studies of Internet advertising focused largely on banner advertising. Researchers used clickthrough rates (CTR) to measure banner ad performance (Chatterjee, Hoffman, and Novak 2003), but these rates began to decline rapidly in the 1990s, from 7% in 1996 to.3% in Dreze and Hussherr (2003) showed that these low CTR resulted because consumers generally avoiding looking at Internet advertising. Cho and Choen (2004) further found that consumers did so because they perceived that advertising impeded their goal achievement. They suggested the adoption of highly customized advertising messages, in line with the Web context. Moore, Stammerjohan, and Coulter (2005) also examined the important role of congruity between the Web site and related advertising, which can increase positive attitudes in consumers. Paid search ads can overcome these shortcomings of banner advertising. Because paid search advertisements appear in search results and have a strong correlation with consumer needs, consumers are more likely to pay attention to ads they believe are useful, motivating them to click. Thus, CTRs in paid search advertising are much higher than in banner ads. Furthermore, advertisers only need to pay for ads clicked on by consumers, so clickthrough volume offers an important indicator of the performance of paid search advertising. Keywords in Paid Search Advertising To search for information on the Internet, consumers usually type a keyword or keywords related to the desired information into a search engine. Because of differences in their focus or language habits, consumers often use vastly different keywords. In this sense, the matching probability of keywords in paid search advertising also varies. The characteristics of keywords are therefore the first important determinants of CTR for paid search advertising. Rutz and Bucklin (2007) study keywords for search engine advertising using search engine advertising data from a major hotel chain. They find that keyword features, such as whether the word "hotel" is included or if they pertain to geographical information, brand names, and so on, influence the conversion rate of paid search advertising. They also propose a conversion rate forecasting model to help advertisers choose keywords more effectively. Ghose and Yang (2009) study keyword features, such as length, brand information, and retailer information. By analyzing data from an online retailer, they find that consumer CTR for keywords with carried brand information are higher, and consumers are more likely to purchase when the keywords contain retailers' information. In a recent study, Rutz and Bucklin (2010) further analyze the indirect influence of a general search (no brand information in the keywords) on branded search (brand information in the keywords). General keyword search advertisements can reveal advertiser brand information to consumers, which will affect their future searches for branded keywords. Keyword Ranking or Location Consumers are more likely to click on earlier or higher ranked links than those listed in the bottom of organic or paid search list results (Feng, Bhargava, and Pennock 2007). We therefore expect that the rank of keywords is an important determinant of the CTR of paid search advertising.
3 29 Journal of Interactive Advertising Fall 2010 Several researchers focus on empirical analyses of keyword rankings. Rutz and Bucklin (2007) find that the average position of keywords affects conversion rates; Agarwal, Hosanagar, and Smith (2008) reveal that a more prominent location of a search advertisement is not necessarily better than a less prominent one. Although the CTR of paid search advertisements decrease when the location is less prominent, for long keywords, the conversion rate increases first, then decreases following the decline in ad location. Advertisers must bid more for each keyword to compete for a better rank, location, or position. Several researchers study the bidding mechanism of paid search advertising (e.g., Edelman and Ostrovsky 2007; Edelman, Ostrovsky, and Schwarz 2007), but because this topic is not the focus of this study, we do not mention their findings here. RESEARCH QUESTIONS For keyword selection, previous studies mainly focus on online businesses, such as online hotel reservations, or e- commerce. However, more offline companies also make use of paid search advertising to promote their company Web sites or raise brand awareness. Moreover, there are no studies on paid search advertising in the Chinese market. Language differences ensure that keyword research has a strong cultural component. Therefore, this study contributes to existing research on paid search advertising by answering two key research questions: RQ1: How do offline companies choose keywords in paid search advertisements, considering that they focus on communication and promotion of corporate and brand image, whereas online companies instead aim to sell products or services? RQ2: How can advertisers create and select keywords in paid search advertising for the Chinese market, in accordance with the characteristics of Chinese syntax? METHODOLOGY For this study, we investigate each keyword's clickthrough volume, as determined by the parsing characteristics and ranking of each keyword. Data Source The data for this study came from a tutoring agency company that helps customers find tutors or extracurricular classes to meet their education needs. The data were collected for 38 days, from October 20, 2009, until November 26, During the 38-day data collection period, users employed 648 keywords to search. The data therefore consisted of searched keywords, the corresponding clickthrough volume, and the average rank of the keywords. Data Recoding Because the keywords are text messages that are not easy to incorporate in a quantitative analysis, we decompose them according to their semantic structure. The general patterns of keywords in the data set are as follows: " 北 京 家 教 " (tutoring in Beijing); " 初 三 物 理 家 教 " (physics tutoring for students in the third year of the secondary school); or " 小 学 英 语 辅 导 班 " (primary school English instructing class). Thus, we derive the following general structure for recoding keywords: Place + Action + Grade + Subject + Purpose + Classification, where Place means the place where tutors can be found, such as a province, city, or region name, such as " 北 京 " (Beijing), " 上 海 " (Shanghai), and so on. It allows for a default level. There are 22 levels (i.e., cities) for this factor. Action describes the act of finding tutors, which provides for a default level. There are two levels, " 找 " (searching for) and the default. Grade or the grade level taught by tutors, such as " 小 学 " (primary school), " 初 三 " (third year in secondary school), or " 高 三 " (sixth year in the secondary school), with a default level. Seven levels constitute this factor. Subject describes the specific subjects for tutors, such as " 数 学 " (mathematics), " 语 文 " (Chinese), " 英 语 " (English), and so on. With the default level, there are six levels in this factor. Purpose, a core component of keywords, indicates the most basic search purposes, such as " 家 教 " (tutoring), " 辅 导 " (instructing), and so on. To avoid keywords that lead to nothing, we require no default for this factor, which consists of five levels. Classification describes the kinds of tutoring service searchers will find, such as " 中 介 " (agent), " 网 " (Web
4 30 Journal of Interactive Advertising Fall 2010 site), or " 班 " (tutoring classes). A default level exists, and there are a total of six levels. To gain some intuitive understanding, we took three keywords as examples and decomposed their semantic components. We present the results in Table 1. Table 1. Examples for Keyword Recoding Keyword 北 京 家 教 (Tutoring Beijing) in 初 三 物 理 家 教 (Physics tutoring for students in the third year of secondary school) 小 学 英 语 辅 导 班 (Primary school English instructing class) Place 北 京 (Beijing) NA NA Action NA NA NA Grade NA 初 三 (third year of secondary school) 小 学 (primary school) Subject NA 物 理 (physics) 英 语 (English) Purpose 家 教 (tutoring) 家 教 (tutoring) 辅 导 (instructing) Classification NA NA 班 (class) Furthermore, we added the total clickthrough volume for the 38-day data collection period to derive a final clickthrough volume. To study the influence of keyword rank, we treated it as a categorical variable. Specifically, we classified rank into three categories: Category 1 consists of those keywords with an average rank between 1 to 1.5, Category 2 comprised those from 1.5 to 2.5, and Category 3 were keywords with ranks higher than 2.5. In addition, we obtained the length of each keyword by taking one Chinese character as two characters in length; therefore, the words " 北 京 家 教," " 初 三 物 理 家 教," and " 小 学 英 语 辅 导 班 " are 8, 12, and 14 characters in length, respectively. In total, the data include 648 samples. Model Setup Because the dependent variable (clickthrough volume) is a form of count data, we introduced Poisson regression to the analysis. We assume that the probability of clickthrough volume for the ith keyword is k times:, where is the average clickthrough volume for the ith keyword. Although clickthrough volume is an integer, the average clickthrough volume is continuous. Because has a large variance, we carried out a logarithmic transformation. The Poisson regression model is: where is the average clickthrough volume for the ith keyword; rank 1 and rank 2 refer to the first and second category of rank, with the third category as a baseline; and length i is the length of the ith keyword. We recoded the keyword as six factors, so refers to the kth level under the jth (j = 1,..., 6) factor of the ith keyword. It is a dummy variable, for which factor. Descriptive Analysis means the total number of levels under the jth RESULTS We summarize the descriptive statistics for clickthrough volume, rank, and length in Table 2.
5 31 Journal of Interactive Advertising Fall 2010 Table 2. Descriptive Analysis for Clickthrough Volume, Rank, and Keyword Length Variable Maximum Median Minimum Mean Standard Deviation Clickthrough volume Rank Keyword length As Table 2 shows, the largest clickthrough volume for a keyword is 3,143, and the smallest is just 1 click. The median clickthrough volume is 3, which means the distribution of this variable is significantly right-skewed; most of the clickthrough volume of a keyword is no more than 3 times, but some keywords were clicked on 3,143 times. The highest rank is 1, and the lowest is 3, due to the rearrangement of this variable. The lengths of the keywords range from 4 to 18 characters. For the basic characteristics of the keywords, we conducted a descriptive analysis of the six factors. Of the 22 place levels, the default accounts for 10.34% of total keywords, indicating that a significant portion of keywords do not include any specific place information. The keywords containing " 北 京 " (Beijing) and " 天 津 " (Tianjin) account for 6.64% and 6.17%, respectively, of total keywords. The action variable mostly is accounted for by the default (94.91%), followed by " 找 " (searching) at 5.09%. The default level of the grade also accounts for the largest proportion (50.31%), followed by " 高 中 " (senior secondary) and " 小 学 " (primary school), with shares of 13.73% and 11.88%, respectively. Similarly, of its six levels, the default subject level accounts for 55.56%, followed by " 数 学 " (mathematics) and " 英 语 " (English), which represent 14.51% and 13.43%, respectively. For the purpose variable, we included no default level, and most of the keyword sample consists of " 家 教 " (tutoring, 69.44%), followed by " 辅 导 " (instructing, 21.76%). Finally, classification mainly consists of the default level, at 83.95%, followed by " 班 " (classes) at 8.64%. Regression Results Judging from the overall fit of the model, the chi-squared goodness of fit is 22,688, with a p-value of less than.0001, so the model is significant. The estimated parameters of all variables, the standard errors, and the p-values are in Tables 3-9. The first level in each table is the benchmark; for this study, the default level of each factor, except purpose, provide that benchmark. Table 3. Regression Results: Intercept, Rank, and Keyword Length Variable Parameter Standard Error p-value Intercept <.0001 Rank (Category1) <.0001 Rank (Category 2) <.0001 Keyword length <.0001 In Table 3, we show that the coefficients of the two categories of rank are positive, so when the rank of the keyword is 1 or 2, the clickthrough volume is larger than that of one ranked 3. Specifically, the coefficient of Category 1 is smaller than that of Category 2, which indicates that the clickthrough volume is greater when the rank is 2. Furthermore, the keyword length coefficient is negative, that is, the longer the keyword, the less the clickthrough volume.
6 32 Journal of Interactive Advertising Fall 2010 Table 4. Regression Result: Place Beijing <.0001 Changchun <.0001 Chengdu <.0001 Dalian <.0001 All levels for the place factor are significant and negative. That is, the clickthrough volume of keywords that include place information is significantly less than that of keywords without any geographic information. The average regression coefficient for place is Table 5. Regression Result: Action 找 (searching for) <.0001 Action is also significant, and the coefficient is negative, indicating that keywords that do not express any action earn the largest clickthrough volumes. Table 6. Regression Result: Grade 初 三 (third year in junior secondary school) <.0001 初 中 (junior secondary school) <.0001 高 中 (senior secondary) <.0001 The regression coefficients for all grade levels are significant and negative. The clickthrough volumes for keywords containing information about grades thus are significantly lower than those with no grade information. The mean of the regression coefficients for all grade levels is
7 33 Journal of Interactive Advertising Fall 2010 Table 7. Regression Result: Subject 化 学 (chemistry) <.0001 数 学 (mathematics) <.0001 物 理 (physics) <.0001 Similar to the previous factors, all subject levels are significant, with negative regression coefficients. The clickthrough volumes of keywords containing information about subjects are significantly less than those of keywords without subjectrelated information. The mean of the regression coefficients for the subject factor at all levels is Table 8. Regression Result: Purpose 课 外 辅 导 (extracurricular instructing).0000 辅 导 (instructing) <.0001 家 教 (tutoring) <.0001 家 教 辅 导 (instructing by tutor) <.0001 Because there is no default level for the purpose factor, we use the level selected by the statistical software SAS as a benchmark. As Table 7 shows, the regression coefficients in all levels are significant; in addition, some are positive, whereas others are negative, among which the coefficient of " 家 教 " is greatest. Therefore, a keyword that includes " 家 教 " has a greater clickthrough volume than the others. Table 9. Regression Result: Classification 班 (class) <.0001 网 (Web site) <.0001 信 息 (information) <.0001
8 34 Journal of Interactive Advertising Fall 2010 Finally, for the classification factor, all the regression coefficients for other levels are significant, though in different directions. For example, the coefficient of " 网 " (Web site) is positive, so the clickthrough volumes of keywords that express a preference for a tutoring Web site is greater than those of keywords without any classification preference. Using all estimated parameters for all levels of the independent variables, we can create keywords and estimate their clickthrough volumes. Take " 北 京 数 学 家 教 " (mathematics tutoring in Beijing) for example: Its six factor levels are as follows: place = " 北 京 " (Beijing); action = NA; grade = NA; subject = " 数 学 " (mathematics); purpose = " 家 教 " (tutoring); and classification = NA. The length of this keyword is 12. When the assumed rank is 2, the average clickthrough volume = exp ( ) = This computation is an extreme case, because we assume a rank in Category 2. In reality, the actual clickthrough volume of the keyword would likely be lesser. DISCUSSION Through Poisson regression analysis, we have learned about the influence of several independent variables and their respective levels on the clickthrough volumes of keywords. Rank has a nonlinear effect; the keyword with an average rank between 1.5 to 2.5 has the highest clickthrough volume, followed by the keyword with an average rank between 1 to 1.5. The keyword with an average rank greater than 2.5 has the lowest clickthrough volume. The length of a keyword has negative effect, which means that the longer the keyword, the lesser its clickthrough volume. In addition, various factors of the keywords themselves significantly affect clickthrough volume. Advertisers can use these results to predict the clickthrough volume of a specific keyword, then select new keywords that will yield higher clickthrough volumes to enhance their marketing performance. For most factors, the default level achieves the highest clickthrough volume, in support of the assumption of a negative relationship between the length and clickthrough volume of the keywords. Therefore, when constructing new keywords, the advertisers should start from the default level of all factors (i.e., the keyword for all default levels is " 家 教," or tutoring), and then adjust the various factors in turn to find a clickthrough volume that meets their standard. When calculating the clickthrough volume of these keywords, we did not take rank into consideration, but in practice of course, specific rank information should be obtained before calculating the clickthrough volume. PRACTICAL IMPLICATIONS As an emerging means of advertising, paid search advertising has rapidly developed in recent years and been recognized and accepted by an increasing number of advertisers and consumers. Paid search advertising challenges traditional advertising media in terms of cost, accuracy, flexibility, and controllability. It also influences e-commerce development, marketing strategies, and consumer behaviors. Approximately 40 million companies use search engine ranking and bidding to conduct their marketing activities. However, a vast majority of companies still fail to recognize the space for improvement in their paid search advertising decisions. Keyword construction is an initial step, yet in China, no products or services provide effective keyword construction suggestions. A typical suggestion is to find keywords that relate to the advertiser's products, brands, competitors, and target customers, then generate a large number of keywords through permutation and combination. But which specific keywords induce effective advertising performance? Among these effective keywords, what kind of internal relations do they show? What key techniques support keyword permutations and combinations? Prior to our study, there has been no effective way to analyze these issues to come up with effective solutions. This study reveals some inherent rules for effective keyword construction through a performance analysis of keywords in a paid search advertising campaign by a service company. We thus offer advice about the choice of keywords by a company. Moreover, this study provides a keyword selection approach to paid search advertisers. We discover statistical relationships between the rules and techniques of keyword design; marketers could increase clickthrough volumes by analyzing historical data and constructing high-performing keywords accordingly. LIMITATIONS AND FURTHER RESEARCH This exploratory study on the use of keywords in paid search advertising naturally contains some limitations that should be addressed in further research. First, we do not take consumer behaviors into account. The most fundamental reason for the relationship between keywords and clickthrough volume may be varying consumer demands. This study therefore provides only a decision-making tool for advertisers, rather than granting more insight into the very essence of paid search advertising. Additional studies should analyze paid search advertising keywords from a consumer perspective to discover
9 35 Journal of Interactive Advertising Fall 2010 their attitudes and behaviors, which determine clickthrough volume. Second, we did not analyze bidding decisions. Because the bid price (i.e., single clickthrough cost) determines the rank of a keyword, and also consumes the advertising budget, it is critical to the effectiveness of any paid search advertising. However, the relationship between the single clickthrough cost and the rank of the keyword is complicated. On the one hand, a single clickthrough cost requires a minimum starting cost; on the other hand, the ad ranking is not linearly related to the bid but rather is the result of bidding with other competitors, which requires an analysis that includes competition. Further research should focus on the bid ranking of keywords. REFERENCES Agarwal, Ashish, Kartik Hosanagar, and Michael David Smith (2008), "Location, Location, Location: An Analysis of Profitability of Position in Online Advertising Markets," = (accessed March 26, 2010). Chatterjee, Patrali, Donna Hoffman, and Tom Novak (2003), "Modeling the Clickstream: Implications for Web-Based Advertising Efforts," Marketing Science, 22 (4), Cho, Chang-Hoan and Hongsik John Cheon (2004), "Why Do People Avoid Advertising on the Internet?" Journal of Advertising, 33 (4), Dreze, Xavier and Francois-Xavier Hussherr (2003), "Internet Advertising: Is Anybody Watching?" Journal of Interactive Marketing, 17 (4), Edelman, Ben and Michael Ostrovsky (2007), "Strategic Bidder Behavior in Sponsored Search Auctions," Decision Support Systems, 43 (1), , ---, and Michael Schwarz (2007), "Internet Advertising and the Generalized 2nd-Price Auction: Selling Billions of Dollars Worth of Keywords," American Economics Review, 97 (1), Moore, Robert, Claire Allison Stammerjohan, and Robin Coulter (2005), "Banner Advertiser-Website Congruity and Color Effects on Attention and Attitudes," Journal of Advertising, 34 (2), Rutz, Oliver J. and Randolph E. Bucklin (2007), "A Model for Individual Keyword Performance in PSA," (accessed March 26, 2010). --- and --- (2010), "From Generic to Branded: A Model of Spillover in PSA," Journal of Marketing Research, forthcoming. Yang, Sha and Anindya Ghose (2009), "Analyzing the Relationship Between Organic and Sponsored Search Advertising: Positive, Negative or Zero Interdependence?" working paper. ABOUT THE AUTHORS Ji Li (Ph.D., Peking University) is an Assistant Professor of Marketing in the School of Business, Central University of Finance and Economics. Her research interests include marketing modeling, Internet marketing, and customer relationship management. Rui Pan is a Ph.D. candidate in Business Statistics and Econometrics at the Guanghua School of Management, Peking University. Her areas of research interest include analyses of ultra-high-dimensional data, marketing modeling, and social network analysis. Hansheng Wang (Ph.D., University of Wisconsin-Madison) is a Professor of Business Statistics and Econometrics in Guanghua School of Management, Peking University. His areas of research interest include ultra-high-dimensional data analysis, regression shrinkage and selection, sufficient dimension reduction, and nonparametric and semiparametric models. Feng, Juan, Hemant Bhargava, and David Pennock (2007), "Implementing Sponsored Search in Web Search Engines: Computational Evaluation of Alternative Mechanisms," INFORMS Journal on Computing, 19 (1), Ghose, Anindya and Sha Yang (2009), "An Empirical Analysis of Search Engine Advertising Sponsored Search in Electronic Markets," Management Science, 55 (10),