Emerging Technology Forecasting Using New Patent Information Analysis Sunghae Jun * and Seung-Joo Lee Department of Statistics, Cheongju University, Chungbuk, Korea shjun@cju.ac.kr, access@cju.ac.kr *Corresponding Author: Sunghae Jun (shjun@cju.ac.kr) Abstract Emerging technology drives technological development and innovation in diverse fields of technology. Emerging technology forecasting can predict the possible areas of emerging technology. However, it is difficult to forecast the emerging technology because most technology forecasting tasks depend on the subjective experience of experts. Patent analysis is an objective method to recognize the trends in technological development. Many patent analysis methods have been researched; these methods apply text mining techniques to analyze the text data of patent documents such as the title and abstract. This approach has some limitations, namely the computing cost and information loss associated with the preprocessing step of text mining. Therefore, we propose a new patent information analysis to overcome these problems. Using the International Patent Classification codes from the patent documents of a target technology, we construct an emerging technology forecasting model. This research combines statistical inference and neural networks to construct our model for new patent information analysis. We perform a case study to verify how our research can be practically applied, using nanotechnology as the target technology. Therefore, we contribute this research to R&D planning. Keywords: Emerging technology, patent information analysis, technology forecasting 1. Introduction Technology forecasting (TF) anticipates technological trends such as the direction and the rate of technological development [1-3]. We can plan R&D policy and new product development based on TF results. Also, a TF model can inform a company s strategic decisions concerning technology licensing and patent management [4]. TF is thus an important issue for a company. Emerging technology drives technological development and innovation in diverse fields of technology [5]. The development of a technology depends on the emergence of related technologies [6, 7]. Thus, extracting the relationships between a target technology and its related technologies is important to technological development. This extraction is called emerging technology forecasting (ETF) in this paper. ETF predicts possible areas of emerging technology. However, it is difficult to forecast emerging technology because most technology forecasting tasks depend on the subjective experience of experts. Such methods are not stable, and so more objective ETF methods are required [8]. Patent analysis is an objective method to recognize the trends in technological development. Many patent analysis methods have been researched [4, 9, 10]; these methods apply text mining techniques to analyze the text data of patent documents, such as the title and abstract, and they are more quantitative TF approaches than were those of previous works such as Delphi [11, 12]. However, most patent analysis approaches have some limitations, namely the 107
computing cost and the information loss associated with preprocessing step of text mining [9, 13]. One approach to avoid this problem is to analyze International Patent Classification (IPC) code data [14]. Therefore, we propose a new patent information analysis method for the purpose of ETF. Using the IPC codes from the patent documents of a target technology, we construct an emerging technology forecasting model. This research combines statistical inference and neural networks to construct our model for new patent information analysis. We perform a case study to verify how our research can be practically applied, using nanotechnology as the target technology. 2. Emerging Technology Forecasting Emerging technology is a technology affecting and explaining other technologies for the coming year [7]. In other words, if the development of a technology relies on the development of its related technology, then it is an emerging technology. Determining the relationship between a target technology and its emerging technology is very important to ETF. The faster the technology develops, the more important the emerging technology is. In previous research, determining emerging technologies for a target technology depended on the Delphi study [6, 7]. In this paper, however, we determine emerging technologies via a quantitative and objective approach. Our research uses statistical inference, regression, and neural networks as quantitative methods and analyzes patent documents as objective data because patent data are appropriate historical data for ETF [7]. 3. New Patent Information Analysis for Emerging Technology Forecasting In this paper, we construct an (n*p) patent-ipc code matrix (PICM) from the retrieved patent documents using text mining techniques; this matrix is then used for our ETF modeling. Figure 1 shows our PICM structure. Figure 1. PICM Structure The rows and columns of this matrix are the patent number and IPC code, respectively; a matrix element represents the frequency that a given IPC code occurs in a patent. Our quantitative analysis is performed using this matrix. First, we have to determine the dependent and independent variables for the construction of the ETF model. The dependent variable is the IPC code representing the target technology. The IPC codes explaining (affecting) the target technology are the independent variables. All IPC codes (excluding the IPC code that is the dependent variable) are not alwa ys independent variables. Thus, we have to select the significantly independent variables to explain the target variable. In this paper, we use statistical inference to select the significant IPC codes (independent variables). To construct our ETF model, we combine statistical inference and neural network methods. In the statistical inference step, we select the IPC codes to be used in the ETF model. The selected IPC codes are the top-ranked codes and are divided into dependent and independent codes. Each code 108
is a variable in our model. The dependent code is a response variable; this is our ETF target technology. All of the remaining codes are candidate independent (predictive) variables: the target technology (dependent code) is explained by the technologi es of the predictive variables (independent codes). To select a meaningful variable to explain the target variable, we test the significance of an independent variable. In this paper, we use the probability value (p-value), which is the probability of a test statistic based on the observed data. This value assumes that the null hypothesis (H 0 ) is true (i.e., the variable is not significant). When the p-value is less than the significance level (0.05 for a 95% confidence interval), we determine that the vari able is significant using p- value, p(x) which is a test statistic that represents the probability of sample X [15]. A small p(x) value suggests that the alternative hypothesis is true (i.e., a variable is significant). denotes the level- test of p(x) and takes a value between 0 and 1. This research applies the p-value to two statistical methods, multiple regression and correlation analysis. In this paper, our regression model is as follows: IPC Ti IPC 0 1 1i IPC 2 2i IPC Where IPC Ti is the IPC code of the target technology in the ith patent (i = 1, 2,, n). 0 is the intercept and ( 1,, k ) are the slopes [16]. j denotes the amount of increase or decrease in the value of IPC T (target technology, dependent variable) associated with a one-unit increase in IPC j (explanatory technology, independent variable). In other words, the regression parameter j describes the relationship between IPC T and IPC j. To select the necessary IPC codes for the ETF model of IPC T, we test the significance of j as follows: H0 : j 0 vs. H1 : j 0 The null hypothesis (H 0 ) states that the variable is not significant and the alternative hypothesis (H 1 ) states that the variable is significant. If the p-value of the regression result is less than 0.05 (95% confidence coefficient), we can reject H 0, and so H 1 is selected for the ETF model. We also apply Pearson correlation analysis as a method of statistical inference for extracting the significant variables. This research shows the correlation coefficient is as follows: r n n i 1 IPC IPC j IPC IPC T ji n 2 2 IPC IPC j IPC IPC T ji i 1 i 1 Where IPC j and IPC T are the mean values of IPC ji and IPC Ti (i = 1, 2,, n), respectively. This coefficient represents the correlation between IPC ji and IPC Ti. As in the multiple regression analysis, H 0 states that the independent variable (IPC j ) is not significant with respect to the dependent variable (IPC T ), and H 1 states the significance of IPC j to IPC T. Using the p-value of IPC j, we decide whether the IPC code is significant. Therefore, we select the necessary IPC codes to explain the development of IPC T by the combined results of the multiple regression and the Pearson correlation. The selected IPC codes are used in the following neural network model to construct an ETF model. Neural networks have been used for diverse analyses, such as forecast modeling [17]. These networks are mathematical models that imitate the information processes of the human brain. The neural network model is a powerful tool for Ti Ti k ki i 109
predicting future events using previous (historical) data [17]. This model consists of neurons and the connecting weights between neurons. The neurons are contained in input, hidden, and output layers, as shown in Figure 2. Figure 2. Neural Network Model for ETF In our neural network model for ETF, the input layer is composed of the independent variables (explanatory IPC codes), and the dependent variable (target IPC code) is in the output layer. Also, this model has one hidden layer. We use the sigmoid function as the activation function of the neural network model [18]. This research uses gradient descent as an optimization strategy for constructing the ETF model based on neural networks [18]. Our proposed process of patent information analysis is shown in Figure 3. Figure 3. Patent Analysis Process for ETF Figure 3 represents the ETF process from the retrieved patent data to the emerging technology path diagram for identifying the future technology. To obtain a final path analysis result from the neural networks, we use the method of mean squared error (MSE) [19] of the ith predictive and actual values. Thus, we determine the best model as the one with the smallest MSE. From the best model, we can construct the emerging technology path for ETF. Therefore, we forecast the emerging technology via the path diagram. 4. Experimental Results To verify the performance of this research, we conduct an experiment using the patent data related to nanotechnology from the Korea Intellectual Property Rights Information Service (KIPRIS) [20]. We retrieved a total of 2,482 patent documents concerning nanotechnology. These were all the relevant patents submitted before April 110
14, 2012. From the patent data, we extracted all the IPC codes for our experiment. Figure 4 shows the PICM of nanotechnology. Figure 4. PICM of Nanotechnology The dimension of this matrix is (2482 253). In other words, the total number of IPC codes in all patent documents is 253. An element of this matrix is the frequency that a given IPC code occurred in each patent document. For example, the IPC code A01N occurred once in patent 9989024. Figure 5 shows the frequency of top ranked IPC codes. Figure 5. Top Ranked IPC Codes We used 13 IPC codes with frequencies larger than 100. Also, we designated the IPC code with the largest frequency (H01L) as the dependent variable because this code occurred in the most nanotechnology patents. Thus, we constructed the following ETF model: H01L=f(A61K, B01J,, G01N, H01M). Our nanotechnology ETF model predicts the trend of H01L using the predictive model f( ) based on A61K, B01J, B05D, B29C, B32B, B82Y, C01B, C01G, C08K, C23C, G01N, and H01M. Table 1. P-values of Correlation and Regression Analysis Independent code Corr. coefficient Reg. parameter A61K 0.001 0.001 B01J 0.010 0.007 B05D 0.004 0.003 B29C 0.005 0.004 B32B 0.001 0.001 B82Y 0.558 0.115 C01B 0.003 0.003 C01G 0.246 0.587 C08K 0.011 0.006 C23C 0.071 0.163 G01N 0.029 0.007 H01M 0.029 0.016 111
We knew that the IPC codes of B82Y, C01G, and C23C were not significant because of their p-values of the correlation (Corr.) coefficient and the regression (Reg.) parameter are less than 0.05. Thus, we removed these variables (IPC codes) when constructing our ETF model. Using the nine selected IPC codes, we performed a neural networks analysis. Figure 6 shows the importance results we obtained for the nine selected IPC codes by neural network model. Figure 6. Importance Results of Explanatory IPC Codes We knew that the IPC code C01B was the most important to explain the IPC code H01L. Table 2 shows the resulting hierarchical models used to construct the ETF model. Table 2. Hierarchical Models for the ETF Model M1 M2 M3 M4 M5 M6 M7 M8 M9 Used IPC codes C01B M1 + B32B M2 + B29C M3 + G01N M4 + B01J M5 + C08K M6 + A61K M7 + H01M M8 + B05D First, we constructed the ETF model for H01L using one IPC code, C01B. Next, we added the second most important IPC code, B32B, to the first model (M1) to construct the second model (M2). In this manner, we constructed nine candidate models (M1 to M9) to determine optimal ETF model. Table 3 shows the neural networks results of the candidate models. Table 3. Neural Networks Results using Nine IPC Codes IPC code MSE Training Test Diff. Lift M1 272.31 122.69 149.62 4.9 M2 269.78 122.59 147.19 5.0 M3 279.65 110.84 168.81 4.9 M4 263.72 124.98 138.74 5.0 M5 263.74 122.97 140.77 4.8 M6 270.41 116.49 153.92 4.7 M7 264.04 122.06 141.98 4.6 M8 276.65 102.92 173.73 4.6 M9 258.78 120.11 138.67 4.6 112
We divided the IPC code data into two data sets, the training set and the test data. Using the training data, we constructed our neural networks model. This model was validated by the test data. The model with the smallest MSE difference (Diff.) between the training and the test data sets was deemed the best. We can thus consider the M4 and M9 models the candidates for the best model. Then, we used the lift value as another measure to determine the best model. Lift value is an improved performance measure of a constructed model. For example, if a model has a lift value of 3, then the improved performance of the model is 3 times more accurate than random selection without the model. There was no difference between the MSEs of M4 and M9. Therefore, we selected the four IPC codes of M4 (C01B, B32B, B29C, and G01N) as our best ETF model of nanotechnology. This research used these IPC codes to forecast the target technology (H01L). Also, the other IPC codes (B01J, C08K, A61K, H01M, and B05D) were considered to explain the IPC codes of M4. Table 4 shows the path analysis result for these IPC codes. Table 4. Path Analysis Result (p-values) IPC code C01B B32B B29C G01N B01J 0.001 0.935 0.225 0.206 C08K 0.419 0.451 0.780 0.186 A61K 0.117 0.111 0.463 0.236 H01M 0.985 0.249 0.248 0.535 B05D 0.034 0.001 0.760 0.059 The B01J was significant to the C01B because its p-value was less than 0.05. Also, we knew that the B05D affected the C01B and B32B significantly. Using these results, we could construct our path diagram for the nanotechnology ETF model, as shown in Figure 7. Figure 7. Emerging Technology Path for Nanotechnology The nanotechnology (H01L) is developing by the development of four technologies based on C01B, B32B, B29C, and G01N. Also, the technology of C01B depends on the technological developments of B01J and B05D. The development of B05D affects the technology of B32B. 5. Conclusions In this paper, we proposed an ETF method using statistical inference and neural networks. This research used the IPC code data of patent documents for the ETF model. As a case study to verify our ETF model, we selected nanotechnology as the target technology. Using the results of p-values of regression and correlation analysis, we extracted the meaningful IPC codes from the PICM and constructed an ETF model. The 113
proposed model used multiple regression and neural networks models to determine the relationships between IPC codes. Our target technology was defined as the dependent variable in our model, and the other codes were treated as independent variables. From the results of the regression and neural networks, we constructed a technological path diagram for ETF. We concluded that the ETF of nanotechnology requires the development of the C01B, B32B, B29C, and G01N technologies directly and the development of B01J and B05D indirectly. The contribution of our research is an objective attempt to forecast emerging technology by quantitative methods (regression and neural networks) and objective data (IPC codes of patent documents). Our model can be applied to the ETF works of diverse technologies. In this research, we had some limitations such as the dependency of interpretation of the constructed path diagram. In other words, we need nanotechnology experts to determine the effective usage of the nanotechnology path diagram. In future work, we will research more objective ETF approaches to address this problem. References [1] A. K. Firat, W. L. Woon and S. Madnick, Technological Forecasting-A Review, Working paper CISL# 2008-15, (2008). [2] S. Jun, A Forecasting Model for Technological Trend Using Unsupervised Learning, Communications in Computer and Information Science, vol. 258, (2011), pp. 51-60. [3] A. T. Roper, S. W. Cunningham, A. L. Porter, T. W. Mason, F. A. Rossini and J. Banks, Forecasting and Management of Technology, Wiley, (2011). [4] S. Jun, S. Park and D. Jang, Technology Forecasting using Matrix Map and Patent Clustering, Industrial Management & Data Systems, vol. 112, Issue 5, (2012). [5] Y. G. Kim, J. H. Suh and S. C. Park, Visualization of patent analysis for emerging technology, Expert Systems with Applications, vol. 34, (2008), pp. 1804-1812. [6] T. U. Daim, G. Rueda, H. Martin and P. Gerdsri, Forecasting emerging technologies: Use of bibliometrics and patent analysis, Technological Forecasting & Social Change, vol. 73, (2006), pp. 981-1012. [7] M. Bengisu and R. Nekhili, Forecasting emerging technologies with the aid of science and technology databases, Technological Forecasting & Social Change, vol. 73, (2006), pp. 835-844. [8] S. Park and S. Jun, New Technology Management Using Time Series Regression and Clustering, International Journal of Software Engineering and Its Applications, vol. 6, no. 2, (2012), pp. 155-160. [9] M. Fattori, G. Pedrazzi and R. Turra, Text mining applied to patent mapping: a practical business case, World Patent Information, vol. 25, (2003), pp. 335-342. [10] S. Lee, B. Toon and Y. Park, An approach to discovering new technology opportunities: Keywords-based patent map approach, technovation, vol. 29, (2009), pp. 481-497. [11] V. W. Mitchell, Using Delphi to Forecast in New Technology Industries, Marketing Intelligence & Planning, vol. 10, Issue 2, (1992), pp. 4-9. [12] Y. C. Yun, G. H. Jeong and S. H. Kim, A Delphi technology forecasting approach using a semi-markov concept, Technological Forecasting and Social Change, vol. 40, (1991), pp. 273-287. [13] Y. H. Tseng, C. J. Lin and Y. I. Linc, Textmining techniques for patent analysis, Information Processing & Management, vol. 43, Issue 5, (2007), pp. 1216-1247. [14] S. Jun, IPC code Analysis of Patent Documents Using Association Rules and Maps-Patent Analysis of Database Technology, Communications in Computer and Information Science, vol. 258, (2011), pp. 21-30. [15] G. Casella and R. L. Berger, Statistical Inference, Duxbury, (2002). [16] R. H. Myers, Classical and Modern Regression with Applications, Duxbury, (1990). [17] P. Giudici, Applied Data Mining Statistical Methods for Business and Industry, Wiley, (2003). [18] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning-Data Mining, Inference, and Prediction, Springer, (2001). [19] B. L. Bowerman, R. T. O Connell and A. B, Koehler, Forecasting, Time Series, and Regression, An Applied Approach, Brooks/Cole, (2005). [20] Korea Intellectual Property Rights Information Service (KIPRIS), www.kipris.or.kr. 114
Sunghae Jun Authors He received the BS, MS, and PhD degrees in department of statistics, Inha University, Korea, in 1993, 1996, and 2001. Also, He received PhD degree in department of computer science, Sogang University in 2007. He was a visiting scholar in department of statistics, Oklahoma State University in the United States from 2009 to 2010. He is currently associate professor in department of statistics, Cheongju University. Seung-Joo Lee He received the BS degree in department of applied statistics from Cheongju University, Korea in 1985. Also, he received MS, and PhD degrees in department of statistics, Dongkuk University, Korea, in 1987 and 1995. He is currently professor in department of statistics, Cheongju University. He has researched Bayesian and multi-variate statistics. 115
116