Domain Classification of Technical Terms Using the Web
|
|
|
- Bethanie Reynolds
- 10 years ago
- Views:
Transcription
1 Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp Domain Classification of Technical Terms Using the Web Mitsuhiro Kida, 1 Masatsugu Tonoike, 2 Takehito Utsuro, 3 and Satoshi Sato 4 1 Nintendo Co., Ltd., Kyoto, Japan 2 Nakai Research Center, Fuji Xerox Co., Ltd., Kanagawa, Japan 3 Department of Intelligent Interaction Technologies, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan 4 Department of Electrical Engineering and Computer Science, Graduate School of Engineering, Nagoya University, Nagoya, Japan SUMMARY This paper proposes a method of domain classification of technical terms using the Web. In the proposed method, it is assumed that, for a certain technical domain, a list of known technical terms of the domain is given. Technical documents of the domain are collected through the Web search engine, which are then used for generating a vector space model for the domain. The domain specificity of a target term is estimated according to the distribution of the domain of the sample pages of the target term. Experimental evaluation results show that the proposed method of domain classification of a technical term achieved mostly 90% precision/recall. We then apply this technique of estimating domain specificity of a term to the task of discovering novel technical terms that are not included in any existing lexicons of technical terms of the domain. Out of 1000 randomly selected candidates of technical terms per domain, we discovered about 100 to 200 novel technical terms Wiley Periodicals, Inc. Syst Comp Jpn, 38(14): 11 19, 2007; Published online in Wiley InterScience ( DOI /scj Key words: technical terms; Web; term extraction; domain classification. 1. Introduction Lexicons of technical terms are one of the most important language resources both for human use and for computational research areas such as information retrieval and natural language processing. Among various research issues regarding technical terms, full-/semi-automatic compilation of technical term lexicons is one of the central issues. In various research fields, novel technologies are invented every year, and related research areas around such novel technologies keep growing. Along with such invention of technologies, novel technical terms are created year by year. Considering such a situation, a huge cost is required for manually compiling lexicons of technical terms for hundreds of thousands of technical domains. Therefore, it is inevitable that of a technique of full-/semi-automatic compilation of technical term lexicons for various technical domains will be invented. The whole task of compiling a technical term lexicon can be roughly decomposed into two subprocesses: (1) collecting candidates of technical terms of a technical domain and (2) judging whether each candidate is actually a technical term of the target technical domain Wiley Periodicals, Inc.
2 The technique of the first subprocess is closely related to research on automatic term recognition, and has been relatively well studied so far (e.g., Ref. 7). On the other hand, the technique of the second subprocess has not been studied well. Exceptional cases are works such as Refs. 1 and 2, where their techniques are mainly based on the tendency of technical terms appearing in technical documents of limited domains rather than in documents of daily use such as newspaper and magazine articles. Although the underlying idea of those previous works is very interesting, those works are quite limited in that they require the existence of a certain amount of technical domain corpus. It is not practical to manually collect technical domain corpus for hundreds of thousands of technical domains. Therefore, as for the second subprocess here, it is very important to invent a technique for automatically classifying the domain of a technical term. Based on this observation, among several key issues regarding the second subprocess above, this paper mainly focuses on the issue of estimating the domain specificity of a term. In this paper, supposing that a target technical term and a technical domain are given, we propose a technique of automatically estimating the specificity of the target term with respect to the target domain. Here, the domain specificity of the term is judged among the following three levels: (i) the term mostly appears in the target domain, (ii) the term generally appears in the target domain as well as in other domains, (iii) the term generally does not appear in the target domain. The key idea of the proposed technique is as follows [8, 9]. In the proposed technique, we assume that sample technical terms of the target domain are available. Using such sample terms with search engine queries, we first collect a corpus of the target domain from the Web. In a similar way, we also collect sample pages that include the target term from the Web. Then, the similarities of the contents of the documents are measured between the corpus of the target domain and each of the sample pages that include the target term. Finally, the domain specificity of the target term is estimated according to the distribution of the domain of those sample pages. Figure 1 illustrates a rough idea of this technique. Among the three example (Japanese) terms, the first term ( impedance characteristic ) mostly appears in the documents of the electric engineering domain on the Web. In the case of the second term ( electromagnetism ), about half of the sample pages collected from the Web can be regarded as in the electric engineering domain, while the rest are not. On the other hand, in the case of the last term ( response characteristic ), only a few of the sample pages can be regarded as in the electric engineering domain. In our technique, such difference of the distribution can be Fig. 1. Degree of specificity of a term based on the domain of the documents (example terms: impedance characteristic, electromagnetism, and response characteristic ). easily identified, and the domain specificities of those three terms are estimated. As experimental evaluation, we first evaluate the proposed technique of estimating domain specificity of a term using manually constructed development and evaluation term sets, where we achieved mostly 90% precision/recall. Furthermore, in this paper, we present the results of applying this technique of estimating domain specificity of a term to the task of discovering novel technical terms that are not included in any existing lexicons of technical terms of the domain. Candidates of technical terms are first collected from the Web corpus of the target domain. Then, about 70 to 80% of those candidates are excluded by roughly judging the domain of their constituent words. Finally, out of 1000 randomly selected candidates of technical terms per domain, we discovered about 100 to 200 novel technical terms that are not included in any existing lexicons of the domain, where we achieved about 75% precision and 80% recall. 2. Domain Specificity Estimation of Technical Terms Using Documents Collected from the Web 2.1. Outline As introduced in the previous section, the underlying purpose of this paper is to invent a technique for automatically classifying the domain of a technical term. More specifically, this paper mainly focuses on the issue of estimating the domain specificity of a term t with respect to a 12
3 domain C, supposing that the term t and the domain C are given. Generally speaking, the coarsest-grained classification of domain specificity of a term is binary classification, namely, the class of terms that are used in a certain technical domain versus the class of terms that are not used in a certain technical domain. In this paper, we further classify the degree g(t, C) of the domain specificity into the following three levels: + (t mostly appears in the documents of the domain C) ± (t generally appears in the documents of the g(t, C) = domain C as well as in those of the domains other than C) (t generally does not appear in the documents of the domain C) (When we simply classify domain specificity of a term into two classes with the coarsest-grained binary classification above, we regard those with domain specificity + or ± as those that are used in the domain, and those with domain specificity as those that are not used in the domain.) The input and output of the process of domain specificity estimation of a term t with respect to the domain C are given below: In the process of domain specificity estimation, the domain of documents including the target term t is estimated, and the domain specificity of t is judged according to the distribution of the domains of the documents including t. The details of those two subprocesses are described below Constructing the corpus of the domain When constructing the corpus D C of the domain C using the set T C of sample terms of the domain C, first, for each term t in the set T C, we collect into a set D t the top 100 pages obtained from search engine queries that include the term t. * The search engine queries here are designed so that documents that describe the technical term t are ranked high. When constructing a corpus of the Japanese language, the search engine goo is used. The specific queries that are used in this search engine are phrases with topic-marking postpositional particles such as t-toha, t-toiu, twa, and an adnominal phrase t-no, and t. Then, union of the sets D t for each t is constructed and denoted as D(T C ): The process of domain specificity estimation of a term is illustrated in Fig. 2, where the whole process can be decomposed into two subprocesses: (a) that of constructing the corpus D C of the domain C and (b) that of estimating the specificity of a term t with respect to the domain C. Finally, in order to exclude noise texts from the set D(T C ), the documents in the set D(T C ) are ranked according to the number of sample terms (of the set T C ) that are included in each document. Through a preliminary experiment, we decided here that it is enough to keep the top 500 documents, and regard them as the corpus D C of the domain C Domain specificity estimation of technical terms Given the corpus D C of the domain C, domain specificity of a term t with respect to a domain C is estimated through the following three steps: Step 1: Collecting documents that include the term t from the Web, and constructing the set D t of those documents. Step 2: For each document in the set D t, estimating its domain by measuring similarity against the corpus D C of the domain C. Then, given a certain lower bound L of document similarity, from D, extracting documents with large enough similarity values into a set D t (C, L). Fig. 2. Domain specificity estimation of terms based on Web documents. * Related techniques for automatically constructing the corpus of the domain using the sample terms of the domain include those presented in Refs. 3 and 6. We are planning to evaluate the performance of those related techniques and compare them with the one employed in this paper
4 Step 3: Estimating the domain specificity g(t, C) of t using the document set D t (C, L) constructed in step 2. Details of the three steps are given below Collecting Web documents including the target term For each target term t, documents that include t are collected from the Web. According to a procedure that is similar to that of constructing the corpus of the domain C described in Section 2.2, the top 100 pages obtained with search engine queries are collected into a set D t Domain estimation of documents For each document in the set D t, its domain is estimated by measuring similarity against the corpus D C of the domain C. Then, given a certain lower bound L of document similarity, documents with large enough similarity values are extracted from D t into the set D t (C, L). In the process of document similarity calculation, we simply employ a standard vector space model, * where a document vector is constructed, after removing 153 stop words, as a vector of frequencies of content words such as nouns and verbs in a document. Here, the corpus D C of the domain C is regarded as a document d C, and a document vector dv(d C ) is constructed. For each document d t in the set D t, a document vector dv(d t ) is also constructed. Then, the cosine similarity between dv(d t ) and dv(d C ) is calculated and is defined as the similarity sim(d t, D C ) between the document d t and the corpus D C of the domain C: Finally, suppose that a certain lower bound L of document similarity is given, documents d t with the similarity value sim(d t, D C ) above or equal to L are regarded as those of the domain C, and are collected into a set D t (C, L): In the experimental evaluation of Section 2.4, the lower bound L is determined using a development term set for tuning various parameters of the whole process of estimating domain specificity of technical terms. * An alternative here is to apply supervised document classification techniques such as those based on machine learning technologies (e.g., Ref. 4). In particular, since it is not easy to collect negative data here in the task of domain specificity estimation of a term, it seems very interesting to apply recently developed techniques without labeled negative data (e.g., Ref. 5) Domain specificity estimation of a term The domain specificity of the term t with respect to the domain C is estimated using the document sets D t and D t (C, L). Here, this is done by simply calculating the following ratio r L of the numbers of documents within the two sets: Then, by introducing the two thresholds a(±) and a(+) for the ratio r L, the specificity g(t, C) of t is estimated with the following three levels: In the experimental evaluation of Section 2.4, as in the case of the lower bound L of the document similarity, the two thresholds a(±) and a(+) are also determined using the development term set mentioned above Experimental evaluation We evaluate the proposed method with five sample domains, namely, electric engineering, optics, aerospace engineering, nucleonics, and astronomy. For each domain C of those five domains, the set T C of sample (Japanese) terms is constructed by randomly selecting 100 terms * from an existing (Japanese) lexicon of technical terms for human use. For each of the five domains, we then manually constructed the development term set T dev and the evaluation term set T eva, each of which has 100 terms (those with frequency more than or equal to five, and hits of the search engine within 100 to 10,000), respectively. For each of the domain specificity +, ±, and, Table 1 lists the number of terms of the class. In our experimental evaluation, the development term set T dev is used for tuning the lower bound L of the document similarity, as well as the two thresholds a(±) and a(+), where those parameter values are determined so as to maximize the F score (α = 0.75). Here, we chose the value of the weight α as 0.75, since we prefer precision to recall in our application such as automatic technical term collection, where automatically collected technical terms are recursively utilized as seed technical terms in the later steps of a bootstrapping process. * Through a preliminary experiment, we conclude that it is not necessary to start with the set T C of sample terms which has more than 100 sample terms. The number of minimum requirement for the size of T C varies according to domains. 14
5 Table 1. Number of terms for experimental evaluation Table 2. Precision/recall of domain specificity estimation 15
6 1 F score = α(1 / precision) + (1 α)(1 / recall) Parameter values as well as experimental evaluation results are summarized in Table 2. Generally speaking, the task of discriminating the terms with domain specificity + against the rest is much harder than that of discriminating those with + and ± against those with. In Table 2(b), especially, the results with the domains aerospace engineering and nucleonics have lower precision values than other domains. Each of these two domains has another closely related domain (e.g., military for aerospace engineering, and radiation medicine for nucleonics ), where technical terms of these two domains tend to appear also in the documents of such closely related domains. Most of the errors are caused by the existence of those close domains. 3. Collecting Novel Technical Terms of a Domain from the Web 3.1. Procedure This section illustrates how to apply the technique of domain specificity estimation of technical terms to the task of discovering novel technical terms that are not included in any existing lexicons of technical terms of the domain. First, as shown in Fig. 3, from the corpus D C of the domain C, candidates of technical terms are collected. In the case of the Japanese language, as candidates of novel technical terms, we collect compound nouns with frequency counts five or more, consisting of more than one noun. Here, we collect compound nouns which are not included in any existing lexicons of technical terms of the domain. Then, after excluding terms which do not share constituent nouns against the sample terms of the given set T C, the domain specificity of the remaining terms are automatically estimated. Finally, we regard terms with domain specificity + or a(±) as those that are used in the domain, and collect them into the set T web,c Experimental evaluation Table 3 compares the numbers of candidates of novel technical terms collected from the Web, with those after excluding terms which do not share constituent nouns against the sample terms of the given set T C. As shown in the table, about 70 to 80% of the candidates are excluded, while the rate of technical terms within the remaining candidates increased. This result clearly shows the effectiveness of the constituent noun filtering technique in reducing the computational time of discovering fixed number of novel technical terms. Then, per domain, we randomly select 1000 of those remaining candidates, and estimate their domain specificity by the proposed method. After manually judging the domain specificity of those 1000 terms, we measure the precision/recall of the proposed method as in Table 4, where we achieved about 75% precision and 80% recall. Here, however, as candidates of technical terms, we simply collect compound nouns, where sometimes their term unit is not correct since the technical term candidate could be with a certain prefix or suffix. Considering this fact, Table 4 also gives the term unit correct rate for those with domain specificity + or ±. Finally, taking this term unit correct rate into account, we can conclude that, out of the 1000 candidates, we discovered about 100 to 200 novel technical terms that are not included in any existing lexicons of the domain. This result clearly supports the effectiveness of the proposed technique for the purpose of full-/semi-automatic compilation of technical term lexicons. 4. Concluding Remarks Fig. 3. Collecting novel technical terms of a domain from the Web. This paper proposed a method of domain specificity estimation of technical terms using the Web. In the proposed method, it is assumed that, for a certain technical domain, a list of known technical terms of the domain is given. Technical documents of the domain are collected through the Web search engine, which are then used for generating a vector space model for the domain. The domain specificity of a target term is estimated according to 16
7 Table 3. Changes in number of technical term candidates with constituent filter Table 4. Precision/recall of collecting novel technical terms 17
8 the distribution of the domain of the sample pages of the target term. Experimental evaluation results showed that the proposed method achieved mostly 90% precision/recall. We then applied this technique of estimating domain specificity of a term to the task of discovering novel technical terms that are not included in any existing lexicons of technical terms of the domain. Out of 1000 randomly selected candidates of technical terms per domain, we discovered about 100 to 200 novel technical terms. Related techniques on domain specificity estimation of a term include the one based on straightforward application of similarity calculation of language models between the domain corpus and the target term. Although this technique seems mathematically well defined, it is weak in that it assumes a single language model per target term. In particular, when a target term appears in the documents of more than one domain, the technique proposed in this paper seems to be advantageous, since it independently estimates the domain of each individual document including the target term. REFERENCES 1. Chung TM. A corpus comparison approach for terminology extraction. Terminology 2004;9: Drouin P. Term extraction using non-technical corpora as a point of leverage. Terminology 2003;9: Huang C-C, Lin K-M, Chien L-F. Automatic training corpora acquisition through Web mining. Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, p , Joachims T. Learning to classify text using support vector machines: Methods, theory, and algorithms. Springer-Verlag; Liu B, Dai Y, Li X, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples. Proceedings of the 3rd IEEE International Conference on Data Mining, p , Liu B, Li X, Lee WS, Yu PS. Text classification by labeling words. Proceedings of the 19th AAAI, p , Nakagawa H, Mori T. Automatic term recognition based on statistics of compound nouns and their components. Terminology 2003;9: Utsuro T, Kida M, Tonoike M, Sato S. Towards automatic domain classification of technical terms: Estimating domain specificity of a term using the Web. In: Ng HT, Leong M-K, Kan M-Y, Ji DH (editors). Information retrieval technology: Third Asia Information Retrieval Symposium. AIRS 2006, Lecture Notes in Computer Science Vol. 4182, p Springer; Utsuro T, Kida M, Tonoike M, Sato S. In: Matsumoto Y, Sproat R, Wong K-F, Zhang M (editors). Computer processing of Oriental languages: Beyond the Orient: The research challenges ahead. Lecture Notes in Artificial Intelligence Vol. 4285, p Springer; AUTHORS (from left to right) Mitsuhiro Kida received his B.E. degree in electrical engineering and M.Inf. degree from Kyoto University in 2004 and Since 2006, he has been involved in development activities of information technologies at Nintendo Co., Ltd. Masatsugu Tonoike received his B.E. degree in information engineering and M.Inf. and D.Inf. degrees from Kyoto University in 2001, 2003, and Since then, he has been involved in research and development activities of information technologies at Fuji Xerox Co., Ltd. 18
9 AUTHORS (continued) (from left to right) Takehito Utsuro (member) received his B.E., M.E., and D.Eng. degrees in electrical engineering from Kyoto University in 1989, 1991, and He was an instructor at the Graduate School of Information Science, Nara Institute of Science and Technology, from 1994 to 2000, a lecturer in the Department of Information and Computer Sciences, Toyohashi University of Technology, from 2000 to 2002, and a lecturer in the Department of Intelligence Science and Technology, Graduate School of Informatics of Kyoto University, from 2003 to He has been an associate professor in the Department of Intelligent Interaction Technologies, Graduate School of Systems and Information Engineering, University of Tsukuba, since From 1999 to 2000, he was a visiting scholar in the Department of Computer Science of Johns Hopkins University. His professional interests lie in natural language processing, information retrieval, text mining from the Web, machine learning, spoken language processing, and artificial intelligence. Satoshi Sato is a professor in the Department of Electrical Engineering and Computer Science at Nagoya University. His recent work in natural language processing focuses on automatic paraphrasing, controlled language, and automatic terminology construction. He received his B.E., M.E., and D.Eng. degrees from Kyoto University in 1983, 1985, and ) 19
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
A Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection
Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292
Document Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
A Survey on Product Aspect Ranking Techniques
A Survey on Product Aspect Ranking Techniques Ancy. J. S, Nisha. J.R P.G. Scholar, Dept. of C.S.E., Marian Engineering College, Kerala University, Trivandrum, India. Asst. Professor, Dept. of C.S.E., Marian
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Data Analysis Methods for Library Marketing in Order to Provide Advanced Patron Services
Data Analysis Methods for Library Marketing in Order to Provide Advanced Patron Services Toshiro Minami 1,2 and Eunja Kim 3 1 Kyushu Institute of Information Sciences, 6-3-1 Saifu, Dazaifu, Fukuoka 818-0117
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Towards applying Data Mining Techniques for Talent Mangement
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
Text Opinion Mining to Analyze News for Stock Market Prediction
Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Facilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat [email protected] Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
Semantic Concept Based Retrieval of Software Bug Report with Feedback
Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Query term suggestion in academic search
Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016
Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
How To Cluster On A Search Engine
Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
RRSS - Rating Reviews Support System purpose built for movies recommendation
RRSS - Rating Reviews Support System purpose built for movies recommendation Grzegorz Dziczkowski 1,2 and Katarzyna Wegrzyn-Wolska 1 1 Ecole Superieur d Ingenieurs en Informatique et Genie des Telecommunicatiom
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Surface Reconstruction from Point Cloud of Human Body by Clustering
Systems and Computers in Japan, Vol. 37, No. 11, 2006 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J87-D-II, No. 2, February 2004, pp. 649 660 Surface Reconstruction from Point Cloud of Human
Operations Research and Knowledge Modeling in Data Mining
Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 [email protected]
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis
, pp.138-142 http://dx.doi.org/10.14257/astl.2013.31.31 Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis Li Peng 1,2, Yu Xiao-yang 1, Liu Yang 2, Bi Ting-ting 2 1 Higher
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
Florida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work
Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan [email protected]
Cross-Lingual Concern Analysis from Multilingual Weblog Articles
Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan
, pp.217-222 http://dx.doi.org/10.14257/ijbsbt.2015.7.3.23 A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan Muhammad Arif 1,2, Asad Khatak
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
Numerical Research on Distributed Genetic Algorithm with Redundant
Numerical Research on Distributed Genetic Algorithm with Redundant Binary Number 1 Sayori Seto, 2 Akinori Kanasugi 1,2 Graduate School of Engineering, Tokyo Denki University, Japan [email protected],
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe
How To Create A Text Classification System For Spam Filtering
Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar
Design call center management system of e-commerce based on BP neural network and multifractal
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
Subordinating to the Majority: Factoid Question Answering over CQA Sites
Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: [email protected]
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS
A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
A Dynamic Approach to Extract Texts and Captions from Videos
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS
ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
Open Access A Facial Expression Recognition Algorithm Based on Local Binary Pattern and Empirical Mode Decomposition
Send Orders for Reprints to [email protected] The Open Electrical & Electronic Engineering Journal, 2014, 8, 599-604 599 Open Access A Facial Expression Recognition Algorithm Based on Local Binary
Building A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 [email protected] Qutaibah Althebyan +962796536277 [email protected] Baraq Ghalib & Mohammed
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Stabilization by Conceptual Duplication in Adaptive Resonance Theory
Stabilization by Conceptual Duplication in Adaptive Resonance Theory Louis Massey Royal Military College of Canada Department of Mathematics and Computer Science PO Box 17000 Station Forces Kingston, Ontario,
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
NTT DOCOMO Technical Journal. Knowledge Q&A: Direct Answers to Natural Questions. 1. Introduction. 2. Overview of Knowledge Q&A Service
Knowledge Q&A: Direct Answers to Natural Questions Natural Language Processing Question-answering Knowledge Retrieval Knowledge Q&A: Direct Answers to Natural Questions In June, 2012, we began providing
Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning
Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning By: Shan Suthaharan Suthaharan, S. (2014). Big data classification: Problems and challenges in network
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment
2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. [email protected]
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
A QoS-Aware Web Service Selection Based on Clustering
International Journal of Scientific and Research Publications, Volume 4, Issue 2, February 2014 1 A QoS-Aware Web Service Selection Based on Clustering R.Karthiban PG scholar, Computer Science and Engineering,
