Mining Commonalities and Variabilities from Natural Language Documents

Size: px
Start display at page:

Download "Mining Commonalities and Variabilities from Natural Language Documents"

Transcription

1 Mining Commonalities and Variabilities from Natural Language Documents Alessio Ferrari ISTI-CNR, Pisa, Italy Giorgio O. Spagnolo ISTI-CNR, Pisa, Italy Felice Dell Orletta ILC-CNR, Pisa, Italy ABSTRACT A company who wishes to enter an established marked with a new, competitive product is required to analyse the product solutions of the competitors. Identifying and comparing the features provided by the other vendors might greatly help during the market analysis. However, mining common and variant features of from the publicly available documents of the competitors is a time consuming and errorprone task. In this paper, we suggest to employ a natural language processing approach based on contrastive analysis to identify commonalities and variabilities from the brochures of a group of vendors. We present a first step towards a practical application of the approach, in the the context of the market of Communications-Based Train Control (CBTC) systems. Categories and Subject Descriptors D.2.1 [Software Engineering]: Requirements Specification analysis,methodologies,specification General Terms Design, Algorithms Keywords Software Product Lines, Variability Mining Introduction A business subject who decides to enter an established technological market is required to accurately analyse the products of the different competitors. In the case of cheap mass products (e.g., mobiles, laptops), the new company can actually purchase the products and evaluate their features in order to compare them. In the case of expensive, large-scale, and often customized, products (e.g., security systems, intelligent transport systems), the company has to rely on the existing public documentation about the products, since the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. SPLC 2013, August Tokyo, Japan Copyright 2013 ACM /13/08...$ cost required to purchase the actual products would be prohibitive. In this paper, we consider the case of Communications- Based Train Control (CBTC) systems. CBTC is the last technological frontier for signalling and train control in the metro market [9]. CBTC systems offer flexible degrees of automation, from enforcing control over dangerous operations acted by the driver, to the complete replacement of the driver role with an automatic pilot and an automatic on-board monitoring system. In a previous paper an experience was presented, where a set of publicly available documents (brochures) has been used to derive a global CBTC model, from which specific product requirements for novel CBTC systems have been derived [7]. The model was represented in the form of a feature diagram [11], following the principles of the product line engineering technology. The bottleneck found in the experience was the large amount of human inspection required to identify the common components, as well as the architectural differences, between the solutions proposed by the different vendors. The identification of these commonalities and variabilities has enabled the definition of mandatory and variant features in the global feature diagram. In order to reduce the time required to extract commonalities and variabilities from the brochures of the different vendors, in this paper we suggest to adopt an automated Natural Language Processing (NLP) approach named contrastive analysis to identify domain-specific terms (single and multi-word) from textual documents [3]. The proposed method takes the brochures of the different vendors as input, and identifies the linguistic expressions in the documents that can be considered as terms. In this context, a term is defined as a conceptually independent expression. Then, the method automatically identifies which terms are actually domain-specific for the metro control system domain. The domain-specific terms that are common among all the brochures are considered as commonality candidates. On the other hand, those domain-specific terms that appear solely in a subset of the brochures are considered as variability candidates. A pilot test is presented to illustrate the feasibility and effectiveness of the approach. The pilot test is focused on five vendors and 19 documents pages in total. The paper is structured as follows. In Sect. 1, related works are briefly discussed. In Sect. 2, our approach for commonality and variability mining is presented. Sect. 3 describes the results of the pilot test. Sect. 4 draws conclusions and final remarks. 116

2 1. RELATED WORKS Mining commonalities and variabilities from natural language documents is an open issue in product line engineering, with several solutions proposed in the literature. In general, the approaches are based on two steps: feature mining and feature model synthesis. Since in this paper we focus on feature mining, we compare the works according to the methodology applied to identify features. Most of the works focus on the extraction of features from natural language requirements and legacy documentation [8, 4, 2, 12, 13, 15]. The DARE tool [8] is the earliest contribution in this sense. A semi-automated approach is employed to identify features according to lexical analysis based on term frequency (i.e., frequently used terms are considered more relevant for the domain). Chen et al. [4] suggests the usage of the clustering technology to identify features: requirements are grouped together according to their similarity, and each group of requirements represents a feature. Clustering is also employed in the subsequent works [2, 12, 13, 15], but while in [4] the computation of the similarity among requirements is manual, in the other works automated approaches are employed. In particular, [2] uses IRbased methods, namely the Vector Similarity Metric (VSM) and Latent Semantic Analysis (LSA). With VSM, requirements are represented as vectors of terms, and compared by computing the cosine among the vectors. With LSA, requirements are similar if they contain semantically similar terms. Two terms are considered semantically similar if they normally occur together in the requirements document. LSA is also employed by Weston et al. [15], aided with syntactic and semantic analysis, to extract the so-called Early Aspects. These are cross-cutting concerns that are useful to derive features. Finally, Niu et al. [12, 13] use Lexical Affinities (LA) roughly, term co-occurrences as the basis to find representative expressions (named Functional Requirements Profiles) in functional requirements. All the previously cited works use requirements as the main source for feature mining. Other works [10, 6, 1] present approaches where public product descriptions are employed, like in our case. While in [10] the feature extraction process is manual, the other papers suggest automated approaches. The feature mining methodology presented in [6] is based on clustering, and the authors provide also automated approaches for recommending useful features for new products. Instead, the approach presented in [1] is based on searching for variability patterns within tables where the description of the products are stored in a semi-structured manner. The approach includes also a relevant part of feature model synthesis. Regardless of the technology, the main difference between [6], [1] and our work is that the former two rely on feature descriptions that are rather structured. Indeed, in [6] the features of a product are expressed with short sentences in a bullet-list form, while in [1] features are stored in a tabular format. Instead, in our case we deal with brochures with less structured text, where the features have to be discovered within the sentences. The novelties of the current work w.r.t. the other papers are: 1) the usage of free-text informative brochures as the input documents for the commonality/variability mining process; 2) the usage of contrastive analysis for the extraction of domain-specific terms. 2. THE NLP APPROACH The proposed method is based on a novel natural language processing approach, named contrastive analysis [3], for the extraction of domain-specific terms from natural language documents. In this context, a term is a conceptually independent linguistic unit, which can be composed by a single word or by multiple words. For example, Automatic Train Protection is a term, while Protection is not a term, since in the textual documents considered in our study it often appears coupled with the same words (i.e., train, mission ), and therefore it cannot be considered as conceptually independent. The contrastive analysis technology aims at detecting those terms in a document that are specific for the domain of the document under consideration [3, 5]. Roughly, contrastive analysis considers the terms extracted from domain-generic documents (e.g., newspapers), and the terms extracted from the domain-specific document to be analysed. If a term in the domain-specific document highly occurs also in the domain-generic documents, such a term is considered as domain-generic. On the other hand, if the term is not frequent in the domain-generic documents, the term is considered as domain-specific. In our work, the documents from which we want to extract domain-specific terms are the brochures of the different vendors. A brochure is promotional document that describes the product to possible customers. Here, the reasonable assumption is that both commonalities and variabilities can be found among the domain-specific terms of the brochures. The proposed method is summarized in Fig.1. First, conceptually independent expressions (i.e., terms) are identified ( of Terms). Then, Contrastive Analysis is applied to select the terms that are domain-specific. From these terms, commonality and variability candidates are extracted (Commonality/Variability ). Brochures of Terms Contrastive Analysis Automa;c1Extrac;on1of1 Domain6specific1Terms Domain6specific1 Terms Commonality Commonality1 Figure 1: Overview of the approach 2.1 of Terms Variability Variability1 Each vendor might have more than one brochure. We collect the brochures of the same vendor i in a single document D i. Therefore, given n vendors, we have D 1... D n documents. From each one of these documents we identify a ranked list of terms. To this end, we perform the following steps. POS Tagging: first, Part of Speech (POS) Tagging is performed with an english version of the tool described in [5]. With POS Tagging, each word is associated with its grammatical category (noun, verb, adjective, etc.). Linguistic Filters: after POS tagging, we select all those words or groups of words (referred in the following as multiwords) that follow a set of specific POS patterns (i.e., se- 117

3 quences of POS), that we consider relevant in our context. For example, we will not be interested in those multi-words that end with a preposition, while we are interested in multiwords with a format like <adjective, noun, noun> (such as Automatic Train Protection ). C-NC Value: terms are finally identified and ranked by computing a termhood metric, called C-NC value [3]. This metric establishes how much a word or a multi-word is likely to be conceptually independent from the context in which it appears. The computation of the metric is rather complex, and the explanation of such computation is beyond the scope of this paper. The interested reader can refer to [3] for further details. Here we give an idea of the spirit of the metric. Roughly, a word/multi-word is conceptually dependent if it often occurs with the same words (i.e., it is nested). Instead a word/multi-word is conceptually independent if it occurs in different context (i.e., it is normally accompanied with different words). Hence, a higher C-NC rank is assigned to those words/multi-word that are conceptually independent, while lower values are assigned to words/multi-words that require additional words to be meaningful in the context in which they are uttered. After this analysis, for each D i, we have a ranked list of words/multi-words that can be considered terms, together with their ranking according to the C-NC metric, and their frequency (i.e., number of occurrences) in D i. The more a word/multi-word is likely to be a term, the higher the ranking. From the list we select the k terms that received the higher ranking. The value of k shall be empirically selected. A higher value guarantees that more domain-specific terms are included in the list. On the other hand, higher values for k might also introduce noisy items, since also words/multiwords with low rank might be included. 2.2 Contrastive Analysis The previous step leads to a ranked list of k terms where all the terms might be domain-generic or domain-specific. With the contrastive analysis step, terms are re-ranked according to their domain-specificity. To this end, the proposed approach takes as input: 1) the ranked list of terms extracted from the document D i; 2) a second list of terms extracted with the same method described in Sect. 2.1 from a set of documents that we will name the contrastive corpora. The contrastive corpora is a set of documents containing domain-generic terminology. In particular, we have considered the Penn Treebank corpus, which collects articles from the Wall Street Journal. The reasonable assumption here is that a term that frequently occurs in the Wall Street Journal is not likely to be a domain-specific term of the metro domain. The new rank R i(t) for a term t extracted from a document D i is computed according to the function: R i(t) = arctan(log(f i(t)) ( fi(t) ) F c(t) N c where f i(t) is the frequency of the term t extracted from D i, F c(t) is the sum of the frequencies of t in the contrastive corpora, and N c is the sum of the frequencies of all the terms extracted from D i in the contrastive corpora. Roughly, if a term is less frequent in the contrastive corpora, it is considered as a domain-specific term, and it is ranked higher. If two terms are equally frequent in the contrastive corpora, but one of them is more frequent in D i, it is considered as a term that characterizes the domain more than the other, and, again, it is ranked higher. After this analysis, for each D i, we have a list of terms, together with their ranking according the function R, and their frequency in D i. The more a term is likely to be domainspecific, the higher the ranking. From each list, we select the l terms that received the higher ranking. The choice of l shall be performed empirically: higer values of l tend to include terms that are not domain-specific, while lower values tend to exclude terms that might be relevant in the subsequent phases. 2.3 Commonality The commonality candidates are the domain-specific terms that are common to all the documents. Indeed, if a term is domain-specific and appears in all the documents of the different vendors, it is likely to be a common feature of all the products. More formally, if C 1... C n are the sets of domainspecific terms for D 1... D n respectively, then the set of commonality candidates is defined as: C = {C 1 C 2... C n}. Ranking is provided also for the set of commonality candidates. The ranking value is provided by computing the average rank of each term. 2.4 Variability The variability candidates are identified as those terms which are domain-specific, and therefore appear in some of the C i sets, but are not part of the commonalities. We assume that, if a domain-specific term appears in some of the documents of the different vendors, but not in all of them, it is likely to be a variant feature, characterizing only a sub-set of the products. More formally, we define the variability candidates as V = {C 1 C 2... C n} \ C. Also in this case, the ranking value is provided by computing the average rank of each term. The sets C and V are domain-specific terms of the documents. In order to assess that they actually include commonalities or variabilities, a human operator shall assess the actual relevance of each candidate. 3. PILOT TEST We have performed a pilot test to evaluate the effectiveness of the approach. To this end, we have selected the brochures and other publicly available documents of five CBTC vendors, namely Alstom, Bombardier, Invensys, Thales, and Siemens. The characteristics of the dataset are summarized in Table 1. Vendor # Docs # Pages # Words Alstom ,031 Bombardier ,317 Invensys ,341 Thales ,478 Siemens ,631 Total ,798 Table 1: Dataset 3.1 of Domain-specific Terms Following the approach described in Sect. 2.1, we have first identified the lists of those words/multi-words that can be considered terms. One list for each vendor has been provided. To this end, we have performed POS-Tagging and 118

4 we have selected the Linguistic Filters that we were interested in. In this case, we were interested in finding features that represented components of the products, rather than features related to functionalities. This choice was driven by our previous experience in manually building a global feature model for a CBTC system, to compare the different CBTC products. In our experience, the CBTC brochures tend to describe their systems in terms of the architectural components provided [7]. If we were interested in functional features, the approach should have been applied on documents including such features (e.g., requirements). Components are normally represented by nouns/acronyms, possibly coupled with preposition and adjectives. Therefore, the preferred POS patterns are those that are included in the following regular expression chosen as linguistic filter: (Noun Preposition Adjective) * Noun + If we were interested in features associated to functionalities, POS patterns including verbs would have been adopted. Among the automatically extracted terms, for each D i we have selected the k = 600 items that received the higher ranking according to the C-NC Value. The value for k has been empirically chosen: we have seen that the majority of the domain-specific terms to be re-ranked in the contrastive analysis phase were actually included in the first 600 terms. We have seen that higher values of k were introducing noisy items, while lower values were excluding relevant domain-specific items. Then, Contrastive Analysis have been performed, and the terms have been re-ranked according to their domain specificity computed through the R function. From each list, we have selected the first l = 100, l = 200 and l = 300 terms. 3.2 Commonality Following the approach described in Sect. 2.3, we have identified the list of commonality candidates for each value of l. Higher values of l lead to a higher number of candidates. In Table 2, we report the list of candidates together with their average rank (normalised between 0 and 100). The first group includes the candidates extracted when l = 100, while the two following groups include the additional candidates extracted when l = 200 and l = 300, respectively. The commonalities that have been manually assessed as relevant features are highlighted in bold. We observe that the typical CBTC components appear in the list. The ATS (Automatic Train Supervision) is the centralized system that monitors and dispatch the trains. The interlocking manages the switches and allows or deny the routing of the trains. The Automatic Train Control component controls the movement of the trains. It is composed of the ATO (Automatic Train Operation) and the ATP (Automatic Train Protection) systems. The ATO (Automatic Train Operation) is the virtual train driver. The ATP controls the train speed and brakes the train in case the allowed speed is exceeded. The majority of the other terms are typical terms of the railway domain, not strictly related to CBTC. However, we argue that a domain expert can easily recognize that these terms cannot be regarded as features, and therefore discard them from the list. Furthermore, when l = 300 no additional interesting commonality is included in the list. Therefore, in this context, using a threshold l = 200 might be sufficient. It is interesting to notice that terms that are not very fre- Candidate Avg. Rank Avg Freq. CBTC train control automatic train train ATP mass transit ATS interlocking ATO Automatic Train Control functionality train operation transit rail interface solution platform track speed Table 2: Commonality candidates quent in the documents, such as Automatic Train Control occurring 7 times in average, are highly ranked as a domainspecific terms. Approaches based solely on the frequency of the terms (e.g., [8]) would hardly recognize such a term as relevant for the domain under consideration. 3.3 Variability The variability candidates have been extracted according to the formula reported in Sect The number of variability candidates is 372 when l = 100, and increases to 809 and 1180, when l = 200 and l = 300, respectively. In the pilot test, we have manually inspected only the candidates for l = 100, since these were already providing enough informative content. The first author, who was not involved in the experiments, has manually assessed the candidates. He checked that the 47% of the terms (174 out of 372) can be actually considered as variant features. In Table 3, some variant features are listed, which are useful arguments of discussion. Candidate Avg. Rank Avg Freq. Airlink region ATO region ATP central control network CCTV train registry system train registry Smartlock BDR Base Data Radio NetTrac Table 3: Selection of variability candidates We have observed that the list includes several terms that are associated to the components provided by the specific vendors. These components can be regarded as variant features. For example, the list includes the term Smartlock, which is the proprietary interlocking of Alstom, the term Airlink, which is a the Siemens wireless communication system, and NetTrac, which is the commercial name of the 119

5 ATS provided by Thales. Other optional components, which are used by more than one vendor, are also found in the list of the variability candidates. Among them, it is worth mentioning the terms CCTV, which is the Closed Circuit Television system and central control network. Among the variant features, we find also sub-systems of the commonalities already identified. For example, region ATO and region ATP, which are sub-systems of the ATO and of the ATP systems, respectively. In the list, we also notice some issues related to synonyms and acronyms. In particular, we have evaluated that about the 10% of the terms among the selected variabilities (17 out of 174), represent the same feature. For example, the list includes both train registry and train registry system, which are both referred to the same sub-system of the ATS. However, these situation can be automatically discovered by checking the rank and the average frequency: we notice that the two terms have the same values for the two indexes. Therefore, they are treated as different terms, but the original expression in the text is always train registry system. If rank and frequency are different, the issue can be addressed by searching for terms whose words are subsets of other terms. Such search shall be supervised by a human operator, in order not to treat wayside ATP and wayside ATP computer as synonyms, since these are different components. Furthermore, we notice that both the acronym BDR and the associated term Base Data Radio appear in the list. This problem occurs solely for the 2% of the terms (3 out of 174). Reducing these expressions to their acronym version during pre-processing might solve these issues. 4. CONCLUSION In this paper, we presented an approach for commonality and variability mining from domain-specific natural language documents. We have performed a pilot test to qualitatively evaluate the approach in the metro systems domain. The approach demonstrated its expected timeeffectiveness. In the previous experience, the commonality/variability mining activity was performed manually, and required about one week of four Ph. D. students with a background in metro systems. In the pilot test, the identification of commonalities and variabilities was completed in terms of minutes, considering also the time required to assess the variabilities. Though the two experiences are not comparable the automated analysis was performed after we acquired knowledge about the products of the vendors we argue that the automated step introduced can considerably ease the preliminary analysis required for the definition of a global feature model such as the one discussed in [7]. Besides solving the issues related to acronyms and synonyms, we are currently focusing on the definition of heuristics to identify relationships/constraints among the identified features. For example, hierarchical relationships might connect terms that are formed by an adjective coupled with a domain specific term (e.g., regional ATP ). Such heuristics could be integrated with automated approaches for feature model synthesis, such as the ones presented in [14] and [1]. 6. REFERENCES [1] M. Acher, A. Cleve, G. Perrouin, P. Heymans, C. Vanbeneden, P. Collet, and P. Lahire. On extracting feature models from product descriptions. In Proc. of VaMoS 12, pages 45 54, [2] V. Alves, C. Schwanninger, L. Barbosa, A. Rashid, P. Sawyer, P. Rayson, C. Pohl, and A. Rummler. An exploratory study of information retrieval techniques in domain analysis. In Proc. of SPLC 08, pages 67 76, [3] F. Bonin, F. Dell Orletta, S. Montemagni, and G. Venturi. A contrastive approach to multi-word extraction from domain-specific corpora. In Proc. of LREC 10, pages 19 21, [4] K. Chen, W. Zhang, H. Zhao, and H. Mei. An approach to constructing feature models based on requirements clustering. In Proc. of RE 05, pages 31 40, [5] F. Dell Orletta. Ensemble system for part-of-speech tagging. In Proc. of Evalita 09, Evaluation of NLP and Speech Tools for Italian, [6] H. Dumitru, M. Gibiec, N. Hariri, J. Cleland-Huang, B. Mobasher, C. Castro-Herrera, and M. Mirakhorli. On-demand feature recommendations derived from mining public product descriptions. In Proc. of ICSE 11, pages , [7] A. Ferrari, G. O. Spagnolo, G. Martelli, and S. Menabeni. Product Line Engineering Applied to CBTC Systems Development. In Proc. of ISOLA 12, volume 7610 of LNCS, pages , [8] W. Frakes, R. Prieto-Diaz, and C. Fox. Dare: Domain analysis and reuse environment. Ann. Softw. Eng., 5: , Jan [9] IEEE. Standard for CBTC Performance and Functional Requirements. IEEE Std , [10] I. John. Capturing product line information from legacy user documentation, pages Springer, [11] K. C. Kang, S. G. Cohen, J. A. Hess, W. E. Novak, and A. S. Peterson. Feature-Oriented Domain Analysis (FODA) Feasibility Study. Technical report, Carnegie-Mellon University SE Institute, [12] N. Niu and S. M. Easterbrook. Extracting and modeling product line functional requirements. In Proc. of RE 08, pages , [13] N. Niu and S. M. Easterbrook. On-demand cluster analysis for product line functional requirements. In Proc. of SPLC 08, pages 87 96, [14] S. She, R. Lotufo, T. Berger, A. Wasowski, and K. Czarnecki. Reverse engineering feature models. In Proc. of ICSE 11, ICSE 11, pages , [15] N. Weston, R. Chitchyan, and A. Rashid. A framework for constructing semantically composable feature models from natural language requirements. In Proc. of SPLC 09, pages , ACKNOWLEDGMENTS This work was partially supported by the PAR FAS (TRACE-IT) project. 120

Product Line Engineering Applied to CBTC Systems Development

Product Line Engineering Applied to CBTC Systems Development Product Line Engineering Applied to CBTC Systems Development Alessio Ferrari 1, Giorgio O. Spagnolo 1, Giacomo Martelli 2, and Simone Menabeni 2 1 ISTI-CNR, Via G. Moruzzi 1, Pisa, ITALY, {lastname}@isti.cnr.it,

More information

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe

More information

A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network

A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network Paola Velardi University of Roma La Sapienza Italy velardi@di.uniroma1.it Alessandro Cucchiarelli

More information

Mining Complex Feature Correlations from Large Software Product Line Configurations

Mining Complex Feature Correlations from Large Software Product Line Configurations Technical Report of AGSE April 3, 2013 Mining Complex Feature Correlations from Large Software Product Line Configurations Bo Zhang Software Engineering Research Group University of Kaiserslautern Kaiserslautern,

More information

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Semantic Concept Based Retrieval of Software Bug Report with Feedback Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Risk Analysis of a CBTC Signaling System

Risk Analysis of a CBTC Signaling System Risk Analysis of a CBTC Signaling System João Batista Camargo Jr. 1, Jorge Rady de Almeida Jr. 1, Paulo Sérgio Cugnasca 1 1 Escola Politécnica da Universidade de São Paulo, São Paulo-SP, Brazil Abstract

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

PLEDGE: A Product Line Editor and Test Generation Tool

PLEDGE: A Product Line Editor and Test Generation Tool PLEDGE: A Product Line Editor and Test Generation Tool Christopher Henard christopher.henard@uni.lu Jacques Klein jacques.klein@uni.lu Mike Papadakis michail.papadakis@uni.lu Yves Le Traon yves.letraon@uni.lu

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

TOWARDS AN AUTOMATED EVALUATION PROCESS FOR SOFTWARE ARCHITECTURES

TOWARDS AN AUTOMATED EVALUATION PROCESS FOR SOFTWARE ARCHITECTURES TOWARDS AN AUTOMATED EVALUATION PROCESS FOR SOFTWARE ARCHITECTURES R. Bashroush, I. Spence, P. Kilpatrick, T.J. Brown Queen s University Belfast School of Computer Science 18 Malone Road, Belfast BT7 1NN,

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,

More information

communication tool: Silvia Biffignandi

communication tool: Silvia Biffignandi An analysis of web sites as a communication tool: an application in the banking sector Silvia Biffignandi Bibliography Datamining come approccio alle analisi dei mercati e delle performance aziendali,

More information

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted

More information

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT) The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Using i for Transformational Creativity in Requirements Engineering

Using i for Transformational Creativity in Requirements Engineering Using i for Transformational Creativity in Requirements Engineering Sushma Rayasam and Nan Niu Department of EECS, University of Cincinnati Cincinnati, OH, USA 45221 rayasasa@mail.uc.edu, nan.niu@uc.edu

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

An Approach towards Automation of Requirements Analysis

An Approach towards Automation of Requirements Analysis An Approach towards Automation of Requirements Analysis Vinay S, Shridhar Aithal, Prashanth Desai Abstract-Application of Natural Language processing to requirements gathering to facilitate automation

More information

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie

More information

How To Write A Summary Of A Review

How To Write A Summary Of A Review PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

Clustering of Polysemic Words

Clustering of Polysemic Words Clustering of Polysemic Words Laurent Cicurel 1, Stephan Bloehdorn 2, and Philipp Cimiano 2 1 isoco S.A., ES-28006 Madrid, Spain lcicurel@isoco.com 2 Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe,

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review Mohammad H. Falakmasir, Kevin D. Ashley, Christian D. Schunn, Diane J. Litman Learning Research and Development Center,

More information

Big Data and Text Mining

Big Data and Text Mining Big Data and Text Mining Dr. Ian Lewin Senior NLP Resource Specialist Ian.lewin@linguamatics.com www.linguamatics.com About Linguamatics Boston, USA Cambridge, UK Software Consulting Hosted content Agile,

More information

Software Product Lines

Software Product Lines Software Product Lines Software Product Line Engineering and Architectures Bodo Igler and Burkhardt Renz Institut für SoftwareArchitektur der Technischen Hochschule Mittelhessen Sommersemester 2015 Questions:

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

Content-Based Discovery of Twitter Influencers

Content-Based Discovery of Twitter Influencers Content-Based Discovery of Twitter Influencers Chiara Francalanci, Irma Metra Department of Electronics, Information and Bioengineering Polytechnic of Milan, Italy irma.metra@mail.polimi.it chiara.francalanci@polimi.it

More information

Processing and data collection of program structures in open source repositories

Processing and data collection of program structures in open source repositories 1 Processing and data collection of program structures in open source repositories JEAN PETRIĆ, TIHANA GALINAC GRBAC AND MARIO DUBRAVAC, University of Rijeka Software structure analysis with help of network

More information

Interactive Information Visualization of Trend Information

Interactive Information Visualization of Trend Information Interactive Information Visualization of Trend Information Yasufumi Takama Takashi Yamada Tokyo Metropolitan University 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan ytakama@sd.tmu.ac.jp Abstract This paper

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle SCIENTIFIC PAPERS, UNIVERSITY OF LATVIA, 2010. Vol. 757 COMPUTER SCIENCE AND INFORMATION TECHNOLOGIES 11 22 P. The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle Armands Šlihte Faculty

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 vpekar@ufanet.ru Steffen STAAB Institute AIFB,

More information

How To Rank Term And Collocation In A Newspaper

How To Rank Term And Collocation In A Newspaper You Can t Beat Frequency (Unless You Use Linguistic Knowledge) A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter Udo Hahn Jena University Language & Information

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

Analyzing survey text: a brief overview

Analyzing survey text: a brief overview IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining

More information

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Customer Intentions Analysis of Twitter Based on Semantic Patterns Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun mohamed.hamrounn@gmail.com Mohamed Salah Gouider ms.gouider@yahoo.fr Lamjed Ben Said lamjed.bensaid@isg.rnu.tn ABSTRACT

More information

Improving Decision Making in Software Product Lines Product Plan Management

Improving Decision Making in Software Product Lines Product Plan Management Improving Decision Making in Software Product Lines Product Plan Management Pablo Trinidad, David Benavides, and Antonio Ruiz-Cortés Dpto. de Lenguajes y Sistemas Informáticos University of Seville Av.

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval

Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval Mariângela Vanzin, Karin Becker, and Duncan Dubugras Alcoba Ruiz Faculdade de Informática - Pontifícia Universidade Católica do Rio

More information

A Survey on Product Aspect Ranking

A Survey on Product Aspect Ranking A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Web opinion mining: How to extract opinions from blogs?

Web opinion mining: How to extract opinions from blogs? Web opinion mining: How to extract opinions from blogs? Ali Harb ali.harb@ema.fr Mathieu Roche LIRMM CNRS 5506 UM II, 161 Rue Ada F-34392 Montpellier, France mathieu.roche@lirmm.fr Gerard Dray gerard.dray@ema.fr

More information

Prototype software framework for causal text mining

Prototype software framework for causal text mining Prototype software framework for causal text mining L.A. de Vries School of Management & Governance University of Twente The Netherlands +31 (0) 648951497 l.a.devries@student.utwente.nl ABSTRACT Consulting

More information

Creating Template Contract Documents using Multi- Agent Text Understanding and Clustering in Cars Insurance Domain

Creating Template Contract Documents using Multi- Agent Text Understanding and Clustering in Cars Insurance Domain Creating Template Contract Documents using Multi- Agent Text Understanding and Clustering in Cars Insurance Domain Igor Minakov 1, George Rzevski 2, Petr Skobelev 1, Simon Volman 1 1 MAGENTA Development,

More information

The Oxford Learner s Dictionary of Academic English

The Oxford Learner s Dictionary of Academic English ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students

More information

Automated Extraction of Security Policies from Natural-Language Software Documents

Automated Extraction of Security Policies from Natural-Language Software Documents Automated Extraction of Security Policies from Natural-Language Software Documents Xusheng Xiao 1 Amit Paradkar 2 Suresh Thummalapenta 3 Tao Xie 1 1 Dept. of Computer Science, North Carolina State University,

More information

IT Customer Relationship Management supported by ITIL

IT Customer Relationship Management supported by ITIL Page 170 of 344 IT Customer Relationship supported by ITIL Melita Kozina, Tina Crnjak Faculty of Organization and Informatics University of Zagreb Pavlinska 2, 42000 {melita.kozina, tina.crnjak}@foi.hr

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

SPLConfig: Product Configuration in Software Product Line

SPLConfig: Product Configuration in Software Product Line SPLConfig: Product Configuration in Software Product Line Lucas Machado, Juliana Pereira, Lucas Garcia, Eduardo Figueiredo Department of Computer Science, Federal University of Minas Gerais (UFMG), Brazil

More information

Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment

Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment Muhammad Atif Javed, Srdjan Stevanetic and Uwe Zdun Software Architecture Research Group University

More information

EA-Analyzer: Automating Conflict Detection in Aspect-Oriented Requirements

EA-Analyzer: Automating Conflict Detection in Aspect-Oriented Requirements 2009 IEEE/ACM International Conference on Automated Software Engineering EA-Analyzer: Automating Conflict Detection in Aspect-Oriented Requirements Alberto Sardinha, Ruzanna Chitchyan, Nathan Weston, Phil

More information

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Investigating Automated Sentiment Analysis of Feedback Tags in a Programming Course Stephen Cummins, Liz Burd, Andrew

More information

Semantic Analysis of Business Process Executions

Semantic Analysis of Business Process Executions Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com

More information

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Parsing Software Requirements with an Ontology-based Semantic Role Labeler Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh mroth@inf.ed.ac.uk Ewan Klein University of Edinburgh ewan@inf.ed.ac.uk Abstract Software

More information

A Note on Automated Support for Product Application Discovery

A Note on Automated Support for Product Application Discovery A Note on Automated Support for Product Application Discovery 14 August 2004 Abstract In addition to new product development, manufacturers often seek to find new applications for their existing products.

More information

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. Is there valuable

More information

Feature Selection for Electronic Negotiation Texts

Feature Selection for Electronic Negotiation Texts Feature Selection for Electronic Negotiation Texts Marina Sokolova, Vivi Nastase, Mohak Shah and Stan Szpakowicz School of Information Technology and Engineering, University of Ottawa, Ottawa ON, K1N 6N5,

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Traceability in Requirement Specifications Using Natural Languages

Traceability in Requirement Specifications Using Natural Languages Traceability in Requirement Specifications Using Natural Languages Kroha, P., Hnetynka, P., Simko, V., Vinarek, J. 1 Traceability and Requirements In software evolution and maintenance, traceability of

More information

Database Design For Corpus Storage: The ET10-63 Data Model

Database Design For Corpus Storage: The ET10-63 Data Model January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million

More information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Carrying Ideas from Knowledge-based Configuration to Software Product Lines

Carrying Ideas from Knowledge-based Configuration to Software Product Lines Carrying Ideas from Knowledge-based Configuration to Software Product Lines Juha Tiihonen 1, Mikko Raatikainen 2, Varvana Myllärniemi 2, and Tomi Männistö 1 1 {firstname.lastname}@cs.helsinki.fi, University

More information

The Future of Transportation

The Future of Transportation The Future of Transportation Innovation at Siemens Press and Analyst Event, CEO Mobility siemens.com/innovation Exponential growth of digitalization will change rail and road transportation enormously

More information

A Survey of Customer Relationship Management

A Survey of Customer Relationship Management A Survey of Customer Relationship Management RumaPanda 1, Dr. A. N. Nandakumar 2 1 Assistant Professor, Dept. of C.S.E. Vemana I T, VTU,Bangalore, Karnataka, India 2 Professor & Principal, R. L. Jalappa

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Get the most value from your surveys with text analysis

Get the most value from your surveys with text analysis PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That

More information

Information extraction from online XML-encoded documents

Information extraction from online XML-encoded documents Information extraction from online XML-encoded documents From: AAAI Technical Report WS-98-14. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Patricia Lutsky ArborText, Inc. 1000

More information

Data Mining Governance for Service Oriented Architecture

Data Mining Governance for Service Oriented Architecture Data Mining Governance for Service Oriented Architecture Ali Beklen Software Group IBM Turkey Istanbul, TURKEY alibek@tr.ibm.com Turgay Tugay Bilgin Dept. of Computer Engineering Maltepe University Istanbul,

More information

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Analysis of Social Media Streams

Analysis of Social Media Streams Fakultätsname 24 Fachrichtung 24 Institutsname 24, Professur 24 Analysis of Social Media Streams Florian Weidner Dresden, 21.01.2014 Outline 1.Introduction 2.Social Media Streams Clustering Summarization

More information

Chapter 17 Recommendation Systems in Requirements Discovery

Chapter 17 Recommendation Systems in Requirements Discovery Chapter 17 Recommendation Systems in Requirements Discovery Negar Hariri, Carlos Castro-Herrera, Jane Cleland-Huang, and Bamshad Mobasher Abstract Recommendation systems offer the opportunity for supporting

More information

Multilingual and Localization Support for Ontologies

Multilingual and Localization Support for Ontologies Multilingual and Localization Support for Ontologies Mauricio Espinoza, Asunción Gómez-Pérez and Elena Montiel-Ponsoda UPM, Laboratorio de Inteligencia Artificial, 28660 Boadilla del Monte, Spain {jespinoza,

More information

Discovering Sequential Rental Patterns by Fleet Tracking

Discovering Sequential Rental Patterns by Fleet Tracking Discovering Sequential Rental Patterns by Fleet Tracking Xinxin Jiang (B), Xueping Peng, and Guodong Long Quantum Computation and Intelligent Systems, University of Technology Sydney, Ultimo, Australia

More information

A Question Answering service for information retrieval in Cooper

A Question Answering service for information retrieval in Cooper A Question Answering service for information retrieval in Cooper Bas Giesbers¹, Antonio Taddeo², Wim van der Vegt¹, Jan van Bruggen¹, Rob Koper¹, ¹Open University of the Netherlands {Bas.Giesbers, Jan.vanBruggen,

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for

More information

Enhancing Requirement Traceability Link Using User's Updating Activity

Enhancing Requirement Traceability Link Using User's Updating Activity ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

More information

ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING

ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING Prashant D. Abhonkar 1, Preeti Sharma 2 1 Department of Computer Engineering, University of Pune SKN Sinhgad Institute of Technology & Sciences,

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information