An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them"

Transcription

1 An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications, Aghia Paraskevi Attikis, Athens, Greece {vangelis, iit.demokritos.gr/skel Abstract. The paper presents a platform that facilitates the use of tools for collecting domain specific web pages as well as for extracting information from them. It also supports the configuration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The platform design is based on the methodology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC. 1 Introduction The growing volume of web content in various languages and formats, along with the lack of structured information and the information diversity have made information and knowledge management a real challenge towards the effort to support the information society. Enabling large scale information extraction (IE) from the Web is a crucial issue for the future of the Internet. The traditional approach to Web IE is to create wrappers, i.e. sets of extraction rules, either manually or automatically. At run-time, wrappers extract information from unseen collections of Web pages, of known layout, and fill the slots of a predefined template. The manual creation of wrappers presents many shortcomings due to the overhead in writing and maintaining them. On the other hand, the automatic creation of wrappers (wrapper induction) presents also problems since a re-training of the wrappers is necessary when changes occur in the formatting of the targeted Web site or when pages from a similar Web site are to be analyzed. Training an effective siteindependent wrapper is an attractive solution in terms of scalability, since any V. Karkaletsis and C.D. Spyropoulos: An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them, StudFuzz 185, (2005) c Springer-Verlag Berlin Heidelberg 2005

2 146 V. Karkaletsis and C.D. Spyropoulos domain-specific page could be processed, without relying heavily on the hypertext structure. The collection of the application specific web pages which will be processed by the wrappers is also a crucial issue. A collection mechanism is necessary for the location of the application specific web sites and the identification of interesting pages within them. The design and development of web pages collection and extraction systems needs to consider requirements such as enabling adaptation to new domains and languages, facilitating maintenance for an existing domain, providing strategies for effective site navigation, ensuring personalized access, and handling of structured, semi-structured or unstructured data. The implementation of a web pages collection and extraction mechanism that addresses effectively these important issues was the motivation for the R&D project CROSSMARC 1, which was partially funded by the EC. CROSSMARC work resulted to a system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. Based on the methodology proposed in CROSSMARC, we started the development of a new platform to facilitate the use of collection and extraction tools as well as their customization. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The current version of the platform incorporates mainly CROSSMARC tools for the case studies in which it is being tested. However, it also enables the incorporation of new tools due to its open architecture design. The paper outlines first CROSSMARC work, in relation to other works in the area. It presents then the first version of the platform as well as some first results from its use in case studies. 2 Related Work Collection of domain specific web pages involves the use of web focused crawling and spidering technologies. The motivation for web focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic Web crawlers. The aim is to adapt the behavior of the search engine to the user requirements. The term focused crawling was introduced in [1] where the system presented starts with a set of representative pages and a topic hierarchy and tries to find more instances of interesting topics in the hierarchy by following the links in the seed pages. Another interesting approach to focused crawling is adopted by the InfoSpiders system [4], 1

3 An Open Platform for Collecting Domain Specific Web Pages 147 a multi-agent focused crawler which uses as starting points a set of keywords and a set of root pages. The crawler implemented according to CROSSMARC approach involves three different crawlers which exploit topic hierarchies, keywords from domain ontologies and lexica, and a set of representative pages [8]. While in focused crawling, the aim is to adapt the behavior of the search engine to the requirements of a user, in site-specific spidering the spider navigates in a Web site, following best-scored-first links. Each Web page visited is evaluated, in order to decide whether it is really relevant to the topic, and its hyperlinks are scored in order to decide whether they are likely to lead to useful pages. Therefore, site-specific spidering involves two decision functions: one which classifies Web pages as being interesting (e.g. laptop offers) or not and one that scores hyperlinks, according to their potential usefulness. Thus, the input to the 1st decision function is a Web page visited by the spider and its output is a binary decision. This is a typical text classification task. Various machine learning methods have been used for constructing such text classifiers. In [6] an up-to-date survey of such approaches is provided. In CROSSMARC we examined a large number of classification approaches in order to find the most appropriate one for each domain and language. Concerning the second decision function in site-specific spidering, this is a regression function, i.e., the input to the function is the hyperlink, together with its anchor and possibly surrounding text, and the output is a score, corresponding to the probability of reaching a product page quickly through this link. Like classification, there is a variety of machine learning methods that are available for learning regression functions. However, in contrast to text classification, the task of hyperlink scoring has not been studied extensively in the literature. Most of the work on scoring and ordering of links refers to Web-wide crawlers, rather than site-specific spiders, and is based on the popularity of the pages pointed by the links that are being examined. This approach is inappropriate for the spider implemented in CROSSMARC. The only really relevant work that has been identified in the literature is [5], who use a type of simplified reinforcement learning in order to score the hyperlinks met by a Web spider. A reinforcement learning link scoring methodology was also examined in CROSSMARC and was compared against a rule-based methodology. Concerning information extraction from web pages, a number of systems have been developed to extract structured data from web pages. A recent nice survey of existing web extraction tools is found in [3], where a classification of these tools is proposed based on the technologies used for wrapper creation or induction. According to [3], tools can be classified in the following categories: Languages for wrappers development: these are languages designed for assisting the manual creation of wrappers.

4 148 V. Karkaletsis and C.D. Spyropoulos Fig. 1. Classification of web extraction tools (taken from [3]) and CROSSMARC position HTML-aware tools: these tools convert a web page into a tree representation that reflects the HTML tag hierarchy. Extraction rules are then applied to the tree representation. Wrapper Induction tools: these tools generate delimiter-based rules relying on page formatting features and not on linguistic ones. They present similarities with the HTML-aware tools. NLP-based tools: these tools employ natural language processing (NLP) techniques, such as part of speech tagging, phrase chunking, to learn extraction rules; Ontology-based tools: these tools employ a domain-specific ontology to locate ontology instances in the web page which are then used to fill the template slots. CROSSMARC employs most of the categories of web extraction tools presented in [3] (see Fig. 1). It uses: Wrapper Induction (WI) techniques in order to exploit the formatting features of the web pages. NLP techniques to exploit linguistic features of the web pages enabling the process of domain specific web pages in different sites and in different languages (multilingual, site-independent). Ontology engineering to enable the creation and maintenance of ontologies, language-specific lexica as well as other application-specific resources. Details on the CROSSMARC extraction tools are presented in [2]. More relevant publications can be found at the project s web site.

5 An Open Platform for Collecting Domain Specific Web Pages 149 Fig. 2. System s agent based architecture 3 The Platform CROSSMARC work resulted to a core system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. The core system implements a distributed, multiagent, open and multi-lingual architecture which is depicted in Fig. 2. It involves components for the identification of interesting web sites (focused crawling) and the location of domain-specific web pages within these sites (spidering), the extraction of information about product/offer descriptions from the collected web pages, and the storage and presentation of the extracted information to the end-user according to his/her preferences. The infrastructure for configuring to new domains and languages involves: an ontology management system for the creation and maintenance of the ontology, the lexicons and other ontology-related resources; a methodology and a tool for the formation of corpus necessary for the training and testing of the modules in the spidering component; a methodology and a tool for the collection and annotation of corpus necessary for the training and testing of the information extraction components. Based on this work, we started the development of a platform that will enable the integration, training and testing of collection and extraction tools (such as the ones developed in CROSSMARC) under a common interface. The experiences from building three different applications using CROSSMARC tools assisted significantly the platform deisgn. These applications concerned the extraction of information from: laptops offers in e-retailers web sites (in four languages),

6 150 V. Karkaletsis and C.D. Spyropoulos Fig. 3. Ontology tab: invoking the ontology management system job offers in IT companies web sites (in four languages), holidays packages in the sites of travel agencies (in two languages). According to CROSSMARC methodology, the building of an application involves two main stages. The 1st one concerns the creation of the applicationspecific resources using the customization infrastructure, whereas the 2nd stage concerns the training of the integrated system using the applicationspecific resources and the system configuration. The 1st stage is realized, in our platform, by the Ontology and Corpora tabs. Through the Ontology tab (see Fig. 3), the user can invoke an ontology management system in order to create or update the domain specific ontology, the lexicons under the domain ontology, the important entities and fact types for the domain, and the user stereotypes definitions according to the ontology. In the current version, the ontology management system of CROSSMARC is used. The Ontology tab enables also the user to specify the location of the ontology related resources, he/she wants to use in the next steps of the application building (see Fig. 4). Through the Corpora Tab the user can perform several tasks. The user can invoke the Corpus Formation Tool (CFT), which helps users build a corpus of positive and negative pages, with respect to a given domain (see Fig. 5). This corpus is then used for the training and testing of the Page Filtering component of the spidering tool. In addition, the user can specify the folder(s) where the corpora for the training and testing of the Information extraction components are stored, and also invoke the annotation tool. The current version of the platform employs a different annotation tool from the one that was included in the CROSS- MARC distribution. The new tool is of the ones provided by the Ellogon

7 An Open Platform for Collecting Domain Specific Web Pages 151 Fig. 4. Ontology tab: specifying the ontology related resources Fig. 5. Corpora tab: invoking the Corpus Formation Tool language engineering platform 2 of our laboratory. However, the platform supports also the use of the CROSSMARC Web annotation tool [7]. The 2nd processing stage is realized by the Training and Extraction tabs. Through the Training tab (see Fig. 6), the user can invoke the machine learning based training tools for the Page Filtering, Link scoring, and Information Extraction components. Especially, in the case of Information Extraction, training involves two separate modules, the Named entity recognition & classification NERC module and the Fact extraction FE module. The current version of the platform employs the Ellogon-based NERC and FE modules developed by our laboratory. The platform can support also 2

8 152 V. Karkaletsis and C.D. Spyropoulos Fig. 6. Training tab: invoking the training tool for page filtering Fig. 7. Extraction tab: configuring the spidering component (advanced options) the use of the other NERC and FE training tools developed in the context of CROSSMARC, since they all share common I/O specifications. Through the Extraction tab (see Fig. 7), the user can configure and test the Crawling, Spidering and Information Extraction components. In the case of Crawling, the user can set the starting points for the crawler editing the corresponding configuration file. In a similar way, a different crawler can be incorporated and configured according to the specific domains. A new crawler is currently under development and will be tested through the platform in a future case study. In the case of Spidering, the user can select the model for page filtering and link scoring (a machine learning or a heuristics based), edit the heuristics based model, set a threshold for link scoring, and perform several more advanced options. The user can test the components

9 An Open Platform for Collecting Domain Specific Web Pages 153 with various configurations, view the results and decide on the preferred configuration. Concerning Information Extraction, the user can test separately the NERC and FE components, and configure the demarcation components. In the current version, the platform supports only the NERC component. It must be noted that the outcome of the platform use is not necessarily a complete web content collection and extraction system. As it is shown in the case studies section, the platform user can build a crawler for a new domain, a collection system (crawler and spider), a named entity recognition system, or an information extraction system. It depends on the specific task needs and the domain. 4 Case Studies The current version of the platform was used for the building of several applications. Some of these applications are presented below grouped according to the different tasks. The first group of applications involves the development of crawlers for an information filtering task. More specifically, the task was to develop crawlers for specific topics (English and Greek languages were covered) that will return lists of web sites for these topics. These lists would be used to train an information filtering system. Examples of topics include web sites that provide a service to communicate (chat) with other users in real time, web sites that provide services (send/receive messages), sites with job offers, etc. In these cases, the extraction tab of the platform was used to configure the starting points of the crawler, test it and find the best configuration for each topic. Another group of applications concern the development of systems collecting web pages for specific domains and languages. An example domain is personal web pages of academic staff in University departments (Greek pages were covered). Such applications involve the training of both the crawling and the spidering components using the platform functionalities. More specifically, the ontology tab for creating the domain-specific ontology and lexica, the corpora tab for create the corpus for the training of page filtering, the training tab for the training of the page filtering and link scoring components, and the extraction tab for configuring and testing the crawling and spidering components. A third group of applications concerns the development of named entity recognition systems for specific domain and languages, which require the collection and annotation of the necessary corpus, the training and testing of the system. In a similar way, information extraction systems can be developed. The final group of applications integrate the collection and extraction mechanisms, as it was the case for the CROSSMARC domains. The platform,

10 154 V. Karkaletsis and C.D. Spyropoulos in its current status, does not support the development of such integrated applications. 5 Concluding Remarks The CROSSMARC project implemented a distributed, multi-agent, open and multilingual architecture for web retrieval and extraction, which integrates several components based on state of the art AI technologies and commercial tools. Based on this work we are developing a platform that enables the integration, training and testing of collection and extraction tools, such as the ones developed in CROSSMARC. A first version of this platform is currently being tested in several case studies for the development of focused crawlers, spiders, and information extraction systems. The current version employs mainly CROSSMARC tools. However, due to its open design, other tools have also been employed and more will be integrated and tested in the near future. References 1. Chakrabarti S., van den Berg M.H., Dom B.E.: Focused Crawling: a new approach to topic-specific Web resource discovery. Proceedings of the 8th International World Wide Web Conference, Toronto, Canada (1999) 2. Karkaletsis V., Spyropoulos C.D., Grover C., Pazienza M.T., Coch J., Souflis D.: A Platform for Cross-lingual, Domain and User Adaptive Web Information Extraction. Proceedings of the European Conference in Artificial Intelligence (ECAI), Valencia, Spain (2004) Laender A., Ribeiro-Neto B., da Silva A., Teixeira J.: A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Records, vol. 31(2) (2002) 4. Menczer F., Belew R.K.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning, 39(2/3) (2000) Rennie J., McCallum A.: Efficient Web Spidering with Reinforcement Learning. Proceedings of the 16th International Conference on Machine Learning (ICML- 99) (1999) 6. Sebastiani F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1) (2002) 7. Sigletos G., Farmakiotou D., Stamatakis K., Paliouras G., Karkaletsis V.: Annotating Web pages for the needs of Web Information Extraction applications. Proceedings of the 12th International WWW Conference (Poster Session), Budapest, Hungary (2003) 8. Stamatakis K., Karkaletsis V., Paliouras G., Horlock J., Grover C., Curran J.R., Dingare S.: Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler. Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)

Health-related Web Content: quality labelling mechanisms and the MedIEQ approach

Health-related Web Content: quality labelling mechanisms and the MedIEQ approach Health-related Web Content: quality labelling mechanisms and the MedIEQ approach Vangelis Karkaletsis, Kostas Stamatakis, Vangelis Metsis, Vassiliki Redoumi, Dimitris Tsarouhas National Centre for Scientific

More information

Use of Ontologies for Cross-lingual Information Management in the Web

Use of Ontologies for Cross-lingual Information Management in the Web Use of Ontologies for Cross-lingual Information Management in the Web Ben Hachey, Claire Grover, Vangelis Karkaletsis, Alexandros Valarakos, Maria Teresa Pazienza, Michele Vindigni, Emmanuel Cartier, José

More information

2QWRORJ\LQWHJUDWLRQLQDPXOWLOLQJXDOHUHWDLOV\VWHP

2QWRORJ\LQWHJUDWLRQLQDPXOWLOLQJXDOHUHWDLOV\VWHP 2QWRORJ\LQWHJUDWLRQLQDPXOWLOLQJXDOHUHWDLOV\VWHP 0DULD7HUHVD3$=,(1=$L$UPDQGR67(//$72L0LFKHOH9,1',*1,L $OH[DQGURV9$/$5$.26LL9DQJHOLV.$5.$/(76,6LL (i) Department of Computer Science, Systems and Management,

More information

Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1

Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1 Combining Ontological Knowledge and Wrapper Induction techniques into an e-retail System 1 Maria Teresa Pazienza, Armando Stellato and Michele Vindigni Department of Computer Science, Systems and Management,

More information

Purchasing the Web: an Agent based E-retail System with Multilingual Knowledge

Purchasing the Web: an Agent based E-retail System with Multilingual Knowledge WSS03 Applications, Products and Services of Web-based Support Systems 165 Purchasing the Web: an Agent based E-retail System with Multilingual Knowledge Maria Teresa Pazienza, Armando Stellato, Michele

More information

Multilingual XML-Based Named Entity Recognition for E-Retail Domains

Multilingual XML-Based Named Entity Recognition for E-Retail Domains Multilingual XML-Based Named Entity Recognition for E-Retail Domains Claire Grover, Scott McDonald, Donnla Nic Gearailt, Vangelis Karkaletsis Ý, Dimitra Farmakiotou Ý, Georgios Samaritakis Ý, Georgios

More information

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Er is door mij gebruik gemaakt van dia s uit presentaties

More information

Intinno: A Web Integrated Digital Library and Learning Content Management System

Intinno: A Web Integrated Digital Library and Learning Content Management System Intinno: A Web Integrated Digital Library and Learning Content Management System Synopsis of the Thesis to be submitted in Partial Fulfillment of the Requirements for the Award of the Degree of Master

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY Yu. A. Zagorulko, O. I. Borovikova, S. V. Bulgakov, E. A. Sidorova 1 A.P.Ershov s Institute

More information

Mining Navigation Histories for User Need Recognition

Mining Navigation Histories for User Need Recognition Mining Navigation Histories for User Need Recognition Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti Roma Tre University, Via della Vasca Navale 79, Rome, 00146 Italy {gaspare,micarel,gsansone}@dia.uniroma3.it

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

A Survey On Various Kinds Of Web Crawlers And Intelligent Crawler

A Survey On Various Kinds Of Web Crawlers And Intelligent Crawler A Survey On Various Kinds Of Web Crawlers And Intelligent Crawler Mridul B. Sahu 1, Prof. Samiksha Bharne 2 1 M.Tech Student, Dept. Of Computer Science And Engineering, (BIT), Ballarpur, India 2 Professor,

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired

More information

Tamil Search Engine. Abstract

Tamil Search Engine. Abstract Tamil Search Engine Baskaran Sankaran AU-KBC Research Centre, MIT campus of Anna University, Chromepet, Chennai - 600 044. India. E-mail: baskaran@au-kbc.org Abstract The Internet marks the era of Information

More information

A Platform Independent Testing Tool for Automated Testing of Web Applications

A Platform Independent Testing Tool for Automated Testing of Web Applications A Platform Independent Testing Tool for Automated Testing of Web Applications December 10, 2009 Abstract Increasing complexity of web applications and their dependency on numerous web technologies has

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT) The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit

More information

Web Data Scraper Tools: Survey

Web Data Scraper Tools: Survey International Journal of Computer Science and Engineering Open Access Survey Paper Volume-2, Issue-5 E-ISSN: 2347-2693 Web Data Scraper Tools: Survey Sneh Nain 1*, Bhumika Lall 2 1* Computer Science Department,

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

The Use of Terminological Knowledge Bases in Software Localisation

The Use of Terminological Knowledge Bases in Software Localisation The Use of Terminological Knowledge Bases in Software Localisation E.A. Karkaletsis, C.D. Spyropoulos, G. Vouros Institute of Informatics & Telecommunications, N.C.S.R. "Demokritos", 15310 Aghia Paraskevi,

More information

Integrating Multi-Modal Messages across Heterogeneous Networks.

Integrating Multi-Modal Messages across Heterogeneous Networks. Integrating Multi-Modal Messages across Heterogeneous Networks. Ramiro Liscano, Roger Impey, Qinxin Yu * and Suhayya Abu-Hakima Institute for Information Technology, National Research Council Canada, Montreal

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

FUTURE RESEARCH DIRECTIONS OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING *

FUTURE RESEARCH DIRECTIONS OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING * International Journal of Software Engineering and Knowledge Engineering World Scientific Publishing Company FUTURE RESEARCH DIRECTIONS OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING * HAIPING XU Computer

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

KOINOTITES: A Web Usage Mining Tool for Personalization

KOINOTITES: A Web Usage Mining Tool for Personalization KOINOTITES: A Web Usage Mining Tool for Personalization Dimitrios Pierrakos Inst. of Informatics and Telecommunications, dpie@iit.demokritos.gr Georgios Paliouras Inst. of Informatics and Telecommunications,

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

An Ontology-Based Knowledge Management Platform

An Ontology-Based Knowledge Management Platform An Ontology-Based Knowledge Management Platform A.Aldea 2, R.Bañares-Alcántara 1, J.Bocio 1, J.Gramajo 2, D.Isern 2, A.Kokossis 3, L.Jiménez 1, A.Moreno 2, D.Riaño 2 1 Universitat Rovira i Virgili, Dept.

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

Text Mining: The state of the art and the challenges

Text Mining: The state of the art and the challenges Text Mining: The state of the art and the challenges Ah-Hwee Tan Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613 Email: ahhwee@krdl.org.sg Abstract Text mining, also known as text data

More information

Personalized Information Management for Web Intelligence

Personalized Information Management for Web Intelligence Personalized Information Management for Web Intelligence Ah-Hwee Tan Kent Ridge Digital Labs 21 Heng Mui Keng Terrace, Singapore 119613 Email: ahhwee@krdl.org.sg Abstract Web intelligence can be defined

More information

EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION

EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION Anna Goy and Diego Magro Dipartimento di Informatica, Università di Torino C. Svizzera, 185, I-10149 Italy ABSTRACT This paper proposes

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method

Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method Jialun Qin, Yilu Zhou Dept. of Management Information Systems The University of

More information

A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval

A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval S. Saranya, B.S.E. Zoraida and P. Victor Paul Abstract Today s Web is very huge and evolving

More information

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity

More information

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms Irina Astrova 1, Bela Stantic 2 1 Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn,

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

Kofax Transformation Modules Generic Versus Specific Online Learning

Kofax Transformation Modules Generic Versus Specific Online Learning Kofax Transformation Modules Generic Versus Specific Online Learning Date June 27, 2011 Applies To Kofax Transformation Modules 3.5, 4.0, 4.5, 5.0 Summary This application note provides information about

More information

Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489

Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489 Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489 Course Outline Module 1: Creating Robust and Efficient Apps for SharePoint In this module, you will review key aspects of the apps

More information

Why are Organizations Interested?

Why are Organizations Interested? SAS Text Analytics Mary-Elizabeth ( M-E ) Eddlestone SAS Customer Loyalty M-E.Eddlestone@sas.com +1 (607) 256-7929 Why are Organizations Interested? Text Analytics 2009: User Perspectives on Solutions

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Computer Aided Document Indexing System

Computer Aided Document Indexing System Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia

More information

The Multi-courses Tutoring System Design

The Multi-courses Tutoring System Design The Multi-courses Tutoring System Design Goran Šimić E-mail: gshimic@eunet.yu The Military educational center for signal, computer science and electronic warfare, Veljka Lukića Kurjaka 1, 11000 Belgrade,

More information

Developing Microsoft SharePoint Server 2013 Advanced Solutions

Developing Microsoft SharePoint Server 2013 Advanced Solutions Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions Course Details Course Outline Module 1: Creating Robust and Efficient Apps for SharePoint In this module, you will review key

More information

Journal of Information Technology Impact

Journal of Information Technology Impact Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Ektron to EPiServer Digital Experience Cloud: Information Architecture

Ektron to EPiServer Digital Experience Cloud: Information Architecture Ektron to EPiServer Digital Experience Cloud: Information Architecture This document is intended for review and use by Sr. Developers, CMS Architects, and other senior development staff to aide in the

More information

A Framework of Personalized Intelligent Document and Information Management System

A Framework of Personalized Intelligent Document and Information Management System A Framework of Personalized Intelligent and Information Management System Xien Fan Department of Computer Science, College of Staten Island, City University of New York, Staten Island, NY 10314, USA Fang

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

More information

A Platform for Large-Scale Machine Learning on Web Design

A Platform for Large-Scale Machine Learning on Web Design A Platform for Large-Scale Machine Learning on Web Design Arvind Satyanarayan SAP Stanford Graduate Fellow Dept. of Computer Science Stanford University 353 Serra Mall Stanford, CA 94305 USA arvindsatya@cs.stanford.edu

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions OVERVIEW

Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions OVERVIEW Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions OVERVIEW About this Course This course provides SharePoint developers the information needed to implement SharePoint solutions

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

More information

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System Bala Kumari P 1, Bercelin Rose Mary W 2 and Devi Mareeswari M 3 1, 2, 3 M.TECH / IT, Dr.Sivanthi Aditanar College

More information

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Abstract. Find out if your mortgage rate is too high, NOW. Free Search Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back

More information

Text Mining and its Applications to Intelligence, CRM and Knowledge Management

Text Mining and its Applications to Intelligence, CRM and Knowledge Management Text Mining and its Applications to Intelligence, CRM and Knowledge Management Editor A. Zanasi TEMS Text Mining Solutions S.A. Italy WITPRESS Southampton, Boston Contents Bibliographies Preface Text Mining:

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

An Improved Indexing Mechanism to Index Web Documents

An Improved Indexing Mechanism to Index Web Documents 2013 5th International Conference on Computational Intelligence and Communication Networks An Improved Indexing Mechanism to Index Web Documents Pooja Mudgil Department of CSE, Banasthali Univ. Rajasthan,

More information

Overview. What is Information Retrieval? Classic IR: Some basics Link analysis & Crawlers Semantic Web Structured Information Extraction/Wrapping

Overview. What is Information Retrieval? Classic IR: Some basics Link analysis & Crawlers Semantic Web Structured Information Extraction/Wrapping Overview What is Information Retrieval? Classic IR: Some basics Link analysis & Crawlers Semantic Web Structured Information Extraction/Wrapping Hidir Aras, Digitale Medien 1 Agenda (agreed so far) 08.4:

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in

More information

ARTIFICIAL INTELLIGENCE METHODS IN EARLY MANUFACTURING TIME ESTIMATION

ARTIFICIAL INTELLIGENCE METHODS IN EARLY MANUFACTURING TIME ESTIMATION 1 ARTIFICIAL INTELLIGENCE METHODS IN EARLY MANUFACTURING TIME ESTIMATION B. Mikó PhD, Z-Form Tool Manufacturing and Application Ltd H-1082. Budapest, Asztalos S. u 4. Tel: (1) 477 1016, e-mail: miko@manuf.bme.hu

More information

Developing Microsoft SharePoint Server 2013 Advanced Solutions

Developing Microsoft SharePoint Server 2013 Advanced Solutions Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions Page 1 of 9 Developing Microsoft SharePoint Server 2013 Advanced Solutions Course 20489B: 4 days; Instructor-Led Introduction

More information

Application of ontologies for the integration of network monitoring platforms

Application of ontologies for the integration of network monitoring platforms Application of ontologies for the integration of network monitoring platforms Jorge E. López de Vergara, Javier Aracil, Jesús Martínez, Alfredo Salvador, José Alberto Hernández Networking Research Group,

More information

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks Melike Şah, Wendy Hall and David C De Roure Intelligence, Agents and Multimedia Group,

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information

AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS

AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS ZORAN JEREMIĆ, VLADAN DEVEDŽIĆ, DRAGAN GAŠEVIĆ FON School of Business Administration, University of Belgrade Jove Ilića 154, POB 52, 11000 Belgrade,

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Study and Analysis of Data Mining Concepts

Study and Analysis of Data Mining Concepts Study and Analysis of Data Mining Concepts M.Parvathi Head/Department of Computer Applications Senthamarai college of Arts and Science,Madurai,TamilNadu,India/ Dr. S.Thabasu Kannan Principal Pannai College

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Ontology for Home Energy Management Domain

Ontology for Home Energy Management Domain Ontology for Home Energy Management Domain Nazaraf Shah 1,, Kuo-Ming Chao 1, 1 Faculty of Engineering and Computing Coventry University, Coventry, UK {nazaraf.shah, k.chao}@coventry.ac.uk Abstract. This

More information

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots ScreenMatch: Providing Context to Software Translators by Displaying Screenshots Geza Kovacs MIT CSAIL 32 Vassar St, Cambridge MA 02139 USA gkovacs@mit.edu Abstract Translators often encounter ambiguous

More information

Spidering and Filtering Web Pages for Vertical Search Engines

Spidering and Filtering Web Pages for Vertical Search Engines Spidering and Filtering Web Pages for Vertical Search Engines Michael Chau The University of Arizona mchau@bpa.arizona.edu 1 Introduction The size of the Web is growing exponentially. The number of indexable

More information

Schema documentation for types1.2.xsd

Schema documentation for types1.2.xsd Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................

More information

Text Opinion Mining to Analyze News for Stock Market Prediction

Text Opinion Mining to Analyze News for Stock Market Prediction Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul

More information

Computer-aided Document Indexing System

Computer-aided Document Indexing System Journal of Computing and Information Technology - CIT 13, 2005, 4, 299-305 299 Computer-aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić and Jan Šnajder,, An enormous

More information

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati

More information

Multi-agent System for Web Advertising

Multi-agent System for Web Advertising Multi-agent System for Web Advertising Przemysław Kazienko 1 1 Wrocław University of Technology, Institute of Applied Informatics, Wybrzee S. Wyspiaskiego 27, 50-370 Wrocław, Poland kazienko@pwr.wroc.pl

More information

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004 ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004 By Aristomenis Macris (e-mail: arism@unipi.gr), University of

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Dynamic Ranking of Cloud Providers

Dynamic Ranking of Cloud Providers Dynamic Ranking of Cloud Providers Paweł Czarnul Dept. of Computer Architecture Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology G. Narutowicza, 11/12, 80-233,

More information

Annotation for the Semantic Web during Website Development

Annotation for the Semantic Web during Website Development Annotation for the Semantic Web during Website Development Peter Plessers, Olga De Troyer Vrije Universiteit Brussel, Department of Computer Science, WISE, Pleinlaan 2, 1050 Brussel, Belgium {Peter.Plessers,

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

CENG 734 Advanced Topics in Bioinformatics

CENG 734 Advanced Topics in Bioinformatics CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the

More information

Design and Development of an Ajax Web Crawler

Design and Development of an Ajax Web Crawler Li-Jie Cui 1, Hui He 2, Hong-Wei Xuan 1, Jin-Gang Li 1 1 School of Software and Engineering, Harbin University of Science and Technology, Harbin, China 2 Harbin Institute of Technology, Harbin, China Li-Jie

More information

Using Semantic Data Mining for Classification Improvement and Knowledge Extraction

Using Semantic Data Mining for Classification Improvement and Knowledge Extraction Using Semantic Data Mining for Classification Improvement and Knowledge Extraction Fernando Benites and Elena Sapozhnikova University of Konstanz, 78464 Konstanz, Germany. Abstract. The objective of this

More information

Recognition and Privacy Preservation of Paper-based Health Records

Recognition and Privacy Preservation of Paper-based Health Records Quality of Life through Quality of Information J. Mantas et al. (Eds.) IOS Press, 2012 2012 European Federation for Medical Informatics and IOS Press. All rights reserved. doi:10.3233/978-1-61499-101-4-751

More information

Dynamism and Data Management in Distributed, Collaborative Working Environments

Dynamism and Data Management in Distributed, Collaborative Working Environments Dynamism and Data Management in Distributed, Collaborative Working Environments Alexander Kipp 1, Lutz Schubert 1, Matthias Assel 1 and Terrence Fernando 2, 1 High Performance Computing Center Stuttgart,

More information

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours.

01219211 Software Development Training Camp 1 (0-3) Prerequisite : 01204214 Program development skill enhancement camp, at least 48 person-hours. (International Program) 01219141 Object-Oriented Modeling and Programming 3 (3-0) Object concepts, object-oriented design and analysis, object-oriented analysis relating to developing conceptual models

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information