Master Thesis Proposal Web Data Extraction of University Staff Competencies Edin Zildzo, 1125449 Supervisor: Ao.Univ.Prof.Dr. Jürgen Dorn Septemeber 11, 2014 1 Problem Statement Web data extraction is a challenging process due to complex data structures and unstructured data on various web pages. Web pages with a wide variety in the styles, code and violations of standards are considered as unstructured and complex for data extraction process. Web pages are categorized based on the information represented in them, some web pages display static text, whereas others extract the information from the backend database dynamically during runtime, even some run complex scripts to generate data at the time of display. Finally the complete web page can be viewed as combination of different types of content displayed in the form of Visual Blocks inside the Web Browser window.[1] The common problem for web data extraction tools is the structure of data on the website. When data is written as a plain text without using the classifiers, it is very complex to identify what those text sections represent. On the web pages of Universities, most of the faculty members have their own list of publications which show their expertise in a particular area. Some of the publications are not available on the University web page, so it will be necessary to browse the digital libraries in order to get a complete list of publications for a particular author. The extracted data will need to be filtered, refined, analyzed in order to obtain a knowledge about competences of faculty members. Publications will be analyzed based on an ontology with competence concepts.
Competency management can be seen as one of the foundations of learning activities in knowledge intensive organizations. As a critical point in the functioning of knowledge management, competencies require a representational framework that is rich enough to support effective and efficient processes of competency search, matching and analysis. [6] 2 Expected Results The outcome of this thesis will be to evaluate existing web data extraction tools and to design and implement a software for data extraction and data analysis of faculty members and their publications based on the defined ontology. Firstly, the requirements for a software will be identified and used for designing a software for data extraction. The software will be implemented based on design. The implemented software will be evaluated and compared to other existing extraction tools. Extracted data will be stored in a database and refined in order to get the competences of selected faculty members and it could be used for further processing (e.g., knowledge extraction process). 3 Methodology and Approach The Methodology will consist of: 1. Search and Analysis of Literature Literature needs to provide a profound information in the area of web data extraction. 2. Designing a Software for Web data extraction In order to determine the requirements of a software, existing tools and approaches will be analyzed. Based on the analysis results the software for web data extraction will be designed. Use case will be some specific Institute of the Faculty of Informatics at the Vienna University of Technology. 3. Designing an Ontology with Competence concepts
Ontology will be designed and used for analysis of publications in order to obtain competences of faculty members. 4. Implementation of a Software The software will be implemented which will extract data for further analysis. Data analysis will be based on competence concepts. Technologies which might be used for implementing a software are Python, Selenium, CSS selector (for data navigation), PHP. Python is a powerful programming language which has a relevant functions for deep navigation of web pages. Selenium enables browser automation from Java, it acts as a web browser out of java code and gives the possibility to read and manipulate data from websites. Technology will be chosen based on research and comparison of already available methods in web data extraction in order to select a method which will provide the most accurate results. 5. Evaluation of Results Implemented software will be evaluated and the results of this work will be analyzed. In order to evaluate the results of the extraction the questionnarie/survery will be carried out among university staff in order to check if the extracted data which will be used for assessment of competencies match with the competency data of staff members which will be provided from survey. Data sample which will be used for survey will be some randomly chosen staff members from the database. 4 State of the Art Nowadays, there are a lot of commercial web data extraction tools and and mostly their functionality is similar. Some tools provide more functionality than the others but the core problem remains which is the structure of various web pages. Most of the tools can detect already common structures and extract data efficiently but the problem arise when there are some unordinary cases like page sections not properly marked, text sections not classified, dynamic data on a page generated in a complex way. Some commercial tools like Mozenda, Visual Web Ripper, Lixto provide a good functionality and they are user oriented. In the scientific literature there are numerous approaches for web data extraction but they are not yet fully implemented. One of the most prominent examples of systems coming from the
academic research field is Lixto. Lixto is a typical visually aided state-of-the-art Web data extraction system in which the user is asked to simply visually select the data that should be extracted. Usually, no programming knowledge is required. [2] Mozenda is a practical tool for basic users. This tool has a nice interactive user interface and a powerful browser from which data is selected for extraction process. With Mozenda it is possible to make scheduled extractions and it also provides several data output formats. [3] Visual Web Ripper is an excellent tool for automated web scraping. This tool extracts complete data structures, such as product catalogues. If needed Visual Web Ripper may repeatedly submit forms for all possible input values which is important for a multiple search. [4] Web Harvest is a tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities. [5] Other approaches and tools not mentioned in this proposal will also be analyzed and compared. There are already proposed formal ontologies for management of competencies. Nonetheless, more work is required in the clarification of the concept of competency and also in providing integrative schemas for competencies.[6] References [1] Narwal, N., "Improving web data extraction by noise removal," Communication and Computing (ARTCom 2013), Fifth International Conference on Advances in Recent Technologies in, vol., no., pp.388,395, 20-21 Sept. 2013 doi: 10.1049/cp.2013.2241 keywords: {Web sites;data mining;noise;dom;internet;web crawling;web data extraction;web extraction systems;web mining technique;web page;web sites;classification;clustering;information extraction;information repository;layout pattern;node importance measure;noise elements;noise removal;pattern tree;search engine;similarity pattern;visual blocks;visual characteristics;dom;node Importance;Noise;Pattern Tree;Similarity Count;Style Importance}, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6843017&isnumber=6835102 -of-the-art web data extraction systems for online business intelligence." 64 tomas (2013): 145.
[3] Mozenda.com, (2014). Web Data Scraping videos, Web Data Mining Videos, Screen Scraper Video Tutorials. [online] Available at: https://www.mozenda.com/features [Accessed 3 Sep. 2014]. [4] Ripper, V. (2012). Visual Web Ripper Review. [online] Web Scraping. Available at: http://scraping.pro/visual-web-ripper-review/ [Accessed 3 Sep. 2014]. [5] Web-harvest.sourceforge.net, (2014). Web-Harvest Project Home Page. [online] Available at: http://web-harvest.sourceforge.net/ [Accessed 3 Sep. 2014]. [6] Sicilia, M.-A. (2005), Ontology-based Competency Management: Infrastructures for the Knowledge Intensive Learning Organization.