Master Thesis Proposal



Similar documents
IT services for analyses of various data samples

A SURVEY ON WEB MINING TOOLS

Zoomer: An Automated Web Application Change Localization Tool

Course Title: Advanced Topics in Quantitative Methods: Educational Data Science Practicum

Web Design and Implementation for Online Registration at University of Diyala

AUTOMATED CONFERENCE CD-ROM BUILDER AN OPEN SOURCE APPROACH Stefan Karastanev

International Journal of Engineering Technology, Management and Applied Sciences. November 2014, Volume 2 Issue 6, ISSN

Enable Your Automated Web App Testing by WebDriver. Yugang Fan Intel

II. PREVIOUS RELATED WORK

Students who successfully complete the Health Science Informatics major will be able to:

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Tutorial JavaScript: Switching panels using a radio button

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

An Electronic Journal Management System

USC VITERBI SCHOOL OF ENGINEERING INFORMATICS PROGRAM

USING DIRECTED ONLINE TUTORIALS FOR TEACHING ENGINEERING STATISTICS

ORACLE APPLICATION EXPRESS 5.0

Analysis of Yslow Performance Test tool & Emergences on Web Page Data Extraction

Metadata Quality Control for Content Migration: The Metadata Migration Project at the University of Houston Libraries

Preprocessing Web Logs for Web Intrusion Detection

MBARI Deep Sea Guide: Designing a web interface that represents information about the Monterey Bay deep-sea world.

MEng, BSc Applied Computer Science

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Windchill PDMLink Curriculum Guide

Using Artificial Intelligence to Manage Big Data for Litigation

Automatic Timeline Construction For Computer Forensics Purposes

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Semantic Search in Portals using Ontologies

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University

WebRecSol Pvt Ltd. WebRecSol is a web development company that. offer affordable SEO services to their clients. designing, web application development

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

MEng, BSc Computer Science with Artificial Intelligence

Towards better understanding Cybersecurity: or are "Cyberspace" and "Cyber Space" the same?

Master Degree Project Ideas (Fall 2014) Proposed By Faculty Department of Information Systems College of Computer Sciences and Information Technology

Open Source Content Management System for content development: a comparative study

MOOCviz 2.0: A Collaborative MOOC Analytics Visualization Platform

NaviCell Data Visualization Python API

Client Overview. Engagement Situation. Key Requirements

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004

SharePoint 2010 vs. SharePoint 2013 Feature Comparison

SOFTWARE TESTING TRAINING COURSES CONTENTS

A FRAMEWORK FOR COLLECTING CLIENTSIDE PARADATA IN WEB APPLICATIONS

BEST WEB PROGRAMMING LANGUAGES TO LEARN ON YOUR OWN TIME

Web Content Mining Techniques: A Survey

IT3503 Web Development Techniques (Optional)

MarkLogic Server. Reference Application Architecture Guide. MarkLogic 8 February, Copyright 2015 MarkLogic Corporation. All rights reserved.

Team Members: Christopher Copper Philip Eittreim Jeremiah Jekich Andrew Reisdorph. Client: Brian Krzys

How To Test A Web Based Application Automatically

COURSE CONTENT FOR WINTER TRAINING ON Web Development using PHP & MySql

Digital Asset Management A DAM System for TYPO3

Functional Requirements for Digital Asset Management Project version /30/2006

Component visualization methods for large legacy software in C/C++

From Databases to Natural Language: The Unusual Direction

Google Analytics for Robust Website Analytics. Deepika Verma, Depanwita Seal, Atul Pandey

THE USE OF INFORMATION TECHNOLOGIES IN BA SCHOOL OF BUSINESS AND FINANCE INNER WEB PORTAL

Business Application Development Platform

Getting started with Mendeley. Guide by ITC library

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Principles and Software Realization of a Multimedia Course on Theoretical Electrical Engineering Based on Enterprise Technology

Programming in HTML5 with JavaScript and CSS3

Responsive web design Are we ready for the new age?

Research of Postal Data mining system based on big data

Web Development News, Tips and Tutorials

MicroStrategy Course Catalog

Short notes on webpage programming languages

Comparative Study of Automated Testing Tools: Selenium, Quick Test Professional and Testcomplete

Data Mining & Data Stream Mining Open Source Tools

Deriving Business Intelligence from Unstructured Data

HTML5 Data Visualization and Manipulation Tool Colorado School of Mines Field Session Summer 2013

Course Information Course Number: IWT 1229 Course Name: Web Development and Design Foundation

Ross University s Content Management System (CMS) Training Manual

opalang - Rapid & Secure Web Development

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

María Elena Alvarado gnoss.com* Susana López-Sola gnoss.com*

MathCloud: From Software Toolkit to Cloud Platform for Building Computing Services

Predicting outcome of soccer matches using machine learning

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unlocking the Java EE Platform with HTML 5

TechTips. Connecting Xcelsius Dashboards to External Data Sources using: Web Services (Dynamic Web Query)

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Transcription:

Master Thesis Proposal Web Data Extraction of University Staff Competencies Edin Zildzo, 1125449 Supervisor: Ao.Univ.Prof.Dr. Jürgen Dorn Septemeber 11, 2014 1 Problem Statement Web data extraction is a challenging process due to complex data structures and unstructured data on various web pages. Web pages with a wide variety in the styles, code and violations of standards are considered as unstructured and complex for data extraction process. Web pages are categorized based on the information represented in them, some web pages display static text, whereas others extract the information from the backend database dynamically during runtime, even some run complex scripts to generate data at the time of display. Finally the complete web page can be viewed as combination of different types of content displayed in the form of Visual Blocks inside the Web Browser window.[1] The common problem for web data extraction tools is the structure of data on the website. When data is written as a plain text without using the classifiers, it is very complex to identify what those text sections represent. On the web pages of Universities, most of the faculty members have their own list of publications which show their expertise in a particular area. Some of the publications are not available on the University web page, so it will be necessary to browse the digital libraries in order to get a complete list of publications for a particular author. The extracted data will need to be filtered, refined, analyzed in order to obtain a knowledge about competences of faculty members. Publications will be analyzed based on an ontology with competence concepts.

Competency management can be seen as one of the foundations of learning activities in knowledge intensive organizations. As a critical point in the functioning of knowledge management, competencies require a representational framework that is rich enough to support effective and efficient processes of competency search, matching and analysis. [6] 2 Expected Results The outcome of this thesis will be to evaluate existing web data extraction tools and to design and implement a software for data extraction and data analysis of faculty members and their publications based on the defined ontology. Firstly, the requirements for a software will be identified and used for designing a software for data extraction. The software will be implemented based on design. The implemented software will be evaluated and compared to other existing extraction tools. Extracted data will be stored in a database and refined in order to get the competences of selected faculty members and it could be used for further processing (e.g., knowledge extraction process). 3 Methodology and Approach The Methodology will consist of: 1. Search and Analysis of Literature Literature needs to provide a profound information in the area of web data extraction. 2. Designing a Software for Web data extraction In order to determine the requirements of a software, existing tools and approaches will be analyzed. Based on the analysis results the software for web data extraction will be designed. Use case will be some specific Institute of the Faculty of Informatics at the Vienna University of Technology. 3. Designing an Ontology with Competence concepts

Ontology will be designed and used for analysis of publications in order to obtain competences of faculty members. 4. Implementation of a Software The software will be implemented which will extract data for further analysis. Data analysis will be based on competence concepts. Technologies which might be used for implementing a software are Python, Selenium, CSS selector (for data navigation), PHP. Python is a powerful programming language which has a relevant functions for deep navigation of web pages. Selenium enables browser automation from Java, it acts as a web browser out of java code and gives the possibility to read and manipulate data from websites. Technology will be chosen based on research and comparison of already available methods in web data extraction in order to select a method which will provide the most accurate results. 5. Evaluation of Results Implemented software will be evaluated and the results of this work will be analyzed. In order to evaluate the results of the extraction the questionnarie/survery will be carried out among university staff in order to check if the extracted data which will be used for assessment of competencies match with the competency data of staff members which will be provided from survey. Data sample which will be used for survey will be some randomly chosen staff members from the database. 4 State of the Art Nowadays, there are a lot of commercial web data extraction tools and and mostly their functionality is similar. Some tools provide more functionality than the others but the core problem remains which is the structure of various web pages. Most of the tools can detect already common structures and extract data efficiently but the problem arise when there are some unordinary cases like page sections not properly marked, text sections not classified, dynamic data on a page generated in a complex way. Some commercial tools like Mozenda, Visual Web Ripper, Lixto provide a good functionality and they are user oriented. In the scientific literature there are numerous approaches for web data extraction but they are not yet fully implemented. One of the most prominent examples of systems coming from the

academic research field is Lixto. Lixto is a typical visually aided state-of-the-art Web data extraction system in which the user is asked to simply visually select the data that should be extracted. Usually, no programming knowledge is required. [2] Mozenda is a practical tool for basic users. This tool has a nice interactive user interface and a powerful browser from which data is selected for extraction process. With Mozenda it is possible to make scheduled extractions and it also provides several data output formats. [3] Visual Web Ripper is an excellent tool for automated web scraping. This tool extracts complete data structures, such as product catalogues. If needed Visual Web Ripper may repeatedly submit forms for all possible input values which is important for a multiple search. [4] Web Harvest is a tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities. [5] Other approaches and tools not mentioned in this proposal will also be analyzed and compared. There are already proposed formal ontologies for management of competencies. Nonetheless, more work is required in the clarification of the concept of competency and also in providing integrative schemas for competencies.[6] References [1] Narwal, N., "Improving web data extraction by noise removal," Communication and Computing (ARTCom 2013), Fifth International Conference on Advances in Recent Technologies in, vol., no., pp.388,395, 20-21 Sept. 2013 doi: 10.1049/cp.2013.2241 keywords: {Web sites;data mining;noise;dom;internet;web crawling;web data extraction;web extraction systems;web mining technique;web page;web sites;classification;clustering;information extraction;information repository;layout pattern;node importance measure;noise elements;noise removal;pattern tree;search engine;similarity pattern;visual blocks;visual characteristics;dom;node Importance;Noise;Pattern Tree;Similarity Count;Style Importance}, URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6843017&isnumber=6835102 -of-the-art web data extraction systems for online business intelligence." 64 tomas (2013): 145.

[3] Mozenda.com, (2014). Web Data Scraping videos, Web Data Mining Videos, Screen Scraper Video Tutorials. [online] Available at: https://www.mozenda.com/features [Accessed 3 Sep. 2014]. [4] Ripper, V. (2012). Visual Web Ripper Review. [online] Web Scraping. Available at: http://scraping.pro/visual-web-ripper-review/ [Accessed 3 Sep. 2014]. [5] Web-harvest.sourceforge.net, (2014). Web-Harvest Project Home Page. [online] Available at: http://web-harvest.sourceforge.net/ [Accessed 3 Sep. 2014]. [6] Sicilia, M.-A. (2005), Ontology-based Competency Management: Infrastructures for the Knowledge Intensive Learning Organization.