Chapter III. Historical Background

Chapter III Historical Background Various researches related to web mining, web content mining, query enhancement, educational tools,educational perspective with web content mining, security issues of kids while searching the web, adaptation of web content on mobile device and various others were studied. The important related research papers are presented here. 3.1 Different Approaches for Web Mining Kolari and Joshi [120] presents an overview of past and current work in the three main areas of web mining research content, structure & usage as well as emerging work in semantic web mining. The authors have also discussed privacy issues, distributed Web mining, and Semantic web mining in relation to web mining. The focus was on the adaptation of web sites and the evaluation of web sites based upon the content searched and finally they presented a semantic web concept for better extraction of data. They have also discussed the semantic information the Semantic Web provides. Exposing content semantics and the link explicitly can help in many tasks, including mining the hidden Web that is, data stored in databases and not accessible through search engines. Raymond Kosala & Hendrik Blockeel [129] has surveyed the researches carried out in the area of web mining by various authors working in the field. They described three categories of web mining- web content mining, web structure mining, web usage mining and the research areas related to three categories. The authors have explained the web mining categories in view of databases which are classified as IR view & DB view and made a comparison of all the three categories based on these views. Cooley & Srivastava [130] has discussed Web mining in two distinct ways & developed taxonomy of the various ongoing efforts related to it. The first, called Web content mining in this paper is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. The author defined Web mining and presents an overview of the various research issues, techniques, and development efforts but focused on web usage mining. The paper also describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues. 55

Appelt & Israel [42], has given a detailed view about the extraction of information from the current web databases. They have covered almost all aspect of information extraction and have elaborated every situation with the help of an example. The authors focuses on MUC, Evaluation matrices, Knowledge extraction system, components of Information extraction system and many other things. Srivastava et. al [76] has given a brief overview about the developments taken place in the field of web mining in the last 5 years. He has elaborated various areas of web mining, the applications, and the various contributions along with future directions. 3.2 Approaches used in Web Content Mining Azmy [99] has discussed various issues related to web content mining. The author has classified web content mining on the basis of various type of data available like structured data, unstructured data, semi structured data etc.various real life applications that are using web content mining are also listed in this paper. Kushmerick et.al [112] has introduced a method for automatically constructing wrappers called wrapper induction. They have also defined the HLRT Bias and the use of Heuristic Knowledge to compose the algorithms oracle. The authors have used PAC analysis to bond the problem s sample complexity and show that the system degrades gracefully with imperfect labelling knowledge. Yu,Cai et.al [45] presents a new approach to extract web content structure based on visual representation which uses automatic top down, tag tree independent approach to detect web content structure. The paper shows experiments which proves, the technique presented helps in web adaptation, information retrieval and information extraction. Arasu [11] has focused on the problem of automatically extracting the database values from a common template used by web pages without any human intervention. The author has discussed an algorithm that helps in the above stated task. The algorithm has proposed work on the concept of equivalence classes and differentiating roles. Pinto et.al [47] explores the technique of data extraction from tables, which are embedded within the web page. The authors have discussed the use of conditional random fields (CRFs) for table extraction and compare them with hidden markov models (HMMs). They have shown the benefits of CRFs with the help of experiments 56

which shows the improvement of CRF over generative models (an HMM) & conditionally trained stateless models. Zhai, and Liu [181] has proposed a new method of web content mining i.e. instance based learning methods, which performs extraction by comparing each new instance to be extracted with labelled instances. This method solves the problem of inductive learning or Wrapper induction which requires an initial set of labelled pages to learn extraction rules. Experimental results with product data extraction from 24 diverse web sites show that the approach is highly effective. Gupta & Kaiser [135] have proposed a new approach for content extraction using Document Object Model tree. This approach is different from raw HTML mark up & hence enables the user to perform content extraction, identifying and preserving the original data instead of summarizing it. The authors have implemented the approach in a publicly available web proxy to extract content from HTML web pages. In another paper Zhai, and Liu [16], has proposed a more effective technique to perform automatic data extraction from Web pages. Given a page, our method first builds a tag tree based on visual information. It then performs a post-order traversal of the tree and matches sub trees in the process using a tree edit distance method and visual cues. the method enables accurate alignment and extraction of both flat and nested data records. Experimental results show that the method performs data extraction accurately. Wang & Lochovsky [74] has described a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment). Buttler & Liu [36] presents a fully automated object extraction system Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. The author has evaluated the system using more than 2,000 Web pages over 40 sites. Chang, & Lui [23], has proposed an IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record 57

boundary by repeated pattern mining and multiple sequence alignment. The discoveries of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular search engines. Crescenzi et.al [165] investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the author develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach. Rosenfeld et.al [19] has described a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. The discussed structural extraction procedure is a learning algorithm, which automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. The author also describes a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract fields such as Author(s), Title, Date, etc. with very high accuracy. Lan & Liu [89] has proposed a technique to clean the web page from various ads and other links using web mining. The authors proposed a compressed tree structure which will capture the commonality of web pages. Based upon these commonalities they assign weight to each node and extract the content. Bing Liu et al [21] present an automatic algorithm to mine both the contiguous and non-contiguous data record. No manual procedure is used. It also uses two important observations about data set and also uses string matching algorithm. Ajoudanian et.al [145] has discussed various issues related to knowledge extraction from the deep web and has presented a technique to extract the knowledge from the web using correlation mining approach. Bin & Chang [63] evaluates various schema matching 1:1 algorithms and presents a technique of matching the schema using correlation mining. In particular, the authors 58

have developed the DCM frame-work, which consists of data pre-processing, dual mining of positive and negative correlations, and finally matching the web query interface. They have integrated various automatic techniques for extracting the interfaces. Bergman [97] gives detailed view about the content available on the net, deep under the various links and databases. The author has conducted various studies and surveys and presented the results which clearly states, the deep web is very vast and needs to be mined to get the correct view of data. Ajoudanian, and Jazi [146] have presented a new system that extract information from the deep web automatically. The algorithm developed by the author works in two steps. In the first step the algorithm extracts information from query interfaces and in the second step it matches them with the online databases. It uses clustering technique to extract and match the content in the database. Laender et.al [7] describes a heuristic based automatic method for extraction of objects. The authors have focused on domain ontology to achieve a high accuracy but the approach incurs high cost. Embley & Jiang [43] have proposed a method to detect informative clocks in WebPages. However their work is limited due to two assumptions: (1) the coherent content blocks of a web page are known in advance. (2) Similar blocks of different web pages are also known in advance, which is difficult to achieve. Lin & Ho [147] proposed a frequency based algorithm for detection of templates or patterns. However they are not concerned with the actual content of the web page. Yossef & Rajagopalan [182] have only focused on structured data whereas the web pages are usually semi structured. Lee & Ling [98] enhance the HITS algorithm of [64] by evaluating entropy of anchor text for important links. 3.3 Information retrieval for Kids Jochmann et.al [66] discusses various characteristics of kids that one should keep in mind while developing any information retrieval system for kids. The experiments and results are shown using adult based search interfaces. 59

Duarte [148] has focused on information need of kids by examining their search behaviour. The author has tried to find the answer of few questions that needs to be answered by any IR system of kids. Duarte et.al [149] analyses the session and queries for kids information need and compare them with general queries and sessions. The author has enriched the AOL query log by implementing the result of kid s queries. Eickhoff et.al [29] presents the use of query assistance and search moderation techniques for kids so that kids have a better experience searching the web. The authors have also focused on interface design for kids. Duarte et.al [159] has analysed a large query log from a commercial search engine and identify the problems related to child search behaviour. The target audiences of their work was child from age 6 to adult of 18. They have also worked on search difficulties based on query matrices. Sandra Hirsh [58] presents a study of 64 fifth grade students who were using science library catalogue for searching the content on the web. The study highlights the problems of kids while searching and the possible solution. Carsten Eickhoff et.al [30] presents an automatic way of identifying the web page suitable for kids. The focus is on child psychology and cognitive science. The authors have investigated the potential of combining topical and non-topical aspect of identifying age appropriate content for kids. Carsten Eickhoff et.al [31] discusses cognitive specifies of children and the way they can be encoded for classification. The authors have worked on two dimensions: child friendliness and focus toward child audiences. Hauff and Trieschnigg [32] discusses project Gutenberg to make available classic literature to children in a secure way. Glassey et.al [131] presents an interaction based information filtering system for kids. This system focuses on user interaction modelling, user evaluation, automatic detection of child friendly information etc. Gyllstrom et.al [84] presents a system named Tad Polemic which will assist children in searching the web for difficult topics and also provide filtering of content based upon child interest and age. 60

Jochmann et.al [77] has conducted a study to gather the quantitative and qualitative data about children interaction with web search engines. They identified that kids perform poorer on metaphorical interfaces and good on Google. Eickhoff & Vries [33] presents a paradigm to identify the suitable videos for kids on youtube on the basis of various features like people reviews, comments, author information, community information etc. Kalsbeek & Wit [100] tries to uncover methods and techniques that can be used to automatically improve search results on queries formulated by children. The author presents a prototype of a query expander that implements several of these techniques. 3.3.1 Mobile based Educational tools for retrieval of content for kids The software [51] is the tool by apple to create digital content, the tools to get that content to students, and the tools to let them play it back anytime, anywhere. They have even introduce students to educational mobile applications for ipod touch and iphone, so that students can access reference information, write blog posts, develop physics models, or simulate flying over the earth. The software [35] is a Calculator known as MIDIet. It is a Quadratic Equation Solver which will solve the Equation of the Form Ax²+Bx+C. It helps in the calculation of mathematical equations and can also work for 5 digit polynomial quadratic equation. The software [180] Yahoo! OneSearch or Google, search is enhanced for mobile users. User can search for anything from stock quotes to celebrity news, sport scores or movie reviews and get the most current, relevant answers you need, every time you search. These search engines understands the user intent and remembers user location, giving answers, tailored to where user is. Another initiative [102] Mobile Education by Tata Indicaom,Towards the promotion of education in the remotest corners of the nation, the company has partnered with SNDT Women's University, ATOM Tech (Any Transaction on Mobile), and Indian PCO Teleservices (IPTL). In this alliance, SNDT University will develop and manage content, Tata Indicom on its service channels will be the carrier, ATOM will provide the intermediary interfaces and IPTL will look after service distribution and dissemination system. The M-Education will offer contemporary content to students and do away with the need to visit physical schools and colleges, thus bridging the physical distances using CDMA technology. 61

Miriam Held [103] motivate the Parents and the teachers to allow the kids to use mobile device as an educational tool to learn various aspects of life, as mobile is handy, always available and personal to use. Petra wentzal et.al [126] has conducted online survey and interviews with three major universities of UK, and has also worked on GIPSY and MALENO projects to gather the information about the current scenario of students frame of mind about mobile based education and the ways to enhance the same by giving various suggestions. Molnar & Martínez [13] proposes a system where games are created by game designers and educational contents by teachers, and both are brought together in a seamless manner. This novel approach facilitates the use of educational games without the need of programming skills and guarantees that teachers can easily create the educational contents that go into the mobile games. Lalita S Kumar et.al [90] presents a study conducted at IGNOU to illustrate the issues in current learning process of students and the improvements that can be done with the use of mobile technology. The paper reports the findings of a study conducted to analyse the effect of mobile device intervention for student support services and to gauge its use for enhancing teaching learning process as a future study in the context of offer of Distance Education programmes. Another important paper by Matthew Kam et.al [104] presents educational games on mobile for the kids who don t have access to the school because of family problems. The authors have given future directions for designing educational games that target less well-prepared children in developing regions. Jill O Neill [79] presents a comparative analysis of user view while using mobile for searching the web with desktop computers. He has illustrated the benefit of both as well as the drawbacks of both. Kelly [105] presents a comparative analysis of books with e-readers available online and the advancements that have taken place in the past few years in reading habits of user from books to desktops to ipads etc. Marc Prensky [106] highlights the utilities of cell phone by focusing on all type of content that a cell phone can display like voice, video, text, graphics, animation etc. Roksana Begum [134] investigates the potentiality of cell phone use in the EFL classroom of Bangladesh as an instructional tool. The author collected data through students questionnaires, and teachers interview records and classroom observation reports. The research results demonstrated that cell phone has great potential as an 62

instructional tool despite some challenges that can be resolved by the sincere attempts of the authority, teachers and by changing the ethical point of view that consider cell phones as mere a disturbing factor in the classroom. Another paper by Wendeson et.al [153] presents a survey from 90 undergraduate students of Universiti Teknologi PETRONAS (UTP), to identify the students perception on M-learning. From the results, the students are willing to use M- learning. The acceptance level of the students is high, and the results obtained revealed that the respondents almost accept M-learning as one method of teaching and learning process and also able to improve the educational efficiency by complementing traditional learning in UTP. Valk et.al [80] surveyed the result of six m-learning projects of developing countries of Asia to review the evidence of the role of mobile phone-facilitated m-learning in contributing to improved educational outcomes. The author has examined the extent to which the use of mobile phones helped to improve educational outcomes in two specific ways: 1) in improving access to education, and 2) in promoting new learning. 3.3.2 Security issues related to IR for kids An important paper on child online safety [44] related to the upcoming project on child online security focuses on various issues necessary to ensure child security has covered three aspects i.e. technological, parental and self protection. The project is in its initial phases and is focusing on child security in developing countries. Another related paper by Nancy Kranich [106] highlights various problems related to filters such as under blocking, over blocking; age restrictions etc and recommend guidelines to ensure online security for kids. In paper [57], the author has discussed various web protection methods. In one more paper [50] the author has discussed the use of Internet in public schools and the security measures applied while accessing the internet in these schools and the potential problems these school face while applying various security checks i.e. the impact of these security filters on students, teachers and legal proceedings. Electronic Privacy Information Center [52] has tried to identify the impact of software filters using a traditional search engine and using a new search engine that is advertised as the "world's first family-friendly Internet search site. The search is conducted using 100 sample terms. The web page has concluded that the filtering 63

mechanism prevented children from obtaining a great deal of useful and appropriate information that is currently available on the Internet. Amanda Lehrat [14] shows the results of survey conducted by Pew internet. The results have shown that 70% of the teens between the age group of 12-16 and their parents are using the internet with some sort of monitoring soft wares. Stol & Kaspersen [175] discussed the investigation of the technical and legal possibilities to filter and block child pornographic material on the internet. The method of investigation used is desk research (literature, documents, and media websites) and semi-structured interviews with experts. The paper [15] displays a fact sheet that is designed to help parents talk to their children about online safety and protecting their identity from criminals. The fact sheet has been prepared by the Australian Bankers Association (ABA) and the Australian Federal Police (AFP). Another webpage [54] focuses on the significance of content filtering tools to children's access to Web sites. Limitation of content filtering software; Effectiveness of blacklists as a filtering tool. The webpage [67] presents a discussion on how to provide school children with access to the Internet and yet keep them safe. 3.3.3 Query enhancement techniques related to kids Dwivedi & Govil [143] presents a model to enhance the query at user level. The highlight of the model is, it include NLP techniques to enrich the user experience of entering the query and is also using databases of synonyms for semantic searching. Above all rest of the module of search engine like ranking, tokenization will not be altered. Duarte et al [151] has analysed group of queries suitable for kids. The aim of the analysis is to: (i) to identify differences in the query space, content space, user sessions, and user click behaviour. (ii) To enhance the query log by including annotations of queries, sessions and actions. Sieg et.al [9] presents ARCH, an interactive query formulation aid that is based on conceptual categories. The user s query is reformulated to include categories that the user recognizes as important and exclude those that are not important. Yonggang & Frei [179] presents a probabilistic query expansion model based on a similarity thesaurus which was constructed automatically. The author discusses two 64

important issues with query expansion: the selection and the weighting of additional search terms. They have shown that query expansion results in a notable improvement in the retrieval effectiveness when measured using both recall-precision and usefulness. Stanković [132] discusses issues related to improvement of queries using a rule based procedure implemented in WS4LR, a workstation for manipulating heterogeneous lexical resources developed by the Human Language Technology Group at the University of Belgrade. They have presented the use of automatic production of lemmas for a morphological dictionary from a given list of compounds, and its evaluation on several different sets of data Rungsawang & Tangpong [10] proposes a novel query expansion technique which employs the association rules that are mined previously from the collection. In this method, each user- submitted query that is relevant to left-hand side of a rule is appended with terms in its right-hand side. The author has used a priori algorithm and has also used association rule mining technique to prove the results. Hafernik[34] has explored geospatial information in queries to improve retrieval by automatically disambiguating geospatial terms within the queries using outside geospatial knowledge gathered from the internet, including city names, countries, regions, parts of countries and location information. The approach used combines simple linguistic analysis with query modification via the addition of geospatial information. Konishi [85] presents a patent retrieval system for extracting patent terms from the documents. The main scope of method is the appropriate query expansion to improve recall. They extracted query terms from the topic claim, and expanded query terms extracted from sentences explained in the patent document including the topic claim. Poblete [22] demonstrates the study of various applications of Web query mining which helps in the improvement of search engine ranking, Web information retrieval and Web site enhancement. The key area of the research is to take advantage of the implicit feedback left in the trail of users while navigating through the Web. Mike Smit [101] presents a system known as Quoogle, which is an enhancement to Google. It is an IR based model which helps to enhance existing Google results with suggested keywords. Quoogle downloads the first 100 results returned by a short query and does some standard text analysis to extract additional keywords. 65

White & Jose [133] has proposed novel methods to present the result, modify the query retrieval strategy selection and evaluation. These methods helps in effective information access and assist searchers in formulating query statements and making new search decisions on how to use these queries. Although the Web is used as the document collection for this investigation the findings are potentially generalisable to different document domains. Sawant [12] describes a Semantic approach towards web search using stand-alone Java application. An Ontology Web Language (OWL) model is used to build a knowledge database related to different types of Organisms. The goal is to guide the Google web search engine using this OWL model. In the rest approach towards Semantic web search, an inference engine called CLIPS is used and in the second aproach, the Protege-OWL API is used. Crous & Bishop [160] proposes a framework which will enable various automated agents to search the semantic web. The paper demonstrates the use of RDF query based semantos as well as Query enhancement services which help the search agent to improve the quality of the result generated by the search agents. Ju Fan et.al [78] suggests various query terms for the improvement of IR systems. The authors have utilized the tree structure to access terms with a prefix and devised a progressive ranking algorithm to find the top-k terms efficiently. Hollink & Tsikrika [166] presents a method that exploits `linked data' to determine semantic relations between consecutive user queries. The application of this method to the logs of an image search engine revealed interesting usage patterns, such as those users often search for two entities sharing a property. Ruthven [72] has viewed RF as a process of explanation. He has used abdicative inference to provide a framework for an explanation-based account of RF. Hiramatsu & Satoh [86] presents a two-phase query modification using ontology for geographic information navigators. In this paper, the author has given an outline of the prototype system and examines the effect of the query modification through an example. 3.4 Adaptation of content for mobile devices Yang & Li [177] has focused on P2P Collaborative deployment scheme which works by dividing the web pages into small logical blocks, so as to display the existing web 66

pages using web adaptation engines. The webpage [119] discusses various issues of using mobile for browsing and searching like smaller screens, typing limitations of phone keypads and the cost of spending lots of time scrolling through mobile search results. Best [39] discusses Mobile web which is part of www and carry almost the same features as the web contains like rich user experience, user participation, dynamic content, metadata, web standards, scalability, openness, freedom and collective intelligence by way of user participation. Banerjee et.al [152] has given summary about the features of mobile web such as WAP and i-mode (Japan) using a mobile device such as a cell phone, PDA, or other portable gadget connected to a public network. Such access does not require a desktop computer, or a fixed landline connection. Another Website Wikimedia [174] discusses about the Mobile Web access and its problem of incompatibility of the format with the information available on the Internet. Economides [8] has proposed a model for adaptation engines with seventeen criteria for evaluating a web adaptation engine. Chua et.al [65] has discussed the issues related to the differences between single-user browsing and co-browsing and proposed a content adaptation framework based on the concept of shared viewpoint and personal viewpoint. Yang & Li [178] has referred the thumbnail view concept, VIPS method and AJAX, and proposed a dynamic Web page adaptation for mobile device. Buyukkokten & Paepcke [117] has focused the search on a PDA using single page web sites. The author has introduced a power browser which will increase the productivity of mobile user while searching the data on web using mobile devices. Kaljuvee et.al [118] proposed a design which will display and edit/ manipulate HTML forms on mobile devices. The authors have developed 8 algorithms that will match label widgets. The algorithms developed by them are broadly classified into 2 categories: n-gram comparisons and form layout conventions. Cserkúti et.al [124] has proposed a proxy based content re-authoring system known as smart web. This system adapts the web page content according to the client device. It usually performs structural analysis and applies some transformations as well. 67

Hoschka [125] has outlined MWI (Mobile web interface), giving an insight about historical development, the future and the story of MWI in the past years and years to come. Another Webpage User vision [161] mainly focuses on differences between desktop view of a web page and the mobile view of a web page. The author has also given guidelines to design a web page to adapt for small screen devices to improve readability. The White paper [154] proposes an enhanced method of web content adaptation for mobile devices. According to author, the process of Web content adaptation consists of 4 stages including block filtering, block title extraction, block content summarization, and personalization through learning. As a result of learning, personalization is realized by showing the information for the relevant block at the top of the content list. 68