Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches

Size: px
Start display at page:

Download "Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches"

Transcription

1 Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches Chris Staff Department of Artificial Intelligence University of Malta Abstract Web browser bookmark files are used to store links to web sites that the user would like to revisit. However, bookmark files tend to be under-utilised, as time and effort is needed to keep them organised. We use four methods to index and automatically classify documents referred to in 80 bookmark files, based on document title-only and full-text indexing and two clustering approaches. We evaluate the approaches by selecting a bookmark entry to classify from a bookmark file and re-creating a snapshot of the bookmark file to contain only entries created before the selected bookmark entry. Individually, the algorithms have an accuracy at rank 1 similar to the title-only baseline approach, but the different approaches tend to make different recommendations. We improve accuracy by combining the recommendations at rank 1 in each algorithm. The baseline algorithm is 39% accurate at rank 1 when the target category contains 7 entries. By merging the recommendations of the 4 approaches, we reach 78.7% accuracy on average, recommending a maximum of 3 categories. 30.6% of the time we need make only one recommendation which is correct 81.4% of the time. 1. Motivation Web browsing software such at Safari, Internet Explorer, and Mozilla Firefox, include a bookmark or favorites facility, so that a user can keep an electronic record of web sites or web pages that they have visited and are likely to want to revisit. It is usually possible to manually organise these bookmark entries into related collections called folders or categories. If bookmark files are kept organised and up-to-date, they could be a good indication of a user s long-term and short-term interests which could be used to automatically identify and retrieve related information. However, bookmark files require user effort to keep organised, so a collection of bookmarks tends to become disorganised over time [1], [2]. We describe HyperBK2 which can assist users to keep a collection of bookmarks organised by recommending the category in which to store the entry for a web page in the process of being bookmarked. We examine some different approaches to indexing web pages, deriving category representations, and classifying web pages into categories. Ideally, a user would be recommended a single category which would always be the correct one (i.e., the user would never opt to save the entry to some other category). The approaches that we have compared do not meet this ideal, but we can offer the user a selection of categories that may include the correct one. Of course, we can do this trivially by offering the user all the categories that exist, (just as in information retrieval we can guarantee a recall of 100% by retrieving all documents in the collection), so we want to show the user as small a selection of recommendations as possible, while maximising the chances that the small selection contains the correct, target, category. We experimented with taking the top-5 recommendations from two different approaches and fusing them, which resulted in an average of 7 recommendations in a results set and an accuracy of 80%, and with fusing the results of the four different approaches at rank 1, which gives comparable accuracy, but we need offer the user a maximum of only 3 recommendations. In section 2 we discuss similar systems. HyperBK2 s indexing and classification approach is discussed in section 3, and the evaluation approach in section 4, and the results are presented and discussed in section 5. Section 6 outlines our future work and conclusions.

2 2. Background and Similar Systems Web pages are frequently classified to automatically create web directories [3], [4], [5], [6] with predefined categories [7] or dynamic or ad hoc categories [6], [3], or to assist users in bookmarking favourite web pages [8]. Bookmarking is a popular way to store and organise information for later use [9]. However, drawbacks exist, especially when the number of bookmarks increases over time [10]. Bookmark managers support users in creating and maintaining reusable bookmarks lists. These may store information locally, such as HyperBK [11], Conceptual Navigator 1 and Check&Get 2 or centrally, such as Caribo [8], Delicious 3 and BookmarkTracker 4. Web pages may be classified using contextual information, such as neighbourhood information and link graphs [4], [5], using supervised [12], or partially supervised learning techniques [6]. [7] summarize a web page before classifying it, to eliminate noise in the form of navigational links and adverts. Delicious, which is an online service, allows users to share bookmarks. Categorisation is aided by the use of tags, which users associate with their bookmarks. However there are no explicit category recommendations when a new bookmark is being stored. InLinx [7] provides for both recommendations and classification of bookmarks into globally predefined categories. Classification is based on the user s profile and the web-page content. CariBo [8] classifies a bookmark by first establishing a similarity in the interests of two users and then finding a mapping between the folder location of a bookmark in the collaborators bookmark files and that of the target user s bookmark hierarchy. In previous work [11], we built a bookmark management system, HyperBK, that can recommend a destination bookmark category (folder). However, only a small number of bookmark files had been used in the evaluation. In HyperBK2, we have modified our approach to indexing and classification, and we have evaluated the new approach using 80 bookmark files. 3. HyperBK2 s Indexing and Classification Approach The literature suggests that approaches to web page classification are frequently performed using a global classification taxonomy [7] or make use of a web page s neighbourhood information [4], [5]. We want to take a partially supervised approach to clustering [6]: the only sources of information are the web page to be bookmarked and web page entries in the user s existing bookmark categories (positive examples). We avoid using a global classification taxonomy, instead using the categories that a user has created in his or her own bookmark file. This allows our recommendations to be personalised, and bookmark entries will be grouped according to an individual user s needs and preferences. We examine four approaches to indexing bookmark files and classifying web pages into an existing bookmark category. We use one, called TITLE-ONLY, as a baseline. As the name suggests, the document title only is indexed using TFIDF [13]. The indexed titles are combined to build a centroid representation of a category. An incoming document is classified according to the similarity of its title to each of the category centroids. The other three approaches, FULL-TEXT, CLUSTER, and SINGLETON all build their indices, and classify web pages based on a document s full-text. They vary according to what bookmarked web pages (bookmark entries) in a category are used to derive centroids for categories to compare to the incoming web page. TITLE-ONLY and FULL-TEXT build one centroid per category, using all the entries in the category. SINGLETON treats each entry as a cluster centroid, so a category containing n entries will have n centroids. CLUSTER clusters the n entries in a category using a thresholded similarity measure, deriving m centroids (where 1 <= m <= n). In both SINGLETON and CLUSTER, the category recommended for an incoming web page is the category containing the centroid most similar to the incoming web page. The user can always override HyperBK2 s recommendation and store the entry in another category or create a new category. Regardless of the approach used, we first create a forward index for each document referred to in the bookmark file. We remove stop words, HTML and script tags, stem the remaining terms using Gupta s Python implementation of the Porter Stemmer 5, and calculate the term frequency for each stem. Once the forward indexing of bookmark entries is complete, we identify the documents that are to be used to create the centroid or centroids for each category. For the TITLE-ONLY and FULL-TEXT approaches, we take the appropriate forward index of each document d 1 to d N in a category and we merge them, calculating a term weight by summing the term frequencies (TF) of 5.

3 each term j 1 to j m in each document in the category, and multiplying it by the Normalised Document Frequency (NDF j = DF j /N, where N is total no of docs in category), N d=1 T F j i,d NDF ji. This has the effect of reducing the weight of terms that occur in few documents in the category. For the SINGLETON approach, each selected document becomes a centroid in its own right. For the CLUSTER approach, we pick a document representation from a category, and compare it to each of the remaining (unclustered) documents in the same category, merging it with those representations that are similar above a certain threshold (arbitrarily set to 0.2). We then create another centroid by picking one of the currently unclustered documents in the category, and merging it with other, similar, currently unclustered document representations. If a document is not sufficiently similar to any other document in the category, it becomes a centroid in its own right. This iterative process continues until each document in a category is allocated to a cluster, or is turned into a centroid. Cluster membership is influenced by the order in which bookmark entries are created in a category, because we select entries (to form the first centroid of a category, and to compare entries to the centroid) in the order that entries are added by the user. We create a TITLE-ONLY or FULL-TEXT representation of the web page to classify, using the same formula (with NDF = 1). We then use the Cosine Similarity Measure [13] to measure the similarity between a web page and each category centroid in the bookmark file. The category containing the highest ranking category centroid is recommended. 4. Evaluation Approach We collected 80 bookmark files from anonymous users. Each bookmark file is in the Netscape bookmark file format 6, and stores the date that each bookmark entry was created. We use this date to re-create a snapshot of the bookmark file s state just prior to the addition of the bookmark to be classified. The method of evaluation is to select bookmark entries from a number of bookmark files, according to some criteria (see below), and to measure the ability of the classification methods to recommend their original category. We measure the presence of the target category in ranks 1 to 5 as the accuracy at each rank. The criteria we use to select bookmark entries for classification from a bookmark file, to determine the el igibility of the bookmark file snapshot to participate in the particular run, are ENTRY-TO-TAKE, and NO-OF- CATEGORIES. ENTRY-TO-TAKE is the nth entry in a category that is selected for classification. If there is a problem with the bookmark entry selected (i.e., the web page it represents no longer exists, etc.), then we take the next entry in the category, if possible. We ran our system with values for ENTRY-TO-TAKE of 2, 4, 6, 7, 8, 9, and 11. For example, in the simplest case (ENTRY-TO-TAKE = 2), the second entry created in each category would be selected for classification, and a snapshot of that category would contain only one entry. The snapshots of other categories in the bookmark file contain entries created before the selected entry. NO-OF-CATEGORIES is the number of categories that must exist in a snapshot of a bookmark file for it to participate in the evaluation. We imposed a minimum of 5 categories, which would give a random classifier a maximum chance of only 20% to correctly assign a selected bookmark entry to its original category (at rank 1). We wanted to see if we would bias results in our system s favour if we did not impose this minimum, so we removed this constraint for the evaluation platform with the best performing criteria (table 3, runs 5 and 6). We first ran the baseline TEXT-ONLY evaluation platform, using only a web page s title to construct the document index and on which to perform the classification. We then compared this approach to the approach that indexed and classified documents using FULL- TEXT (table 3). Each category had just one centroid representation created by merging the descriptions of all the documents in a category snapshot. The results are presented in subsection 5.2. Next, categories were divided into clusters, using the SINGLETON approach (in which each document in a category snapshot is considered to be a centroid) and CLUSTER (built by merging representations of sufficiently similar documents using the cosine similarity measure). The results are presented in subsection 5.3 and compared to the FULL-TEXT and TITLE-ONLY baseline results in subsection 5.4. We notice that although the full-text and baseline approaches appear to give similar levels of accuracy (table 3), the different algorithms tend to make different recommendations 33% of the time, and by merging the results, we obtain better results (table 4). However, the disadvantage of merging the results is that on average, a user would be recommended 7 categories. Given that 61% of the bookmark files used in the evaluation contain up to 10 categories (table 1), there is no advantage over allowing users to choose a destination category, though it would be advantageous to the 26.25% users with 21 or more categories.

4 Table 1. Submitted bookmark files and their numbers of categories No. of categories No. of bookmark files 1: 8 2-5: : : : : : 4 We ran the CLUSTER and SINGLETON experiments to see if they were more accurate than the unclustered approaches, and also to see if we could achieve even higher accuracy by recommending a maximum of four categories: the categories recommended at rank 1 by each approach. A user would then be presented with between 1 and 4 recommendations: 1 in the event that each approach made the same recommendation, and 4 in the event that each made a different recommendation. These results are presented in subsection Results 5.1. Bookmark File Properties and Bookmark Entries Selected for Classification In this section, we describe the general properties of the bookmark files that we collected, in terms of the number of categories that they contain (table 1), and provide information about the number of bookmark file entries selected for classification for each run (Total Eligible Entries in table 2). On average, bookmark files used in the evaluation have 23 categories, with a minimum of 1 and a maximum of 229. Table 1 gives the number of categories in each bookmark file used in the evaluation. 8 files (10%) contain only one category. 51 files (63.75%) contain between 2 and 20 categories, and 21 (26.25%) contain more than 20 categories. Table 2 gives the parameters for each run, and the total number of bookmark entries selected for classification in each run TITLE-ONLY and FULL-TEXT In table 3, we present the results of the runs, highlighting the best performances. We conducted the evaluation as follows. From each bookmark file, all the bookmark entries that satisfied the criteria were extracted, and a snapshot of the bookmark file was created per eligible bookmark entry. In table 3, we see that from rank 2 onwards, there is an advantage of the FULL-TEXT approach over TITLE-ONLY. Ideally, the top ranking recommendation is the target category. The best performance was 44% and worst was 24% (both FT rank 1). However, it turns out that TITLE- ONLY and FULL-TEXT approaches are frequently recommending different categories (on average the different approaches recommend 3 identical categories and 4 different categories in all). When we merge the results (table 4) we see an increase in accuracy although users would need to be shown 5 to 10 recommended categories, and 7 categories on average CLUSTER and SINGLETON We ran CLUSTER and SINGLETON on categories from which the 8th entry was taken (Run 5). SIN- GLETON gives 39.2% accuracy at rank 1, and 60.8% accuracy at rank 5. CLUSTER yields an accuracy of 37.7% at rank 1, and 65.3% accuracy at rank 5. These are similar to the accuracy of the baseline classification at rank 1 (39%) and the FULL-TEXT results at rank 5 (64%), respectively, so are not, individually, an improvement over the simpler approaches. However, when we compare the recommendations made by each of the four approaches, we are able to improve our recommendation accuracy, and we can also reduce the numbers of categories recommended to the user Merging Recommendations at Rank 1 We want a mechanism that has a good chance of recommending the correct category, without overloading the user with too many choices of category. We can merge the FULL-TEXT and TITLE-ONLY recommendations to increase the recommendation accuracy, at a cost of giving the user a choice of 7 candidate categories on average (subsection 5.3). We measure the frequency of agreement between the different approaches at rank 1, and the accuracy of the recommendation. We compare the results obtained for run 5. We can predict the quality of recommendation based on the degree of agreement on the recommended category between the different approaches (table 5). The following arrangements of agreement between the different approaches are possible: all four approaches can give the same result (4-of-a-kind), which may be correct or incorrect; three of the approaches may make the same recommendation (3-of-a-kind), with the fourth giving a different one, and either one is correct or both are incorrect; two of the approaches may recommend the same category with the other

5 Table 2. No. of bookmark entries classified Run ENTRY-TO-TAKE NO-OF-CATEGORIES Total Eligible Entries Table 3. Comparing FULL-TEXT (FT) and TITLE-ONLY (TO) classification accuracy (percent). Run TO rank FT rank TO rank FT rank TO rank FT rank TO rank FT rank TO rank FT rank Table 4. Merging the FULL-TEXT and TITLE-ONLY recommendations improves accuracy (percent). Run Rank Rank Rank Rank Rank Table 5. Comparing merged recommendations at rank 1 with the baseline and merged TEXT-ONLY and FULL-TEXT approaches at rank 5 4-of-a-kind 3-of-a-kind 2-of-a-kind 1-of-a-kind Probability of observation: 30.6% 38.4% 21.4% 9.6% Accuracy: 81.4% 64.2% 95.8% 40.6% No. of recommended categories: or 3 4 % improvement over baseline (52% rank 5): +56.7% +23.5% +84.2% -21.9% No. of recommended categories using baseline only: % improvement over merged TITLE-ONLY + FULL-TEXT +1.8% -19.8% +19.8% -49.5% (80% rank 5): Average no. of recommended categories using merged TITLE-ONLY and FULL-TEXT results: two approaches agreeing on another category or each recommending a different category (2-of-a-kind), with any recommendation correct or all incorrect; or each approach may make a different recommendation (1-ofa-kind) and one of them may be correct, or they may all be incorrect. Table 5 gives the different combinations, the frequency of observing them, and their accuracy, the number of categories that would need to be shown to users, and the percentage improvement over the accuracy of the merged results of TITLE-ONLY and FULL-TEXT (at rank 5), and the percentage improvement of the accuracy over the TITLE-ONLY baseline (at rank 5). When the approaches agree on 2, 3, or 4 of the recommendations, the recommendation is correct on average 78.7% of the time. An added benefit is that HyperBK2 needs to make only 1, 2, or 3 recommendations. The approaches disagree totally only 9.6% of the time, and accuracy is only 40%, despite needing to make 4 recommendations. Our results are an improvement on CariBo s: a collaborative bookmark category recommendation system evaluated on the bookmark files of 15 users that has 60% accuracy at

6 rank 5 [8]. 6. Future Work and Conclusions When we merge the recommendations of the four different indexing and clustering approaches at rank 1, 90.4% of the time we can recommend 1 to 3 categories, and the target category will be recommended on average 78.7% of the time. This gives a 51.3% improvement over the TITLE-ONLY baseline (and needing to show users 5 categories), and a slight decrease of just 1.7% compared to the merged recommendations of TITLE-ONLY and FULL-TEXT (but we would need to show users an average of 7 categories and a maximum of 10). We have extended our work previously conducted in the area of automatic bookmark classification by comparing indexing and classification methods based on vector-based full-text and title-only representations of documents in a bookmark category. We also built a full-text representation based on category entry cluster centroid, where the cluster is either a singleton entry or cluster membership is based upon entry similarity. We conducted several runs in which the bookmark entry to be selected for classification was the 2nd, 4th, 6th, 7th, 8th, 9th, or 11th entry created in a category. The FULL-TEXT and TEXT- ONLY approaches worked best for categories already containing seven entries. However, SINGLETON treats each entry as a centroid, so it can work for an arbitrary number of entries in a category, and CLUSTER creates an arbitrary number of centroids in a category by merging representations of an arbitrary number of similar entries, treating non-similar entries as singleton centroids, so it too is unaffected by the actual number of entries in a category. There is a greater likelihood of making an accurate recommendation if at least two of the approaches make the same recommendation. We intend to automatically generate a query from the category centroids, and evaluate the ability to automatically find previously unseen documents that users consider relevant and worth bookmarking. References [1] D. Abrams and R. Baecker, How people use www bookmarks, in CHI 97: CHI 97 extended abstracts on Human factors in computing systems. New York, NY, USA: ACM, 1997, pp [2] D. Abrams, R. Baecker, and M. Chignell, Information archiving with bookmarks: personal web space construction and organization, in CHI 98: Proceedings of the SIGCHI conference on Human factors in computing systems. New York, NY, USA: ACM Press/Addison- Wesley Publishing Co., 1998, pp [3] X. PENG and B. CHOI, Automatic web page classification in a dynamic and hierarchical way, in ICDM 02: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 02). Washington, DC, USA: IEEE Computer Society, 2002, p [4] X. Qi and B. D. Davison, Knowing a web page by the company it keeps, in CIKM 06: Proceedings of the 15th ACM international conference on Information and knowledge management. New York, NY, USA: ACM, 2006, pp [5] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, A comparison of implicit and explicit links for web page classification, in WWW 06: Proceedings of the 15th international conference on World Wide Web. New York, NY, USA: ACM, 2006, pp [6] H. Yu, J. Han, and K. C.-C. Chang, Pebl: Web page classification without negative examples, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp , [7] C. Bighini, A. Carbonaro, and G. Casadei, Inlinx for document classification, sharing and recommendation, icalt, vol. 00, p. 91, [8] D. Benz, K. H. L. Tso, and L. Schmidt-Thieme, Automatic bookmark classification - a collaborative approach, in Proceedings of the 2nd Workshop in Innovations in Web Infrastructure (IWI2) at WWW2006, Edinburgh, Scotland, May [9] H. Bruce, W. Jones, and S. Dumais, Keeping and refinding information on the web: What do people do and what do they need? in ASIST 2004: Proceedings of the 67th ASIST annual meeting. Chicago, IL.: Information Today, Inc., [10] W. Jones, H. Bruce, and S. Dumais, Keeping found things found on the web, in CIKM 01: Proceedings of the tenth international conference on Information and knowledge management. New York, NY, USA: ACM, 2001, pp [11] C. Staff and I. Bugeja, Automatic classification of web pages into bookmark categories, in SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2007, pp [12] M. Tsukada, T. Washio, and H. Motoda, Automatic web-page classification by using machine learning methods, in WI 01: Proceedings of the First Asia- Pacific Conference on Web Intelligence: Research and Development. London, UK: Springer-Verlag, 2001, pp [13] G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Ithaca, NY, USA, Tech. Rep., 1987.

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

LDA Based Security in Personalized Web Search

LDA Based Security in Personalized Web Search LDA Based Security in Personalized Web Search R. Dhivya 1 / PG Scholar, B. Vinodhini 2 /Assistant Professor, S. Karthik 3 /Prof & Dean Department of Computer Science & Engineering SNS College of Technology

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Semantic Concept Based Retrieval of Software Bug Report with Feedback Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop

More information

Supporting Privacy Protection in Personalized Web Search

Supporting Privacy Protection in Personalized Web Search Supporting Privacy Protection in Personalized Web Search Kamatam Amala P.G. Scholar (M. Tech), Department of CSE, Srinivasa Institute of Technology & Sciences, Ukkayapalli, Kadapa, Andhra Pradesh. ABSTRACT:

More information

A UPS Framework for Providing Privacy Protection in Personalized Web Search

A UPS Framework for Providing Privacy Protection in Personalized Web Search A UPS Framework for Providing Privacy Protection in Personalized Web Search V. Sai kumar 1, P.N.V.S. Pavan Kumar 2 PG Scholar, Dept. of CSE, G Pulla Reddy Engineering College, Kurnool, Andhra Pradesh,

More information

Topological Tree Clustering of Social Network Search Results

Topological Tree Clustering of Social Network Search Results Topological Tree Clustering of Social Network Search Results Richard T. Freeman Capgemini, FS Business Information Management No. 1 Forge End, Woking, Surrey, GU21 6DB United Kingdom richard.freeman@capgemini.com

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University Presented by Qiang Yang, Hong Kong Univ. of Science and Technology 1 In a Search Engine Company Advertisers

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Studying the Impact of Text Summarization on Contextual Advertising

Studying the Impact of Text Summarization on Contextual Advertising Studying the Impact of Text Summarization on Contextual Advertising Giuliano Armano, Alessandro Giuliani and Eloisa Vargiu Dept. of Electric and Electronic Engineering University of Cagliari Cagliari,

More information

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer. REVIEW ARTICLE ISSN: 2321-7758 UPS EFFICIENT SEARCH ENGINE BASED ON WEB-SNIPPET HIERARCHICAL CLUSTERING MS.MANISHA DESHMUKH, PROF. UMESH KULKARNI Department of Computer Engineering, ARMIET, Department

More information

Personalization of Web Search With Protected Privacy

Personalization of Web Search With Protected Privacy Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information

More information

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS Biswajit Biswal Oracle Corporation biswajit.biswal@oracle.com ABSTRACT With the World Wide Web (www) s ubiquity increase and the rapid development

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

User Data Analytics and Recommender System for Discovery Engine

User Data Analytics and Recommender System for Discovery Engine User Data Analytics and Recommender System for Discovery Engine Yu Wang Master of Science Thesis Stockholm, Sweden 2013 TRITA- ICT- EX- 2013: 88 User Data Analytics and Recommender System for Discovery

More information

How To Cluster On A Search Engine

How To Cluster On A Search Engine Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

How Programmers Use Internet Resources to Aid Programming

How Programmers Use Internet Resources to Aid Programming How Programmers Use Internet Resources to Aid Programming Jeffrey Stylos Brad A. Myers Computer Science Department and Human-Computer Interaction Institute Carnegie Mellon University 5000 Forbes Ave Pittsburgh,

More information

Expert Finding Using Social Networking

Expert Finding Using Social Networking San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 1-1-2009 Expert Finding Using Social Networking Parin Shah San Jose State University Follow this and

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Automated Collaborative Filtering Applications for Online Recruitment Services

Automated Collaborative Filtering Applications for Online Recruitment Services Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Clustering Data Streams

Clustering Data Streams Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

More information

Web Personalization based on Usage Mining

Web Personalization based on Usage Mining Web Personalization based on Usage Mining Sharhida Zawani Saad School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, Colchester, Essex, CO4 3SQ, UK szsaad@essex.ac.uk

More information

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113 CSE 450 Web Mining Seminar Spring 2008 MWF 11:10 12:00pm Maginnes 113 Instructor: Dr. Brian D. Davison Dept. of Computer Science & Engineering Lehigh University davison@cse.lehigh.edu http://www.cse.lehigh.edu/~brian/course/webmining/

More information

A PERSONALIZED WEB PAGE CONTENT FILTERING MODEL BASED ON SEGMENTATION

A PERSONALIZED WEB PAGE CONTENT FILTERING MODEL BASED ON SEGMENTATION A PERSONALIZED WEB PAGE CONTENT FILTERING MODEL BASED ON SEGMENTATION K.S.Kuppusamy 1 and G.Aghila 2 1 Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry,

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Building A Smart Academic Advising System Using Association Rule Mining

Building A Smart Academic Advising System Using Association Rule Mining Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 raedamin@just.edu.jo Qutaibah Althebyan +962796536277 qaalthebyan@just.edu.jo Baraq Ghalib & Mohammed

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach Banatus Soiraya Faculty of Technology King Mongkut's

More information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

More information

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES FOUNDATION OF CONTROL AND MANAGEMENT SCIENCES No Year Manuscripts Mateusz, KOBOS * Jacek, MAŃDZIUK ** ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES Analysis

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour?

Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour? Seventh IEEE International Conference on Data Mining Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour? Calum Robertson Information Research Group Queensland University

More information

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Finding Advertising Keywords on Web Pages. Contextual Ads 101 Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

More information

Full-text Search in Intermediate Data Storage of FCART

Full-text Search in Intermediate Data Storage of FCART Full-text Search in Intermediate Data Storage of FCART Alexey Neznanov, Andrey Parinov National Research University Higher School of Economics, 20 Myasnitskaya Ulitsa, Moscow, 101000, Russia ANeznanov@hse.ru,

More information

Profile Based Personalized Web Search and Download Blocker

Profile Based Personalized Web Search and Download Blocker Profile Based Personalized Web Search and Download Blocker 1 K.Sheeba, 2 G.Kalaiarasi Dhanalakshmi Srinivasan College of Engineering and Technology, Mamallapuram, Chennai, Tamil nadu, India Email: 1 sheebaoec@gmail.com,

More information

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,

More information

Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

A Survey on Association Rule Mining in Market Basket Analysis

A Survey on Association Rule Mining in Market Basket Analysis International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 4 (2014), pp. 409-414 International Research Publications House http://www. irphouse.com /ijict.htm A Survey

More information

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Twitter sentiment vs. Stock price!

Twitter sentiment vs. Stock price! Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured

More information

Some Research Challenges for Big Data Analytics of Intelligent Security

Some Research Challenges for Big Data Analytics of Intelligent Security Some Research Challenges for Big Data Analytics of Intelligent Security Yuh-Jong Hu hu at cs.nccu.edu.tw Emerging Network Technology (ENT) Lab. Department of Computer Science National Chengchi University,

More information

How do Students Organize Personal Information Spaces?

How do Students Organize Personal Information Spaces? How do Students Organize Personal Information Spaces? Sharon Hardof-Jaffe 1, Arnon Hershkovitz 1, Hama Abu-Kishk 2, Ofer Bergman 3, Rafi Nachmias 1 {sharonh2, arnonher, nachmias}@post.tau.ac.il, hama@bgu.ac.il,

More information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN

More information

Infoview XIR3. User Guide. 1 of 20

Infoview XIR3. User Guide. 1 of 20 Infoview XIR3 User Guide 1 of 20 1. WHAT IS INFOVIEW?...3 2. LOGGING IN TO INFOVIEW...4 3. NAVIGATING THE INFOVIEW ENVIRONMENT...5 3.1. Home Page... 5 3.2. The Header Panel... 5 3.3. Workspace Panel...

More information

A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED WEB SEARCH

A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED WEB SEARCH International Journal of Computer Science and System Analysis Vol. 5, No. 1, January-June 2011, pp. 37-43 Serials Publications ISSN 0973-7448 A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Orland Hoeber and Hanze Liu Department of Computer Science, Memorial University St. John s, NL, Canada A1B 3X5

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Framework for Dynamic Faculty Support System to Analyze Student Course Data

A Framework for Dynamic Faculty Support System to Analyze Student Course Data A Framework for Dynamic Faculty Support System to Analyze Student Course Data J. Shana 1, T. Venkatachalam 2 1 Department of MCA, Coimbatore Institute of Technology, Affiliated to Anna University of Chennai,

More information

EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING

EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING Agustín Sabater, Carlos Guerrero, Isaac Lera, Carlos Juiz Computer Science Department, University of the Balearic Islands, SPAIN pinyeiro@gmail.com, carlos.guerrero@uib.es,

More information

Data Mining with Hadoop at TACC

Data Mining with Hadoop at TACC Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical

More information

Privacy Protection in Personalized Web Search- A Survey

Privacy Protection in Personalized Web Search- A Survey Privacy Protection in Personalized Web Search- A Survey Greeshma A S. * Lekshmy P. L. M.Tech Student Assistant Professor Dept. of CSE & Kerala University Dept. of CSE & Kerala University Thiruvananthapuram

More information

Using Wikipedia to Translate OOV Terms on MLIR

Using Wikipedia to Translate OOV Terms on MLIR Using to Translate OOV Terms on MLIR Chen-Yu Su, Tien-Chien Lin and Shih-Hung Wu* Department of Computer Science and Information Engineering Chaoyang University of Technology Taichung County 41349, TAIWAN

More information

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES Anwar Ali Yahya *, Addin Osman * * Faculty of Computer Science and Information Systems, Najran University,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

SEO Techniques for various Applications - A Comparative Analyses and Evaluation IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya

More information

Programming Risk Assessment Models for Online Security Evaluation Systems

Programming Risk Assessment Models for Online Security Evaluation Systems Programming Risk Assessment Models for Online Security Evaluation Systems Ajith Abraham 1, Crina Grosan 12, Vaclav Snasel 13 1 Machine Intelligence Research Labs, MIR Labs, http://www.mirlabs.org 2 Babes-Bolyai

More information

Optimised Realistic Test Input Generation

Optimised Realistic Test Input Generation Optimised Realistic Test Input Generation Mustafa Bozkurt and Mark Harman {m.bozkurt,m.harman}@cs.ucl.ac.uk CREST Centre, Department of Computer Science, University College London. Malet Place, London

More information

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Florida International University - University of Miami TRECVID 2014

Florida International University - University of Miami TRECVID 2014 Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,

More information

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Context Aware Predictive Analytics: Motivation, Potential, Challenges Context Aware Predictive Analytics: Motivation, Potential, Challenges Mykola Pechenizkiy Seminar 31 October 2011 University of Bournemouth, England http://www.win.tue.nl/~mpechen/projects/capa Outline

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Feature Article: Ahu Sieg, Bamshad Mobasher and Robin Burke 7 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Ahu Sieg, Bamshad Mobasher, Robin Burke Center for Web

More information

Semantic Jira - Semantic Expert Finder in the Bug Tracking Tool Jira

Semantic Jira - Semantic Expert Finder in the Bug Tracking Tool Jira Semantic Jira - Semantic Expert Finder in the Bug Tracking Tool Jira Velten Heyn and Adrian Paschke Corporate Semantic Web, Institute of Computer Science, Koenigin-Luise-Str. 24, 14195 Berlin, Germany

More information

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value , pp. 397-408 http://dx.doi.org/10.14257/ijmue.2014.9.11.38 Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value Mohannad Al-Mousa 1

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

Inverted files and dynamic signature files for optimisation of Web directories

Inverted files and dynamic signature files for optimisation of Web directories s and dynamic signature files for optimisation of Web directories Fidel Cacheda, Angel Viña Department of Information and Communication Technologies Facultad de Informática, University of A Coruña Campus

More information

Music Genre Classification

Music Genre Classification Music Genre Classification Michael Haggblade Yang Hong Kenny Kao 1 Introduction Music classification is an interesting problem with many applications, from Drinkify (a program that generates cocktails

More information

TREC 2007 ciqa Task: University of Maryland

TREC 2007 ciqa Task: University of Maryland TREC 2007 ciqa Task: University of Maryland Nitin Madnani, Jimmy Lin, and Bonnie Dorr University of Maryland College Park, Maryland, USA nmadnani,jimmylin,bonnie@umiacs.umd.edu 1 The ciqa Task Information

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information