Topical Crawling for Business Intelligence
|
|
|
- Philip Wilkerson
- 10 years ago
- Views:
Transcription
1 Topical Crawling for Business Intelligence Gautam Pant and Filippo Menczer Department of Management Sciences The University of Iowa, Iowa City IA 52242, USA Abstract. The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results. 1 Introduction A large number of business entities start-up companies and established corporations have a Web presence. This makes the Web a lucrative source for locating business information of interest. A company that is planning to diversify or invest in a start-up would want to locate a number of players in the area of business. The intelligence gathering process may involve manual efforts using search engines, business portals or personal contacts. Topical crawlers can help in extracting a small but focused document collection from the Web that can then be thoroughly mined for appropriate information using off-the-shelf text mining, indexing and ranking tools. Topical crawlers, also called focused crawlers, have been studied extensively in the past [6, 7, 3, 10, 2]. In our previous evaluation studies of topical crawlers, we found a similarity based Naive Best-First crawler to be quite effective [11, 12, 18]. However, the Naive Best-First crawler made no use of the inherent structure available in an HTML document. We also studied algorithms that attempt to identify the context of a link using a sliding window and a distance measure based on number of links separating a word from a given link. However, such approaches Current affiliation: School of Informatics and Computer Science Department, Indiana University. [email protected]
2 often performed no better than variants of the Naive Best-First approach [12, 18]. In this paper we consider a crawler that identifies the context of a link using the HTML page s tag tree or Document Object Model (DOM) representation. The problem of locating business entities on the Web maps onto the general problem of Web resource discovery. However, it has some features that distinguish it from the general case. In particular, business communities are highly competitive. Hence, it is unlikely that a company s Web page would link to a Web page of its competitor. A topical crawler that is aware of such domain level characteristic may utilize it to its advantage. We evaluate our crawlers over millions of pages across 159 topics. Based on the number of topics and the number of pages crawled per topic, our evaluations are the most extensive in the currently available topical crawling literature. 2 The Problem Searching for URLs of related business entities is a type of business intelligence problem. The entities could be related through the area of competence, research thrust, comparable nature (like start-ups) or a combination of such features. We start by assuming that a short list of URLs of related business entities is already available. However, the list needs to be further expanded. The short list may have been generated manually with the help of search engines, business portals or Web directories. An analyst may face some hurdles in expanding the list of relevant URLs. Such hurdles could be due to lack of appropriate content in relevant pages, inadequate user queries, staleness of search engines collections, or bias in search engines ranking. Similar problems plague information discovery using Web directories or portals. The staleness of a search engine s collection is highlighted by the dynamic nature of the Web [9]. Cho and Garcia-Molina [4] have shown, in a study with 720,000 Web pages and spanning over 4 months, that 40% of the pages in the.com domain changed every day. Hence, it is reasonable to complement traditional techniques with topical crawlers to discover up-to-date business information. Since our focus is on studying the effect of different crawling techniques, we do not investigate the issue of ranking. We believe that ranking is a separate task best left until a collection has been created. In fact, many different indexing, ranking or text mining tools may be applied to retrieve information from the collection. Our goal is to find ways of crawling and building a small but effective collection for the purpose of finding related business entities. We measure the quality of the collection at various points during the crawl through precision-like and recall-like metrics that are described next. 3 The Test Bed For our test bed, we need a number of topics and corresponding lists of related business entities. The hierarchical categories of the Open Directory Project
3 (ODP) 1 are used for this purpose. ODP provides us with a categorical collection of URLs that are manually edited and not biased by commercial considerations. Hence, we can find categories that list related business entities as judged by humans. To begin with, we identify categories in the ODP hierarchy that end with any of the following words: Companies, Consultants, or Manufacturers. These categories serve as topics for our test bed. We collect only those categories that have more than 20 unique external URLs and we skip categories that have Regional, World and International as a top-level or sub-category. After going through the entire RDF dump of ODP, 159 such categories are found. Next, we split each category s external URLs into two disjoint sets seeds and targets. The set of seeds is created by picking 10 external URLs at random from a category. The remaining URLs make up the targets. The seeds are given to the crawlers as starting links while the targets are used for evaluation. The keywords that guide the crawlers are created by concatenating all the alphanumeric tokens in the ODP category name leaving out the most general category (Science, Arts etc.). We also concatenate the manual descriptions of the seeds and the targets and create a master description that would serve as another source for evaluations. An example of a topic with corresponding keywords, description, seeds and targets as extracted from an ODP category is shown in Table 1. Each crawler is provided with a set of keywords and the corresponding seeds to start the crawl. A crawler is then allowed to crawl up to 10,000 pages starting from the seeds. The process is repeated for each of the 159 topics in the test bed. As a result, we may crawl more than one and a half million pages for one crawler alone. We use the following precision-like and recall-like measures to understand the performance of the crawlers: Precision@N: We measure the precision for a crawler on topic t after crawling N pages as: precision@n t = 1 N N sim(d t, p i ) (1) where d t is the description of the topic t, p i is a crawled page, and sim(d t, p i ) is the cosine similarity between the two. The cosine similarity is measured using a TF-IDF weighting scheme [17]. In particular, the weight of a term k in a given page p is computed as: w kp = ( i=1 ) ( ) 0.5 tf kp C ln t max k T p tf k p df k where tf kp is the frequency of the term k in page p, T p is the set of all terms in page p, Ct is the set of pages crawled by all the crawlers for topic t at the end of the crawl, and df k is the number of pages in Ct that contain the term k. After representing a topic description (d) and a crawled page (p) 1 (2)
4 as a vector of term weights (v d and v p respectively), the cosine similarity between them is computed as: v d v p sim(d, p) = (3) v d v p where v d v p is the dot (inner) product of the two vectors. Note that the terms used for cosine similarity, throughout this paper, undergo stemming (using Porter stemming algorithm [15]) and stoplisting. For a given N, we can average precision@n t over all the topics in the test bed to get the average precision@n. The latter can be plotted as a trajectory over time, where time is approximated by increasing values of crawled pages (N). Target recall@n: In the absence of a known relevant set, we treat the recall of the targets as an indirect indicator of actual recall (Figure 1). The target recall@n t for a crawler on a topic t can be computed at various points (N) during the crawl: target recall@n t = T t C N t T t where T t is the set of targets for topic t, and Ct N is the set of N crawled pages. Again, we can average the recall (average target recall@n) at various values of N over the topics in the test bed. The above evaluation metrics are a special case of a general evaluation methodology for topical crawlers [18]. We will perform one-tailed t-tests to understand, in statistical terms, the benefit of using one crawler over another. The null hypothesis in all such tests will be that the two crawlers under consideration perform equally well. (4) T t U C t N Crawled C t N Tt Targets R t U C t N R t Relevant Fig. 1. T t C N t / T t as an estimate of R t C N t / R t 4 Crawlers We evaluate the performance of four different crawlers. Before we describe the crawling algorithms, it is worthwhile to illustrate the general crawling infrastructure that all the crawlers use.
5 Table 1. A sample topic - keywords, description, seeds, targets keywords description seeds targets Gambling Equipment Manufacturers Gemaco Cards Manufacturer of promotional and casino playing cards in poker, bridge... R.T. Plastics Supplier of casino value, roulette, poker and tournament chips Crawling Infrastructure The crawlers are implemented as multi-threaded objects in Java. Each crawler has many (possibly hundreds) threads of execution sharing a single synchronized frontier that lists the unvisited URLs. Each thread of the crawler follows a crawling loop that involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL. Some of the crawlers utilize the tag tree structure of the Web pages. They first tidy 2 the HTML page and then use an XML parser to obtain relevant information from the DOM structure. Tidying an HTML page includes both insertion of missing tags and reordering of tags in the page. The process is necessary for mapping the content of a page onto a tree structure with integrity, where each node has a single parent. We also enclose all the text tokens within <text>...</text> tags. This makes sure that all text tokens appear as leaves on the tag tree. While this step is not necessary for mapping an HTML document to a tree structure, it does provide some simplifications for the analysis. Figure 2 shows a snippet of an HTML page that is subsequently cleaned and converted into convenient XML format before being represented as a tag tree. The current crawler implementation uses 75 threads of execution and limits the maximum size of the frontier to 70,000. Only the first 10KB of a Web page are downloaded. Note that during a crawl of 10,000 pages, a crawler may encounter more than 70,000 unvisited URLs. However, given that the average number of outlinks on Web pages is 7 [16], the maximum frontier size is not very restrictive for a crawl of 10,000 pages. Being a shared resource for all the threads, the frontier is also responsible for enforcing certain ethics that prevent the threads from accessing the same server too frequently. In particular, the frontier tries to enforce the constraint that every batch of D URLs picked from it are from D different server host names (D = 50). The crawlers also respect the Robot Exclusion Protocol
6 <P class=msonormal> <STRONG>About Exelixis</STRONG><BR>Exelixis, Inc. is a leading genomics-based drug discovery company focused on product development through its expertise in comparative genomics and model system genetics. These technologies provide a rapid, efficient and cost effective way to move from DNA sequence data to knowledge about the function of genes and the proteins they encode. The company s technology is broadly applicable to all life sciences industries including pharmaceutical, diagnostic, agricultural biotechnology and animal health. Exelixis has partnerships with Aventis CropScience S.A., Bayer Corporation, Bristol-Myers Squibb Company, Elan Pharmaceuticals, Inc., Pharmacia Corporation, Protein Design Labs, Inc., Scios Inc. and Dow AgroSciences LLC, and is building its internal development program in the area of oncology.<span style='mso-spacerun: yes'> </SPAN>For more information, please visit the company s web site at <A href=" </P> html p Context strong text text a text text Fig. 2. An HTML snippet (top) whose source (center) is modified into a convenient XML format (bottom). The tag tree representation of the HTML snippet is shown on the right. 4.2 Breadth-First Crawler Breadth-First is a baseline crawler for our experiments. The crawl frontier is a FIFO queue. Each thread of the crawler picks up the URL at the front of the queue and adds new unvisited URLs to the back of it. Since the crawler is multithreaded, many pages are fetched simultaneously. Hence, it is possible that the pages are not fetched in an exact FIFO order. In addition the ethics enforced in the system, that prevent inundating a Web server with requests, also make the crawler deviate from strict FIFO order. The crawler adds unvisited URLs to the frontier only when the size of the frontier is less than the maximum allowed. 4.3 Naive Best-First Crawler The Naive Best-First crawler treats a Web page as a bag of words. It computes the cosine similarity of the page to the given keywords and uses it as a score (link score bfs ) of the unvisited URLs on the page. The URLs are then added to the frontier that is maintained as a priority queue using the scores. Each thread picks the best URL in the frontier to crawl, and inserts the unvisited URLs at appropriate positions in the priority queue. The link score bfs is computed using Equation 3. However, the keywords replace the topic description, and the vector representation is based only on term frequencies. A TF-IDF weighting scheme is
7 problematic during the crawl because there is no a priori knowledge about the distribution of terms across crawled pages. As in the case of the Breadth-First crawler, a multi-threaded version of Naive Best-First crawler with ethical behavior does not crawl in an exact best first order. Due to multiple threads it acts like a Best-N-First crawler where N is related to the number of simultaneously running threads. Best-N-First is a generalized version of the Naive Best-First crawler that picks N best URLs to crawl at a time. We have found certain versions of the Best-N-First crawler to be strong competitors in our previous evaluations [14, 12, 18]. Note that the Naive Best- First crawler and the crawlers to follow keep the frontier size within its upper bound by retaining only the best URLs based on the assigned scores. 4.4 DOM Crawler Unlike the Naive Best-First crawler, the DOM crawler tries to make use of the structure that is available in an HTML page through its tag tree representation. Figure 2 shows the tag tree representation of an HTML snippet. In its general form, the DOM crawler treats some node that appears on the path from the root of the tag tree to a link as an aggregation node of the link. All the text in the sub-tree rooted at the aggregation node is then considered a context of the link. In the current implementation of the DOM crawler we have set the immediate parent of a link to be its aggregation node. While the choice of the parent as the aggregation node may seem arbitrary, we note that it is no more arbitrary than picking up W words or bytes around a hyperlink and treating them as the link s context as has been done before [7, 1]. In an ideal situation we would like to identify the optimal aggregation node for a given page and topic. This problem is being studied [13] but it is beyond the scope of this paper. In our example from Figure 2, all the text (at any depth) under the <p> paragraph tag forms a context for the link After associating a link with its context, the crawler computes the cosine similarity between the context and the given keywords. The measure is called the context score. In addition, the crawler computes the cosine similarity between the page in which the link was found and the keywords, which corresponds to link score bfs. The final link score dom associated with a link is computed by: link score dom = α link score bfs + (1 α) context score where α weighs the relative importance of the entire page content vs. the context of a link. It is set to 0.25 to give more importance to the context score. The frontier is a priority queue based on link score dom. In practice, due to the reasons mentioned for the other crawlers, the individual pages are not fetched in the exact order of priority. Note that the Naive Best-First crawler can be seen as a special case of the DOM crawler in which the aggregation node is set to the root of the DOM tree (the <html> tag), or where α = 1.
8 4.5 Hub-Seeking Crawler The Hub-Seeking crawler is an extension of the DOM crawler that tries to explore potential hubs during its crawl. Since the seed URLs are assumed relevant to the topic, the crawler determines that a page that links to many of the seed hosts is a good hub. Seed hosts are fully qualified host names for Web servers, such as of the seed URLs. A page points to if it points to any page from the host. The crawler assigns a hub score to each page based on the number of seed hosts it points to. It then combines the hub score with the link score dom to get a link score hub for each link on a given page. We would like hub score to have the following properties: It should be a non-negative increasing function for all values of n 0, where n is the number of seed hosts. It should have extremely small or zero value for n = 0, 1 (to avoid false positives). The score for n = 2 should be relatively high as compared to the average link score dom. The event that a page points to two different business entities in the same area is relatively unlikely and hence must be fully exploited. The scores must lie between 0 and 1 so that they can be compared against link score dom. While there is an infinite number of functions that satisfy the above properties, we use the one described in Figure hub score n Fig. 3. Hub scores based on hub score(n) = n (n 1) 1+n 2, n = 0, 1, 2,.... Note that the Equation in Figure 3 relies on the fact that the seed URLs are relevant. It is ad hoc and lacks the sophistication of methods such as Kleinberg s algorithm [8] to identify hubs and authorities. However, it is easy to compute in real time during the crawl using the link information from a single page. The link score hub that is used to prioritize the frontier queue is calculated as: link score hub = max(hub score, link score dom ). (5)
9 5 Performance Figures 4 (a) & (b) show the performance of all the crawlers described above on the test bed. We note that on average the Hub-Seeking crawler performs better than the DOM Crawler which in turn performs better than the Best-First crawler for most part of the 10,000 page crawls. The baseline Breadth-First crawler performs the worst as expected. The above observation is true for both of the performance metrics described in section 3. We note that the performance of the Hub-Seeking crawler is significantly better than the Naive Best-First crawler on average target recall@10000 (Figure 4 (b)). In fact a one tailed t-test to check the same generates a p-value of While the Hub-Seeking crawler outperforms the Best-First crawler even on the average precision@10000 metric (Figure 4 (a)), a corresponding one tailed t-test gives a p-value of Hence, the benefit is not as significant for precision. However, this means that while building a collection that is at least as topically focused as the one built by the Naive Best-First crawler, the Hub-Seeking crawler will harvest a greater number of the relevant pages of other business entities in the area. (a) (b) average precision@n Breadth-First Naive Best-First DOM Hub-Seeker average target recall@n (%) Breadth-First Naive Best-First DOM Hub-Seeker N (pages crawled) (c) N (pages crawled) (d) average precision@n Breadth-First Naive Best-First DOM Hub-Seeker average target recall@n (%) Breadth-First Naive Best-First DOM Hub-Seeker N (pages crawled) N (pages crawled) Fig. 4. Performance of the crawlers: (a) average precision@n and (b) average target recall@n using the ODP links as seeds; (c) average precision@n and (d) average target recall@n using the augmented seeds. The error bars correspond to ±1 standard error.
10 6 Improving the Seed Set A search engine can assist a topical crawler by sharing the more global Web information available to it. We use the Google API 4 to automatically find more effective seeds for the crawler. The Hub-Seeking crawler assumes that a page that points to many of the seed hosts is a good hub. We use a similar idea to find good hubs from a search engine in order to better seed the crawl. Using the Google API, we find the top 10 back-links of all the test bed seeds for a topic. We then count the number of times a back-link repeats across different seeds. The count is used as a hub score. The process is repeated for all the 159 topics in the test bed. Only the top 10 back-links in accordance with the hub score are kept for each topic. If the count for a back-link is less than 2 then the link is filtered out. Also, to make sure that the problem is non-trivial, we must avoid hubs that are duplicates of the ODP page that lists the seeds and the targets. For this purpose, we do not include any back-link that is from the domain dmoz.org or directory.google.com (a popular directory based on ODP data). Furthermore, to screen unknown ODP mirrors, any back-link page that has higher than 90% cosine similarity to the the topic description is filtered out. A back-link page is first projected into the space of terms that appear in the topic description, and then the cosine similarity is measured in a manner similar to the Naive Best-First crawler. The projection is necessary to ignore extra text (in addition to the ODP data) that may have been added by a site. Due to the filtering criterion and limited results through the Google API, it is possible that we do not find even a single hub for certain topics. In fact, out of the 159 topics, we find hubs for only 94 topics. We use the hubs thus obtained along with the original test bed seeds to form an augmented seed set. The augmented seeds are used to start the crawl for each of the 94 topics. Figures 4 (c) & (d) show the performance of the four crawlers using the augmented seed set. We first notice that the order of performance (on both metrics) among the crawlers remains the same as that for the original seeds. However, the performance of the DOM crawler in addition to the Hub-Seeking crawler is significantly better than the Naive Best-First crawler on average target recall@10000 (Figure 4 (d)). One tailed t-tests to verify the same yield p-value of for the DOM crawler and for the Hub-Seeking crawler. Another important observation to make is that all the crawlers better their performance as compared to the experiments using the original seeds for most part of the page crawls (cf. Figure 4 (a),(b) and Figure 4 (c),(d)). However, average precision@n plots (with augmented seeds) show steeper downhill slopes leading to slightly poorer performance towards the end of the crawls for all but the Breadth-First crawler (Figure 4 (c)). The benefit of augmented seeds is more prominent when we look at average target recall@n plots (cf. Figure 4 (b) and (d)). We perform onetailed t-tests to check the hypothesis that the availability of augmented seeds (that include hubs) significantly improves the performance of the crawlers based on average target recall@ We find that for all the crawlers the null hy- 4
11 pothesis (no difference with augmented seeds) can be rejected in favor of the alternative hypothesis (performance improves with augmented seeds) at α = (in most cases even lower) significance level. We conclude that there is strong evidence that the availability of good hubs during the crawl does improve the performance of the crawlers. By providing the hubs, the search engine assists the crawlers in making a good start that affects their overall performance. 7 Related Work Various facets of Web crawlers have been studied in depth for nearly a decade (e.g.,[6, 7, 3, 10]). Chakrabarti et al. [3] used a distiller within their crawler that identified good hubs using a modified version of Kleinberg s algorithm [8]. As noted earlier we use a much simpler approach to identify potential hubs. Moreover, the authors did not show any strong evidence of the benefit of using hubs during the crawl. In a more recent work Chakrabarti et al. [2] used the DOM structure of a Web page in a focused crawler. However, their use of DOM trees is different from the present idea of using the aggregation node to associate a link to a context. In particular, their method uses the DOM tree in a manner similar to the idea of using text tokens in the vicinity of a link to derive the context. In contrast, the aggregation node explicitly captures the tag tree hierarchy by grouping text tokens based on a common ancestor. Measuring the performance of a crawler is a challenging problem due to the non-availability of relevant sets on the Web. Nevertheless, researchers use various metrics to understand the performance of topical crawlers (e.g.,[5, 3, 11]). A study by Menczer et al. [11] on the evaluation of topical crawlers looks at a number of ways to compare different crawlers. A more general framework to evaluate topical crawlers is presented by Srinivasan et al. [18]. 8 Conclusions We investigated the problem of creating a small but effective collection of Web documents for the purpose of locating related business entities by evaluating four different crawlers. We found that the Hub-Seeking crawler that identifies potential hubs and exploits the intra-document structure significantly outperforms the Naive Best-First crawler based on estimated recall. The Hub-Seeking crawler also maintains a better estimated precision than the Naive Best-First crawler. Since Web pages of similar business entities are expected to form a competitive Web community, recognizing neutral hubs that link to many of the competing entities is important. We do see a positive effect in performance through identification of hubs both at the start of the crawl and during the crawl process. In the future we would like to explore the performance of the Hub-Seeking crawler on more general crawling problems. As noted earlier, the issue of finding the optimal aggregation node is being investigated [13].
12 Acknowledgments Thanks to Robin McEntire, Valdis A. Dzelzkalns, and Paul Stead at Glaxo- SmithKline for their valuable suggestions. This work has been supported in part through a summer internship by GlaxoSmithKline R&D to GP, and by the NSF under CAREER grant No. IIS to FM. References 1. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In WWW7, S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW2002, Hawaii, May S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In WWW8, May J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In VLDB 2000, Cairo, Egypt. 5. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30(1 7): , P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proc. 1st International World Wide Web Conference, M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm An application: Tailored Web site mapping. In WWW7, J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5): , S. Lawrence and C.L. Giles. Accessibility of information on the Web. Nature, 400: , F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2 3): , F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Trans. on Internet Technologies, fil/papers/toit.pdf. 13. G. Pant. Deriving Link-context from HTML Tag Tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In WWW02 Workshop on Web Dynamics, M. Porter. An algorithm for suffix stripping. Program, 14(3): , S. RaviKumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In FOCS, pages 57 65, Nov G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, P. Srinivasan, F. Menczer, and G. Pant. A general evaluation framework for topical crawlers. Information Retrieval, Submitted, fil/papers/crawl framework.pdf.
Framework for Intelligent Crawler Engine on IaaS Cloud Service Model
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1783-1789 International Research Publications House http://www. irphouse.com Framework for
Search Engine-Crawler Symbiosis
Search Engine-Crawler Symbiosis Gautam Pant [email protected] Shannon Bradshaw [email protected] Department of Management Sciences The University of Iowa Iowa City, IA 52242 Filippo Menczer
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)
HIDDEN WEB EXTRACTOR DYNAMIC WAY TO UNCOVER THE DEEP WEB DR. ANURADHA YMCA,CSE, YMCA University Faridabad, Haryana 121006,India [email protected] http://www.ymcaust.ac.in BABITA AHUJA MRCE, IT, MDU University
Link Analysis and Site Structure in Information Retrieval
Link Analysis and Site Structure in Information Retrieval Thomas Mandl Information Science Universität Hildesheim Marienburger Platz 22 31141 Hildesheim - Germany [email protected] Abstract: Link
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
Efficient Focused Web Crawling Approach for Search Engine
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.545
American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
Site-Specific versus General Purpose Web Search Engines: A Comparative Evaluation
Panhellenic Conference on Informatics Site-Specific versus General Purpose Web Search Engines: A Comparative Evaluation G. Atsaros, D. Spinellis, P. Louridas Department of Management Science and Technology
Spidering and Filtering Web Pages for Vertical Search Engines
Spidering and Filtering Web Pages for Vertical Search Engines Michael Chau The University of Arizona [email protected] 1 Introduction The size of the Web is growing exponentially. The number of indexable
Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION
Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
SEO Services. Climb up the Search Engine Ladder
SEO Services Climb up the Search Engine Ladder 2 SEARCH ENGINE OPTIMIZATION Increase your Website s Visibility on Search Engines INTRODUCTION 92% of internet users try Google, Yahoo! or Bing first while
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING
EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING Agustín Sabater, Carlos Guerrero, Isaac Lera, Carlos Juiz Computer Science Department, University of the Balearic Islands, SPAIN [email protected], [email protected],
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Mining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
Yandex: Webmaster Tools Overview and Guidelines
Yandex: Webmaster Tools Overview and Guidelines Agenda Introduction Register Features and Tools 2 Introduction What is Yandex Yandex is the leading search engine in Russia. It has nearly 60% market share
Chapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
A Comparison of Techniques to Find Mirrored Hosts on the WWW
A Comparison of Techniques to Find Mirrored Hosts on the WWW Krishna Bharat Google Inc. 2400 Bayshore Ave Mountain View, CA 94043 [email protected] Andrei Broder AltaVista Company 1825 S. Grant St. San
Intinno: A Web Integrated Digital Library and Learning Content Management System
Intinno: A Web Integrated Digital Library and Learning Content Management System Synopsis of the Thesis to be submitted in Partial Fulfillment of the Requirements for the Award of the Degree of Master
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
Intelligent web monitoring A hypertext mining-based approach
J. Indian Inst. Sci., Sept. Oct. 2006, 86, 481 492 Indian Institute of Science. Intelligent web monitoring A hypertext mining-based approach MANAS A. PATHAK* AND VIVEK S. THAKRE Department of Computer
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
SEO Techniques for various Applications - A Comparative Analyses and Evaluation
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
Lightweight Data Integration using the WebComposition Data Grid Service
Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed
International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.
RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,
An Enhanced Framework For Performing Pre- Processing On Web Server Logs
An Enhanced Framework For Performing Pre- Processing On Web Server Logs T.Subha Mastan Rao #1, P.Siva Durga Bhavani #2, M.Revathi #3, N.Kiran Kumar #4,V.Sara #5 # Department of information science and
Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.
Client-Side Dynamic Web Page Generation CGI, PHP, JSP, and ASP scripts solve the problem of handling forms and interactions with databases on the server. They can all accept incoming information from forms,
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
30 Website Audit Report. 6 Website Audit Report. 18 Website Audit Report. 12 Website Audit Report. Package Name 3
TalkRite Communications, LLC Keene, NH (603) 499-4600 Winchendon, MA (978) 213-4200 [email protected] Website Audit Report TRC Website Audit Report Packages are designed to help your business succeed further.
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification
Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification Adriel Cheng Cheng-Chew Lim The University of Adelaide, Australia 5005 Abstract We propose a test generation method employing
A Business Process Services Portal
A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
IRLbot: Scaling to 6 Billion Pages and Beyond
IRLbot: Scaling to 6 Billion Pages and Beyond Presented by Xiaoming Wang Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov Internet Research Lab Computer Science Department Texas A&M University
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
Measuring the Performance of an Agent
25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational
How To Cluster On A Search Engine
Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING
Analysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
Original-page small file oriented EXT3 file storage system
Original-page small file oriented EXT3 file storage system Zhang Weizhe, Hui He, Zhang Qizhen School of Computer Science and Technology, Harbin Institute of Technology, Harbin E-mail: [email protected]
Website Audit Reports
Website Audit Reports Here are our Website Audit Reports Packages designed to help your business succeed further. Hover over the question marks to get a quick description. You may also download this as
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Building Multilingual Search Index using open source framework
Building Multilingual Search Index using open source framework ABSTRACT Arjun Atreya V 1 Swapnil Chaudhari 1 Pushpak Bhattacharyya 1 Ganesh Ramakrishnan 1 (1) Deptartment of CSE, IIT Bombay {arjun, swapnil,
Identifying Market Price Levels using Differential Evolution
Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand [email protected] WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary
Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.
Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall. 5401 Butler Street, Suite 200 Pittsburgh, PA 15201 +1 (412) 408 3167 www.metronomelabs.com
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Search engine ranking
Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University
Search Engine Optimization for Higher Education. An Ingeniux Whitepaper
Search Engine Optimization for Higher Education An Ingeniux Whitepaper This whitepaper provides recommendations on how colleges and universities may improve search engine rankings by focusing on proper
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Multi-agent System for Web Advertising
Multi-agent System for Web Advertising Przemysław Kazienko 1 1 Wrocław University of Technology, Institute of Applied Informatics, Wybrzee S. Wyspiaskiego 27, 50-370 Wrocław, Poland [email protected]
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Part 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey
PDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this
Search Engine Optimization Content is Key. Emerald Web Sites-SEO 1
Search Engine Optimization Content is Key Emerald Web Sites-SEO 1 Search Engine Optimization Content is Key 1. Search Engines and SEO 2. Terms & Definitions 3. What SEO does Emerald apply? 4. What SEO
Website Report for http://www.cresconnect.co.uk by Cresconnect UK, London
Website Report for http://www.cresconnect.co.uk by Cresconnect UK, London Table of contents Table of contents 2 Chapter 1 - Introduction 3 About us.................................... 3 Our services..................................
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
A Workbench for Prototyping XML Data Exchange (extended abstract)
A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy
Configuring CQ Security
Configuring CQ Security About Me CQ Architect for Inside Solutions http://inside-solutions.ch CQ Blog: http://cqblog.inside-solutions.ch Customer Projects with Adobe CQ Training Material on Adobe CQ Agenda
SEO Audit Report For. [email protected]. 10149 Ravenna Way. [email protected] 5617352998. 10149 Ravenna Way Boynton Beach FL 33437
SEO Audit Report For Automotive Training & Development [email protected] 7818560862 10149 Ravenna Way [email protected] 5617352998 10149 Ravenna Way Boynton Beach FL 33437 1 Report For http://www.automotivetraining.us
