Optimizing Apache Nutch For Domain Specific Crawling at Large Scale
|
|
|
- Cecil Daniels
- 10 years ago
- Views:
Transcription
1 2015 IEEE International Conference on Big Data (Big Data) Optimizing Apache Nutch For Domain Specific Crawling at Large Scale Luis A. Lopez 1, Ruth Duerr 2, Siri Jodha Singh Khalsa 3 NSIDC 1, The Ronin Institute 2, University of Colorado Boulder 3 Boulder, Colorado. [email protected] Abstract Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation s EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far. Keywords: discovery focused crawl, big data, Apache Nutch, data I. INTRODUCTION Once a problem becomes too large for vertical scaling to work scaling horizontally is required. This is why Apache Nutch and Hadoop were designed[10]. Hadoop abstracts most of the perils of implementing a distributed, fault tolerant file system, uses MapReduce as its data processing algorithm and has become a defacto standard for big data processing. Hadoop has advantages and limitations, it scales well and has a solid user base, however a plethora of configuration properties need to be tuned in order to achieve a performant cluster[8]. However, having a performant cluster is not enough, optimizing the crawl and speeding up fetch times within the range of polite crawl delays given by the remote servers is also required. At that point, other problems appear - slow servers, malformed content, poor standard implementations, dark data, semantic equivalence issues and very sparse data distribution. Each challenge requires a different mitigation technique and some can only be addressed by the site s owners. For example, if a slow server has relevant information all we ethically can do to fix is to contact the organization to let them know they have a performance issue. This is also the case for malformed content and poor standard implementations like wrong mime type and dates given by the servers. In BCube we ran into most of these problems and we attempted to improve the crawl by tackling those that are under our control. The role of BCube as part of the NSF funded project EarthCube is to discover and integrate scientific data sets, web services and information to build a google for scientific data, for this reason a simple web crawler was not enough and a focused crawler had to be developed. The crawler is a core component of BCube but not the only one, at the moment our stack is also using semantic technologies to extract valuable metadata from the discovered data. Several academic papers have been written about Hadoop cluster optimization and advanced techniques for focused crawling however the purpose of this paper is to describe how Nutch can be used as a domain specific (topical crawler) discovery tool and our results so far in using it at scale. First, we describe the issues we ran into while doing focused crawls and what a user gets from a vanilla Nutch cluster. Second, we talk about customization made to maximize cluster utilization and lastly, we discuss the improved performance and future work for this project. II. RELATED WORK Previous work on this topic has focused on the use of supervised and semi-supervised machine learning techniques to improve relevant document retrieval [1][2][11], but there is little literature on how these techniques perform at scale and in most cases the core component of the research is a custom made crawler that is only tested in a few hundred pages [3][7][11]. This is not just inconclusive but also hides problems that focused crawlers will run into at large scale [6]. Some open source projects can and have been used to perform focused crawls as well, Scrapy Frontera, Heritrix and Bixo are among the most mature and they all support extensions. Sadly, none of these frameworks have been used extensively in domain specific crawls at scale with publicly available results. Future work would necessarily include porting the code used in BCube to these frameworks and benchmark the results using similar hardware. Current efforts like JPL s MEMEX use Nutch and other Apache components to crawl the deep web and extract metadata on more specialized content[13]. As MEMEX and other projects evolve it will be interesting to compare their findings and performance once that the size of the search space grows to cover a significant portion of the Internet.
2 3.1. Challenges III. FOCUSED CRAWLS. A focused crawl could be oversimplified as a Depth-First Search (DFS) algorithm with the added complexity that we operate in an undirected graph with billions of edges. Our work then is to figure out how to get the most documents of interest possible visiting the minimum number of nodes(web pages). This is critical because the Internet is large and computational resources are usually limited by time and costs. Precisely because of the intractable Internet size, Google came up with Page Rank[4] so links can be crawled in an ordered manner based on their relevance. The relevance algorithm used in PageRank is very similar to the one used by Nutch s Scoring Opic Filter. This type of algorithm prioritizes popular documents and sites that are well connected. This approach is good for the average end user but not really suitable for domain specific searches. This can be seen in figures 3.1 and 3.2. BCube s scoring filter currently uses only content and in the future we plan to use the context path on which a relevant document is located and semantic techniques to improve and adapt the relevance based on user interaction[14]. As we stated before, the focus will be on how a focused crawler performs independently from the scoring algorithm. Let s assume that we have a good scoring algorithm in place and we are about to crawl the.edu search space. Our intuition says that we ll index more relevant documents as time passes, we ll find a gaussian distribution of these documents in our search space and that the performance will be predictable based on the input size. We will probably be wrong. As we scaled in BCube to cover millions of URLs we ran into 3 issues with our focused crawls Falling into the Tar Pit We use the term tar pit to describe a host that contains many relevant documents, in our case these were data granules and web services (or data repositories). For these sites our scoring filter acts as expected and boosts the urls where there relevant documents were found. If the size of this tar pit was big enough our crawler behaved not like a finder but like a harvester. We were indexing thousands of relevant documents from the same hosts and this left other interesting sites unexplored as we spent more time harvesting the tar pits. This is better shown in figure 3.3, as our crawling frontier expands all the relevant documents in blue will have a higher priority if we discover them earlier. Fig. 3.1 PageRank Fig. 3.3 A tar pit in blue. Other relevant documents on top. Fig. 3.2 Focused Crawl Note that in PageRank-like algorithms the nodes with more inlinks are visited first and this is not optimal for potentially diffuse, sparse, not very popular but relevant information. BCube s approach to scoring is not complicated, we used a semi supervised relevance filter where relevant documents were manually tagged and then tokenized so we can compare 2 things, content and urls path components. To mitigate this we configured Nutch to only explore a limited number of documents from the same origin even if all of them are relevant. This allowed us to discover more diverse data and improve cluster performance as we were not crawling a limited number of high scored hosts. Crawling a limited number of hosts is however, an almost unavoidable scenario in focused crawls, this behavior led to our second problem, performance degradation. 1907
3 3.1.2 Performance degradation In focused crawls the number of hosts over time should decrease as we only explore paths that are relevant, causing our cluster performance to also decrease over time. This degradation follows a exponential decay pattern[18]. Figure 3.4 depicts the crawl performance we observed. For example, after finding the domain cimate-data.org our crawler proceeded to boost the score for their documents and after a little while we noticed the following numbers in our crawl database. Table. 3.1 Documents fetched from climate-data.org Domain Documents fetched fr.climate-data.org bn.climate-data.org de.climate-data.org en.climate-data.org As one can guess the documents were very similar climate reports simply expressed in sites geared for French, German, Bengali and English users. This is also shown in figure 3.5. Fig. 3.4 Documents Indexed per second as the crawl frontier expands. The 2 models represent slightly different decay constants[18]. The reason for this exponential decay is driven by 3 factors, seed quality, crawling politeness and scoring effectiveness. If we use good URL seeds [9] we ll reduce our search space early and if we crawl politely we ll have to obey Robots.txt causing the underutilization of the cluster. This decay has been also found by other research [17][5]. In BCube we tried to overcome this issue by spawning separate clusters to handle specific use cases, scaling out or reducing the resources as needed. Thanks to Amazon s EMR this task was simplified, the only extra effort made was scripting the glue code to keep track of all the clusters running. In our case most of the servers had crawler-friendly Robots.txt values but some had Crawl-Delay values of up to 1 day for each request. Having multiple crawls working at different speeds allowed us to maximize our resources and avoid this problem Duplication Every search engine has to deal with duplication problems, how to identify and delete duplicated content is a process that normally happens when we have all the documents indexed as a post processing step. At web scale however, we cannot afford to index completely similar repositories from different hosts and deduplicate later. Also there are different kinds of equivalence, clones, versions, copies and semantically similar documents. A curious case happened when our filter boosted similar relevant documents. We were crawling a repeated number of times indexing the same documents over and over. This was because there are sites that distribute content in different languages but the actual content doesn t change that much, given that we are looking for scientific data. Fig. 3.5 Similar documents in slightly different yellow colors.. Currently BCube is exploring [12][14] semantic technologies to further characterize similar relevant documents to avoid crawling multiple times and index what s basically the same information. IV. NUTCH CUSTOMIZATIONS 4.1 Out of the box configuration Nutch comes with generic plugins enabled and its default configuration is good for general crawls but not so much for focused crawls. Nutch uses 2 main configuration files, nutchdefault and nutch-site, the later used to override default configurations. We can adjust a number of things in the configuration files i.e. the number of documents we want to crawl on each iteration or the number of threads to use on each mapper. There are also Hadoop specific configuration properties that we probably need to tune. Our crawling strategy required us to index the raw content of a relevant document and there was no official plugin developed for this. We also had to index documents based on the mime type(once that they pass all the filters) and Nutch didn t have this plugin until recent weeks. 1908
4 4.1 BCube tuning and performance After replacing the scoring filter and putting our extra plugins in place we had to maximize fetching speeds. The properties to do this are grouped in the fetcher section of the nutch-default configuration file. Something to take into account beforehand is the number of CPUs available to each Hadoop mapper, and this was used for our baseline calculation. Since document fetching is not a CPU bounded task we can play with the values to manage the workload distribution. i.e. If we have a 16 node cluster and we want to crawl 160,000 urls at the time then each mapper can get up to 10,000 urls. Given that we generate only 1,000 documents per host and all our hosts have more than 1,000 documents then we ll have 10 queues. There is another important property to configure, the number of threads that can access the same queue. This value can improve the speeds dramatically depending on how we group our URLs. Going back to our previous case, with 10 queues per host and configuring 2 threads per host then having more than 20 threads per mapping will be wasteful as the maximum number of working threads would be given by (queues * fetcher.threads.per.queue). Nutch properties can be used to dynamically target a specific bandwidth. In our experiments these properties were less effective than doing the calculation of current domains and active queues. Once that we had the maximum number of working threads, we capped the fetching times to 45 minutes. This was because we noticed that the cluster crawled more than 90% of their queues in 30 minutes or less(see fig. 4.1) but lasted up to 2 hours to finish in some cases. The actual cause was that some queues had high Crawl-Delay values and the cluster waited until all the queues were empty. Some operations in Nutch are performed on the current segment on which we are operating; a segment is the list of urls being fetched, parsed and integrated to our crawl database. Other operations are carried out on the entire crawl database. We improved our performance by only applying some filters in the parsing stage and pruning the URLs that we considered totally off topic or had very low scores. This was important because even if we only explore the high scoring nodes MapReduce will process the total number of URLs in our DB in operations like scoring propagation. This is similar to what happens with Search algorithms like A* with Alpha/Beta pruning[19]. The following properties were the ones that had the biggest impact on the performance of individual clusters. We applied filters once, we reduced the minimum waiting time for each request, we only generated 1000 urls using the highest scores for each host and finally we set a limit for the fetching stage to 45 minutes. Table 4.1 Configuration changes for a 16-nodes cluster. Property Default Value BCube Value crawldb.url.filters True False db.update.max.inlinks db.injector.overwrite False True generate.max.count fetcher.server.delay 10 2 fetcher.threads.fetch fetcher.threads.per.queue 1 2 fetcher.timelimit.mins Finally, after several tests we show the typical behavior of one of our crawls against what we got from a vanilla version of Nutch. This test was carried out using 16-node clusters on Amazon s EMR. Fig. 4.1 Fetching time on 250k documents and 4223 queues. A key difference between our focused crawls and a normal crawl lies on what we keep from them. Normal crawls index all the documents found but we don t. This saves HDFS space (we only keep the crawldb and linkdb) and deletes the content of all the documents after the ones we consider relevant are indexed. The result is savings above 99% of the total HDFS file system. Fig. 4.2 Relevant documents indexed per iteration in a vanilla Nutch cluster and a BCube cluster. As we can see the gains are noticeable, we could theoretically match our performance by doing a bigger cluster and indexing more documents using vanilla Nutch versions but this will be expensive and not efficient. 1909
5 Figure 4.2 summarizes the content of our crawl database after 2 weeks of continuous crawling. Table 4.2 Configuration changes for a medium size cluster. Status Number redir_temp 670,937 redir_perm 826,173 notmodified 603,808 duplicate 958,431 unfetched 565,990,788 gone 65,527 fetched 408,368,666 From the total number of fetched documents only 614,724 were relevant enough to be indexed. That is only % of the total documents retrieved. A note on costs. The total cost over this two week crawl was less than $500 USD on the Amazon cloud. Obviously, this is considerably less expensive than buying hardware and attempting on an in-house cluster. We also used spot instances to decrease the costs. V. CONCLUSIONS Based on the empirical results we had after optimizing Nutch for focused crawling we have learned that there are major issues that can only be reproduced at large scale and they only get worse as the search space gets bigger. Among the two most difficult problems we ran into we mention performance degradation due sparse data distribution and duplicated content in the Web. We also learned that some issues cannot be addressed by improving the focused crawl alone and there are problems out of our control that probably won t be solved by crawl processes themselves. Problems like prohibitive values for crawlers in robots.txt, poor performance in remote hosts, incorrect implementation of web standards, bad metadata and semantically similar content. Despite the complexity of the overall problem we introduced different mitigation techniques to tackle some of the issues with good results. Search space pruning using manually tagged relevant documents, optimized configuration values for improving fetching times and the discover and move on approach for relevant data were the ones that proved to be the most effective. Future work will be focused on optimizing the scoring algorithms and introducing graph context[1][11] and semantic information to make the crawl more adaptable and responsive to sparse data distribution. We also want to develop data type specific responses so that we both maximize the data set and data service information captured without crawling large amounts of very similar information. BCube is committed to sharing all findings with the community and all the work presented here is the result of open sourced code available at Our Nutch crawldb set with +100 Million URLs that might be useful to other research is publicly available in AWS S3 (200.62GB with more than 6761 unique domains)[15][16]. REFERENCES [1] Michelangelo Diligenti, Frans Coetzee, Steve Lawrence, Lee C. Giles, Marco Gori. Focused Crawling using Context Graphs 26th International Conference on Very Large Databases, VLDB 2000 pp , September [2] O. Etzioni, M. Cafarella, D. Downey, et al. Web-scale information extraction in KnowItAll [3] John Garofalakis, Yannis Panagis, Evangelos Sakkopoulos, Athanasios Tsakalidis. Web Service Discovery Mechanisms: Looking for a Needle in a Haystack? Journal of Web Engineering, Rinton Press, 5(3): , [4] Page, L., Brin, S., Motwani, R., Winograd, T. The Pagerank Citation Algorithm: Bringing Order to the Web. Technical Report, Stanford Digital Library Technologies, [5] Yang Sun, Ziming Zhuang, and C. Lee Giles. A Large-Scale Study of Robots.txt. The Pennsylvania State University, ACM [6] Wei Xu, Ling Huang, Armando Fox, David A. Patterson and Michael Jordan. Detecting Large-Scale System Problems by Mining Console Logs. University of California, Berkeley Technical Report No. UCB/EECS July 21, [7] Li, Wenwen, Yang, Chaowei and Yang, Chongjun(2010) 'An active crawler for discovering geospatial. Web services and their distribution pattern - A case study of OGC Web Map Service', International Geographical Information Science, 24: 8, [8] Huang, Shengsheng, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. "The HiBench Benchmark Suite: Characterization of the MapReducebased Data Analysis." 2010 IEEE 26th International Conference on Data Engineering Workshops ICDEW [9] Hati, Debashis, and Amritesh Kumar. "An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler." International Journal of Computer Applications IJCA 2.3 (2010): [10] Cafarella, Mike, and Doug Cutting. "Building Nutch." Queue 2.2 (2004): 54. Web. [11] Liu, Lu, and Tao Peng. "Clustering-based Topical Web Crawling Using CFu-tree Guided by Link-context." Frontiers of Computer Science Front. Comput. Sci. 8.4 (2014): [12] Isaacson, Joe. The Science of Crawl (Part 1): Deduplication of Web Content web. [13] White, Chris, Mattmann, Chris. "Memex (Domain-Specific Search)." Memex. DARPA, Feb Web. [14] Lopez, L. A.; Khalsa, S. J. S.; Duerr, R.; Tayachow, A.; Mingo, E. The BCube Crawler: Web Scale Data and Service Discovery for EarthCube. American Geophysical Union, Fall Meeting 2014, abstract #IN51C-06 [15] Nutch Logs are stored in: s3://bcube-nutch-test/logs and crawl data: [16] Nutch JIRA Tickets: [17] Krugler, Ken. Performance problems with vertical/focused web crawling 2009 Web. [18] Truslove, I.; Duerr, R. E.; Wilcox, H.; Savoie, M.; Lopez, L.; Brandt, M. Crawling The Web for Libre. American Geophysical Union, Fall Meeting 2012, abstract #IN11D-1482 [19] Alpha-Beta Pruning Algorithm, Wikipedia
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
CiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities:
System Behavior Analysis by Machine Learning
CSC456 OS Survey Yuncheng Li [email protected] December 6, 2012 Table of contents 1 Motivation Background 2 3 4 Table of Contents Motivation Background 1 Motivation Background 2 3 4 Scenarios Motivation
A QoS-Aware Web Service Selection Based on Clustering
International Journal of Scientific and Research Publications, Volume 4, Issue 2, February 2014 1 A QoS-Aware Web Service Selection Based on Clustering R.Karthiban PG scholar, Computer Science and Engineering,
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Big Data Systems CS 5965/6965 FALL 2014
Big Data Systems CS 5965/6965 FALL 2014 Today General course overview Q&A Introduction to Big Data Data Collection Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2014.html
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Improving Apriori Algorithm to get better performance with Cloud Computing
Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become
Data Domain Profiling and Data Masking for Hadoop
Data Domain Profiling and Data Masking for Hadoop 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Web Crawling with Apache Nutch
Web Crawling with Apache Nutch Sebastian Nagel [email protected] ApacheCon EU 2014 2014-11-18 About Me computational linguist software developer at Exorbyte (Konstanz, Germany) search and data matching
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
MapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
SEO Techniques for various Applications - A Comparative Analyses and Evaluation
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Big data blue print for cloud architecture
Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
On a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using
Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using Amazon Web Services rather than setting up a physical server
Enhancing the Ranking of a Web Page in the Ocean of Data
Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India [email protected] In today
Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide
OPTIMIZATION AND TUNING GUIDE Intel Distribution for Apache Hadoop* Software Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide Configuring and managing your Hadoop* environment
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
What Is Specific in Load Testing?
What Is Specific in Load Testing? Testing of multi-user applications under realistic and stress loads is really the only way to ensure appropriate performance and reliability in production. Load testing
Deploying and Optimizing SQL Server for Virtual Machines
Deploying and Optimizing SQL Server for Virtual Machines Deploying and Optimizing SQL Server for Virtual Machines Much has been written over the years regarding best practices for deploying Microsoft SQL
Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.
Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS
ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
BIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
Building Multilingual Search Index using open source framework
Building Multilingual Search Index using open source framework ABSTRACT Arjun Atreya V 1 Swapnil Chaudhari 1 Pushpak Bhattacharyya 1 Ganesh Ramakrishnan 1 (1) Deptartment of CSE, IIT Bombay {arjun, swapnil,
International Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Performance of
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Final Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
User Guide to the Content Analysis Tool
User Guide to the Content Analysis Tool User Guide To The Content Analysis Tool 1 Contents Introduction... 3 Setting Up a New Job... 3 The Dashboard... 7 Job Queue... 8 Completed Jobs List... 8 Job Details
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Machine Learning Log File Analysis
Machine Learning Log File Analysis Research Proposal Kieran Matherson ID: 1154908 Supervisor: Richard Nelson 13 March, 2015 Abstract The need for analysis of systems log files is increasing as systems
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Search and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
Everything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
Indexing big data with Tika, Solr, and map-reduce
Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 1 / 19 Outline
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP
ANALYSING THE FEATURES OF JAVA AND MAP/REDUCE ON HADOOP Livjeet Kaur Research Student, Department of Computer Science, Punjabi University, Patiala, India Abstract In the present study, we have compared
Implementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya [email protected]
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr
CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr Lambros Charissis, Wahaj Ali University of Crete {lcharisis, ali}@csd.uoc.gr Abstract. Implementing a performant
Evaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
Big Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
Analysis of Issues with Load Balancing Algorithms in Hosted (Cloud) Environments
Analysis of Issues with Load Balancing Algorithms in Hosted (Cloud) Environments Branko Radojević *, Mario Žagar ** * Croatian Academic and Research Network (CARNet), Zagreb, Croatia ** Faculty of Electrical
Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015 [email protected]
Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 [email protected] Sziasztok résztvevők! Born in Yekaterinburg, RU 5 years at Yandex, search quality
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Cloud Computing @ JPL Science Data Systems
Cloud Computing @ JPL Science Data Systems Emily Law, GSAW 2011 Outline Science Data Systems (SDS) Space & Earth SDSs SDS Common Architecture Components Key Components using Cloud Computing Use Case 1:
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Mark Bennett. Search and the Virtual Machine
Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS 2014. November 7, 2014. Machine Learning Group
Big Data and Its Implication to Research Methodologies and Funding Cornelia Caragea TARDIS 2014 November 7, 2014 UNT Computer Science and Engineering Data Everywhere Lots of data is being collected and
Understanding Web personalization with Web Usage Mining and its Application: Recommender System
Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Chapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Data Integrity Check using Hash Functions in Cloud environment
Data Integrity Check using Hash Functions in Cloud environment Selman Haxhijaha 1, Gazmend Bajrami 1, Fisnik Prekazi 1 1 Faculty of Computer Science and Engineering, University for Business and Tecnology
