A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION

Transcription

1 A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION Sumitha VS 1, Shilpa V 2 1 M.E. Final Year, Department of Computer Science Engineering (IT), UVCE, Bangalore, gvsumitha@gmail.com 2 M.E. Final Year, Department of Computer Science Engineering (IT), UVCE Bangalore, shilpav66@gmail.com ABSTRACT The proposed system is a use case of Big Data exploration and Analysis which is an intelligent system useful in the web world that is designed to evaluate domain names on various parameters, which would help understand the business trends and in turn would enhance business. It provides a unique method to learn more about the domain names registered and related statistics. The service uses multiple website attributes such as domain name, resolution status, along with insights gathered from the analysis of related data - to deliver relevant, actionable reports. Reports available through the proposed system are designed to provide intelligence that can help in the business as it provides better insights to how the business works and customer trends which would help to understand the customer in a better way. Keywords: Domain, Big data, Hadoop distributed file system (HDFS), Crawler INTRODUCTION Big Data encompasses everything from click stream data from the web to genomic and proteomic data from biological research and medicines. Big Data is a heterogeneous mix of data both structured (traditional datasets in rows and columns like DBMS tables, CSV's and XLS's) and unstructured data like attachments, manuals, images, PDF documents, medical records such as x-rays, ECG and MRI images, forms, rich media like graphics, video and audio, contacts, forms and documents. Businesses are primarily con cerned with managing unstructured data, because over 80 percent of enterprise data is unstructured and require significant storage space and effort to manage. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. Big data is defined as large amount of data which requires new technologies and architectures to make possible to extract value from it by capturing and analysis process. New sources of big data include location sp ecific data arising from traffic management, and from the tracking of personal devices such as Smart phones. Big Data has emerged because we are living in a society which makes increasing use of data intensive technologies. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are need to be understood. Big Data concept means a datasets which continues to grow so much that it becomes difficult to manage it using existing database management concepts & tools. The difficulties can be related to data capture, storage, search, sharing, analytics and visualization etc. Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. The various challenges faced in large data management include scalability, unstructured data, accessibility, real time analytics, fault tolerance and many more. In addition to variations in the amount of data stored in different sectors, the types of data generated and stored i.e., encoded video, images, audio, or text/numeric information; also differ markedly from industry to industry 2. LITERATURE SURVEY

2 The Lustre File System, an open source, high-performance file system from Cluster File Systems, Inc., is a distributed file system that eliminates the performance, availability, and scalability problems that are present in many traditional distributed file systems. Lustre is a highly modular next generation storage architecture that combines established, open standards, the Linux operating system, and innovative protocols into a reliable, network-neutral data storage and retrieval solution. Lustre provides high I/O throughput in clusters and shared -data environments and also provides independence from the location of data on the physical storage, protection from single points of failure, and fast recovery from cluster reconfiguration and server or network outages. As a parallel file system, the primary goal of PVFS is to provide high-speed access to file data for parallel applications. In addition, PVFS provides a cluster-wide consistent name space, enables user-controlled striping of data across disks on different I/O nodes, and allows existing binaries to operate on PVFS files without the need for recompiling. Like many other file systems, PVFS is designed as a client-server system with multiple servers, called I/O daemons. I/O daemons typically run on separate nodes in the cluster, called I/O nodes, which have disks attached to them. Each PVFS file is striped across the disks on the I/O nodes. Application processes interact with PVFS via a client library. PVFS also has a manager daemon that handles only metadata operations such as permission checking for file creation, open, close, and remove operations. The manager does not participate in read/write operations; the client library and the I/O daemons handle all file I/O without the intervention of the manager. The clients, I/O daemons, and the manager need not be run on different machines. Running them on different machines may result in higher performance, however. PVFS is primarily a user-level implementation; no kernel modifications or modules are necessary to install or operate the file system. We have, however, created a Linux kernel module to make simple file manipulation more convenient. PVFS currently uses TCP for all internal communication. As a result it is not dependent on any particular message-passing library. Cloud-based storage services have established themselves as a paradigm of choice for supporting bulk storage needs of modern networked services and applications. Although individual storage service providers can be trusted to do their best to reliably store the user data, exclusive reliance on any single provider or storage service leaves the users inherently at risk of being locked out of their data due to outages, connectivity problems, and unforeseen alterations of the service contracts. An emerging multi-cloud storage paradigm addresses these concerns by replicating data across multiple cloud storage services, potentially operated by distinct providers. Cloud-based storage services have established themselves as a paradigm of choice for supporting bulk storage needs of modern networked services and applications. Although individual storage service providers can be trusted to do their best to reliably store the user data, exclusive reliance on any single provider or storage service leaves the users inherently at risk of being locked out of their data due to outages, connectivity problems, and un foreseen alterations of the service contracts. An emerging multi-cloud storage paradigm addresses these concerns by replicating data across multiple cloud storage services, potentially operated by distinct providers. Although a significant progress has so far been made in building practical multi-cloud storage systems as of today, little is known about their fundamental capabilities and limitations. The primary challenge lies in a wide variety of the storage interfaces and consistency semantics offered by different cloud providers to their external users. 3. BIG DATA ANALYTICS Big data analytics is the area where advanced analytic techniques operate on big data sets. It is really about two things, Big data and Analytics and how the two have teamed up to create one of the most profound trends in business intelligence (BI). Map Reduce by itself is capable for analyzing large distributed data sets; but due to the heterogeneity, velocity and volume of Big Data, it is a challenge for traditional data analysis and management tools. A problem with Big Data is that they use NoSQL and has no Data Description Language (DDL) and it supports transaction processing. Also, web-scale data is not universal and it is heterogeneous. For analysis of Big Data, database integration and cleaning is much harder than the traditional mining approaches. Parallel processing and distributed computing is becoming a standard procedure which are nearly non-existent in RDBMS With big data analytics, the user is trying to discover new business facts that no one in the enterprise knew before, a better term would be discovery analytics. To do that, the analyst needs large volumes of data with plenty of detail. This is often data that the enterprise has not yet tapped for analytics example, the log data. The analyst might mix that data with historic data from a data warehouse and would discover for example, new change behavior in a subset of the customer base. The discovery would lead to a metric, report, analytic model, or some other product of BI, through which the company could track and predict the new form of customer behavioral change. Discovery analytics against big data can be enabled by different types of analytic tools, including those based on SQL queries, data mining, statistical analysis, fact clustering, data visualization, natural language processing, text analytics, artificial intelligence etc. A unique challenge for researchers system and academicians is that the large

3 datasets needs special processing systems. Map Reduce over HDFS gives Data Scientists the techniques through which analysis of Big Data can be done. HDFS is a distributed file system architecture which encompasses the original Google File System. Map Reduce jobs use efficient data processing techniques which can be applied in each of the phases of MapReduce; namely Mapping, Combining, Shuffling, Indexing, Groupin g and Reducing. 3.1 Hadoop and its characteristics Hadoop is an open source project hosted by Apache Software Foundation. It consists of many small sub projects which belong to the category of infrastructure for distributed computing. The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. One hundred other organizations worldwide report using Hadoop. 4. SYSTEM ARCHITECTURE 4.1 Actors Website Operators - They maintain the web servers of the Internet. They are the custodians of their networks, operating Firewalls, Intrusion Product Development - In the past, they have provided a vital input into the system: the marker database, which defines many (but not all) attributes that the system can find in a domain. The business office als o analyzes the reports and look for interesting correlations and data points. Engineering - Besides developing features for the application, engineering maintains strict control categorization of domain names in the NAICS classification system. of the Customers - Some customers provide a critical input into the system: the list of domain names they want to studyall customers receive a report once a month, with the results of the analysis. Digital Envoy - Digital Envoy, through its NetAcuity API, provides Geo-location information for IP addresses. They are used, for example, to obtain and record the geographical location of a web server.

4 5. IMPLEMENTATION Fig-1: System Overview Fig-1: System Implementation Crawl: Downloads a number of pages from each domain in a number of zones Analyze: As domains are completed by the crawler, the downloaded pages are searched for "markers", classified using Grapeshot, geolocated, etc. The results are compiled in result files. Synthesize: Combines analysis data and traffic data collected from Root Level Name Servers, to provide link analysis, clustering etc. Report: Using the analyzer and synthesizer files, the system produces reports for three types of customers: Registries, Registrars and Internal.

5 International Journal of Research In Science & Engineering e-issn: Access: Using the indexer, the analysis outputs are made available through the UI via the Search Engine. Crawler determines what domains to work on by querying the workflow database. As the crawler finishes crawling a domain, it writes the crawled domain information to a path (A) on a Network Attached Storage (NAS) device. At the same time, the crawler records the ejection of the domain in the workflow database. (These files contain output for multiple domains). Once the crawler finishes writing an LZO file, it creates a trigger file on another NAS directory (B) and updates the workflow database to indicate the file is complete. The Analyzer looks for a trigger file to show up in the trigger directory (B), and when it does the Analyzer performs analysis on all the domains in the file and writes results to a local directory (C). Once the Analyzer completes all the work for a trigger file, the workflow engine detects the completed trigger file (B), and updates the workflow database to indicate the file has been completed. Once the crawler and Analyzer have completed all the work for a Zone, the workflow engine detects that the Zone can be consolidated, and the workflow engine starts the consolidation process. The Analysis Consolidator collects all the results for a specific zone from the local analyzer directories (C), consolidates the output into a single file (D), and pushes the resulting file into the Verisign Shared Compute Cluster (VSCC) (F). The Analysis Consolidator then inserts Synthesizer routines into the synthesizer database (this occurs on a monthly basis). The DNS Traffic Processor looks for files to arrive in its inbound directory (E). Once enough files have arrived (this should happen on a daily basis), the DNS Traffic Processor loads the files into the VSCC (F) and then inserts Synthesizer routines into the synthesizer database. The Synthesizer processes analysis files on the VSCC (F) and stores the resulting files in directories on the reporting server (H). The Synthesizer also processes the traffic files on the VSCC (F) and produces Lucene Indexes (G) that are used by the Search Server. The Workflow Engine kicks off reports associated with the zone being processed. Customers access the Customer Web application and can view information contained in the Lucene Indexes. 6. RESULTS 6.1 Report 1

6 6.2 Report 2 International Journal of Research In Science & Engineering e-issn: CONCLUSION The main motivation here is the need for improving the system performance and resource utilization of the existing system. When we have huge amount of data coming into the system to be analyzed and processed to ultimately use the data for our benefit, HDFS plays an important role. The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. With the rapid growth of data volume in the enterprise going forward, as data grows from petabytes to zeta bytes and furthermore, large-scale data processing may become a challenging issue, attracting plenty of attention in both the academic and industrial fields. There may arise the need to come up with better and more efficient ways to handle this huge chunk of data. Hence the needs to continue analyze and research is important. REFERENCES [1] T. White, Hadoop - The Definitive Guide. O Reilly, [2] M. Zaharia, D. Borthakur, J. S. Sarma, S. Shenker, and I. Stoica, Job scheduling for multi-user mapreduce clusters, Univ. of Calif., Berkeley, CA, Technical Report No. UCB/EECS , Apr [3] Y. Chen, S. Alspaugh, and R. H. Katz, Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads, CoRR, vol. abs/ , [4] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou, Workload characterization on a production hadoop cluster: A case study on taobao, in IEEE IISWC, [5] Ganglia. [Online]. Available: ganglia.sourceforge.net [6] Y. Chen, S. Alspaugh, D. Borthakur, and R. H. Katz, Energy efficiency for large-scale mapreduce workloads with significant interactive analysis, in EuroSys. ACM, 2012, pp [7] M. A. Stephens, EDF statistics for goodness of fit and some comparisons, Journal of the American Statistical Association, v ol. 69, no. 347, pp , [8] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, Implementing webgis on hadoop: A case study of improving small file I/O performance on HDFS, in CLUSTER, 2009, pp [9] G. Mackey, S. Sehrish, and J. Wang, Improving metadata management for small files in HDFS, in CLUSTER, 2009, pp. 1 4.