Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

Transcription

1 Lightweight Stack for Big Data Analytics Muhammad Asif Saleem Dissertation 2014 Erasmus Mundus MSc in Dependable Software Systems Department of Computer Science University of St Andrews A dissertation submitted in partial fulfillment Of the requirements for the Erasmus Mundus MSc in Dependable Software Systems Head of Department: Dr. Steve Linton Supervisor: Dr. Adam Barker June 2014

2 Declaration I hereby certify that this material, which I now submit for assessment of the program of study leading to the award of Master of Science in Dependable Software Systems, is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work and was performed during the current academic year except where otherwise stated. The main text of this project report is words long. In submitting this project report to the University of St Andrews, I give permission for it to be made available for use in accordance with the regulations of the University Library. I also give permission for the title and abstract to be published and for copies of the report to be made and supplied at cost to any bona fide library or research worker, and to be made available on the World Wide Web. I retain the copyright in this work. If there is a strong case for the protection of confidential data, the parts of the declaration, giving permission for its use and publication may be omitted by prior permission of the Erasmus Mundus DESEM Program Coordinator.

3 Acknowledgements I would like to express my gratitude to my supervisor Dr. Adam Barker for his keen attention, guidance and encouragement. I am also thankful for my co-supervisor Dr. Blesson Varghese for his creative ideas, suggestions and moral support over the duration of this work. I would also like to extend my gratitude to all of my lecturers in the university, for their knowledge, skills and support they imparted to me.

4 Abstract A wide variety of Big Data sets are employed in social science, physical science and the business world. Analytics on such data can predict and make estimations that were not previously possible. While there are technologies to support big data analytics, but it is not easy for a nonspecialist computer user, whether it be a social scientist or a physical scientist or a business analyst, to easily make use of the technology without significant big data expertise. The aim of this project is to develop a lightweight stack that supports technologies for big data analytics.

5 Contents 1 :-: Introduction Big Data Problem Description Motivation Goals of this research Contribution of this thesis Overview of the thesis 4 2 :-: Related Work HIVE Web based interface IBM InfoSphere BigInsights SAP Big Data Analytics TERADATA Big Data Analytics Cloudera Big Data Solutions HortonWorks Amazon Big Data Analytics Platform data Big Data Platform Oracle Big Data Platform Hewlett Packard Big Data Platform Summary 10 3 :-: Big Data Processing Technologies Hadoop Hive Summary 18 4 :-: Development Process & Solution Design Problem statement Revist Development Process Solution Requirements Specifications 20

6 4.4 Solution Design Solution and Research Objectives Summary 24 5 :-: Solution Implementation Implementation Decisions & Development Web Services Data Browser Implementation (User Interaction Layer) Query Management Layer Implementation Summary 32 6 :-: Case Studies and Experiments Case Study Case Study Summary 45 7 :-:Evaluation Comparison with related work Performance Evaluation Summary 50 8 :-: Conclusion & Future Work Conclusions Future Work 52

7 List of Figures Figure I MapReduce Programming Model 12 Figure II Hadoop Architecture with two Nodes 15 Figure III Hive Architecture [Ref Figure II] 16 Figure IV Hive Map Reduce Job Process [25] 17 Figure V Web based framework for Big Data exploration in real time 22 Figure VI Buzz Score From May Figure VII Buzz Score From June Figure VIII Daily Aggregated Buzz Scores From 01 April 2005 to 31 July Figure IX Comparison of Actual and Predicted Buzz Scores using EP technique 37 Figure X Comparison of Actual and Predicted Buzz Scores using RP technique 38 Figure XI Actual and Predicted Score with 4 week sample 40 Figure XII Actual and Predicted Score with 8 week sample 41 Figure XIII Number of bird flu related statements found in the corpus 43 Figure XIV Frequency of related statement found from the corpus 44 Figure XV Frequency of related statement found from the corpus 45 List of Tables Table 2-1 Comparison of different Big Data Platofroms/Frameworks 11 Table 4-1 Research Goals and proposed solution 24 Table 5-1 REST Vs SOAP [35] 26 Table 6-1 % Error between predicted and actual buzz score 38 Table 6-2 Correlation between actual and predicted scores 41 Table 7-1 Comparisons between related and Big Excel 48 Table 7-2 Performance Analysis 50

8 This page is left intentionally

9 Introduction 1 :-: Introduction Big Data technologies are transforming the way data is used to be analyzed. One reason is the massive amount of data that is being generated from different sources such as social networks, sensors, search engines, banks, telecommunication and web, handling this massive amount of data take us in the era of Big Data. According to the YouTube statistics 100 hours of video are being uploaded to YouTube [1] servers in every minute. Facebook is dealing with more than 500 terabytes of data daily [2], companies such as a google and yahoo are recording search engine results for analyzing the searching trends; crawling different web sources to analyze for any important events; gathering marketing data for analyzing the current and future trends which all results the generation of large data sets also referred to as Big Data. Data is everywhere, from social sciences to physical science, business and commercial world, for example, digitizing the past fifty year's newspapers will results the massive amount of data, in astronomy storing billions of astronomical objects, in biology storing genes, proteins and small molecules results in massive amounts of data. In business world such as handling millions of call data records in telecommunication, handling millions of transactions in banking and handling millions of transactions for multinational grocery store results in large data sets. Analyzing these large datasets and getting out meaningful information from it is a challenging in itself. 1.1 Big Data Big Data can be described in 3 V s such as variety, volume and velocity [3] Variety Data has different variations, for example semi-structured or unstructured, such as data, generated from web sites, social networks, s, sensors and web logs is unstructured. Structured data refers to as data generated in result of conversion from call data record to tabular format in order to calculate the monetary value out of it or banks transactions data or data generated from the airline ticketing system are different varieties in the data Volume Volume refers to the amount of data or size of the data set. Nowadays figures are in Tera and Peta bytes. For instance Airbus can generate half of terabytes of data in one flight [4]. 1

10 Introduction Velocity Velocity refers to the speed of data generation which is very fast nowadays. For example weather sensors are kept on generating data as new updates comes, Twitter is generating data at 9100 tweets per second and on Facebook users is sending 3 million messages to each other every 20 minutes [5]. There are different technologies to analyze the Big Data. Hadoop [6] is one of the most popular among them. There are many others, such as MongoDB [7] which is NoSQL database and technologies for warehousing such as Hive and HBase [6] which can be used in conjunction with Hadoop to ease the analysis and provides the abstraction over Hadoop platform. 1.2 Problem Description There are different technologies to deal with Big Data analysis, but most of them are complex and requires expertise to deal with them. Especially for non-computer scientists such as social scientists, they require good programming skills and knowledge of configuring and maintaining the infrastructure which almost makes it impossible for them to explore the large data sets or to perform ad hoc analytics on it. For example, how a social scientist can explore the data to find an event in 1975 by having the previous fifty years of newspaper data? Or how a social scientist can predict the human behavior by analyzing its previous 5 years of data gathered from different sources such as cell phone records with GPS tracking, search engine queries, internet transaction data, consumer behavior or its social network activity? [8] There are plenty of Big Data analysis platforms or frameworks are available nowadays in the market, but the problem for non-computer scientists is to master them because of the complexity involves with them and where necessary to take training in order to use them for exploring Big Data in adhoc manners and doing analytics on large data sets. These systems inherit the problem of maintaining them as well, which might include at application or infrastructure level. 1.3 Motivation The motivation behind this research is to assist non-computer scientists such as social scientists to explore large data sets and to perform adhoc analytics without having any technical expertise in Big Data technologies as well as in maintaining and configuring underlying infrastructure. Nowadays Big Data is an important area to look at in within every field, whether it is social or scientific. Experts such as social scientist who do not have a technical background for handling computer software systems, especially Big Data analytics platforms, can not get insights into large data sets as they do not have enough technical expertise to deal with these systems. One of 2

11 Introduction the motivations behind this research is to give access of Big Data exploration to scientists who want to analyze the information for cutting edge research or want to predict future trends in their respective fields by analyzing the large data sets generated from a particular event or from group of events. Technological advancements in Big Data technologies give the opportunity to computer scientists and commercial companies to look inside large data sets, but these are too complex for non-computer scientists which leads us the need to come up with simple to use and understandable framework that can hide the complexity of Big Data technologies for exploration and analytics. 1.4 Goals of this research The goals of this research are to explore Big Data in an adhoc manner by facilitating the management of user interaction with large data sets and providing the abstraction over deployment, configuration and maintenance of the underlying infrastructure and Big Data technologies. Frameworks such as BigExcel 1 will help non-expert users such as social scientist to harness the Big Data in just a few clicks with less effort. For instance, if a social scientist wants to explore a language dataset of one year gathered from different web sources to explore what was the most popular event happened in that year to use in his research to draw some conclusion, so the scientist can do it by just accessing a simple application by taking simple steps for loading and analyzing. All state-of-the-art Big Data platforms or frameworks are either commercial or very complex to be used for non-technical users and hence they are out of reach from non-computer scientists without taking any considerable training. The goal of BigExcel is to address this issue by developing a framework for non-computer scientists. 1.5 Contribution of this thesis The contributions of thesis are 1- A lightweight web based framework for large data sets, exploration and analytics in an adhoc manner 2- Abstraction for a non - computer scientist such as a social scientist for using Big Data technologies 3- Abstraction for a non - computer scientist such as a social scientist for configuring and maintaining underlying infrastructure for Big Data technologies 4- Access to Big data exploration to non-computer scientists such as social scientists 5- Real time analytics on large data sets 1 Framework developed under this research 3

12 Introduction 1.6 Overview of the thesis In this chapter, we discussed what is Big Data and how it is being generated and the challenges Big Data inherits with it. We discussed the problem statement, research goals and contribution of this work. In the next chapter we will present the related work and comparison between different Big Data platforms/frameworks followed by an overview of Big Data Technologies (Hadoop 2 and Hive 3 ) in chapter three. In chapter four we will present the design of the lightweight web based framework BigExcel followed by the implementation of the framework in chapter five. In chapter six we will present the two case studies. One will demonstrate the predictive analytics use case and one will demonstrate the exploration of a large language corpus use case. These two data sets are taken from yahoo research labs sandbox [9]. In chapter seven we evaluate our work by comparing it with the related work and by providing the performance analysis and usability of the framework. In chapter eight, we will conclude our work and provides the future direction of this research. 2 Big Data Processing Technology based on MapReduce Programming Model 3 Database for handling large data sets with SQL as an interfacing language 4

13 Related Work 2 :-: Related Work In this chapter, we will discuss the related work by analyzing tools and framework which are available to explore Big Data and are capable to perform analytics. We look at these tools and framework with respect to difficulty and expertise require in using them by non-expert users and also whether they support the exploration of Big Data or not. 2.1 HIVE Web based interface Hive has an interactive web interface which is designed for the administration purpose of the Hive and as well as for querying the database. User can create and delete tables and browse the database schema. In addition, users can execute the queries by supplying it from the web based interface. It is not really an analytical platform, but it was developed to work with Hive easily and interactively by using the web interface. The web interface is easy to use, but it is too technical for non-technical users to create and browser schema and other tasks such as starting and stoping the Hive. Also for Hive web interface to work, the user should have Hive configured and deployed on the computer system which requires Hadoop as well for processing, hence makes it very complex for non-expert users with no support of analytics. 2.2 IBM InfoSphere BigInsights InfoSphere BigInsights is a Big Data analytics platform from IBM, which support different type of analytics under one roof. InfoSphere is built on top of Hadoop to enhance its capabilities and provides an interactive interface on it for analyzing the Big Data. InfoSphere has built-in analytics capability, including text analytics for getting insights from large textual data, social data analyzer for analyzing social media data, machine data analytics for analyzing machine data such as data from sensors and GPS and InfoSphere also supports the integration with other Big Data technologies. In addition InfoSphere provides a SQL interface, namely as BigSQL and also a spreadsheet like interface called BigSheets for analyzing and exploring Big Data easily with development tools analytics and security for Big Data operations. BigSheets and BigSQL modules of InfoSphere are some of the core components of the system. BigSQL provides the facility to user to explore the database schema and provide access to analytics using structured query language. BigSheets provide Big Data analysis, exploration and manipulation for non-programmers or non-technical users. The system can load data from multiple sources such as databases, web crawlers, text files, Json, csv files etc. and can store it in 5

14 Related Work the distributed file system for processing. BigSheets provide the user with interactive interface like an excel sheet where users can select or import the data for analysis, such as by applying filters, or by using built-in aggregation functions e.g. sum or average. It also supports user defined functions where users can write their own logic for the computation. BigSheets also support the visualization of data and analysis results interactively. It supports multiple chart types for rendering the result, such as line, bar and pie charts. Despite of all these benefits and easiness InfoSphere provides, issue with InfoSphere is that, it is not easy to configure the system, even technical computer scientist needs training to install and configure the system, hence InfoSphere taken away the opportunity from non-expert users such as social scientists to use the system without having technical expertise [10]. 2.3 SAP Big Data Analytics SAP is one of the leading providers of Big Data analytics platforms. It is one of the first company to introduce in-memory database for analytics. The aim of building in-memory database was to provide a single database for both transactional and analytical data processing, commonly refer to as OLTP 4 and OLAP 5 systems. However, any applications can be built on top of SAP HANA 6 in-memory database for analytics. However SAP HANA back-end engine is available for use in the cloud as well, but there are less application exists, which can be used by the non - expert user for analytics. HANA has a strong library called Predictive analytics library (PAL) which can be used by the HANA developers to perform predictive calculations by just invoking the library functions hence providing ease for Big Data analytics. For setting up HANA system for Big Data exploration, the user needs some type of ETL 7 tool for loading the data into the system and then an application which can manipulate that data. Many sample applications exist for demonstration of the HANA power but most of them address the business use cases and designed for expert users or who can use the system by taking the adequate training [11] [12]. 2.4 TERADATA Big Data Analytics The Aster Big Analytics Appliance is solution from TERADATA for Big Data Analytics. Aster is basically a database developed by TERADATA supporting row based and column based storage and it s the key component of their Big Data analytical platform which consists of Aster database, Aster SQL-MapReduce, which is basically an interfacing SQL language for Aster 4 Online Transaction Processing 5 Online Analytical Processing 6 High Performance Analytic Appliance 7 Extraction, Transformation and Load 6

15 Related Work database but includes support for built-in functions for MapReduce using SQL language similarly to Hive which sits on top Hadoop along with their own supplied and tested hardware. We will discuss in detail about the architecture and working of Hive and Hadoop in the next chapter. The Aster Big Analytics Appliance is a combination of software and hardware and the basic idea is to run Big Data technologies on a dedicated and specially designed hardware by connecting multiple nodes with InfniBand [13] instead of running Hadoop on commodity hardware with traditional networconnectionsns. New nodes can be added as needed by using separate dedicated hardware machines. Despite the dedicated hardware and fast speed database, together with built-in functions for MapReduce task and SQL interface for analytics, there are less applications which are available for non-expert users to explore and analyze Big Data. The technology main focus is on the business users and users have to have a physical machine (not commodity hardware) together with front end application to use the system which is not suitable for non-technical users [14]. 2.5 Cloudera Big Data Solutions Cloudera founded in 2008 and was the first enterprise distributors of Apache Hadoop and other Big Data technologies. The contributions of Cloudera to Big Data world is the abstraction over different Big Data technologies such as Hadoop, Hive, HCatalog [15] etc., which provides easiness to users to use these technologies without going into technical details. Cloudera solutions are ready to install on any commodity hardware which hides the technical details of compiling and configuring the Big Data technologies and provides the system management such as configuration, deployment, security management, diagnostics, operational reports generation etc. Cloudera is at forefront of providing back-end solutions for Big Data exploration and analysis but does not provide any tool or framework which actually can be used on top of these technologies especially for non-expert users [16]. 2.6 HortonWorks HortonWorks founded in 2011 by the ex Yahoo engineers. The concept is same as Cloudera to provide different Big Data technologies for enterprise computing and as a result the company developed HortonWork Data Platform for enterprise computing which builds on top of Hadoop and have the capability to provide different type of analysis such as batch, real time and interactive. The platform also supports data management, the facility which is built on top of the Hadoop file system and resource manager. In addition to multiple analysis capabilities of the Horton data platform, it supports different technologies for data integration and data flow 7

16 Related Work control. Technologies such as Apache Falcon [17],, Sqoop [18] and Flume [19] are part of the platform and provides easy and systematic access for handling data in and out of Hadoop. Furthermore, the platform provides AAA 8 and data protection by encrypting traffic during transfer between different node such as using remote procedure calls, transferring data over HTTP, JDBC and Data transfer protocol. The platform also supports the management of Hadoop cluster operations. For deploying and configuring Horton data platform requires considerable expertise and training to use it along with other technologies experience for loading data into the distributed file system for processing. Furthermore the system lacks applications where the user can just login to the system and start analyzing the data without having any expert knowledge in system [20]. 2.7 Amazon Big Data Analytics Platform Amazon is a pioneer in cloud computing and provides technological services through web interface usually referred as web services. Amazon elastic map reduce (EMR) is a web service from amazon where we can access Big Data technologies such as Hadoop and Hive via Amazon exposed API s or via launching an EC2 9 instance, where users can deploy map and reduce script for processing on Hadoop and Hive. Hive can be accessed via the amazon exposed API s such as JDBC or by using a Hive compatible client such as Hive command line utility and beeline. Non-expert users which do not have technical grounds to write map and reduce scripts are not able to use this technology and also non-programmer can not work with the exposed API s as they have to be proficient in programming in order to work with them. In addition, users need to build an application on top of Amazon web services to provide the access to non-expert users for big data exploration data Big Data Platform 1010data is a leading provider for big data exploration and analysis, serving multiple industries such as retail, telecom, financial, healthcare, government, etc. 1010data systems can be installed on premises and can be accessed through cloud. The 1010data big data platform has very interactive interface with a spreadsheet like interface called The Trillion-Row Spreadsheet [21]. It is similar to the excel spreadsheet, but it is capable of handling billion or trillion rows of data. The cloud solution can be used by non-expert users as it only requires the authentication for accessing the system to explore the data. In addition platform also provides the facility to import and load the data from multiple sources easily. 8 Authentication, Authroziation and Audit 9 Elastic Compute Cloud by Amazon 8

17 Related Work 1010data Big Data platform depends on the underlying high performance database technology which support both row and column based storage. A column based storage is better for set type operations as data is physically present adjacent to each other in the disk and disk takes less time for locating the data when the data retrieval query comes. The database support multiple built in functions for analytics usually refers to as in-database analytics. The functions include grouping functions, statistical functions such as mean, standard deviation, time series functions such as finding regression and correlation coefficient, text functions such as finding a specific text from the data such as finding tweets from a large data set etc. 1010data provides easy to use interface even for non-expert users, but the technology they are using is now outdated. Although they are using the row and column based storage database but there are other databases available such as SAP HANA, which provides faster response to OLAP queries as compared to any other database. Furthermore Hadoop and other Big Data technologies are the forefront in analyzing Big Data and provides a reliable and fastest way to explore Big Data and 1010data platform is not supporting Hadoop and other Big Data technologies, which mean this platform might not be of interest of most of non-technical users [21]. 2.9 Oracle Big Data Platform Oracle is not behind from any other company for Big Data solutions. Oracle has introduced the in memory database option in oracle database version of 12c. Oracle partners with Cloudera to provide Hadoop for Big data analytics. In addition, Oracle has developed its own analytical tools using R [22] for Hadoop together with In-memory and NoSQL databases. In addition the system has a data management facility for moving data in and out of Hadoop and to and from the database with dedicated and tested hardware from Oracle. In addition, Oracle is providing the Big data platform as a service in the cloud so any application can be built on top of the Oracle Big Data platform. Despite the powerful hardware and software capability to deliver real insights into the Big Data there is a need to build a simple application on top of this technology to support non-expert user for exploring the Big Data [23] Hewlett Packard Big Data Platform HP provides analytics through its HP Vertica analytics platform. The product is basically based on Vertica database which is the backbone of the platform. The database provides fast execution of queries and especially for set type queries and for warehouse applications due to its columnar storage structure. In addition database uses advanced compression techniques to compress the data to be stored on the disk along with in-database analytics. Furthermore, the database has integration with Hadoop for processing stored data in the Hadoop File System 9

18 Related Work which can be loaded into the database through Hadoop-Vertica connector. However system does not provide the flexibility to execute MapReduce jobs into the Hadoop instead platform is more interested to get only the data from the Hadoop distributed file system for processing by using rich in-database analytical functions. Despite the fact that the system provides high performance analytics, but it cannot be used by non-expert users as the system has to be configured and deployed on the premise and there less applications exists which can use the platform easily for Big Data exploration. Furthermore, system is not providing MapReduce computation facilities which is a key factor for any Big Data exploration and analytics application nowadays. [24]. Table 2-1 shows the tabular comparison of different technologies and applications that we have discussed above Summary In this chapter, we talked about different Big Data platforms for analyzing and exploring the large data sets and discussed their insights and whether the non-technical users can use those platforms or not. We also provided the tabular comparison between them. In the next chapter, we will discuss the popular Big Data processing technologies Hadoop and Hive. 10

19 Related Work Features InforSphere SAP TERADATA Cloudera HortonWorks Amazon 1010data Oracle HP Hive Analysis Web Interface Yes No No Yes No Yes Yes No No Yes Hadoop Support Yes Yes Yes Yes Yes Yes No Yes No No Dedicated Hardware Support Yes Yes Yes No No No No Yes Yes N//A Analytics Yes Yes Yes No No No Yes Yes Yes No Database Support No Yes Yes No No No Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes Yes No Data Management Nonexpert users support Yes (Low) No No No No No Yes (Medium) No No No Bussiness users support Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Table 2-1 Comparison of different Big Data Platofroms/Frameworks 11

20 Big Data Processing Technologies 3 :-: Big Data Processing Technologies In this chapter, we will discuss the big data processing technologies that we have used in developing the framework. We start with Hadoop and describe its programming model and architecture followed by an overview of Hive and its architecture and the benefits Hive provides for analyzing large data sets. 3.1 Hadoop Hadoop is based on the MapReduce Programming Model originally developed by Google in 2004 and currently maintained by an Apache software foundation as an open source software. Hadoop is most widely used Big Data technology for analyzing the large data sets. Hadoop batch and parallel processing nature make it perfect for crunching large amount of data very easily and in cost effective manners. Input Output Split0 Mapper Key-1 Value Reducer Split1 Key-2 Value Split2 Mapper Key-1 Value Reducer Split3... Split4 Mapper Key-1 Value Reducer Figure I MapReduce Programming Model MapReduce Programming Model based on the concept of key / value pairs as shown in the Figure I, where the user inputs data file which splits into multiple chunks of 64 MB in size and passed to the mapper where different chucks of the same file is assigned to different mappers for parallel processing. Mapper parses the input file and convert it to key / value pairs for processing and generates intermediate key /value pairs after processing as shown in the Figure I. Reducer then loads all these key/value pairs and sort it in order to group the related keys 12

21 Big Data Processing Technologies together, reducer then process these groups to produce the output file. Each reducer usually has zero or one output file at the end. We can describe the Map and Reduce processes as below Mapper (Key, Value) Data (Intermediate Key, Value) Reduce (Intermediate Key, List (Value)) = Final output in the form of a file Hadoop is based on the MapReduce programming model which requires different modules for parallel processing. Hadoop consists of multiple nodes, namely NameNode, Data Nodes, JobTracker and Task Trackers for handling parallel processing describe below NameNode This node is responsible for managing HDFS 10 file system. It keeps records of the file system in the form of tree data structure and manages the DataNodes which are actually responsible for all operations on the file system such as move, delete or add files. When a client requests the operation on the file system, the NameNode notify the concerned DataNode to process the request. Related DataNode processes the request and sends the result back to the client. Below figure shows the NameNode architecture NameNode DataNode DataNode... DataNode NameNode Manages DataNodes JobTracker JobTracker is responsible for assigning the tasks to TaskTracker (described below) and also responsible for periodic checking of TaskTracker that is it up and running by pinging the node so that in case of node failure, processing can be transferred to other live node TaskTracker Task Tracker is responsible for accepting jobs for the mapper, reducer and shuffling/sorting tasks from the JobTracker. 10 Hadoop Distributed File System 13

22 Big Data Processing Technologies HDFS The Hadoop distributed file system is used for storing and retrieving data within Hadoop architecture. HDFS has different characteristics File Storage HDFS within Hadoop architecture provides storage capability. Data can be imported or loaded from different sources for analysis. Large Files HDFS is designed for managing very large data files usually in Tera or Peta bytes. Streaming Data Access HDFS is based on the principle of write once, read multiple times. Due to its large block size of 64 MB data can read and write in larger chunks, reducing the disk seeks. Designed for Comodity Hardware 11 HDFS is designed to run on commodity hardware and hence hardware failure is usual. File system ensures the persistence of data by replicating it to different nodes and handles all the complexity behind it. Client Interface HDFS provides the client interface for reading and writing data to Filesystem for processing through data node. Figure II shows the Hadoop architecture in a cluster setup which has two nodes. From the figure we can see JobTracker is communicating with the TaskTracker for assigning tasks such a Map, Reduce and shuffling. NameNode is responsible for handling all the data nodes in a cluster. Typically each data node is assigned to each node in the cluster. In addition to primary NameNode, there may be a secondary NameNode in the Master node which is usually used to store the replica of metadata of the NameNode for handling failures [25] [26] [27] [28]. 3.2 Hive Hive is a warehouse solution initially developed by Facebook and available now under apache software foundation as an open source framework. The main issue is with Hadoop is to define custom logic for writing map and reduce tasks. Users who even understand what to write in custom logic for analyzing the data cannot write the map and reduce tasks because of lack of 11 Low cost hardwares from different vendors 14

23 Big Data Processing Technologies programming ability which takes many non-programmer users away from Hadoop and hence they cannot analyze their data without the knowledge the programming language. Master Node JobTracker NameNode TaskTracker DataNode TaskTracker DataNode HDFS Slave Node -1 Slave Node -2 HDFS Figure II Hadoop Architecture with two Nodes Hive overcomes this issue by providing the database like structure on top of Hadoop with an interfacing query language called HiveQL 12. HiveQL is same like standard database interfacing language SQL 13. Users can load their data into Hive by using its data loading facility. Hive communicates with Hadoop and automatically creates the MapReduce jobs for analysis. Users can also define custom map and custom reduce functions by embedding them into the Hive queries for processing on Hadoop. Figure III shows the architecture of the Hive. In Hive, user issues the queries to the Hive via CLI (command line interface). Hive automatically decide how many mapper and reducers are needed to process the query, hence simplifying the processing wherein Hadoop the users have to define how many mapper and 12 Hive Query Language 13 Structured Query Language 15

24 Big Data Processing Technologies reduces needed to complete the task. Additionally, users can set the number of mapper and reducers needed for the query in the configuration file manually. HIVE Job Submission Master Node JobTracker NameNode TaskTracker DataNode TaskTracker DataNode HDFS Slave Node -1 Slave Node -2 HDFS Figure III Hive Architecture [Ref Figure II] Figure IV shows the job creation and execution within hive architecture. User supply the query via Hive interface called CLI (command line interface) and Hive creates the job that is to be executed on Hadoop. All the processing logic builds in Hive such as assigning mappers and reducers required for the query or whether query required MapReduce job or not. Types of queries such as aggregation or with a user defined logic, requires MapReduce jobs for 16

25 Big Data Processing Technologies processing, but queries such as listing the table values (SELECT * FROM TABLE_AME) does not require MapReduce jobs. HIVE Job Submission Job JobTracker Master Node Resource Copying e.g. splited files TaskTracker HDFS Slave Node Figure IV Hive Map Reduce Job Process [25] Job is submitted to JobTracker by the Hive and JobTracker assigns the job to TaskTracker (Mapper or Reducer) for execution. Created job copies the required files to HDFS such as input file for processing. JobTracker and TaskTracker also communicate with the HDFS to get the current file information for processing tasks. Furthermore Hive provides multiple client accessing interfaces in addition to CLI (command line interface). Thrift client library provides access to third party clients for accessing the database. Beeline is another command line utility providing access to Hive metadata and data. Hive also provides JDBC 14 API for accessing the Hive from Java applications. HiveServer2 was introduced in version 0.11 of Hive to facilitate the clients who want access to the database via JDBC API. Hive also requires an additional relational database for storing the metadata of the database, e.g. table names, data types, table partitions, etc. Hive give access to its metastore through the 14 Java Database Connectivity 17

26 Big Data Processing Technologies meta store API. Users can configure any relational database for this purpose, by default Hive uses the derby relational database. Hive supports almost all data types of traditional database, e.g. Integer, String, CLOB etc. In addition Hive has built in aggregation functions such as calculating the average or the standard deviation of a column. Users can also write their own custom functions called UDF (user defined functions) for any custom logic. Hive support almost all relational operators and traditional database query operator such as <= or between. IN operator was added in the latest version of Hive Furthermore, users can write their own custom scripts (algorithms) which can be embedded into a Hive query using following syntax SELECT TRANSFORM (column names.) using custom_logic (Algorithm) from TABLE_NME WHERE (condition.) Column names are the data column on which users want to perform analysis. TRANSFORM clause converts the column name as string separated by \t tab where users can read this string like reading a string line from the command line in their scripts and can build custom logic using it. Every single tuple will be passed to script until the end of all filtered rows as a result of the query [29]. Hive only support the structured data. For analyzing large data sets users have to convert their data into a structured format. For instance unstructured data from web logs will be extracted first, using any data or Big Data processing technology before loading it into a hive. We need an independent utility for converting the results extracted from unstructured data into some structured forms such as tabular form for loading them into the hive for analysis. 3.3 Summary In this chapter, we discussed the current state of the art Big Data processing technologies and provides their architecture. In the next chapter, we will discuss the development process and solution design of the lightweight Big Data analytics framework. 18

27 Solution Design 4 :-: Development Process & Solution Design In this chapter, we will present the solution requirements and design of the lightweight web based framework. We go through the problem statement again as we described in the introduction chapter and design the framework accordingly. We will also discuss that how this solution address the problem and whether it meets its goals. 4.1 Problem statement Revist As we discussed in the first chapter, Big Data technologies are complex and require considerable computer skills to use them and especially non-computer scientists require skills and training in order to use them as they are too complex for them. Also, there is a need to configure and maintain the infrastructure for these technologies. In order to address this issue we need tools like a lightweight framework which can be used by non-expert uses for exploration of large data sets in adhoc manners with less effort. 4.2 Development Process We have used software engineering principles 15 for developing the solution which provides the systematic approach to software development. Following are different phases that we are following Requirements Specifications Solution Design Solution Implementation Testing/ Experiments Principles of Software Engineering represented using the Water Fall Model 15 Software Engineering Principles e.g. Representing using Waterfall Model 19

28 Solution Design During requirements gathering and specifications, we will gather requirements according to the stated objective and research goals. After that we will design the solution which satisfies the requirements followed by solution implementation. Testing will be done in the form of experiments using test data of large data sets from yahoo labs sandbox. We will iterate back and forth for adjusting design and solution implementation for any requirements changes or missing functionality during development. This chapter will cover the first two phases followed by the implementation phase in the next chapter. The test / Experiment phase will be in the case studies and experiments chapter. 4.3 Solution Requirements Specifications Below are some identified requirements for designing the web based framework for exploring and analyzing large data sets. A data browser which will interact with the user and assist in querying and analyzing the large data sets A Query constructor processor, which constructs the queries from information entered by the user for analytics and the exploration of the large data sets A data manager which will assist query constructor for analytics and in exploring the large data sets in adhoc manners An analytical processor, which will assist in adhoc analytics from large data sets A module repository or set of algorithms which will be used by the analytical processor for analysis Support Big Data analysis technologies e.g Hive and Hadoop by providing abstraction to users Support infrastructure on which these technologies will run by providing abstractions to users 4.4 Solution Design In order to design the framework which will fulfill the stated requirements, we are using three tier architectural pattern for designing it. The framework consists of three layers namely, the user interaction layer which will handle all the user interaction, query management layer which is responsible for handling queries and 20

29 Solution Design analytics request and infrastructure management layer which will handle the infrastructure to be used by the framework User interaction layer The user interaction layer is responsible for taking the queries and input from the user for adhoc exploration and analytics on large data sets. This layer consists of a module called data browser which will communicate with the query management layer over the standard HTTP protocol. All the communication between user interaction layer and query management layer will be secure with built in security provided by the HTTP protocol. Data browser facilitates the users for the following tasks Facilitates in writing queries for adhoc exploration Facilitates in visualization of large data sets in tabular form Facilitates in adhoc analytic by visualizing the results Provides the visualization of HDFS 16 Facilitates in conversion of data sets for loading Facilitates in loading the data to Hive Warehouse Data Browser will be the web based application where user can access it over the internet Query Management layer The query management layer is responsible for handling queries related to data exploration, data aggregation and analytics on the data. Query management layer consists of five modules, namely Query Constructor, Response processor, Analytical processor, Modules Repositories and Data Manager. Query Constructor Query constructor is responsible for constructing the user queries. Queries might have different types such as aggregation queries or analytics queries. The query constructor passes queries to analytical processor, which in turn interacts with infrastructure layer. Queries for moving the data to HDFS or loading the data into the Hive warehouse are handled by the data manager. Analytical Processor Analytical processor is responsible for handling analytical queries for adhoc analytics. Analytical processor consults the modules repository which is a set of algorithms (user defines 16 Hadoop Distributed File System 21

30 Solution Design Data Browser User Interaction Layer Data Manager Query Constructor Analytical Processor Modules Repository Response Processor Query Management Layer Connection Manager Hive Hadoop MetaStore/HDFS Infrastructure Layer Figure V Web based framework for Big Data exploration in real time 22

31 Solution Design logic for exploring data sets) to perform analytics. By combining with user query and algorithm, analytical processor interacts with the infrastructure layer for processing. Modules Repository Modules repository is a set of algorithms which can be supplied to extract information for analytics. Prefer languages of algorithms are the scripting languages such as Perl, python and shell script, but algorithms can be written in Java. Data Manager Data Manager is responsible for handling the processing queries such as moving the data to HDFS and loading it into the data to Hive warehouse. Response Processor Response processor is responsible for getting and parsing the response from infrastructure layer and sending it to data browser for rendering. It usually includes results such as viewing the large data sets, aggregation results, analytics results and view of the HDFS file system Infrastructure layer Infrastructure layer is responsible for the processing of the user queries on larger data sets. Infrastructure layer might be a single machine or it might be a cloud [30]. Infrastructure layer consists of Big Data processing technologies and the hardware. We are using Hive and Hadoop for data processing technologies. Hadoop is a framework based on MapReduce paradigm and the Hive is a warehouse application which is designed to run on top of the Hadoop distributed file system and provides the SQL like interface for querying and analyzing large data sets as described in chapter three. The infrastructure layer consists of connection manager and hardware running Hive and Hadoop. Connection Manager Connection manager is responsible for handling and processing requests for big data processing technologies. It is a first point of contact for request in the infrastructure layer. It passes the request to underlying technologies for processing and retrieving response for sending back to the response processor for rendering in the data browser. Cloud/Hardware This module consists of pre-configured hardware or cloud application which can run big data processing technologies such as Hive and Hadoop for processing the user queries. This preconfigured system helps the users of the framework to interact with these technologies without 23

32 Solution Design having any technical expertise in it, hence assisting the framework for providing the abstraction between the user and the underlying infrastructure. We proposed to use the cloud. 4.5 Solution and Research Objectives Now we analyze that how our proposed solution will achieve our research goals as stated in the first chapter. Research goals Framework for non-computer Scientist Purpose Solution - Easy to use interface includes Data management in basic form Analytics and visualization of results - Abstraction over Big Data technologies Hive and Hadoop in this case - Abstraction over configuration and maintenance of infrastructure Cloud or specializes Comoditity hardware Big Data Exploration access - Modules repository Custom defined algorithms Table 4-1 Research Goals and proposed solution Table 4-1 shows the main research goals and proposed solution. The research goals for developing Big Data analytics framework for non-computer scientists are achieved by using a combination of easy to use interface along with abstraction over Big Data processing technologies and the underlying infrastructure by using the cloud or specialize commodity hardware with basic data management facility such as converting and loading from text files along with pre written set of algorithms for analysis. 4.6 Summary In this chapter we presented the design of a lightweight web based framework for big data analytics. We outlined the requirements specifications for the design and presented the block diagram in the Figure V along with class diagram (see appendix I). In addition, we described the development process that we are using for developing the system. In the next chapter we will present the detail implementation, including technologies used for the web based framework. 24

33 Solution Implementation 5 :-: Solution Implementation In this chapter, we will discuss the solution implementation of the web based framework. We will discuss the development decision regarding the technologies used, by explaining and comparing the related technologies for framework development and for Big Data processing. 5.1 Implementation Decisions & Development There are many technologies which are available nowadays for developing web based applications, but it is important to select the technology which provides rapid application development along with a rich set of interface designing tools. Below we discuss the technologies that we have used to implement lightweight web based framework along the description of the implementation. 5.2 Web Services Web services are the key components in the application development, especially for distributed applications where one or more systems are scattered across different locations or running on different servers. Web services enable application to communicate effectively and securely between different components of the systems residing on the same machine or different machines. REST 17 [31] and SOAP 18 [32] are two standard technologies for designing web services. SOAP technology is more powerful than REST with respect to security as SOAP has its own security mechanism for securing the communication, whereas REST uses a standard hypertext transfer protocol (HTTP) for communication which uses standard OSI 19 stack which has built in security layer which uses TLS (Transport level security) for data encryption. SOAP is complex with respect to development and configuration as compared to REST and most cooperates are using SOAP instead of REST. Below are some key difference between REST and SOAP technologies [33] REST Uses standard HTTP protocol for communication SOAP Is itself a protocol 17 Representatioanl State Trafnsfer 18 Simple Object Access Protocol 19 Open System Interconnection Model 25

34 Solution Implementation Uses Transport Level Security for secure communication Uses JSON 20 [34] for data transportation No support for distributed transactions Better interoperability because of the usage of standard protocol HTTP Easy to achieve scalability Provides end to end message security and defines its own security via WS-Security [32] Uses XML for data transportation Support for distributed transactions SOAP has different implementation standards, which might create problems while communicating across vendors different implementations Much harder to achieve scalability Table 5-1 REST Vs SOAP [35] Framework is using REST web services where Query management layer is exposed with REST API and data browser is communicating via REST call over the HTTP. The benefits of using REST in our framework are that any client can query, the query management layer by calling the service URL 21 regardless of location and portability restriction over the internet and can manipulate or visualize the results. In addition, REST is less complex to use and also supports rapid application development. 5.3 Data Browser Implementation (User Interaction Layer) As it is a web based framework we are using standard HTML pages for data browser development. As data browser requires control for loading, and visualization of data so we are using RichFaces [36] for designing the data browser. RichFaces is an open source user interface designing kit which uses Java Server Faces [37] and provides rich set of feature for interface design. Java Sever Faces is an MVC 22 framework from Oracle and uses the component concept for designing a web interface. The table below describes the packages and classes of data browser. Package Name Class Name/HTML Page Description Login Login.xhtml This web page is responsible for authenticating the user Login LoginBean.java This Java class is responsible for temporary storing of login information to be passed to database for 20 Java Script Object Notation 21 Universal Resource Locator 22 Model View Controller 26

35 Solution Implementation authentication Client/viewdata ViewDataRich.xhtml This page is responsible for getting user input to view table stored in hive PartialLoadPaginatedData.xhtml ViewTableBackingBean.java PaginationViewModel.java This page is responsible for showing partial table data on the web interface by using pagination or lazy loading 23 This Java class is responsible for storing the query to view data from the web page for transferring it for processing This class is responsible for lazy loading of the data. PartialLoadPaginatedData.xhtml page passes the number of records to be loaded on each request from client Client/loaddata LoadData.xhmtl This page is responsible for loading user data into the hive warehouse LoadTxtData.xhmtl ClientLoadCall.java LoadDataBean.java TextToCSV.java ReadCSVFileRowCount.java FileChooserForLoadingData.java This page is responsible for converting text datasets into csv format for loading into a hive This Java class is responsible for calling REST API for loading data This Java class is responsible for holding temporary data to be processed for loading This Java class is responsible for converting text files into csv This Java class is responsible for reading the total row count while loading the data and storing into to a database as metadata of the csv file This Java class is responsible for selecting the files from the disk to be loaded 23 Lazy loading or pagination is a technique for loading the chuck of data from large data table 27

36 Solution Implementation Client/analysis ModuleRespostirySelection.xhtml This page is responsible for getting custom algorithms or user define logic from user for executing it onto Hive QueryHive.xhtml UtilityFunctions.xhtml GraphVisualization.xhtml HiveQueryBean.java QueryHiveCall.java UtilityFunctionBean.java ExtrapoationbasedPrediction.java RegressionbasedPrediction.java This page is responsible for direct querying the Hive e.g. aggregation queries This page is responsible for handling utility functions to be performed on the data sets, e.g. getting a standard deviation of a column in a large data set This page is responsible for visualizing the results in graphical form This Java class is responsible for temporary storage of data for direct querying of the Hive on data sets This Java class is responsible for handling REST calls to query management layer for querying and analytics of data as well as for performing aggregate functions This Java class is responsible for temporary storage of data for performing utility functions e.g. average This Java class is responsible for calculating extrapolation based predictions This Java class is responsible for calculating regression based predictions 5.4 Query Management Layer Implementation Query management layer is the main processing layer of the framework which consist of different sub modules where each sub module has its own responsibilities. We have opted the modular approach for implementing this layer, as this approach assists in achieving high 28

37 Solution Implementation cohesion and loose coupling between the modules. First, we discuss the query constructor implementation followed by other modules in this layer Query Constructor implementation Package Name Java Class Description Server/query constructor QueryConstructor.java This Java class is responsible for constructing user queries for executing on Hive warehouse Data Manager Data manager is responsible for transferring the data to HDFS and loading it into to HIVE. Below table shows its implementation Package Name Java Class Description Server/loading LoadData.java This Java class is responsible for providing a REST API interface for all requests related to data loading into Hive LoadDataLogic.java This Java class is responsible for transferring data to HDFS and loading it into Hive LoadDataServlet.java This Java class is responsible for loading REST API interface classes into web container Response Processor Response processor is responsible for parsing the response after getting it from the connection manager for displaying it in the data browser. Below table shows the implementation of response processor Package Name Java Class Description Server/viewdata LoadTable.java This Java class is responsible for providing a REST API interface for all requests related to the viewing of stored data 29

38 Solution Implementation LoadTableFromDataBase.java & LoadTableLogic.java ResponseParser.java These Java classes are responsible for providing services for viewing of the stored data from the warehouse This Java class is responsible for parsing the responses Analytical Processor Analytical processor is responsible for processing, analytic queries by consulting with the modules repository (algorithm library). Below table shows the implementation details Package Name Java Class Description Server/analysis CustomeScriptAnalysis.java This Java class is responsible for executing custom scripts (algorithms) for analytics on Hive store AnalyticsInterface.java UtilityFunctionsComputatio ns.java This Java class is responsible for providing a REST API interface for all requests related to analytics This Java class is responsible for handling the computations related to utility functions, e.g. average, standard deviation, sum, etc. In addition to the above modules, there exists a general module in the implementation which is interacting with different other modules in the design. For instance we are using a separate database for storing the metadata of the database along with some tables information such as number of rows in a table before loading it into the Hive. The purpose of this database for storing metadata is to visualize the information without going to Hive for resource intensive queries such as counting the number of rows in a million row table. Hive is designed to store large data sets and such queries could take considerable time and can slow down the framework performance, by having metadata separately in a database helping us for processing and visualizing data in efficient and easy manner without going to Hive for informational queries. Furthermore, general modules contain classes such as conversion to and from JSON for transporting the data. Below tables show the implementation details of the general module 30

39 Solution Implementation Package Name Java Class Description General/client ApplicationMessages.java This Java class is responsible for assisting response processor for processing responses ClientSideGsonConversion.java This Java class is responsible for client side JSON conversion (To and From) General/serve ServerSideGsonConversion.java This Java class is responsible for server side JSON conversion (To and From) SQLLiteDBAccess.java This Java class is responsible for handling temporary database transactions such as insertion, updation and retrieval of records Infrastructure Layer Infrastructure layer is responsible for providing infrastructure and Big Data processing technologies as a bundle by hiding configuration and deployment details of these technologies such as Hadoop and Hive. As we discussed earlier that we are using Hive and Hadoop for data processing hence it requires a connection manager module to manage all the traffic of query management layer with the infrastructure layer. Blow is the table which shows the connection manager implementation details. Package Name Java Class Description Infrastructure/hive hadoop and AccessHadoopFileSystem.java This Java class is responsible for handling access to HDFS TreeDataModelForHDFS.java This Java class is responsible for handling data model for HDFS file system tree visualization DataHolderForTree.java This Java class is responsible for getting data from HDFS and passing it to the data model for rendering 31

40 Solution Implementation ConnectionManager.java This Java class is responsible for handling connection with Hive warehouse 5.5 Summary In this chapter we presented the detailed implementation of lightweight web based framework. We discussed different technologies that we have used for implementation. In the next chapter, we will test our framework with two case studies by using large data datasets (Big Data) from yahoo labs sandbox. 32

41 Case Studies and Experiments 6 :-: Case Studies and Experiments In this chapter, we demonstrate the framework with two case studies employed from yahoo sandbox. We will analyze the data and try to predict future trends based on current data. In addition, we try to analyze the important events happened in 2006 from a large data set gathered from a corpus of more than news related websites. 6.1 Case Study -1 This case study is using a dataset from the yahoo sandbox box. The data set is taken from the marketing and advertising category and represents yahoo buzz game scores 24. It contains yahoo search engine results for different products such as TV s, EBooks, Browser, Maps, Games, Linux, Photo sharing/organizing webs etc. each associated with its buzz score. Each product further contains different product types such as Internet Explorer, Mozilla, Netscape for Browser product and Debian, Fedora, Ubuntu for Linux product. Users buy and sells stocks and the price fluctuates according to supply and demand of the stocks and determined by using a market mechanism developed by yahoo called dynamic pari-mutuel market [9]. Dividends are paid to stock holder based on the buzz scores where buzz scores are representing the percentage of searches on a particular time for a particular product type. Researchers such as social scientists can use this data to predict the online market trend and can use these results to test the market behavior theories [9]. The data set contains data for the periods from Ist April 2005 to 31 st July 2005 and from 22 nd August 2005 to 6 th April 2006 where for each searched product and product type, date and time is recorded along with the their buzz scores. The size of this data set is 90MB approximately uncompressed. We use this data to predict buzz scores based on the hourly, weekly and daily analysis of the data and we will compare actual figures with the predicted values in order to draw future trends from predicted data. The initial data is in structured text (. txt) format. We will use the web based framework to convert it to (. csv) format. The csv file will then load into the Hive for analysis. After loading the data user can explore the data by binding the custom written scripts (algorithms) from the module repository. 24 A value calculated by the yahoo based on searches for each product 33

42 Case Studies and Experiments Hourly Analysis In hourly analysis, we are using custom developed scripts written in Perl 25 [38] (see appendix II) programming language from modules repository to perform the analysis. Analytical processor embeds the module from module repository and execute the analysis on to Hive. In this analysis, we are averaging the buzz scores on an hourly basis for predicting the future trends which can be used to trade the stocks based on buzz scores. Figure VI shows the buzz scores of the five days from 23/05/2005 to 27/05/2005 where each hour of each day is the average of buzz scores from 00:00:00 to 00:59:59. Buzz Scores :00:00 2:00:00 3:00:00 4:00:00 5:00:00 6:00:00 7:00:00 8:00:00 9:00:00 10:00:00 11:00:00 12:00:00 13:00:00 14:00:00 15:00:00 16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00 23:00:00 Time 5/23/2005 5/24/2005 5/25/2005 5/26/2005 5/27/2005 Figure VI Buzz Score From May 2005 Figure VI is showing the search volume in term of buzz scores for the product video. In the figure we can see the clustering of points around times such as at 2:00:00, 7:00:00,11:00:00, 15:00:00, 18:00:00 and 22:00:00 which can be used for predicting reasonable trends for trading stocks of video games at different time intervals. For generating this analysis, user will input the table name which is in this case (yahoo_buzz_scores), column name (date, time, buzz_scores) and the product name e.g. vgames along with the algorithm from modules repository. User can view the products first by querying the table on the framework interface. Optionally the user can enter the dates manually for analysis. Query Constructor will then use those dates supplied by the user. After required input query constructor, constructs the query and analytical processor execute the query on the Hive. Below query is used to generate the Figure VI graph. TRANSFORM keyword is the Hive built-in clause for incorporating user defined logic (algorithm) into the Hive queries which takes input as column names and filter the records 25 Scripting language 34

43 Case Studies and Experiments SELECT TRANSFORM (date, time, buzz_scores) USING hourly_analysis From YAHOO_BUZZ_SCORES WHERE product= vgames and date >= and date <= with WHERE clause. The filtered data is processed by user defined logic, in this case, hourly_analysis in the query. For analysis results output, Hive provides two options, first to store the result into a temporary table and second to display it on the client screen. Similarly, Figure VII shows the graphical representation of the data for the period of five days from 06/06/2005 to 10/06/ Buzz Scores :00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00 13:00:00 14:00:00 15:00:00 16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00 23:00:00 Time 6/6/2005 6/7/2005 6/8/2005 6/9/2005 6/10/2005 Figure VII Buzz Score From June 2005 From the Figure VII we can see that at times 3:00:00, 11:00:00, 20:00:00 and 21:00:00 graph shows the clustering of data points, these points can be used to predict trends such as by taking the average of clustering points of five days for a particular hour to predict for that hour for a sixth day. The framework has constructed the same query for generating the Figure VII as it has constructed for the generation of the Figure VI graph Daily Analysis In daily analysis, we will incorporate two techniques. The first is called extrapolation based prediction and second is called regression based prediction. In Extrapolation based prediction we aggregate the data by taking the average of n preceding days. For instance, if we want to predict the searching trends on 23 July 2005 with a sample of the 7 preceding days, then we will take the daily average from 15 July 2005 to 22 July Similarly, if we want to predict searching trends for 23 July 2005 based on proceeding 4 weeks sample, then we will use the average buzz score of each Saturday, the dates will be 8 th and 25 th 35

44 Case Studies and Experiments June, 9 th and 16 th July. Similarly, if 8 weeks sample is selected to predict trends, then average buzz scores from each Saturday on 14 th May, 4 th, 18 th and 25 th June and 2 nd, 9 th and 16 th July will be used. Formula for Extrapolation Based Prediction Extrapolation based prediction: n x i /n i 1 n is the number of the preceding days, which we are using as sample points to predict the buzz score E-books Online music Social Network Photos Video Game 4/1/2005 4/6/2005 4/11/2005 4/16/2005 4/21/2005 4/26/2005 5/1/2005 5/6/2005 5/11/2005 5/16/2005 5/21/2005 5/26/2005 5/31/2005 6/5/2005 6/10/2005 6/15/2005 6/20/2005 6/25/2005 6/30/2005 7/5/2005 7/10/2005 7/15/2005 7/20/2005 7/25/2005 Figure VIII Daily Aggregated Buzz Scores From 01 April 2005 to 31 July 2005 Figure VIII is representing the daily aggregated buzz score for different technologies over the period of four months. Now for predicting the searching trends for 23 July 2005 for ebooks with a sample of 7 preceding days user will input the column name which is in this case (date, time, buzz_scores) and the product name which is ebook and sample of proceeding days, which are 7 in this case, by entering the dates from to The query constructor will construct the following query. SELECT TRANSFORM (date, time, buzz_scores) using daily_analysis from YAHOO_BUZZ_SCORES WHERE product= ebooks AND date >= and date <= In Regression based prediction technique; n preceding days or n preceding weeks samples are collected and aggregated using a linear regression for predicting the future trends. For analysis the same data as in previous query will be employed at user interaction layer and query 36

45 Case Studies and Experiments constructor will construct the query, but this time the analysis will use the regression based aggregation instead of aggregating by average as it was in extrapolation based technique. Regression based prediction formula Equation for forecasting (regression based prediction) Y = a + bx Where a = y bx and b = (x x )(y y )/ (x x ) x and y are means of x and y respectively Buzz Scores Actual EP with 7 days EP with 14 days EP with 28 days EP with a 4 week sample EP with a 8 week sample E-book Online Music Social Network Photo Organiser Video Game Figure IX Comparison of Actual and Predicted Buzz Scores using EP technique Figure IX shows the comparison of actual and predicted buzz scores using EP technique. We can see that predicted score with a sample of 7 days has the lowest score, whereas the sample with 8 weeks has the highest score as compare to the actual score for online music. We can see the fluctuation in predicted scores of E-book. High predicted score can be seen with a sample of 8 preceding weeks by taking its buzz score on each Saturday. Social network category shows that predicted score is lower than the actual score with a sample of 7 and 14 days while it is higher when a 28 preceding day sample is used. Similarly, the photo organizer category shows an increase in buzz score when a 28 day sample is used, whereas video game category is not showing a significant increase or decrease in the buzz score. Figure X shows the comparison of actual and predicted buzz scores with regression based prediction technique. We can see from the graph that online music has almost the same trend between actual and predicted score when 8 week sample is used, whereas buzz score with a sample of 7 preceding days results in a high buzz score. The predicted score for E-book is high with a sample of 28 days and lower with a sample of 8 weeks as compared to actual buzz score. The predicted score for a video game is again consistent as it was with Extrapolation technique, whereas social network score is high with 8 37

46 Case Studies and Experiments week sample and lower with 14 and 28 preceding days and 4 preceding week's samples as compare to actual buzz score. The predicted score for photo organizer is slighter lower with 7 days sample and slightly higher with 4 week sample as compare to actual buzz score. Buzz Scores Actual RP with 7 days RP with 14 days RP with 28 days RP with a 4 week sample RP with a 8 week sample E-book Online Music Social Network Photo Organiser Video Game Figure X Comparison of Actual and Predicted Buzz Scores using RP technique Table 6-1 shows the error percentage of predicted techniques EP and RP. Technology EP with 7 days EP with 14 days EP with 28 days EP with a 4 week sample EP with a 8 week sample RP with 7 days RP with 14 days RP with 28 days RP with a 4 week sample RP with a 8 week sample E-books Online Music Social Network Photos Video Game Table 6-1 % Error between predicted and actual buzz score The value which is close to zero represent the correctness of the prediction technique. E.g. video game with 8 week sample has only an error of 2% in prediction. 38

47 Case Studies and Experiments Weekly Analysis In weekly analysis, we are taking the buzz scores of n proceeding weeks with a sample of 4 and 8 weeks. For instance, if we want to predict buzz score on multiple days in a week, then we will take the sample score of n preceding weeks where each week consists of the buzz score obtained by aggregating (summing and averaging the whole day buzz scores for a particular product type e.g. Adobe books in the E-book category) daily scores in that week. For example, for predicting the buzz score in a week from 17/07/2005 to 23/07/2005 we will take the buzz score of four previous weeks (not necessarily consecutive) if 4 week sample is used. We have used the buzz scores of 12/06/ /06/2005, 19/06/ /06/2005, 26/ /07/2005 and 10/07/ /07/2005 where each week consists of aggregated buzz scores of every day from Sunday to Saturday as shown in the sample table below Week Sunday Monday Tuesday Wednesday Thursday Friday Saturday 12/06/ /06/ For generating the Figure XI, the framework has used multiple queries SELECT TRANSFORM (date, time, buzz_scores) using daily_analysis from YAHOO_BUZZ_SCORES WHERE product= ebooks AND date >= and date <= SELECT TRANSFORM (date, time, buzz_scores) using daily_analysis from YAHOO_BUZZ_SCORES WHERE product= ebooks AND product_type= adobebook AND date >= and date <= AND date >= AND date <= SELECT TRANSFORM (date, time, buzz_scores) using daily_analysis from YAHOO_BUZZ_SCORES WHERE product= ebooks AND product_type= safaribook AND date >= and date <= AND date >= AND date <= SELECT TRANSFORM (date, time, buzz_scores) using daily_analysis from YAHOO_BUZZ_SCORES WHERE product= ebooks AND product_type= amazonbook AND date >= and date <= AND date >= AND date <= All the above queries use the daily_analysis script for weekly analysis. The changes are only in the WHERE condition of the query where we are selecting the rows based on the condition. The 39

48 Case Studies and Experiments user will input the table name and column name and select the daily analysis script from module repository and optional dates for the analysis. E-book-4 week E-book-Adobe-Book-4 week E-book-Safari-Book- 4 week E-book-Amazone 4 weeks Figure XI Actual and Predicted Score with 4 week sample Figure XI shows better trends with a sample of 4 weeks for product E-book and product type adobe book using regression based technique. Safari book and amazon also have better prediction using regression based technique. The scores with extrapolation technique are more consistent with actual scores, but these scores are higher than the actual scores. E-book-8-week E-book-Adobe Book-8-week 40

49 Case Studies and Experiments E-book-Safari-Book-8-week E-book-Amazon-8-week Figure XII Actual and Predicted Score with 8 week sample Figure XII shows the actual and predicted score with a sample of 8 weeks and again we can see from the figures that trend is better captured using regression based prediction for E-books. Adobe-book statistics show increase in buzz scores on Wednesday with using regression based technique, while the extrapolation based technique shows almost identical variation in the graph and closer to actual results but with the low buzz scores on each day. Safari books, statistics show variations in the results which are almost opposite to actual results, but again regression based technique is showing better predicted results as compare to extrapolation based technique. In addition the search volume of safari book is low except Saturday where it is predicted to be higher. For the amazon books, both regression and extrapolation based techniques are showing close results. The table below shows the correlation between actual and predicted buzz scores. Correlation between Actual and Predicted buzz scores (x x )(y y )/ (x x ) (y y ) Where x and y are the means of x and y Technology EP with 4 weeks samples RP with 4 weeks samples EP with 8 weeks samples RP with 8 weeks samples E-book Adobe Book Amazon Safari Book Table 6-2 Correlation between actual and predicted scores 41

50 Case Studies and Experiments Table 6-2 shows the correlation between actual and predicted score where we can see the 68% of correlation when 8 week sample was used with EP technique and 70% and 88% correlation with RP techniques for Amazon and Safari books. In addition, there is low correlation for Amazon books when 4 weeks sample was used with EP technique. We can conclude from our results that generally RP technique is better than the EP for predicting future trends and hence the users of the framework such as non-expert scientists can predict future market trends by using either of the technique with less effort. 6.2 Case Study 2 In this case study, we are taking language dataset from yahoo sandbox which is almost 30 GB of uncompressed data out of which approximately 12 GB of 5-gram data gathered from more than news related websites and from approximately more than 14.6 million documents which contains almost 126 million unique sentences. The data sets are from the period of 11 months from February 2006 to December Scientists such as social who are working in the linguistic domain can use the data to build a statistic model for different domains as well as can use to analyze the events happened on specific years by extracting the related information out of the data [9]. This data set consists of n-grams from 1 to 5 where 1 indicates the one word and 5 indicate the five words in the corpus. We are using 5-gram words in this case study as the likelihood of finding meaningful information from the corpus increases as n-grams increases. We are using this corpus to find out the important events that happened during the period of February to December in 2006 from the total of 29,570,136 n-gram words in the corpus. In the first scenario we are searching for H5N1 influenza, also called bird flu. The first case was reported in January in turkey and the number of deaths were recorded in various countries, including turkey and other countries of Africa, Asia and Europe throughout the year [39]. We are using different n-gram words to look for the possible text such as bird flu is a 2-gram or bird flu deaths is a 3-gram or bird flu disease or bird flu came or bird flu influenza. Similarly for 4-gram bird flu cause deaths might be the sentence. Figure XIII shows the number of statements/words related to bird flu found from the corpus where <token> is representing the word within an n-gram, e.g. bird flu <token> <token> <token> might be bird flu death in Egypt or bird flu deaths have occurred. Similarly <token> bird flu <token> <token> might be the human bird flu death came or six bird flu deaths traced and for <token> <token> bird flu <token> might be the 41 st bird flu death or confirm another bird flu death sentence for the 5-gram word. 42

51 Case Studies and Experiments bird flu <token> <token> <token> <token> bird flu <token> <token> <token> <token> bird flu <token> Figure XIII Number of bird flu related statements found in the corpus From the above Figure XIII, we can see that the statements with a bird flu phrase at the start of the 5-gram word has high frequency as compare to other statements which have the bird flu keyword in the middle. There are a total of grams that are related to the bird flu event from the whole corpus, which is almost (1563/ ) *100 = % of the entire data set. To perform the above analysis the user provides the keyword to search and frequency as column name by viewing it from the data browser interface along with the table name. Query constructor will construct the follow query to perform the analysis. SELECT TRANSFORM (key_word, frequency) USING bird_flu_event_analysis FROM YAHOO_NGRAMS The bird_flu_event analysis is a user written algorithms for analyzing the supplied keyword from the corpus. The analytical processor will select the bird_flu_event_analysis script from the module repository and execute the above query on the Hive warehouse. The response processor will parse the response and send the information back to data browser for visualization. In the second scenario we are considering the FIFA world cup of 2006 in which Italy won by beating France in penalty kicks with the score of 5-3, after having a draw at a score of 1-1 in two half s. The event happened in June and July of 2006 and watched by almost billion nonunique users [40]. As FIFA world cup is the most watched sports event in the world so we think that there may be considerable discussion on the internet regarding this event. For exploring the data to analyze events related to world cup. We are using two key words Italy beats France and Italy wins. Again, we are taking 5-gram words data for analysis, 43

52 Case Studies and Experiments because as explained before it is more likely to get more results with 5-gram as compared to less than 5-gram keywords italy beat france <token> <token> <token> <token> italy beat france <token> italy beat france <token> italy wins <token> <token> <token> <token> italy wins <token> <token> Figure XIV Frequency of related statement found from the corpus Figure XIV shows the frequency of occurrence of different statement in the corpus. Similar to bird flu frequency of Italy beat France as first tokens in the statement have a high frequency of occurrence as compare to rest of the combination. There are total 388 statements found which are directly related to the event. To perform this analysis using the framework, the user will input the key word and frequency as column name and the query constructor will construct the following query. SELECT TRANSFORM (key_word, frequency) USING wrold_cup_event_analysis FROM YAHOO_NGRAMS The query constructor passes the query to analytical processor, which selects the user defined algorithm from the modules repository and executes it into the Hive warehouse for analysis. The response processor will get the analysis response and pass the result to data browser for visualization. In the third scenario we are considering the event of Google who purchases YouTube on 9 th October 2006 in $1.65 billion by competing with other bidders such as Yahoo, Microsoft, Viacom and News Corporations [41]. We think that this event may be under discussion at different websites and blogs. We are searching the key word such as Google bought YouTube and Google buys YouTube. 44

53 Case Studies and Experiments google bought youtube <token> <token> <token> google bought youtube <token> <token> <token> google bought youtube google buys youtube <token> <token> <token> google buys youtube <token> <token> <token> google buys youtube Figure XV Frequency of related statement found from the corpus Figure XV shows the high frequency of occurrence when Google buys YouTube phrase is used. Overall, we have found 56 total occurrences of statements which are directly related to our key words. For performing above analysis, user will input the key word and frequency as column name and query constructor constructs the following query. SELECT TRANSFORM (key_word, frequency) USING google_youtube_event_analysis FROM YAHOO_NGRAMS Query constructor passes the query to analytical processor where it takes the user defined algorithm from the modules repository and executes it on the Hive warehouse. The responses from the infrastructure layer handled by the response processor who parses the response and send the result back to data browser for visualization. From this case study we can easily conclude the usability of the framework for non-expert users where one can explore and analyze data with the minimum effort by having no technical knowledge of the big data technologies. The non-expert user can easily get the information by analyzing the graph and conclude that the most popular or discussed event was bird flu in year Summary In this chapter we presented the two case studies to demonstrate our framework. In the first case study we tried to predict future trends from the market dataset by using extrapolation and 45

54 Case Studies and Experiments regression based techniques and in the second case study we demonstrated the exploration of large data set to understand the important events happened in The case studies demonstrate the usefulness of the framework for non-expert users. In the next chapter, we will evaluate our framework by comparing it with the related work and in term of performance. 46

55 Evaluation 7 :-:Evaluation In this chapter, we evaluate our solution by comparing it with related work that we have discussed in chapter two. Also we will check the performance of the tool and discussed the usability for non-expert users. 7.1 Comparison with related work Table 7-1 shows the comparison of BigExcel with the related technologies. From our discussion in chapter two, we can conclude that most of the Big Data platforms or frameworks are commercial and complex, which are designed for expert and business users, which can use analytics to enhance their business and get insights into the data that was not possible before. We have compared the lightweight web based framework with the related platforms/frameworks. From Table 7-1 we can conclude that the complexity of all the Big Data platforms are high except some, such as 1010data platform which has spreadsheet like structure and hides the data processing technologies from the user, but 1010data platform is developed by keeping in mind the business users and not sutiable for non-technical users and also it lacks the Big Data processing technologies at the backend and using a high performance database which might not be the solution to all Big Data issues. In comparison lightweight web based framework supporting the state of the art Big Data processing technologies Hive and Hadoop for analyzing the data by providing the abstraction over technical deployment, configuration and running with easy to use interface. BigExcel checks all the points in the table 7-1 except it is not intended for business users and the data management support is limited while most of commercial platforms have good data management support. Similarly InfoSphere has spreadsheet like interface called BigSheets but the platform is designed for expert users and even running a basic version which is also referred to as trial version needs considerable configuration and deployment. InfoSphere supports everything from analytics to current processing Big Data technologies such as Hadoop along with dedicated hardware and data management support. On the other side BigExcel provides easy to use interface and provides abstraction over configuration and deployment of Big Data technologies along with analytics support. 47

56 Evaluation Features InforSphere SAP TERADATA Cloudera HortonWorks Amazon 1010data Oracle HP Hive BigExcel Analysis Web Interface Yes No No Yes No Yes Yes No No Yes Yes Hadoop Support Yes Yes Yes Yes Yes Yes No Yes No No Yes Dedicated Hardware Support Yes Yes Yes No No No No Yes Yes N//A No Analytics Yes Yes Yes No No No Yes Yes Yes No Yes Database Support No Yes Yes No No No Yes Yes Yes No No Yes Yes Yes Yes Yes No Yes Yes Yes No Yes (Limted) Data Management Nonexpert users support Yes (Low) No No No No No Yes (Medium) No No No Yes Bussiness users support Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Table 7-1 Comparisons between related and Big Excel 48

57 Evaluation Similarly, other platforms as shown in the Table 7-1 such as SAP, TERADATA, Clouder, HorntonWorks etc. are all have the similar properties for being commercial products and intended for technical users, but BigExcel is intended for non-expert scientist such as social scientist and can be accessed from over the internet where users can straight forward start the visualization and analysis on Big Data. 7.2 Performance Evaluation The performance of the framework is heavily depending on the underlying hardware as framework is using Big Data processing technologies at the backend. There are multiple options exists for having infrastructure and Big Data processing technologies up and running by using commodity hardware with pre-configured and deployed Hadoop and Hive or to use the cloud such as Amazon Elastic Map Reduce (EMR). Table 7-2 shows different benchmarks obtained from the framework. Parameters Performance Description Data Loading Time Analytics Processing Data Size = 90 MB 1 minute Data Size = 15 GB 20 minutes Data Size = 90 MB Depends upon the cluster setup, e.g. small, medium, large, typically less than a minute in small cluster setup Data Size = 15 GB Depends upon the cluster setup, e.g. small, medium, larger Depending on the data size and the underlying hardware. It varies for instance loading 10 GB of data takes 10 seconds on the machine running Intel i3 core processor. Depending on the underlying hardware A machine running Intel i3 processor having 2 cores A machine running Intel i3 processor having 2 cores Amazon Elastic MapReduce Amazon Elastic MapReduce 49

58 Evaluation typically less than 10 minutes in a small cluster setup Usability The framework is mainly developed for non-technical users who can explore and analyze Big Data without having any technical expertise. -Simple Web Based Framework -Can be accessed over HTTP via any web browser -Loading, processing and visualization of Big Data Table 7-2 Performance Analysis 7.3 Summary In this chapter we presented the evaluation of the framework and discussed the performance benchmark along with the comparison of related platforms/frameworks with BigExcel. In the next chapter, we will provide the conclusion and future work. 50

59 Conclusions & Future Work 8 :-: Conclusion & Future Work In this chapter, we will discuss the conclusion of the work presented in this research and suggest the future work in the domain and how we can make this framework easier and more efficient to be used for non-expert scientists. 8.1 Conclusions The main limitations with the current Big data tools are their complexity and non-suitability for researchers who are not computer scientists. This research is aimed to overcome this issue by developing easy to use web based framework for Big Data analytics, which hides the technical details needed to explore the large data sets. Keeping in mind this reason we have presented a web based framework for Big Data exploration in adhoc manners. The main intention of this research was to assist non-expert users such as social scientist to get insights into Big Data. Users can easily load data into a framework and start analyzing straight away by selecting algorithms from modules repository. In addition, users can supply queries to Hive using SQL like language called HiveQL by using the framework web interface and also can view the Big Data as like a spreadsheet on the web interface by loading it in an incremental way. Users can give a column name as input along with the table name on the web interface for analysis, which can be viewed by loading data on the interface. After selecting algorithm from module repository framework will perform the analysis. The framework will take care of underlying logic and gives user the abstraction over Big Data technologies, configuration and deployment as well as abstraction for application configuration and running and hence facilitates the non-expert user to harness the Big Data. Furthermore the framework is developed using software engineering principles. The outcome of this research is in the form of a framework for Big Data analytics which can be used to explore large data sets in adhoc manners. The feasibility of the tool is tested by taking two large data sets from yahoo labs sandbox for predictive analytics and for analysis of a large language data set extracted from a corpus of more than websites. We evaluated the framework with respect to useability for non-expert users and with respect to the performance of the tool. This research is useful mainly for the scientist who are working in the social sciences domain and are not familiar technically to use the softwares especially Big Data analytics platforms. With the help of this framework they can now access the Big Data and can perform analytics on it, which eventually helps the researchers to develop an interest to employ 51

60 Conclusions & Future Work historical Big Data in their field to do future innovations based on previous trends or information. The work (see appendix III) that we have presented in this thesis is expected to be submitted to the IEEE Big Humanities workshop 2014 to be held in conjunction with IEEE Big Data Future Work Although framework provides all the basic functionality to explore Big Data in an adhoc manner, but still it has limitations. Data management facility in the framework is very limited and it s only support the structured data in the form of text file for loading it into Hive. Unstructured datasets are not supported as they require conversion from unstructured to structured data after extracting meaningful information out of them. Modules repository (algorithm repository) is limited. Currently, algorithms are applicable to a particular type of data sets. For instance the data sets which involve temporal information can be analyzed using these algorithms and also the data sets which involves historical data. For instance, if we can make out some sort of structured information from the data of the last fifty years of newspaper outlining the headlines, then this information can be used to draw conclusions about important events happened during a specific time period, lets say in a year or in the past 10 years from this large corpus. For the framework improvements we need to develop better data handling utility by keeping in mind the non-technical users as main users, which supports the extraction and manipulation of unstructured information. More generic algorithms are needed to develop, which support more data set for analysis. Also an analytical library is needed to develop, which support statistical and mathematical function, which framework can invoke for analytics by passing just input values. 52

61 Bibliography [1] (2014, May) YouTube. [Online]. [2] (2014) Cnet. [Online]. [3] Avita Katal, Mohammad Wazid, and R H Goudar, "Big data: Issues, challenges, tools and Good practices," in Contemporary Computing (IC3), 2013 Sixth International Conference, 2013, pp [4] (2014) Aribus. [Online]. [5] (2014) StatisticsBrain. [Online]. [6] (2014) Statisticbrain. [Online]. [7] (2014) MongoDB. [Online]. [8] (2014) StonyBrooks University. [Online]. [9] (2014) Yahoo Labs WebScope. [Online]. [10] (2014) IBM. [Online]. [11] (2014) SAP Big Data Solutions. [Online]. [12] (2014) SAP HANA In-Memory Database. [Online]. [13] Yi-Man Ma, Che-Rung Lee, and Yeh-Ching Chung, "InfiniBand virtualization on KVM," in 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), 2012, pp [14] (2014) TERADATA Aster. [Online]. [15] (2014) Apache HCatalog. [Online]. [16] (2014) Cloudera. [Online]. 53

62 [17] (2014) Apache Falcon. [Online]. [18] (2014) Apache Sqoop. [Online]. [19] (2014) Apache Flume. [Online]. [20] (2014) HortonWorks. [Online]. [21] (2014) 1010data Insight. Now. [Online]. [22] (2014) The R Project for Staistical Computing. [Online]. [23] (2014) Oracle Big Data. [Online]. [24] (2014) HP Vertica Analytics Platform. [Online]. [25] Tom White, Hadoop; The Definitive Guide, 3rd ed., Mike Loukides and Meghan Blanchette, Eds.: O'Reilly, [26] Jeffrey Dean and Sanjay Ghemawat, "ʺMapReduce: Simplified Data Processing on Large Clusters," in OSDI'04: Sixth Symposium on Operating System Design and Implementation, [27] Aditya B. Patel, Manashvi Birla, and Ushma Nair, "Addressing Big Data Problem Using Hadoop and Map Reduce," in 2012 Nirma University International Conference on Engineering (NUiCONE), 2012, pp [28] Hadoop Wiki. [Online]. [29] Edward Capriolo, Dean Wampler, and Jason Rutherglen, Programming Hive, Ist ed., Courtney Nash and Mike Loukides, Eds.: O'Reilly, [30] Panagiotis Kalagiakos and Panagiotis Karampelas, "Cloud Computing Learning," in International Conference on Application of Information and Communication Technologies (AICT), Baku, 2011, pp [31] Leonard Richardson and Sam Rubt, RESTful Web Services.: O'Reilly Media, [32] James Snell, Doug Tidwell, and Pavel Kulchenko, Programming Web Services with SOAP.: O'Reilly Media, [33] (2014) MSDN Microsoft. [Online]. 54

63 [34] (2014) JSON. [Online]. [35] (Accessed June 2014) Microsoft MSDN. [Online]. [36] Max Katz and Ilya Shaikovsky, Practical RichFaces (Expert's Voice in Java Technology), 2nd ed.: Paul Manning, [37] David Geary and Cay S Horstmann, Core JavaServer Faces, 3rd ed.: prentice hall, [38] (2014) The Perl Programming Language. [Online]. [39] Robert G Webster and Elena A Govorkova, "H5N1 Influenza Continuing Evolution and Spread," in New England Journal of Medicine, pp [40] (2014) FIFA. [Online]. 06e tv [41] (2014) New York Times. [Online]. 55

64 Appendix Appendix I Client Side Class Diagram 56

65 Appendix Server Side Class Diagram 57

66 Appendix Appendix II Example Algorithm used for Hourly Analysis #!/usr/bin/perl #global variable my $temp_start_date = ' '; my $format = '%Y-%m-%d'; my $temp_original_date = = split(/-/, $temp_start_date); $temp_date = $temp_dates_array[2]+31*$temp_dates_array[1]+365*$temp_dates_array[0]; my $temp_hour = 0; my $interval_sum = 0.0; my $interval_average = 0.0; my $count=0; #This flag is for the case where there is only a one hour my $flag=0; my $date_comparison_flag=0; my $current_hour_changed_date=0; while ($input = <STDIN>) = split(/\t/, $input); my $date = $columns[0]; my $time = $columns[1]; my $score = $columns[2]; chomp($date); chomp($time); chomp($score); #handling = split(/:/,$time); my $current_hour = $current_time[0] + = split(/-/, $date); $db_date = $current_date_array[2]+31*$current_date_array[1]+365*$current_date_array[0]; if ($db_date == $temp_date) { $date_comparison_flag=1; if ($current_hour == $temp_hour) { my $current_minute = $current_time[1] + 0; if ($current_minute <= 60) { $interval_sum += $score; $count++; $flag=1; 58

67 Appendix } } else { if ($temp_hour!= 0 ) { $interval_average = $interval_sum/$count; my $temporary_current_hour = $current_hour-1; if ($temporary_current_hour == 0) { $temporary_current_hour++; } print "$date, $temporary_current_hour:00:00, $interval_average\n"; $temporary_current_hour=0; $count=0; $interval_sum=0; $flag=1; my $current_minute_for_changed_hour = $current_time[1] + 0; if ($current_minute_for_changed_hour <= 60) { $interval_sum += $score; $count++; } my $current_hour_changed = $current_time[0] + 0; $current_hour_changed_date = $current_hour_changed; #displaying the last hour at output $temp_hour = $current_hour_changed; } else #first time hour check { my $current_minute = $current_time[1] + 0; if ($current_minute <= 60) { $interval_sum += $score; $count++; $flag=1; } my $current_hour_inside = $current_time[0] + 0; $current_hour_changed_date = $current_hour_inside; $temp_hour = $current_hour_inside; } } } else { if ($date_comparison_flag == 1) { 59

68 Appendix $interval_average = $interval_sum/$count; my $temporary_current_hour = $current_hour; if ($current_hour_changed_date == 0) { $current_hour_changed_date++; } print "$temp_original_date, $current_hour_changed_date:00:00, $interval_average\n"; $current_hour_changed_date=0; } $temp_date = $db_date; $temp_original_date = $date; $interval_sum=0; $count=0; $flag=1; my $current_minute_date_changed = $current_time[1] + 0; if ($current_minute_date_changed <= 60) { $interval_sum += $score; $count++; } }#end of while loop } $current_hour_changed_date = $current_time[0] + 0; $temp_hour = $current_hour_changed_date; #for handling one hour only (case where hour will never change) if ($flag == 1) { $interval_average = $interval_sum/$count; if ($current_hour_changed_date == 0) { $current_hour_changed_date++; } print "$temp_original_date, $current_hour_changed_date:00:00, $interval_average\n"; } 60