A Requirements Specification Framework for Big Data Collection and Capture

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "A Requirements Specification Framework for Big Data Collection and Capture"

Transcription

1 A Requirements Specification Framework for Big Data Collection and Capture A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Of Masters of Science in Software Engineering At the College of Computer and Information Sciences At Prince Sultan University By Nouf A. Al-Najran February 2015

2 For my parents

3 Acknowledgements All praises and thanks be to Allah for helping me in the completion of this thesis. My profound gratitude is expressed to my esteemed and kind supervisor, Prof. Dr. Ajantha Dahanayake. She has given liberally of her time and freely of her expert knowledge, to the great benefit of this research. Both her patience and critical interest at every stage have been the source of great encouragement. Particular thanks are appropriate to Dr. Areej Al-Wabil and Dr. Musaad bin Muqbil for their support, encouragements, and insightful suggestions. Also, I would like to thank my competent and skillful teachers who have brought me to the point I am at. Truly, I have been blessed to be taught by the very best. My greatest thanks and appreciation go to my family who taught me the value of hard work and dedication. Their belief in me never wavers, and they always lift my spirit with their constant love and support. I should not neglect to thank my dear husband for his understanding and unceasing support he provided throughout my study.

4 Abstract This thesis introduces big data scenarios to the domain of data collection, because the ad hoc processes of data gathering used currently by most organizations are proving to be inadequate in a digital world that is expanding with infinite information. As a consequence, users are often unable to obtain specific relevant information from largescale data collections. The today s practice tends to collect bulks of data that most often: (1) containing large portions of useless data; (2) leading to longer analysis time frames and thus, longer time to insights. As this has implications to real-time decision support, it is vital for businesses, organizations, and associations to implement better approaches for developing scenario-relevant data collection. The premise of this thesis is; that big data analytics can only be successful when they are able to digest captured data and deliver valuable information. Therefore, this thesis develops a conceptual model for a well-defined scenario-based big data collection processes through a Requirements Specification Framework. This framework has been validated for effectiveness in improving the process of data collection through performing an experiment. The experiment provides quantitative measures on the relevancy of collected feeds. In the conducted experiment, the ad hoc process of data collection generates 8.5% relevant feeds and 91.5% irrelevant feeds, whereas the scenario-based data collection generates 92.5% relevant feeds and 7.5% irrelevant feeds. Hence, in a time of mass content creation, the Requirements Specification Framework contributes to: (1) the Requirements Engineering domain based on scenario-based big data collection; (2) collecting data according to scenarios of interest for analysis of (real-time) decision support; (3) the reduction of unnecessary or garbage data collection, which is a huge problem for big data in terms of storage, transportation and analytic time for (real-time) decision support. Therefore, this research mainly contributes to a paradigm shift of big data collection. Key words: Big Data Scenarios, Information Filtering, Big Data Collection, Big Data Analytics, Scenario-based Data Collection, Software Requirements Engineering. III

5 ملخص البحث تقدم هذه الرسالة "سيناريوهات البيانات الضخمة" إلى مجال جمع البيانات وذلك ألن العملية التقليدية لجمع البيانات والمستخدمة من قبل المنظمات أثبتت أنها غير كافية في ظروف عالم رقمي يتضاعف ويتسع باستمرار مع التزايد الالمحدود للمعلوماتية. ونتيجة لذلك يكون المستخدمين في كثير من األحيان غير قادرين على الحصول على بيانات محددة ذات صلة ضمن نطاق واسع من مجموعات البيانات. بدال من ذلك فإنهم يميلون إلى جمع كتلة من البيانات التي غالبا: )1( تحتوي على أجزاء كبيرة من البيانات غير مجدية )2( مما يؤدي إلى تحليل البيانات خالل إطارات زمنية أطول وبالتالي وقتا أطول للحصول على المرئيات المرجوة. فإن ذلك من شأنه إظهار مرئيات غير ذات صلة أو معلومات غير مرغوبة و بالتالي التأثير سلبا على عملية اتخاذ القرارات في الوقت المناسب. لذا من الضروري على ذوي األعمال التجارية والمنظمات والشركات تفعيل أساليب أفضل للوصل إلى البيانات ذات الصلة بالسيناريو المحدد. تستند هذه الرسالة على فرضية أن نجاح تقنيات تحليل البيانات الضخمة يعتمد على قدرتها على استيعاب البيانات التي تم التقاطها وتقديم معلومات قيمة. وبالتالي فإنها تساهم في بناء نموذج مفاهيمي لعملية جمع البيانات الضخمة القائمة على سيناريو واضح المعالم من خالل إطار المواصفات. وقد تم التحقق من صحة هذا اإلطار للفعالية في تحسين عملية جمع البيانات من خالل إجراء تجربة. تقدم هذه التجربة مقاييس كميه على نسبة ارتباط البيانات اللي تم جمعها. في التجربة التي أجريت فإن العملية الغير مخطط لها لجمع البيانات تولد 5.8 بيانات ذات صلة و 51.8 بيانات غير ذات صلة في حين أن جمع البيانات القائم على السيناريو يولد 52.8 بيانات ذات صلة و 5.8 بيانات غير ذات صلة. وبالتالي في غضون إنشاء كتلة من المحتوى الشامل فإن هذا اإلطار المفاهيمي يساهم في: )1( مجال هندسة المتطلبات على أساس جمع البيانات الكبيرة القائمة على السيناريو )2( جمع البيانات وفقا للسيناريوهات المطلوبة للتحليل )الوقت المناسب( لدعم اتخاذ القرارات )3( الحد من جمع البيانات غير ضرورية أو القمامة والتي هي مشكلة ضخمة للبيانات الكبيرة من حيث التخزين والنقل والوقت التحليلي المناسب لدعم اتخاذ القرارات. ولذلك يساهم هذا البحث أساسا إلى نقلة نوعية لجمع البيانات الكبيرة. مفاتيح البحث: سيناريوهات البيانات الضخمة تصفية المعلومات جمع البيانات الضخمة تحليل البيانات الضخمة جمع البيانات القائم على السيناريو هندسة متطلبات البرمجيات. IV

6 Table of Contents Dedication... I Acknowledgments...II Abstract in English... III Abstract in Arabic... IV List of Tables... V List of Figures... VI List of Appendix Tables... VII List of Appendix Figures... VIII List of Abbreviations... IX Chapter 1: Introduction Introduction Area of Research Motivation What is Big Data? Characteristics of Big Data General Problems of Big Data Data Collection Problems of Big Data Collection Usefulness of Big Data to Organizations Requirements Engineering for Big Data Capture and Collection Problem Statement Research Questions and Objectives Scope of the Thesis Outline of the Thesis Chapter 2: Literature Review Introduction Related Works Relation of Software Engineering Applications to Big Data Collection Hadoop- The Big Data Management Framework Hadoop Distributed File System (HDFS) YARN (MapReduce2.0) Innovative Big Data Collection Approaches... 17

7 2.2.4 Data Reduction Approaches Research Innovation Research Inspiration Chapter 3: Research Methods Introduction Research Methods Research Design Research Philosophy Research Strategy Research Techniques and Analysis Research Instruments and Procedures Experiment Tools Research Method Structure Chapter 4: Scenario-based Big Data Collection Introduction Relevancy of Requirements Engineering According to SDLC The Scenario-based Approach Backward Analysis Planning for Big Data Collection Determine Data Collection Needs Determine Big Data Capturing Techniques According to Data Types Determine Big Data Analytical Techniques Marketing Scenario Example Chapter 5: Data Collection Requirements Modelling Introduction The W*H Conceptual Model for Services Using The Rhetorical And Zachman Frameworks The Requirements Specification Framework for Scenario-Based Data Collection Determining The Scenario-based Data Collection Factors The Characteristics of Big Data Scenarios Chapter 6: Case Study Introduction US Presidential Elections Case Overview Description of the Needs... 49

8 6.2.2 Key Problem Application of the Specification Framework SA Ebola Scare Case Overview Description of the Needs Key Problem Application of the Specification Framework US Elections and SA Ebola Scare Case Analysis Traffic Management Automated System Case Overview Description of the Needs Key Problem Application of the Specification Framework Auto Traffic Management Case Analysis Chapter 7: Experiment and Validation Introduction Experimental Design Experimental Work Data Collection Scenario Ad hoc Data Collection and Result Analysis Scenario-based Data Collection and Result Analysis Chapter 8: Discussion Introduction Research Analysis The Conceptual Model for Scenario-based Data Collection Framework Evaluation Case Study Framework Validation Experiment Chapter 9: Conclusion Introduction Research Conclusion Limitations Subjective Theme Limited Number of Cases Experiment and Validation Limited Sample Size Lack of Generalizability... 82

9 9.4 Future Research Directions References Appendices Appendix A. The Unstructured Interview Appendix B. Other Hadoop Components Appendix C. Mapping the W*H Framework to Big Data Collection Domain Appendix D. Collecting Data in DiscoverText Appendix E. Related Published Papers

10 LIST OF TABLES TABLE 2.1: SOFTWARE ENGINEERING APPLICATIONS AND BIG DATA COLLECTION TABLE 2.2: STUDIES AROUND BIG DATA SCENARIOS AND BIG DATA COLLECTION TABLE 4.1: BUSINESS DOMAIN AND CORRESPONDING SCENARIOS TABLE 4.2: CATEGORIES OF BIG DATA ANALYZING TECHNIQUES AND APPLICABILITY. 37 TABLE 6.1: FRAMEWORK APPLICATION ON US PRESIDENTIAL ELECTIONS SCENARIO TABLE 6.2: FRAMEWORK APPLICATION ON THE EBOLA SCARE IN SA SCENARIO TABLE 6.3: FRAMEWORK APPLICATION ON AUTO TRAFFIC MANAGEMENT SCENARIO TABLE 7.1: KEYWORDS AND THEIR OCCURRENCES IN THE AD HOC DATA COLLECTION. 63 TABLE 7.2: KEYWORDS AND THEIR OCCURRENCES IN THE SCENARIO-BASED DATA COLLECTION TABLE 8.1: THE CONCEPTUAL MODEL FOR SCENARIO-BASED DATA COLLECTION V

11 LIST OF FIGURES FIGURE 1.1: THESIS STRUCTURE FIGURE 2.1: HORTONWORKS DATA PLATFORM FIGURE 3.1: RESEARCH METHOD FIGURE 4.1: OVERLAPPING SCENARIOS IN A DOMAIN FIGURE 4.2: POSSIBLE SCENARIOS IN THE BUSINESS AND MEDICAL DOMAINS FIGURE 4.3: REQUIREMENTS ENGINEERING PHASE IN A BIG DATA SOFTWARE LIFE CYCLE. 33 FIGURE 4.4: DETERMINING BIG DATA CAPTURING TECHNIQUES FIGURE 5.1: THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION FIGURE 6.1: DATA FLOWING THROUGH FLUME CHANNEL FIGURE 6.2: DATA FLOWING THROUGH SQOOP AND FLUME INTO HDFS FIGURE 7.1: AD HOC DATA COLLECTION RESULT IN BAR CHART GRAPH FIGURE 7.2: AD HOC DATA COLLECTION RESULT IN PIE CHART FIGURE 7.3: A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (I) FIGURE 7.4: A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (II) FIGURE 7.5: SCENARIO-BASED DATA COLLECTION RESULT IN BAR CHART GRAPH FIGURE 7.6: SCENARIO-BASED DATA COLLECTION RESULT IN PIE CHART FIGURE 7.7: A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (I) FIGURE 7.8: A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (II) FIGURE 8.1: COMPARISON OF THE EXPERIMENTATION RESULTS FIGURE 9.1: AD HOC PROCESS OF DATA COLLECTION (I) FIGURE 9.2: IMPROVED PROCESS OF DATA COLLECTION (II) VI

12 LIST OF APPENDIX TABLES TABLE 1B: HADOOP ECOSYSTEM COMPONENTS TABLE 1C: MAPPING W*H MODEL QS TO SCENARIO-BASED DATA COLLECTION QS VII

13 LIST OF APPENDIX FIGURES FIGURE 1C: THE W H INQUIRY BASED CONCEPTUAL MODEL FOR SERVICES FIGURE 2C: THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION FIGURE 1D: DISCOVERTEXT DASHBOARD FIGURE 2D: START A NEW PROJECT FIGURE 3D: NAME YOUR PROJECT FIGURE 4D: IMPORT DATA FIGURE 5D: DATA SOURCES FIGURE 6D: TWITTER FEED (I) FIGURE 7D: TWITTER FEED (II) FIGURE 8D: TWITTER FEED (III) FIGURE 9D: ARCHIVE MANAGEMENT FIGURE 10D: LIST OF TWEET FETCHES FIGURE 11D: NOTIFICATIONS VIII

14 LIST OF ABBREVIATIONS SDLC RE NIC HDFS HDP YARN NoSQL ETL CF NLP DOM MIDIS DTO BGP BFS SNAP FIFO MVC ICT CDR POS EVD MOH RFID ITS MOI Software Development Life Cycle Requirements Engineering National Information Center Hadoop Distributed File System Hortonworks Data Platform Yet Another Resource Negotiator Not Only SQL Extract, Transform, and Load Collaborative Filtering Natural Language Processing Document Object Module Multi-Intelligence Data Integration Services Data Transformation Operations Basic Graph Pattern Breadth First Search Stanford Network Analysis Platform First In First Out Model View Controller Information and Communication Technology Call Detail Records Part Of Speech Ebola Virus Disease Ministry of Health Radio Frequency Identification Intelligent Transportation Ministry Of Interior IX

15 CHAPTER 1: INTRODUCTION We have more data than we have skills to turn it into useful knowledge Mark Rolston [1]

16 Chapter 1: Introduction 1.1 INTRODUCTION Today, with the explosion of digital data growth in social media, marketing, healthcare, national security, and weather forecasting, etc., most enterprises, organizations, and governments are unable to effectively filter and analyze those massive collections of data to be used for timely and informed decision makings [2]. This is because separating the relevant and meaningful information from the available universe of data which can reveal hidden patterns is a non-trivial task [3]. Therefore, these public as well as the private sectors may not be able to cope with the velocity of data collection and fail to make use of those data for instances such as real-time decision support [2]. Hence, it is important to move away from the ad hoc process of data collection, and develop a better strategy for capturing the useful data that can leverage valuable information and insights in a timely manner. As these data volumes require different capturing techniques, they also need special analyzing techniques to analyze that data and make it meaningful in a way that helps reduce data noise and store only what is needed to answer the useful questions [4]. A structured and effective way to bring the right data to the right analytical technique, is to derive a framework that is capable of collecting data according to requirements of the analyzing needs based on a model that is useful for users to ask the right set of questions that define the required outcome of data analytics prior attempting to capture the data [5]. Therefore, this thesis provides a requirements specification framework leading to a set of questions that is organized as a requirements engineering model for data collection in order to identify the properties of the data analytics environment to yield a meaningful data collection process. 1.2 AREA OF RESEARCH Due to the diversity and heterogeneity of data structures and formats found in various data sources, such as healthcare, national security, weather forecasting systems, etc., there is a strong need for a sound data collection approach in which people are aided in capturing and extracting useful knowledge from different data sources. Hence, there is a need for a user friendly framework that provides the right questions for capturing useful information from all the available data. The framework introduced in this thesis is inspired by the work of [6], which follows the Zachman framework 2 P a g e

17 Chapter 1: Introduction as an inquiry system for information systems engineering, and Hermagoras of Temnos frameworks used in legal inquiry [7]. Hermagoras of Temnos is a Greek rector, who established a classical rhetorical heuristic for identifying the crucial issue in a given case, which is based on a sequence of 6Ws + 1H (who, what, when, where, why, how, and by what means) [6]. Therefore, this thesis looks into several research fields from the point of view of software requirements engineering for big data capture and collection such as Big Data Analytical Techniques, Big Data Capturing Techniques, Search Patterns, and Information Retrieval. 1.3 MOTIVATION In the age of information overload, this thesis is motivated by the vision of ensuring access to the most valuable sources with the least resources. Recently, in addition to other data sources, much care has been given to the production of user-generated content from what is known by Web 2.0 [8]. Decision makers often need to utilize the relevant structured, semi-structured and unstructured data to drive their strategy [9]. Hence, two facts derived from studies have been a source of encouragement and motivation to conduct this research [10]: Studies have shown that more data does not necessitate more knowledge. In fact, a lot of data can be overwhelming and sometimes leads to the wrong decision. You could never work with all the available data. Running after every piece of data will consume a great deal of your resources and will make you paralyzed. So focus on the data that will best serve the involved scenario. Therefore, reversing the process of data collection through analyzing the required output (business scenario) in order to determine the relevant input, will look into the data capturing from a different angle and contribute to science. It will add a new dimension of knowledge and technology to deal with big data capture. Thus, this study emphasizes the demand for a well-defined mechanism that aims to develop effective processes in order to take the maximum value from the available data that brings decision makers close to extracting value out of big data. The need for a development of a value-added data collection framework to assist users in understanding what they require to know before attempting to collect the data is the main motivation of this research. 3 P a g e

18 Chapter 1: Introduction 1.4 WHAT IS BIG DATA? In the digital world, everyone is dealing with data in one way or another. People communicate through social networks and generate content like blog posts, photos and videos. Wireless sensors and RFID readers create signals and servers continuously log messages about what they re doing. Scientists make scientific experiments and create detailed measurements and marketers record information about sales, suppliers, operations, customers, and etc.. This rapid growth of data is the reason behind the evolution of big data [11]. According to the leading IT industry research group Gartner [12], big data is defined as: Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Yet there are two more equally important characteristics to consider, which are Veracity and Value [13] CHARACTERISTICS OF BIG DATA Big data is characterized by the following five elements: Volume: How big the data is growing. Many factors contribute to the increase in data volume. Historical, transaction-based data stored through the years, multi-structured data streaming of social media and mobile devices is exploding. In addition to the increasing amounts of sensor and machine-to-machine data being collected with new sources of data that are emerging every year. This rapid growth of data causes the digital universe to double its size every two years [14]. Velocity: How fast the data is being generated. Data is streaming in at unimaginable speed. For example, according to emorpis Technologies [15], every minute we upload 100 hours of video on Youtube. In addition, every minute over 200 million s are sent, around 20 million photos are viewed and uploaded on Flickr, almost tweets are sent and almost 2,5 million queries on Google are performed. Variety: Data are not only structured or can be represented in rows and columns. Variation of data types today includes source, format, and structure [16]. Data today comes from different sources in all types of 4 P a g e

19 Chapter 1: Introduction formats. Structured, unstructured text documents, , video, audio, stock ticker data and financial transactions are some examples. Veracity: Trustworthiness, validity and quality of the data. According to IBM [17], veracity refers to how much of data can be trusted when key decisions need to be made on such large volumes collected at high rates of velocity and variety. Paul Miller [18] reported that a good process will, typically, make bad decisions if based upon bad data. Value: The success of big data drives businesses in terms of better and faster management decisions and financial performance [19] GENERAL PROBLEMS OF BIG DATA Big data can provide big success opportunities [11]. However, as with most emerging technologies, several characteristics are associated with big data problems that make them technically challenging. These general problems or challenges of big data can be grouped in three categories: data, process, and management [16]. Data Challenges Volume The problem is how to deal with the sheer volumes of big data in terms of processing and storage? Velocity The problem is how to respond to the flood of information in a realtime manner, or at least in the time required by the application. Variety The problem is how to deal with the multiplicity of data sources, formats, and structures. Veracity As this is a critical challenge, there are several problems associated with it [16]: How can you cope with the invalidity, untruths, missing values or uncertainty of the data being analyzed? How broad is the coverage of the data available for analysis? How timely is the readings of the values? These specific problems can be combined in one huge problem: 5 P a g e

20 Chapter 1: Introduction How can you discover high-quality data from all the high volumes of data that are available out there? Process Challenges According to Laura Haas (IBM Research), process challenges include [18]: Collecting the data. This challenge is illustrated in more details later in this chapter, as this research is centered on addressing and enhancing the process of big data capture Integrating the data from multiple resources Transforming the data into a format that is feasible for analysis Modeling the data Visualizing the results of data analysis and sharing the output Management Challenges Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits Michael Blaha [20]. The main management challenges include [18]: Data privacy Security Governance Ethical The problems associated are: How to track and ensure that the data is used correctly? How are the data being used, transformed, and derived? And managing its lifecycle. These were the general problems of big data categorized in three dimensions: Data, process, and management problems. 1.5 DATA COLLECTION According to [21], the process of data collection is defined as: The process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. 6 P a g e

21 Chapter 1: Introduction This research utilizes the generic definition of data collection to the collection of big data. However, as big data requires different analytical tools and techniques, it as well requires non-ad hoc capturing approaches and technologies [4]. Clearly, those huge volumes of continuously generated data are more than what conventional technologies can sustain. Hence, the lack of effective processes for information collection and management in organizations adapting to big data solutions, can result in a negative impact in the financial as well as reputation wise [4] PROBLEMS OF BIG DATA COLLECTION Acquiring the data that holds useful information from tremendous amounts of available data with the rapid increase of online information is a non-trivial task [3]. Collecting scenario-based relevant data from all the available information sources, poses several challenges [16]: Integrating multi-disciplinary methods aiming to locate useful data in the large volume, messy, often schema-less, and complex world of big data. Understanding big data analyzing techniques as well as big data capturing techniques to be able to select the right one for the scenario being processed. And consolidating the possible factors that can have a control over reducing the unwanted data. The ability to develop a simple yet comprehensible and powerful approach to guide and streamline the data collection process, based on the properties of the given scenario. Collecting big data requires experts with technical knowledge who can map the right data to the right analytical technique, and execute complex data queries. 1.6 USEFULNESS OF BIG DATA TO ORGANIZATIONS Today, with the expansion of the adoption of big data, organizations are using big data analytics to benefit their businesses [9]. They are taking advantage of the vast amounts of available information to enhance their process of decision making and performance. Following is an illustration of how some companies are taking advantage of big data [22]: 7 P a g e

22 Chapter 1: Introduction Amazon uses big data to build and power their recommender system that suggests products to people through their purchase history and clickstream data. Samsung Inc. uses big data on its new smart TVs to enhance their content recommendation engine, and thus, provide the customer with more accurate and user specific recommendations. Progressive Insurance Inc. relies on big data to decide on competitive pricing and capture customer driving behavior. LexisNexis Risk Solutions Inc. uses big data to help financial organizations and other clients detect and reduce fraud through identifying individuals, including family relationships. These are some examples of how organizations using big data to leverage their performance. 1.7 REQUIREMENTS ENGINEERING FOR BIG DATA CAPTURE AND COLLECTION The volume, velocity, variety and veracity of big data has grown tremendously in the past years due to the vast spread of software systems as well as the social behaviors [23]. Social behaviors refer to the communication of people through social media applications such as Facebook, Myspace, Twitter, Digg, YouTube, and Flickr to express their thoughts, voice, opinions, and connect to each other anytime and anywhere [24]. This exponential growth and activities around big data in social applications and IT software systems have intensified the need for a well-structured Software Engineering approach to identify the requirements of the big data collection process [25]. One important aspect of Software Engineering when it comes to big data, is related to the capture and collection of relevant data for software systems. Seeking to collect as much data as possible creates a significant software processing challenge for software data analysts [9]. Therefore, we need to invest in a Requirement Engineering approach that specifies the requirements and structure for gathering and collecting only the needful data according to scenarios, and discarding irrelevant and useless data. The research into a Specification Framework for Big Data Collection and capturing is therefore the Requirement Engineering phase for the big data collection process of a big data 8 P a g e

23 Chapter 1: Introduction software solution [26]. The guiding questions in the framework is a structured process for system analysts to elicit the big data collection requirements in a more effective and user-friendly manner. This approach to Requirements Engineering is one of the main principles of the Software Development Life Cycle (SDLC) [27]. 1.8 PROBLEM STATEMENT Today, organizations and individuals use computers to solve complex problems. For business and many other purposes, contributes to generating volumes of digital data. However, these huge volumes of information are evolving in a great pace, making the process of retrieving relevant and valuable information to produce decisions very difficult [2]. In the past, excessive data volume was a storage issue, but with decreasing storage costs, organizations tend to acquire and store all the available data through data streaming, whether it matches their organizational needs or not [28]. This leads to creating other issue which includes the size of datasets getting so huge that efficiency becomes a big challenge for current data analytical technologies [14]. This is unfortunate because analyzers will consume a lot of time trying to figure out matching patterns in the data and may not be able to answer important questions in a timely manner. Organizations will be stuck with an ever-growing volume of data, and may miss out opportunities to take actions on critical business decisions [2]. Technology allows you to fetch every bit and byte, but not all of the data out there is relevant or useful. Organizations need to separate the meaningful information from the chatter and focus on what counts. Thus, the real issue behind big data value does not only include the acquisition and storage of the massive volumes of data; rather it lies in the process of acquiring only what is suspected of being relevant for further analysis [16]. When the amount of data to be analyzed is reduced, the managing of their storage, merging, analyzing, and governing different varieties of the data is expected to be simpler and more controllable [29]. The size of the web exceeded 800 million pages in 1999 to 11.5 billion in 2005, and probably more than 30 billion nowadays [30]. The amount of information is continuing to increase at an enormous rate. Therefore, it is imperative that businesses, organizations, and associations find better approaches for information filtering which would effectively decrease the information overload and improve the precision of results [29]. 9 P a g e

24 Chapter 1: Introduction Big Data analytics can only be effective when the underlying data collection processes are able to leverage the relevant information to a particular scenario [31]. And thus, improving the usefulness of the analysis results. Therefore, a more powerful mechanism of data capture guidance is needed to avoid the waste of time and resources analyzing irrelevant data. 1.9 RESEARCH QUESTIONS AND OBJECTIVES The study will examine the structure of the current process of data collection and its inadequacy for the huge world of digital data. It will raise significant points that question: How can we improve the ad hoc process of data collection that hinders the efficiency of extracting value from large datasets in a timely manner? This thesis endeavors to answer this research question through the introduction of a requirements specification framework which can play a significant and potentially profitable role for big data collection processes. The research objectives are: a) To provide a problem-centric and user-centric approach that improves the data collection in the big data domain, than that of the ad hoc data collection process, which collects huge volumes of data most of which is irrelevant to the particular business or organizational scenarios, and is inefficient for creating value out of big data analytics. Therefore, it is necessary to have an approach that manage to collect only the data that is relevant to the scenario being under investigation [32]. b) To examine how Scenario-based Data Collection can leverage the usefulness for businesses and organizations to make better real-time decisions. c) To define an approach for analysis-driven data collection based on business scenarios, through determining what output you need in order to determine the relevant input (Backward Analysis). d) To develop a framework that provides a well-structured processes to locate the appropriate data and increase the precision of the results. Therefore, the goal of this thesis is to define coherent processes to acquiring only the data relevant to the business question from all the available data. Thus, data analytics can be done in smaller time frames, allowing decisions to be made faster and with 10 P a g e

25 Chapter 1: Introduction higher precision. Improving the current data capturing process from where you can draw accurate and useful conclusions, will contribute to changing the way people are collecting data and therefore, transforming decision making in a way that gives business the required advantage SCOPE OF THE THESIS As with most technologies, extracting value from the available universe of information has a core body of processing stages. In terms of big data, these stages are: data collection processing storage and performing analytics [11]. Logically, data input into later stages of processing will be affected by the amount of data acquired and how relevant it is to the scenario being investigated. Therefore, the focus of this research is directed to the primary phase in a big data solution. This phase is inspired by and is similar to the primary phase of a Software Development Life Cycle (SDLC), which is Requirements Engineering (RE). In an SDLC process, RE is used to collect the requirements of a software from the stakeholders [27]. This research follows RE in collecting the requirements of a data collection process and provide a requirements specification framework that improves the current ad hoc process of data collection. Digging into big data storage capacities, processing facilities and different big data mining and analytical techniques is beyond the scope of this research OUTLINE OF THE THESIS Apart from this introduction, the rest of the thesis is structured in six chapters as outlined in Figure 1.1: Literature Review, Methodology, Scenario-based Data Collection, The Data Collection Requirements Modelling, Case Study, Experiment and Validation, Discussion, and Conclusion, Chapter 2 consists the literature review. It discusses the related works including some available data reduction approaches, highlighting the innovativeness of this research. Additionally, an overview of the work that has been a source of main inspiration is presented. Chapter 3 consists the research methodology. It provides a demonstration of the adapted methodology to conduct this research. Chapter 4 introduces important research concepts and provides a mechanism for planning the data collection process. The framework developed and proposed as the 11 P a g e

26 Chapter 1: Introduction core of this research is presented in chapter 5 along with supporting materials. Chapter 6 provides an application of the framework on three case studies covering the three big data formats. The framework has been validated through an experiment to prove its effectiveness in chapter 7. Afterwards, the research analysis, a discussion on the framework, its validation, and the conceptual model is provided in chapter 8. Finally, chapter 9 contains the conclusion, limitations of this research and further research directions. Introduction Literature Review Research Methods Thesis Scenario-based Data Collection Requirements Modelling Case Study Experiment & Validation Discussion Conclusion FIGURE 1.1 THESIS STRUCTURE This chapter presented an introduction and overview of this thesis. It mainly provided a brief glimpse of the research areas and motivation behind this study, it introduced the phenomena of big data and the general associated problems. It identified the process of data collection in relation to big data, and the challenges of big data collection. Moreover, a view on uses of big data in some organizations is presented, the big data capture and collection from the angle of Software Engineering has been discussed, and the problem statement was illustrated along with the research question, objectives and the scope. 12 P a g e

27 CHAPTER 2: LITERATURE REVIEW There is no data on the future Laurel Cutler [33]

28 Chapter 2: Literature Review 2.1 INTRODUCTION Software Engineering and its applications through information technology is a subject of intense discussion around the globe, and a large number of scientific researches has been published on this discipline over the Web [34]. Nothing seems to stand still in this area because as soon as one work is developed, another comes out to supplant the previous one. In terms of big data collection, much research is conducted in this field but there is no clear and sufficient information on how to determine relevancy within structured, semi structured and unstructured data in all the available universe of information [11]. In this chapter, an overview of the related work is presented, highlighting the value and worthiness behind this research and how it differs from other contributions. It states the research innovation and main source of inspiration in conducting the core research of this thesis. 2.2 RELATED WORKS RELATION OF SOFTWARE ENGINEERING APPLICATIONS TO BIG DATA COLLECTION Application Description Relation to Big Data Collection Software Organizations will not The proposed Requirements meet the software they Specification Framework for Requirements need if the software scenario-based big data Engineering requirements were not right from the very beginning [35]. The hardest part of building a software system is deciding precisely what to build. This illustrates why Requirements Engineering are so important [35]. collection contributes to Requirements Engineering as it provides a structured set of questions to assist users in identifying the requirements of the data collection process (in the Big Data domain) based on the scenario of interest, and therefore collecting the right data. Reverse Engineering Reverse Engineering can be applied to re-specify a system for reimplementation [37]. The system s specifications may be reverse engineered and provided as an input to The proposed Requirements Specification Framework for scenario-based big data collection contributes to Reverse Engineering through Backward Analysis in optimizing the process of 14 P a g e

29 Chapter 2: Literature Review Software Process Improvement the requirements specification process for system replacement. In reengineering, the system may be restructured and re-documented without changing its functionality, in order to support frequent maintenance [37]. The software in its development process, requires continuous improvements in order to ensure quality products [38]. In a competitive industry, companies tend to hire professionals with multiple skills, implement new technologies and adapt new methods, standards and techniques to improve their processes. data collection. It provides means for analyzing the properties of the scenario (business problem) of interest and determining the relevant elements which, when collected, will probably reveal hidden patterns, prior to the actual data collection process. In big data applications, domain experts make use of all the available data to make informed decisions and leverage their business strategy [38]. The proposed Requirements Specification Framework for scenariobased big data collection contributes to Process Improvement as it improves the process of decision making. Applying the proposed framework generates datasets that are relevant to the scenario of interest, which requires less processing and analysis time, and therefore less time to insights (real-time decision support). TABLE 2.1. SOFTWARE ENGINEERING APPLICATIONS AND BIG DATA COLLECTION HADOOP THE BIG DATA MANAGEMENT FRAMEWORK This section provides an overview of Apache Hadoop as a big data processing framework which some of its components will be revisited in advanced chapters (see Appendix B for more information on Hadoop s core components). The aspects are explained here in a highly simplified manner. A detailed description of them can be found in [39-50]. APACHE HADOOP Hadoop is the name that creator Doug Cutting s son gave to his stuffed toy elephants. He was looking for something that was easy to say and stands for nothing in particular [39]. 15 P a g e

30 Chapter 2: Literature Review Hadoop provides a distributed file system (HDFS) and a framework for the capturing, processing and transformation of very large data sets using the MapReduce [42] paradigm. The important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. Hadoop is an Apache project; all components are available via the Apache open source license. Yahoo! has developed and contributed to 80% of the core of Hadoop [44]. Although Hadoop is best known for MapReduce and HDFS, the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing [39]. A brief explanation of the core components for Hadoop ecosystem: HDFS (storage), and MapReduce 2.0 or YARN (resource managing and data processing) will be provided. The other components will be summarized in a table at the end of this section. The use of components will be depending on Hortonworks Data Platform (HDP) [45] as an open source distribution powered by Apache Hadoop. HDP provides actual Apache-released versions of the components with all necessary bug fixes to make all the interoperable needs in the production environment (see Figure 2.1). FIGURE 2.1. HORTONWORKS DATA PLATFORM [45] Hadoop Distributed File System (HDFS) HDFS is the file system component of Hadoop designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware [39]. HDFS stores file systems metadata and application data separately. As in other distributed file systems, such as, 16 P a g e

31 Chapter 2: Literature Review PVFS [46], Lustre [47] and GFS [48], HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols [49] YARN (MapReduce 2.0) MapReduce was created by Google mainly to process big volume of unstructured data. MapReduce is a general execution engine that is ignorant of storage layouts and data schemas. The runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user only needs to provide a map function and a reduce function. The map function is applied to all input rows of the dataset and produces an intermediate output that is aggregated by the reduce function later to produce the final result [50]. In 2010, a group at Yahoo! began to design the next generation of MapReduce. The result was YARN, which meets the scalability shortcomings of classic MapReduce. YARN is more general than MapReduce, and in fact MapReduce is just one type of YARN application. The beauty of YARN s design is that different YARN applications can coexist on the same cluster, so a MapReduce application can run at the same time as an MPI (Message Passing Interface) application [39]. It performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-mapreduce workloads associated with other programming models [45]. Which brings great benefits for manageability and cluster utilization [39] INNOVATIVE BIG DATA COLLECTION APPROACHES In this section, several related works and innovative approaches for data collection and integration are visited. It aims to provide a knowledge in big data challenges, current data collection state of art, and how this research contributes to a shift in the domain of big data collection and analytics. In [51], the authors have emphasized that data analysis based on spatial and temporal relationships yields new knowledge discovery in multi-database environments. They have developed a novel approach to data analysis by 17 P a g e

32 Chapter 2: Literature Review turning topsy-turvy the analysis task. This approach provides that the analysis task drives the features of data collectors. These collectors are small databases which collect the data of interest. To illustrate their idea, they have surveyed the processes and tools used to analyze traffic behavior of passengers in the Tokyo metropolitan railway environment. Moreover, they presented the data integration method for heterogeneous legacy databases by combining equality, similarity, topological relationships, directional relationships and distance relationships for spatial and temporal data. C. Anne and B. Boury in [52], proposed a framework facilitating the integration of heterogeneous unstructured and structured data, enabling Hard/Soft fusion and preparing for various analytics exploitation. It as well provides timely and relevant information to the analyst through intuitive search and discovery mechanisms. The authors described the design and implementation of a prototype for scalable MIDIS, based on a flexible data integration approach, making use of Semantic Web and Big Data technologies. In [53], the white paper published by Intel walked through the challenge of extracting big data from multiple sources. It has explained how Hadoop infrastructure can contribute to the process of big data ETL. It illustrates the process of loading different data formats from multiple data sources into Hadoop s warehouse in a technical point of view. However, they did not touch the idea of reducing useless data capture nor producing real-time management decisions. IBM in [54] provides a means of classifying big data business problems according to a specified criteria. They have provided a pattern-based approach to facilitate the task of defining an overall big data architecture. Their idea of classifying data in order to map each problem with its suitable solution pattern provides an understanding of how a structured classification approach can lead to an analysis of the need and a clear vision of what needs to be captured. Moreover, IBM has presented several real-life samples of big data case studies in [55]. From the two previous contributions of IBM in the field of big data, the idea of scenario analysis for a structured approach to big data collection has emerged. 18 P a g e

33 Chapter 2: Literature Review The authors in [56], have studied different big data types and problems. They developed a conceptual framework that classifies big data problems according to the format of the data that must be processed. It maps the big data types with the appropriate combinations of data processing components. These components are the processing and analytic tools in order to generate useful patterns from this type of data. In [32], Nakanishi emphasizes that most current data analytics and data mining methods are insufficient for the big data environment. Therefore, they have designed and proposed a model that creates axes for correlation measurement on big data analytics. This model maps the Bayesian network to measure correlation mutually in the coordination axes. It contributes to a shift in the domain of big data analytics DATA REDUCTION APPROACHES There are several approaches and technologies discussed that may possibly lead to have a control on limiting or reducing unwanted data. Some of which are: Visualization and manual Data Collection [57]. However, several challenges emerged as a result of this process. These include the possibility for correct misses/false alarms and errors in categorizing the data and can be very time consuming. Machine Learning and Data Mining techniques [58]. However, data mining can only be applied to structured data that can be stored in a relational database. Collaborative Filtering (CF) is a common web technique for providing personalized recommendations, such as the ones generated by Amazon (based on the user profile and transaction history). In spite of the technique s effectiveness, it rises privacy issues as some customers don t prefer to have their preferences or habits widely known, along with other associated challenges such as data sparsity, scalability, and synonymy [59]. Contextual Approach uses semantic technologies such as an NLP, annotation, and classification to handle information integration (depending on the context of the web page at that moment in time) and 19 P a g e

34 Chapter 2: Literature Review querying of distributed data. For query representation, SPARQL language is specifically designed for the semantic technology and enables constructing sophisticated queries to search for different types of data [60]. This approach is efficient in terms of its high precision in controlling unwanted data, as it takes into account the important factors such as keywords, synonyms and antonyms. However, it requires a different infrastructure and highly skilled experts to deal with the complicated technology. More studies on big data scenarios and big data collection approaches are described in Table 2.2. TABLE 2.2. STUDIES AROUND BIG DATA SCENARIOS AND BIG DATA COLLECTION Authors W. C. Wesley, B. J. David and K. Hooshang [61] V. Sitalakshmi and K. Sadhana [62] R. Sanjay [63] Study Description This research describes the ineffectiveness of general queries in addressing scenariospecific information gathering. It calls for a scenario-based approach for information retrieval. Addresses the challenge of retrieving text, barcodes and images (unstructured data) that is relevant, pertinent and novel. An approach aims to acquire and store unstructured data through utilizing Hadoop components Tools, Languages, Approach scenario-based proxies, contextsensitive navigation and matching, content correlation of documents and user models Intelligent Image Retrieval components. Intelligent Information Retrieval components. Recommender component Hadoop components Findings Propose a medical digital library that supports scenario-specific and user-tailored information retrieval. The development of a recommender system framework that combines data relevance from multiple sources. The framework has been evaluated and proved high effectiveness. The development of a big data management system that includes data acquiring, organization and analysis. Z. Z. A. Siti, M.D. Noorazida and H. H. Azizul [64] This research contributes to the approach of classifying and capturing unstructured web data and the efficiency of Document Object Module (DOM) tree for classification process, XML for data transmission from web The development of an interface that allows people to extract meaningful multimedia data. The tool will extract useful 20 P a g e

35 Chapter 2: Literature Review A. K. Craig and S. Pedro [65] H. Olaf, B. Christian and C. F. Johann [66] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti [67] multimedia database in storing this sort of data. Described an approach to building and executing integration and restructuring plans to support analysis and visualization tools on very large and diverse datasets Introduced an approach to discover data that might be relevant for answering a query during the query execution itself Developed a set of tools to analyze specific properties of socialnetwork graphs, i.e., among others, degree distribution, centrality measures, scaling laws and distribution of friendship into multimedia database Built a comprehensive set of Data Transformation Operations (DTO) including structured information and semistructured data SPARQL query language in the context of Basic Graph Pattern (BGP) matching over a fixed set of RDF graphs Breadth-first-search (BFS) using FIFO queue information from the specified URL. The proposed approach will enable developers to rapidly and correctly prepare data for analysis and visualization tools and link the output of one tool to the input of the next, all within the big data environment The more links exist, the more complete results can be expected because more relevant data might be discovered The analysis of collected datasets has been conducted exploiting the functionalities of the Stanford Network Analysis Platform library (SNAP), which provides general purpose network analysis functions T. Cao and Q. Nguyen [68] The authors proposed a semantic approach for searching tourist information and generating travel itinerary. An ontological model for representation of tourist resources as well as traveler s profile An algorithm for generating travel itinerary that will combine semantic matching with ant colony optimization technique An experiment was conducted to show that the proposed algorithm generates travel itinerary relevant to both criteria of itinerary length and user interest B. Thalheim And Y. Kiyoki [51] Developed a novel approach to realize dynamic data integration and analysis among heterogeneous databases model-view-controller (MVC) Data analysis based on spatial and temporal relationships leads to new knowledge discovery in multidatabase environments 21 P a g e

36 Chapter 2: Literature Review Anne- Claire and Boury- Brisset [52] This research makes use of big data technologies, ontological models and semantic-based analysis to address the challenge of transforming the overwhelming amounts of sensed data into useful, actionable intelligence in a timely manner. R&D intelligence data integration platform MIDIS Presented the ongoing work that they are conducting for the development of a scalable and flexible platform through experimenting with recent big data technologies. As one of the most important elements for successful data collection is being able to deliver relevant information and services in real-time, a simple and comprehensible approach that relies on the properties of the need is essential to control the data collection process. 2.3 RESEARCH INNOVATION Indeed, big data scenarios play a vital role in the process of collecting the relevant data. Much research was conducted around big data scenarios and around data collection [55], [52]. However, there was no clear and sufficient information that links the two fields together. Therefore, the innovativeness of this research lies in the development of a scenario-based big data collection framework that performs as the Requirements Engineering phase in a software life cycle. The framework links the two or more aspects together to provide a well-defined approach for identifying the properties of the scenario context in which the data collection process will take place. This research studies the requirements specification of the big data collection process and makes it more tailored to the business needs, in order to decreases the analysis time and increases the value of the results by making faster management decisions. 2.4 RESEARCH INSPIRATION The main inspiration of this thesis comes from the W*H Conceptual Model for Services [6]. The authors in their research have studied the concept of services as a design artifact. They have aimed to merge the gap between main service design initiatives and their abstraction level interpretations. In order to address their research goal, the authors have developed an inquiry- 22 P a g e

37 Chapter 2: Literature Review based conceptual model for service systems designing. This model formulates the right questions that specify service systems innovation, design and development. This chapter provided an overview of some Software Engineering applications as well as the big data framework, Hadoop. It visited multiple related works in the arena of big data collection. In this matter, it highlights the uniqueness of this thesis contribution in providing an approach that rests big data collection on the analysis of the big data scenario being addressed. In this chapter, the main inspiration in the development of the core part of this thesis has been as well presented. 23 P a g e

38 CHAPTER 3: RESEARCH METHODS Data is the new oil. We need to find it, extract it, refine it, distribute it and monetize it David Buckingham [69]

39 Chapter 3: Research Methods 3.1 INTRODUCTION In this chapter, the research method is outlined. The research design, philosophy, and strategy is presented. The techniques and data analysis methods to be used in analyzing the data and providing results is illustrated as well. Moreover, the instruments and procedures used to conduct the experimental work is also discussed. 3.2 RESEARCH METHODS RESEARCH DESIGN This research is designed to develop a Requirements Specification Framework to scenario-based big data collection based on a conceptual modeling of the principles of design science [70]. Findings of this research will be a requirements specification framework, an outcome of comprehensive analysis of subjective information and categorizing data. It will rely as well on quantitative analysis in order to validate the proposed framework. Thus, a mixed research design of qualitative and quantitative method of investigation is followed in this research RESEARCH PHILOSOPHY This research is associated with an interpretive philosophy [71]. This is because it needs to make sense of the subjective and socially constructed meanings expressed about the concepts under study [71]. It commences with an inductive approach, where the data gathered is analyzed and used to develop a richer theoretical perspective than what already exists in the literature RESEARCH STRATEGY The research is exploratory in nature. It explores the subject and allows for the development of knowledge. It is of cross-sectional in nature. It supports gathering more in-depth contextual understanding of the proposed framework in order to address the research question and meet the objectives. Real-life situation case studies are carried out to evaluate and examine the proposed framework and its applicability in fulfilling its purpose. 25 P a g e

40 Chapter 3: Research Methods An experiment is conducted in order to evaluate and validate the effectiveness of the framework in providing relevant data according to the given scenario, compared to an ad hoc process of data collection. The statistical results of the experiment shall provide the validity of the research in its answer to the research question RESEARCH TECHNIQUES AND DATA ANALYSIS This is a mixed methods research. It uses a variety of data collection techniques and analytical procedures to develop and validate the framework. In order to maximize the validity and trust-worthiness of the findings, the research intends to use a hybrid access type to gather a richer set of data of the related works. Hybrid access data collection method refers to collecting the data and materials through difference access types, such as traditional access and internet access [71]. The primary source of data collection is through literature exploration and the use of in-depth internet access and going through various relevant publications and white papers. Supporting data is collected through traditional access and conversations with interested participants in local as well as international conferences. In addition, observations of several companies and meeting experts such as Mr. Joseph Kambourakis 1 through unstructured verbal interviews have took place in March 2014 (See Appendix A for brief description on the unstructured interview). The choice of companies has been determined by the ease of accessibility, reputation, and level of involvement in this field. 3.3 RESEARCH INSTRUMENTS AND PROCEDURES This research attempts to provide a requirements specification framework for scenario-based data collection. In order for this framework to be validated for effectiveness, some tools need to be available to aid in the framework evaluation process. 1 Mr. Joseph Kambourakis, EMC data scientist ( E M C are the initials of the corporation founders, where E stands for Egan, M stands for Marino and the C for corporation). 26 P a g e

41 Chapter 3: Research Methods EXPERIMENT TOOLS DiscoverText DiscoverText is a powerful and reliable we-based software application launched by Texifter. It enables collecting text from social media and a variety of other sources. In addition to data collection, the software is designed to improve standard research, government and business processes. It provides collaborative text analytics solutions tailored to the user s specific needs [72]. With DiscoverText it is possible to ingest hundreds of thousands of items from social media, and electronic document repositories. This advanced social search leveraging metadata, networks, credentials and filters will change the way users interact with text data over time [72]. It helps organizations to aggregate customer feedback from many public and private sources, and generate key insights for better business process. DiscoverText has many other text analysis and storage features to sift and sort textual data. However, the experiment in this research is interested in the data collection feature of this tool and will not go through other text mining capabilities. Reasons behind the selection of DiscoverText for performing the validation: It generates a reliable and accurate results in terms of data analysis, matching patterns and consistency [72]. DiscoverText, among other information retrieval and text analytics tools, provides simple and user friendly interface that does not require intensive training or technical expertise. Re-inventing the wheel and implementing a module to import and aggregate data will require vast time and effort. Effort lies in [73]: 1. Understanding and applying specialized programming languages, such as Python, JSON, R and etc.. 2. The integration with different social media infrastructures such as Twitter, Facebook, Path, and security issues such as acquiring authentication tokens to fetch feeds. The development of such tool is not within the scope of this thesis. 27 P a g e

42 Chapter 3: Research Methods Therefore, this tool is suitable to perform the experiment on an ad hoc data collection and a planned-for data collection process [72]. The experiment is intended to validate the requirements specification framework for effectiveness and to answer the research question. 3.4 RESEARCH METHOD STRUCTURE This is an exploratory research as outlined earlier in this chapter, where vast amounts of information is collected to dig in, and explore the process of data collection in relation to big data. After gathering all the required data, analyzing and exploring existing theories and related approaches, a conceptual model of a scenario-based data collection is developed based on a requirements specification framework. This framework consists of a specialized set of questions that should be asked to limit the data collection process to the data that is suspected of being relevant for further analysis. Relevancy of the data is determined by asking the framework questions to restrict the data collection to the data that is only related to the scenario of interest. Data that does not influence the result of performing analytics will be avoided. Afterwards, three real-life case studies is presented as an application of the framework, and are used for validating the framework for effectiveness through an experiment. The experiment s data are analyzed and results are presented. Figure 3.1 shows the structure of the research method. Exploratory Research Collect Information Analyze Collected Information Devise a Framework Case Study Validation Discussion Research Papers Applications collecting big data Conceptual Model FIGURE3.1 RESEARCH METHOD This chapter has presented an overview on the research methods, in terms of the research design, philosophy, strategy, techniques and data analysis, and the instruments and procedures that is followed to conduct this research. 28 P a g e

43 CHAPTER 4: SCENARIO-BASED DATA COLLECTION We are looking for that handful of data that is meaningful for what we are doing" Roser [74]

44 Chapter 4: Scenario-based Data Collection 4.1 INTRODUCTION In this chapter the basic concepts are defined, the relation of planning for Requirements Engineering in an SDLC to this research is provided. The scenario based approach along with illustrative examples based on a defined criteria is presented. Moreover, the concept of backward analysis as a sort of Reverse Engineering, and its impact in relation to data collection is explained. Afterwards, the importance of planning for a structured data collection process and how to determine the data collection and analytics technique that is most appropriate for your big data environment is provided. 4.2 RELEVANCY OF REQUIREMENTS ENGINEERING ACCORDING TO SDLC A domain is a sphere of knowledge, influence, or activity [75]. The domain includes distinct or interrelated scenarios for a broad range of connected capabilities as shown in figure 4.1. These domains are enabled or controlled by domain experts and require some degree of resilience and security in transactions and operations. Information and Communication Technology (ICT) cut across all domains [76]. DOMAIN X Scenario A Scenario B Scenario C FIGURE 4.1: OVERLAPPING SCENARIOS IN A DOMAIN A scenario is an account or synopsis of a possible course of action or events [77]. In Scenario-based Requirements Engineering, several interpretations of scenarios have been proposed. These interpretations include behaviors derived from use cases [78], narratives for requirements elicitation and validation [79], [80] and description of systems usages [81]. In Software Engineering, scenarios often describe information at the instance level [82]. They may be used to validate requirements in order to check the operation of a new system through test data. Moreover, scenarios may be used as pathways to specify system usage through being represented as 30 P a g e

45 Business Medical Chapter 4: Scenario-based Data Collection animations and simulations of the new system. This allows validation by examination of the behavior of the future system [82]. In relation to data collection, the Australian government in 2008 defined the scenario as a brief narrative, or story, used to describe the hypothetical use of one or more systems to capture relevant information [83]. Therefore, a scenario s is a sub category or a special case of the domain D. This implies that, a domain is a generalized area that contains several possible scenarios as shown in (1). s 1, s 2, s 3, s n ϵ D (1) THE SCENARIO-BASED APPROACH According to IBM [54], in a scenario-based approach, each scenario starts with a business problem and describes why a big data solution is required. For instance, in a hospital practice management scenario in the Medical domain, systems need to manage the business of the general practice. Relevant information to address this scenario would be to record patient details, manage bookings and attendances, and managing financial transactions. On the other hand, Clinical management systems is a scenario in the same domain. However, relevant information would be to record data about the patient s health issues, current and past problems, conditions and treatments, including medications and diagnostic tests. Figure 4.2 below presents the possible scenarios in the Medical and Business domains, based on IBM s classification of big data business problems [54]. Marketing Hospital Information Retail Clinical Management Telecommunications Patient Monitoring Customer Service Medical Imaging FIGURE 4.2: POSSIBLE SCENARIOS IN THE BUSINESS AND MEDICAL DOMAINS [54] 31 P a g e

46 Chapter 4: Scenario-based Data Collection Table 4.1 describes in detail the scenarios presented in figure 4.2, which belong to the Business domain [54], [56]. TABLE 4.1. BUSINESS DOMAIN AND CORRESPONDING SCENARIOS Domain Context Telecommunications Examples of Scenarios Telecommunications operators need to build detailed customer churn models that include social media and transaction data, such as Call Detail Records (CDRs), to keep up with the competition. Business Marketing Marketing departments need to determine what users are saying about the company and its products or services, especially after a new product or release is launched. Customer service Retail IT departments want to gain insight that can improve system performance. Retailers need to make personalized offers to customers based on buying behavior and location BACKWARD ANALYSIS Backward Analysis refers to performing one of the main applications of Software Engineering, which is Reverse Engineering [37]. It refers to looking at the properties of the solution that is needed (output) to figure out the appropriate input [84]. According to [84], Backward analysis is the process of defining the properties of the input, given or based on the context and properties of the output. This concept is utilized in optimizing the process of data collection. Analyzing the properties of the scenario at hand and determining the relevant elements that, when collected, will probably reveal hidden value, should be done prior to the data collection process. Comprehensive backward analysis will eliminate the chance of being overwhelmed by bulks of irrelevant 32 P a g e

47 Chapter 4: Scenario-based Data Collection data. This will help users and businesses to generate fast management decisions and answer mission critical questions. Therefore, collecting data upon prior analysis needs to a particular business scenario eliminates the presence of unrelated data. Therefore, the effectiveness of the final insights derived from the analytics depends on the quality more than the quantity of the data that will form the foundation for the analytic techniques [11]. 4.3 PLANNING FOR BIG DATA COLLECTION Big data collection is the basic building block in big data analytics, as it is equivalent to the Requirements Elicitation phase in the SDLC [26]. Planning for this phase can be the most time-consuming part of data analytics. Thus, if not done wisely, a great deal of resources will be wasted to make quick and informed decisions [85]. This means that we need to have a balance between the data that needs to be collected and the analyzing technique associated with. Therefore, making plans for a scenariobased big data collection process will require following a structured and systematic approach. This is to ensure that the questions asked yields to the right analysis technique and that the selected analyzing technique will take the right data for analyzing. Figure 4.3 provides a diagram that presents the mapping of the primary phase, which must exist on top of the data collection process, to the Requirements Engineering in an SDLC. FIGURE 4.3: REQUIREMENTS ENGINEERING PHASE IN A BIG DATA SOFTWARE LIFE CYCLE 33 P a g e

48 Chapter 4: Scenario-based Data Collection DETERMINE DATA COLLECTION NEEDS As stated earlier, the focus of this thesis is on the efficient process of big data collection. This section presents a generic, high-level set of questions provided by [85] as a data collection plan. Following is a presentation of these questions answered in relation to big data scenario needs: What is the reason for data collection? Identify the motivation and the outcomes of the data collection. For example, statisticians wants to make polling study. They collect tweets to determine the opinion polls and forecasting if Obama or Romney will win the elections. What data need to be collected? Determine what data should be collected in order to address the scenario needs. Thinking about the data that will account for the factors in the scenario, will eliminate chances of being unintentionally overwhelmed with irrelevant data. Try to reduce the number of factors that will affect the outcome of analysis. For example, consider the above example about the US political elections. At first glance, the following factors could play a role in this forecasting behavior scenario: anti-obama polls, anti- Romney polls, the entire population in U.S, percentage of nations racism, the demographics of anti-obama supporters, previous democrats and republican presidents, candidates doctrine, and more. However, having a closer look, and using a creative thinking and knowledge of the scenario, these factors can be narrowed to: the entire population in U.S who are above 17-years old, and anti-obama polls. Data that can provide information about those factors could include Twitter data and census data the government routinely collects. You now know exactly what type of data is adequate to collect. What is the best source of data? For the particular scenario, determine where to find the necessary source of information that will most likely address the scenario at hand. These data sources may be social media data such as Twitter, Facebook, Flickr, 34 P a g e

49 Chapter 4: Scenario-based Data Collection and etc., or it might be another source of Web information such as blogs, forums, articles, and etc. How much data is enough? Identifying what constitutes enough data depends on the type of analytics associated. When doing statistical analysis, the bigger the sample, the less likely your analysis will suffer from bias. But, when performing predictive analytics, more data does not always mean better results. A good rule of thumb, is that you should collect the minimum amount of data that will produce unbiased and accurate results. This will reduce the time and resources allocated for collecting and analyzing more data, when having more will not make much difference. When to collect the data? Attempting to collect the data at any time may result in retrieving relevant but useless data. For example, when planning to make predictive analysis for the elections of Obama and Romney, the best time to collect tweets will be the week prior to the day of elections. These are the general questions that leads to determining the data collection needs as a data collection plan. These questions can be further utilized for more scenario-specific kind of questions for determining the scenario-based data collection needs DETERMINE BIG DATA CAPTURING TECHNIQUES ACCORDING TO DATA TYPES With the explosion of digital data nowadays, several data collection tools and techniques has emerged either as a standalone data integration tool or within a data management framework. However, Hadoop has emerged as the de facto framework for big data management and processing which has been utilized by multiple companies in performing big data analytics [53]. Therefore, this section provides Hadoop s data capturing components among the available data collection and integration techniques. Most of the huge volumes of data available today have unstructured or semistructured format. This leads to rising demands for special data capturing methods to deal with these types of data. Apache Flume and Sqoop, as 35 P a g e

50 Chapter 4: Scenario-based Data Collection provided earlier, are Hadoop s two core components which are connected to HDFS and responsible for collecting streams of data in real-time [45]. Flume efficiently collects, aggregate, and load large amounts of semi-structured and unstructured data, such as social media, sensors data and log files. Sqoop on the other hand, efficiently collects bulks of structured data such as medical data stored in relational databases, loads it into HDFS, and vice versa. See Figure 4.4. Unstructured and Semi-structured data Flume Hadoop Ecosystem Structured data Sqoop HDFS FIGURE 4.4: DETERMINING BIG DATA CAPTURING TECHNIQUES DETERMINE BIG DATA ANALYTICAL TECHNIQUES IBM provides a good practice for performing a big data solution. It states that a good first step in planning for a big data solution, before capturing the data, is to specify the right type of data analytics that is most appropriate for your big data environment [54]. This is because each type of big data analytics is appropriate for different types of data structures, formats, and processing methodology. Therefore, understanding big data analytical techniques in order to determine which one is the most suitable for your business scenario, simplifies the complex process of information extraction and business insights. Different kinds of big data analytic tools or techniques can be used to extract knowledge from very large and complex datasets to create business value. And each of these techniques are used in certain cases and deliver different results or provide different insights. The decision of choosing the right technique to apply depends mainly on the type of scenario and its data that is being processed [54]. Therefore, it is important to have a clear understanding of those techniques and when can each be used when developing your big data collection plan or 36 P a g e

51 Chapter 4: Scenario-based Data Collection strategy. Table 4.2 provides the general categories of techniques that is used for big data analytics mapped to their applicability [86]. TABLE 4.2 CATEGORIES OF BIG DATA ANALYZING TECHNIQUES AND APPLICABILITY [86] Analyzing Technique Applicability- When to use Scenario example Association Rule Classification Analysis A need for discovering inter-dependencies between variables in large datasets. A need for identifying to which of a set of categories, different types of data belong. What is the likelihood of people who purchase tea to purchase sugar? To Which category does this belong to? Regression Analysis A need for describing how the value of a dependent variable varies according to the value of the independent variable. Sentiment Analysis A need for determining people s opinions, with respect to a topic. How does the type of background music affect the time spent in store? How do people feel about Hillary Clinton going for the elections? Social Network Analysis A need for analyzing relationships between people in different fields and activities. What is the influence of Miley Cyrus on teenagers? Genetic Algorithms A need for searching useful solutions to problems that require optimization. Which advertisements should we announce for, and in what time slot, to maximize our sales? Location-based Analysis A need for exploring the relationship of data Which negative driving patterns are within driver s control and which are not? 37 P a g e

52 Chapter 4: Scenario-based Data Collection elements that can be tied to a geo-location. By understanding these big data analyzing techniques outlined above and when to apply each of them, it will be possible to select the right one that is most appropriate for your big data environment. And thus when applied, will help to recover hidden patterns and valuable insights MARKETING SCENARIO EXAMPLE Apple has released iphone six and wants to know how the public feels about their product launch right now, and then to track how those opinions change over time. The company s business question might be: How do the public feel about the product, and how can the sentiment data be used to predict customers opinions to enhance our products? There are huge volumes of data found in diverse sources that the company may collect and analyze. However collecting too much data will increase the resources needed to analyze these data to produce valuable insights. What the company needs to answer their question; is the sentiments data describing people s opinions, attitudes, and emotions. Therefore, their best source for collecting this unstructured data would be social media posts and online product reviews. The company will look at the public sentiment in Twitter during the days leading to, and immediately following the recent release of the iphone six. Hortonworks Flume, can be used to efficiently obtain Twitter data. Flume is connected to HDFS and acts as a log aggregator for collecting log data from many data sources [45]. For Apple s case, Flume will capture Twitter stream data and load it into HDFS for further processing. Thus, by analyzing tweets flowing through Flume into HDFS, they are able to know what the public in each of their markets tweeting about the product in real time. They can see how that sentiment changes after the products launch. This is particularly useful when they want to make urgent marketing and distribution decisions about the product. 38 P a g e

53 Chapter 4: Scenario-based Data Collection What domain experts indeed need is a structured approach to follow in order to identify the right factors that govern the data collection process. It shall contribute to the analysis of the scenario at hand and the properties of the outputs. This chapter has defined conceptual model of the domain and the scenario along with some illustrative examples. It explained the concept of backward analysis and its impact in relation to big data collection. The chapter also insisted on the importance of having a well-defined data collection plan and Requirements Analysis according to scenarios which shall act as a Requirements Engineering phase to the data collection process, following SDLC in Software Engineering [27]. It has provided an understanding of the general categories of the analyzing technique and when to choose each of them depending on the pre-defined scenario. 39 P a g e

54 CHAPTER 5: DATA COLLECTION REQUIREMENTS MODELING We are looking for that handful of data that is meaningful for what we are doing" Roser [74]

55 Chapter 5: Data Collection Requirements Modelling 5.1 INTRODUCTION In this chapter, W*H model, and the works that performs as the main inspiration for this research is presented. An identification of the main factors that govern the data collection phase of a big data analytics solution is discussed in relation to W*H model. Moreover, the proposed Requirements Specification Framework for Scenario-based Big Data Collection, along with a decomposition of its main elements is provided. And the mapping of the W*H model to the Big Data Collection domain is discussed. 5.2 THE W*H CONCEPTUAL MODEL FOR SERVICES The W*H model extends the Rhetorical and Zachman frameworks [6]. The Zachman framework is an analytical model that uses the classical description of the rhector Hermagoras of Temnos (what how where who when why) to organize descriptive representations [6]. The key questions in systems development are: What is represented in the system? (Data) How does the system work? (Function) Where is the information system used? (Network) Who will be using the system? (People) When the system will be used? (Time) Why is the system used? (Motivation) The W*H conceptual model for inquiry system of the conceptualization process, is inspired by the success of Zachman s framework as an inquiry system for information systems engineering, and Hermagoras of Temnos frames success in legal inquiry frameworks [6]. W*H model provides a detailed description of services and how the responsibility of providing the service information is divided between the service provider, service consumer, and the patterns that govern the process of information exchange. Services are characterized by several factors which formulate the service properties. Some of these specific properties are the supplier or manufacturer, the pricing that is applicable, and the costs depending on the user, provider and deliverer [6]. 41 P a g e

56 Chapter 5: Data Collection Requirements Modelling The foundation of the W*H model is based on a set of primary, secondary, and additional questions that support completeness, simplicity, and correctness into service systems innovation, design, and development [6]. In the W*H model, the service is primarily declared by answering the following questions: Wherefore? (The ends) It defines the benefit a potential user may obtain when using the service. This factor is based on the answers for the following questions: why, whereto, for when, for which reason. Whereof? (The sources) It defines a general description of the environment for the service. Wherewith? (The supporting means) It identifies the aspects that must be known to potential users in case of utilizing the service. Worthiness? (The surplus value) It defines the value of service utilization for the potential user. These are the primary set of questions which leads to further secondary and additional questions, for a full understanding of the domain s services and requirements. The W H model is comprised of 23 questions in total that cover the complete spectrum of questions addressing the service description [6]. (See Appendix C for more information about the services model dimensions and the mapping of the W*H specification framework and the W*H model to the requirements specification framework for scenario-based data collection) 5.3 USING THE RHETORICAL AND ZACHMAN FRAMEWORKS In relation to scenario-based data collection, each domain has different scenarios as mentioned earlier. And each scenario requires gathering different information. In some scenarios, one source of data might be enough, but in others, integrated data from several resources will be necessary. This depends on several factors that govern the data collection process: The scenario that needs to be analyzed. The analytical techniques associated. The time and resources available. 42 P a g e

57 Chapter 5: Data Collection Requirements Modelling The data access constraints. For this reason, care should be taken to the specific factors that must be clearly pointed out to ensure the elimination of collecting irrelevant data. A comprehensive and substantial series of questions should be asked prior to the process of data collection to support this need. Also, a well-formed structure and prioritization of the questions are needed. Therefore, it is essential to follow a systematic approach that allows to consider the main data collection factors in a scenario-based manner. Following [6], the conceptual framework may be headed by the questions: who, what, when, where, why, in what way, by what means. This classical rhetorical framework was established long time ago 2. The specification framework developed in this chapter follow the work of [6] for service systems. It is inspired by the success of Zachman s framework for information systems development, which is nothing but a partial reinvention of the Hermagoras of Temnos framework. 5.4 THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED DATA COLLECTION This section presents the framework for identifying the factors that govern the big data collection process. This framework is built according to the approach followed in [6]. As discussed earlier, the focus of this thesis is identifying the scenario specifications at the Data Collection phase for a Big Data Analytics solution. It aims to accelerate the analysis time through data reduction by focusing on retrieving data from the source that meets the big data scenario at hand. Due to the sheer volume, velocity, and variety of big data, it is challenging to minimize the amount of data to be collected. 2 It dates back to Cicero and even to Hermagoras of Temnos who was one of the inventors of rhetoric frames in the 2 nd century BC. The latter has been using a frame consisting of the seven questions: Quis, quid, quando, ubi, cur, quem ad modum, quibus adminiculis (W7: Who, what, when, where, why, in what way, by what means). The work of Hermagoras of Temnos is almost lost. He had a great influence on orality due to his proposals. For instance, Cicero has intensively discussed his proposals and made them thus available [6]. 43 P a g e

58 Chapter 5: Data Collection Requirements Modelling The framework developed shall be applied during the data collection phase; as the initial process in a big data analytics solution DETERMINING THE SCENARIO-BASED DATA COLLECTION FACTORS When attempting to capture data to address a certain scenario, the scenario properties and many factors should be taken into consideration. These factors need to be categorized and put into a hierarchy to provide simplicity, completeness and correctness into the data collection process, according to the 3 Cs of RE (Consistency, Correctness and completeness) [87]. An idea that might be useful in such categorization is separation of concern by aspects and grouping of questions depending on the related factors on a hierarchy of primary, secondary and additional questions THE CHARACTERISTICS OF BIG DATA SCENARIOS Scenarios are to be characterized according to their specific properties, the domain that it belongs to, the temporal and spatial factors, the search patterns such as keywords, phrases or named entities, the analyzing technique and the capturing technique [54]. Scenarios may be composed of other scenarios. In this case, some properties of the scenarios may contradict, while others may be the same. Nested scenarios are beyond the scope of this study. A full description of the scenario-based data collection process is governed by the following factors: (A) The scenario or purpose - (wherefore) of the data collection and thus the insights a potential domain expert may obtain from analyzing the collected data. The scenario description governs the data collection process. It allows to characterize the data collection. This characterization is based on the answers for the following questions: why, whereto, for when, for which reason. They define the potential and the capability of the scenario-based data collection process. (B) The sources - (whereof) factor determines the data source that is most likely to generate relevant information about the given scenario among 44 P a g e

59 Chapter 5: Data Collection Requirements Modelling the available data sources. It describes the provider of the data to be collected that is relevant to the given scenario, the consumer of the processed data, the classification of content format and how much data is expected. This classification can help to understand how the data is acquired, and how it will be analyzed. These are declared through answering questions such as: where from, to whom, in what format and how much. (C) The search patterns - (by-what) captures the Supporting needs and Activity factors. It determines which part-of-speech (POS), phrases and keywords correspond to the scenario at hand and must be contained within the data that we want to pull out. a. The Supporting needs factor describes the analytical technique that can analyze the collected data, the capturing technique that can be utilized to capture the right data, the frequency of the arriving data, the environment at which the data collection process is implemented, whether the data is processed in real-time or batched for later processing. These are declared through answering questions such as: what, how, whence, where and whether. b. The Activity factor describes the input and the expected output of the data collection process. These are declared through answering questions such as: what-in, what-out. (D) The value - (worthiness) a scenario-based big data collection and analytics is expected to provide for the scenario context and time context, saving time by not collecting garbage but only needed data that is ready to use for more accurate real-time analysis. These are declared by answering the following questions: where about and when. 45 P a g e

60 Chapter 5: Data Collection Requirements Modelling Figure 5.1 combines the scenario-based data collection factors into a general framework, following the approach presented in [6]. Why Problem Whereto Domain ForWhen Period Purpose WhereFor ForWhichReason Motivation WhereAbout Scenario Context When Time Context Context Worthiness HowMuch Quantity InWhatFormat Where Location Consumer Format Data WhereOf ToWhom WhereFrom e Provider Scenario Value Sources SCENARIO-BASED DATA COLLECTION Search Patterns Activity ByWhat Supporting needs In-put WhatIn Out-put WhatOut Analytical Technique Capturing Technique Triggering Processing Event methodology What How Whence Whether FIGURE 5.1 THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION 46 P a g e

61 Chapter 5: Data Collection Requirements Modelling In summary, the proposed scenario-based specification framework is thus based on the following questions: Primarily: wherefore, whereof, by what, worthiness; and additionally why, whereto, for when, for which reason; Secondarily: where from, to whom, in what format, how much; and additionally what, how, whence, where, whichever, whether, what in, what out; Additionally: where about and when. These questionnaire form the kernel of the specification framework. This chapter has presented the works that are the main inspiration for this research. It introduced the Rhetorical and Zachman frameworks which are the basis for the W*H model. The chapter raised points on what are the right questions to ask, in relation to the W*H model, in order to ensure collecting the right data. It has provided a demonstration of the development of a conceptual model of the scenariobased data collection process through a specification framework. The mapping of the W*H model to the Big Data Collection domain has been as well discussed. 47 P a g e

62 CHAPTER 6: CASE STUDY With data collection, The sooner the better is always the best answer" Marissa Mayer [88]

63 Chapter 6: Case Study 6.1 INTRODUCTION This chapter presents an illustration of the application of the proposed specification framework on three examples of real-life scenario cases. These cases tend to cover the data capture process on all the types of big data. The first two cases is an application of the specification framework on unstructured data, while the third case is a complex scenario that requires hybrid data collection of structured and semistructured data. The first case aims to the prediction of the elections scenario of the United States president. The second case aims to the tracking of Ebola scare in Saudi Arabia. And last case aims to decrease traffic violation rate in Saudi Arabia and enhance driver s safety. It will show how the scenario-based data collection model can be used in a big data problem to produce a clear and focused data collection process. The next section provides a brief background on the business environment of the first scenario case, a description of the business mission, a demonstration of the key issue or problems encountered, and then provides the solution as an application of the specification framework introduced earlier. This is followed by the second case scenario in the same sequence and the results as an analysis of the two framework applications. Then the third case respectively. 6.2 US PRESIDENTIAL ELECTIONS CASE OVERVIEW The day of the Presidential Elections was scheduled to be on the 6 th of November The two candidates are Barak Obama and Mitt Romney. The former was a civil-rights lawyer and teacher before pursuing a political career. He was elected to the Illinois State Senate in 1996, and served from 1997 to He was elected in 2008 as a Democrat president of the United States, and is in a challenge for winning the re-elections against Republican candidate Mitt Romney [89]. The former Massachusetts Governor Mitt Romney, is making a second bid for presidency. He was requested in 1999 to head a struggling U.S. Olympic Committee in advance of the 2002 winter games. Romney helped the committee erase $379 million and the games were considered a success [90] DESCRIPTION OF THE NEEDS In the meanwhile, a certain data collection team from the presidential campaign of Barak Obama, wants to tackle the problem of opinion polls and predict for 49 P a g e

64 Chapter 6: Case Study the incumbent president. As Twitter is becoming one of the major online platforms for political associations as well as various interest groups expressing their opinions and thoughts, the team has decided to use it as the main data source for their project. The presidential campaign, which the data collection team belongs to, has two major objectives which they would like to achieve by collecting Twitter information: To assess the amount of information regarding the political events in the United States. To use the information gained to predict the overall voting intentions and estimate the winning candidate a week before the day of the elections (see Table 6.1). Twitter API has restricted to 1500 tweets per fetch, which is a reasonable amount to allow for relevant data capture and real-time analysis [91] KEY PROBLEM However, the team lacks a well-defined strategy for collecting the data. They have faced the dilemma of wondering what data to collect and what to leave, what are the right patterns to search for, when is the right time to collect the data and at what frequency, the capturing and analyzing techniques associated and many other factors APPLICATION OF THE SPECIFICATION FRAMEWORK The approach to the modeling of the scenario-based data collection process described in chapter 5, supports the guidance to a development of a welldefined data collection plan. It can provide the team with a roadmap for building virtual boundaries to the collection process in dependence on the answers to the specification framework s questions. Table 6.1 is a summarized version of the application of the proposed specification framework for scenario-based data collection modelling. 50 P a g e

65 Chapter 6: Case Study TABLE 6.1 FRAMEWORK APPLICATION ON US PRESIDENTIAL ELECTIONS SCENARIO Scenario-based Data Collection Specifications Scenario (Where for) Sources (Whereof) Search patterns (by what) Value (worthiness) Purpose - Problem (why) - Domain (where to) - Period (when) - Motivation (for which reason) Data - Provider (by whom) - Consumer (to whom) - Format (in what format) - Location (where) - Quantity (how much) Technology - Capturing (how) - Analytical (What) - Triggering event (whence) - Processing methodology (whether) Activity - Input (what in) - Output (what out) Obama s campaign wants to know the likelihood of defeating Romney and winning the elections. Social media data Barak, Obama, Mitt, Romney, democrat, republican, Michelle, #Obama2012, #RomneyRyan2012, Mitt2012, #Obama. To estimate elections results - To tackle the problem of opinion polls. - Elections - Starting on the 26 th of October To rally individual voters based on the estimation results. - Twitter - Obama s Campaign - Unstructured - United States tweets at a time - Twitter Streaming API through Flume - Sentiment Analysis - every 15 minutes - Real-time processing - A search query including the search patterns or keywords such as #Obama. - A collection of relevant tweets matching the specified query such as: I don't support #Obama because I need a job or healthcare. I support him because he shines the light on the future I want for my children. Context - Scenario context (where about) - Time context (when) - (Political) Predictions - On-demand 51 P a g e

66 Chapter 6: Case Study 6.3 SA EBOLA SCARE CASE OVERVIEW Ebola Virus Disease (EVD) is a severe acute viral infection that is the hot topic of the breaking news nowadays. It is initially marked by muscle pain, intense weakness, sore throat, and headache. This is followed by vomiting and diarrhea, organ failure, sever bleeding and may lead to death in many cases. EVD is native to Africa, and can be transmitted to people through infected animals or through contraction with body fluids and/or contaminated needles [92], [93]. The fact that there is no licensed vaccine nor specific treatment is available yet to cure the disease, causes scares against the disease. Therefore, governments around the world are trying to protect the health of their citizens. In Saudi Arabia, the scare has spread after a man who had come back from a business trip to Sierra Leone died on August 6 th, 2014 and was suffering from symptoms similar to those of Ebola. Since then, the government of Saudi Arabia has blocked issuing visas to West African countries [94] DESCRIPTION OF THE NEEDS Keeping up with the event of EVD outbreak, the department of ICT at the Ministry of Health (MOH) in Saudi Arabia, wants to track citizens scare and their reactions against the virus. ICT department has decided to use Twitter feeds as the main data source as it reflects the public opinions and fears in realtime. Saudi Arabia MOH, which the ICT department unit belongs to, has two major objectives that they would like to achieve by collecting and analyzing Twitter information: To track citizens reaction against EVD scare in Saudi Arabia, and how they tend to protect themselves. Based on the feedback gained, they need to make timely informed decisions regarding the disease awareness campaign. Twitter API has restricted to 1500 tweets per fetch, which is a reasonable amount to allow for relevant data capture and real-time analysis [91]. 52 P a g e

67 Chapter 6: Case Study KEY PROBLEM MOH need to make real-time decisions in order to protect the health of its citizens. Therefore, ICT department should produce real-time analysis insights. However, the lack of a well-defined strategy for data collection based on the given scenario causes frustration of what data can provide meaningful to results and what data is useless and need not be collected. This may lead them to collect a lot of unwanted data and thus, increasing analysis time and producing results through longer time frames APPLICATION OF THE SPECIFICATION FRAMEWORK Table 6.2 provides the second application of the specification framework for scenario-based data collection modelling as a solution to addressing the scenario case of Ebola scare in Saudi Arabia. 53 P a g e

68 Chapter 6: Case Study TABLE 6.2 FRAMEWORK APPLICATION ON THE EBOLA SCARE IN SA SCENARIO Scenario-based Data Collection Specifications Scenario (Where for) Sources (Whereof) Search patterns (by what) Value (worthiness) Purpose - Problem (why) - Domain (where to) - Period (when) - Motivation (for which reason) Data - Provider (by whom) - Consumer (to whom) - Format (in what format) - Quantity (how much) - Location (where) Technology - Capturing (how) - Analytical (What) - Triggering event (whence) - Processing methodology (whether) Activity - Input (what in) - Output (what out) Context - Scenario context (where about) - Time context (when) Ministry of Health (MOH) wants to track Ebola scare in Saudi Arabia. Social media data Ebola, Ebola virus, Ebola infection, Ebola death, Ebola scare, Ebola panic. الخوف من, إيبوال وفاة, عدوى إيبوال, فيروس إيبوال, إيبوال. استنفار ايبوال, ايبوال To provide awareness and virus precautions. - To track Ebola scare in Saudi Arabia - News - A week starting from 6 th of Aug To develop and support purposeful awareness campaign. - Twitter - MOH - Unstructured tweets at a time - Saudi Arabia - Twitter Streaming API through Flume - Sentiment Analysis - every minute - Real-time processing - A search query including the search patterns or. فايروس إيبوال virus keywords such as: Ebola - A collection of relevant tweets matching the specified query such as: A case of Ebola has been discovered today in Jeddah.. and we are approaching school then pilgrimage.. It is important to alert health efforts to overcome the danger "اليوم تم اكتشاف حالة ايبوال بجده.. ونحن مقبلين على الدوام الدراسي ثم الحج.. من المهم استنفار جهود الصحة الحتواء الخطر! " - (Medical) Timely awareness decisions - On-demand 54 P a g e

69 Chapter 6: Case Study 6.4 US ELECTIONS AND SA EBOLA SCARE CASE ANALYSIS The two cases provided above presents the specification of the relevant data for each of the real-life scenarios. In both cases, it is possible to collect the required data through Apache Flume (Hadoop data collecting and aggregating component). Flume can be used in such situations to directly fetch Twitter relevant data from the Twitter API and forward it to HDFS for storage and processing through Apache Storm to perform real-time integration and trend detection on Twitter feeds. Thus, Flume is responsible of efficiently transforming the data from the source to the sink as shown in Figure 6.1. Twitter provides two options for collecting feeds [91]: Twitter Search API Searches backward and provides matching tweets that have already been sent, up to one week ago (not applicable for predictive analysis). Twitter Streaming API - Searches forward and capture matching tweets in realtime as they are sent. Twitter Streaming API fits well for the scenarios provided, as they both require collecting real-time data. Twitter Streaming API supports the filtering and specification of constraints on the data entering Flume according to the search patterns, quantity, frequency, location and other factors determined above. According to Mediabistro [95], 277,000 tweets are published every minute. Twitter API has limited the data pulls by maximum 1500 tweet per fetch [91]. However, restricting data pulls to 1500 scenario-related tweets every 15 minutes in the first case and 1500 scenario-related streaming tweets every minute in the second case, for a certain time frame, is reasonable in producing timely and informed analysis results [91]. 55 P a g e

70 Chapter 6: Case Study Source Twitter Streaming API FLUME Channel Sink HDFS FIGURE 6.1 DATA FLOWING THROUGH FLUME CHANNEL What has been presented in the first and second cases, is an illustration of the specification framework application on a real-time unstructured data collection scenarios. To demonstrate the framework s comprehensiveness and generality on the three types of big data structures: structured, semi-structured and unstructured data, we need to provide an application of the framework on the structured as well as semistructured data types. The following case is a scenario that requires hybrid data capture of batch structured data pulled out from relational databases, combined with semi-structured geo-location data transmitted through sensors in real-time. 6.5 TRAFFIC MANAGEMENT AUTOMATED SYSTEM CASE OVERVIEW With the obvious recent increase in the amount of traffic accidents and deaths due to excessive speeds and careless drivers, the government of Saudi Arabia decided to develop an Automated System for the management of traffic via E-Systems that covers major cities in Saudi Arabia. The system shall be designed with sensors and Radio Frequency Identification (RFID) readers that communicate the position and speed of each vehicle to the National Information Center (NIC) [96]. These sensors shall sense unsafe events such as speeding and passing traffic signals while it s red. These transmitted real-time data are called Geo-located data. According to [45], geo-location data determines the location of an asset or an individual at a moment in time using digital technology. This data may be delivered 56 P a g e

71 Chapter 6: Case Study in two forms, either in the form of (x,y) coordinates or an actual address. Google Maps, Google Earth and Bing Maps are advanced applications of geo-location data DESCRIPTION OF THE NEEDS The government of Saudi Arabia wants to refine and visualize machinegenerated data in order to reduce traffic problems and improve driver and passenger s safety. Due to the large volumes of traffic violation data that may create processing problems, they have suggested using Hadoop to address this issue in a fault-tolerant manner. Therefore, the government needs to develop a system to auto-detect, capture and transmit vehicles data that have violated traffic regulations to Hadoop HDFS clusters. This transmitted data is a combination of semi-structured geo-location data and structured data from NIC. The government is looking forward to achieve the following objectives [96]: Improve the level of traffic safety. Use the latest advanced technologies in Intelligent Transportation (ITS) to ensure a safe traffic environment. Raise the efficiency of existing road networks. Strengthen the public safety by using the latest surveillance systems. Implement traffic regulations strictly and continuity KEY PROBLEM However, such complicated scenario requires capturing and integrating hybrid data types from different data sources, there should be a welldefined process to guide the choices of data collection APPLICATION OF THE SPECIFICATION FRAMEWORK Table 6.3 provides an application of the specification framework for scenario-based data collection modelling to identify the factors that will govern and limit the process of data collection. 57 P a g e

72 Chapter 6: Case Study TABLE 6.3 FRAMEWORK APPLICATION ON AUTO TRAFFIC MANAGEMENT SCENARIO Scenario-based Data Collection Specifications Scenario (Where for) Semi-structured Data Structured Data Ministry of Interior (MOI) wants to limit vehicles traffic violation. Sources (Whereof) Machine-generated data NIC relational database Search patterns (by what) Value (worthiness) Purpose - Problem (why) - Domain (where to) - Period (when) - Motivation (for which reason) Data - Provider (by whom) - Consumer (to whom) - Format (in what format) - Quantity (how much) - Location (where) Technology - Capturing (how) - Analytical (What) - Triggering event (whence) - Processing methodology (whether) time, date, vehicle s plate#, event, latitude, longitude, city and velocity driver s national ID, vehicle ID, model# To refine and visualize machine-generated and structured data. - Increased rate of traffic violations leading to deadly accidents. - Public Security - Whenever event is triggered - To reduce traffic problems and improve driver and passenger s safety. - Geo-location data through sensors - MOI - Semi-structured data - 1 terabyte - Saudi Arabia - NIC database - MOI - Structured data - 1 terabyte - Saudi Arabia - Flume streams geolocation data vehicle data - Sqoop imports structured - Location-based Analysis - Speeding and passing red traffic signals - Real-time Processing - Batch Processing Activity - Input (what in) - Output (what out) Context - Scenario context (where about) - Time context (when) - Geo-location data and structured vehicle data from NIC database. - Data ready for visualization. - (Traffic General Directorate) shortage in traffic problems. - Continuous 58 P a g e

73 Chapter 6: Case Study 6.6 AUTO TRAFFIC MANAGEMENT CASE ANALYSIS The application of the specification framework on the hybrid data collection scenario performs as a structured and a well-defined data collection plan. It specifies the major factors that govern the data collection process to produce the right dataset that meets the given scenario. In the Auto Traffic Management case provided above, two types of data needs to be collected: Structured Data this data will be collect from the vehicles tables in NIC database through Apache Sqoop. Sqoop will import the specified structured data from the database and transmit them to HDFS for batch processing. Semi-structured Data the geo-location data will be captured in real-time from sensors and RFID readers through Apache Flume and transferred to HDFS along with the structured data for further processing and visualization through Apache Hive and HCatalog. Hadoop is capable of processing the sheer volumes of traffic violation data in Saudi Arabia. Figure 6.2 presents the process of data capture through Flume and Sqoop into HDFS. SOURCE DATA Vehicles Geo-location Data CAPTURE SQOOP FLUME HADOOP HIVE (data processing) HCATALOG (table metadata) HDFS FIGURE 6.2 DATA FLOWING THROUGH SQOOP AND FLUME INTO HDFS The conceptual model is now used as a well-defined process for collecting relevant structured, semi-structure and unstructured data based on a clear vision of the scenario specification, what is the reason for the data collection, what are the right search 59 P a g e

74 Chapter 6: Case Study patterns, what is the best data source and how the data will be captured and analyzed. The typical situation is the non-existence of a scenario-based data collection model. Therefore data collectors will tend to collect huge volumes of data that is mostly irrelevant. The capturing of irrelevant data results in waste of resources. Therefore, the scenariobased data collection specification framework will be an essential element that yields to the elimination of collecting noisy data, and thus, reducing the analysis time and obtaining useful results in a timely manner. This chapter has provided an application of the scenario-based data collection model on the three types of big data. First, it has provided a demonstration of the framework on real-life examples of two case studies that requires unstructured data collection. The application of the framework questions the US Presidential Elections scenario and the SA Ebola Scare scenario. Afterwards, a complex scenario that requires hybrid data collection of structured and semi-structured data has been presented. The three cases provided in this chapter have illustrated the idea of how to have a well-defined strategy for a more focused and structured data collection process. 60 P a g e

75 CHAPTER 7: EXPERIMENT AND VALIDATION An experiment is a question which science poses to Nature, and a measurement is the recording of Nature's answer Max Planck [97]

76 Chapter 7: Experiment and Validation 7.1 INTRODUCTION In this chapter, an experimental work is presented on Twitter feeds in order to examine the effectiveness and powerfulness of the scenario-based data collection framework. The experimentation is divided into two parts: the first experiment provides a random data collection without applying the scenario-based framework as a first step prior to performing the actual data collection. This process is worked out and analyzed in section The second experiment is where the scenario-based data collection framework has been applied on the same scenario in an attempt to collect relevant feeds. This process is provided in section A comparison and an analysis of the experiment and the validation of the scenario-based specification framework has been discussed as well. 7.2 EXPERIMENTAL DESIGN The value behind this experiment is concentrated on the validation of the specification framework on providing relevant data in a timely manner. This can be made visible in comparison with an ad hoc process of data collection, through an analysis of the generated data. Therefore, in order to ensure the availability of the right and sufficient data to prove the validity of the framework clearly and efficiently, considerable time and effort has been taken to design and organize the experiment and its attributes properly. 7.3 EXPERIMENTAL WORK In this section, an experiment is worked out through the Web-based application, DiscoverText 3, in order to evaluate and ensure the validation of the scenario-based data collection framework in providing the relevant data in a timely manner. 1. An ad hoc process of data collection, where the scenario-based data collection framework will not be applied prior fetching the data, and thus, no investigation about the scenario at hand has been made. 2. In contrast, a systematic process, through the application of the Requirements Specification Framework, of data collection is made on the same scenario. Afterwards, the results of the ad hoc data collection process, where the framework was not applied, is compared to the results of the data collection that applied the 3 This tool has been introduced and illustrated earlier in chapter3. A detailed description of how this software application works is provided in Appendix D. 62 P a g e

77 Chapter 7: Experiment and Validation framework prior attempting to collect the data. The comparison is intended to provide an evidence on the validation of the specification framework, and its efficiency in improving the data collection process Data Collection Scenario This experiment studies the validation of the specification framework on the Ebola scenario presented in section 6.3. The specification framework of the Ebola case study is presented in table Ad hoc Data Collection and Result Analysis In this section, DiscoverText [72] is used to collect twitter feeds without prior planning. The scenario-based data collection framework will not be applied before starting to collect the data. In an attempt to collect data that is relevant to the Ebola Scare scenario through DiscoverText, the tool has not been scheduled according to the scenario-based specification framework that forces relevant feeds. Therefore, after executing this ad hoc process of data collection, the resulting feeds appear to be a combination of Arabic and English tweets, most of are irrelevant to the scenario, and few are relevant tweets. Table 7.1 presents the keywords according to the specification framework for Ebola Scare given in table 6.2 and for each keyword, the number of appearances in the fetched feeds 4 when performing ad hoc data collection. Sample size: 200. TABLE 7.1 KEYWORDS AND THEIR OCCURRENCES IN THE AD HOC DATA COLLECTION k 1: k 2: Ebola k 3: Ebola k 4: Ebola k 5: Ebola k 6: Ebola Keyword Ebola virus scare panic infection death Occurrence Figure 7.1 presents a non-linear graph showing the result of the ad hoc data collection in a relation between X (keywords) and Y (counter) axis. 4 The English keywords were considered in the search for relevant feeds to clarify the point in a simple way. And the sample size is P a g e

78 Chapter 7: Experiment and Validation Add hoc Data Collection K1 K2 K3 K4 K5 K6 Keywords FIGURE 7.1 AD HOC DATA COLLECTION RESULT IN BAR CHART GRAPH According to this classification of the fetched feeds, the percentage of relevant feeds against the percentage of the irrelevant feeds can be calculated as follows: The total feeds fetched (sample size) = 200 (1) The total number of relevant feeds = = 17 (2) The total number of irrelevant feeds = (1) (2) = 183 (3) The percentage of relevant feeds = (2) 100 (1) = 8.5% The percentage of irrelevant feeds = (3) 100 (1) = 91.5% The visual presentation of the proportion of the relevant feeds to the irrelevant feeds can be viewed in figure P a g e

79 Chapter 7: Experiment and Validation FIGURE 7.2 AD HOC DATA COLLECTION RESULT IN PIE CHART A snapshot of the feeds container can be viewed in figures 7.3 and 7.4. There is no difference between the two figures, but to two snapshots were provided to increase visibility of the result of fetched feeds. FIGURE 7.3 A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (I) 65 P a g e

80 Chapter 7: Experiment and Validation FIGURE 7.4 A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (II) Section has presented an experiment of ad hoc data collection on a sample of size 200 for the Ebola Scare structured data scenario. DiscoverText has been worked out to fetch the tweets from Twitter API and store them in a container for further manipulation. The fetched feeds were then filtered out according to specific keywords and the number of their occurrences in order to be compared in the next sections with the result of the systematic data collection process that applies the scenario-based data collection framework Scenario-based Data Collection and Result Analysis In this section, DiscoverText is used to collect twitter feeds with prior planning and investigation. This will be through the application of the scenario-based data collection framework on the given scenario. Unlike the ad hoc process of data collection provided earlier, here, the tool will be scheduled according to the application of the framework on the Ebola scare scenario given in table 6.2. This produces more accurate results in terms of relevant feeds and time to capture the right data. Thus, providing a well-defined criteria to enhance the process of data collection, and therefore, addressing the research question. 66 P a g e

81 Chapter 7: Experiment and Validation Therefore, DiscoverText has been specified according to the characteristics of the Requirements Specification Framework such as presented in table 6.2 from chapter 6: the source (whereof), the search patterns that is keywords or phrases (by what), the provider (by whom), the period for the fetches (when), the fetch triggering event (whence), the quantity or maximum items per fetch (how much) and other constraints that aid in narrowing the results to what is relevant to this case. After executing the tool, most of the resulting feeds appear to be relevant and accurate. Table 7.2 presents the keywords according to the specification framework for Ebola Scare given in table 6.2 and for each keyword, the number of appearances in the fetched feeds when applying the scenario-based framework. Sample size: 200. TABLE 7.2 KEYWORDS AND THEIR OCCURRENCES IN THE SCENARIO-BASED DATA COLLECTION k 1: k 2: Ebola k 3: Ebola k 4: Ebola k 5: Ebola k 6: Ebola Keyword Ebola virus scare panic infection death Occurrence Figure 7.5 presents a non-linear graph showing the result of the scenario-based data collection in a relation between X (keywords) and Y (counter) axis. Scenario-based Data Collection K1 K2 K3 K4 K5 K6 Keywords FIGURE 7.5 SCENARIO-BASED DATA COLLECTION RESULT IN BAR CHART GRAPH 67 P a g e

82 Chapter 7: Experiment and Validation Figure 7.5 provides a clear presentation of the difference between the result of the ad hoc and the scenario-based result of data collection. It is apparent that the scenario-based data collection provides more accurate and relevant feeds. In order to provide more evidence on the powerfulness of the scenario-based framework in enhancing the process of data collection, the percentage of relevant feeds against the irrelevant feeds is calculated and presented later on a pie chart for a more clear comparison. The total feeds fetched (sample size) = 200 (1) The total number of relevant feeds = = 185 (2) The total number of irrelevant feeds = (1) (2) = 15 (3) The percentage of relevant feeds = (2) 100 (1) = 92.5% The percentage of irrelevant feeds = (3) 100 (1) = 7.5% A visual presentation of the proportion of the relevant feeds to the irrelevant feeds can be viewed in figure 7.6. FIGURE 7.6 SCENARIO-BASED DATA COLLECTION RESULT IN PIE CHART It can be visible through the retrieved data, the relevancy and accuracy compared to the randomly retrieved data in the ad hoc process. This scenariobased data is provided in the snapshot of the feeds container in figures 7.7 and 7.8. Again, there is no difference between the two figures, but to two snapshots were provided to increase visibility of the result of fetched feeds. 68 P a g e

83 Chapter 7: Experiment and Validation FIGURE 7.7 A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (I) FIGURE 7.8 A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (II) 69 P a g e

84 Chapter 7: Experiment and Validation Section has presented an experiment that applies the scenario-based framework to the scenario of Ebola scare provided in chapter6, prior to performing the data collection process. This has aid in constraining the data that flows in, according to a specified criteria. Thus, providing feeds that is more accurate and relevant to the case which is under analysis. To allow unbiased comparison, the sample size was fixed to 200 for the Ebola Scare scenario-based data collection. As with the experiment of the ad hoc data collection, DiscoverText has been worked out to fetch the tweets from Twitter API and store them in a container for further manipulation. However, in the latter experiment, the tool has been specified according to the scenario-based framework. As the fetched feeds been filtered out according to the same keywords, it appears that the number of their occurrences has been sharply increased for each keyword, causing a significant decrease in the rate of the irrelevant feeds. Therefore, the comparison of the analysis of the systematic data collection process that applies the scenario-based data collection framework, against the analysis of the result of the ad hoc data collection process provides an evidence on the effectiveness of the scenario-based data collection framework in retrieving the relevant data and discarding unwanted data. 70 P a g e

85 CHAPTER 8: DISCUSSION We are drowning in information and starving for knowledge" Rupert Murdoch [88]

86 Chapter 8: Discussion 8.1 INTRODUCTION This chapter reflects on a discussion upon the research analysis in terms of its contributions to: the data collection process in relation to a big data analytics solution which is presented in section 8.2. A discussion on the conceptual modeling of scenario-based data collection is provided in section And the evaluation of the framework application on the provided case studies is discussed in section The validation of the framework is analyzed in terms of its effectiveness in the data collection process and its reflection to the research question in section RESEARCH ANALYSIS This thesis has started with a potential motivation of developing a well-defined approach that is able to assist executives and decision makers in taking timely management decisions. The development of this approach happened to be more challenging especially with the overwhelming data generated and found everywhere. This yields data analysts to be frustrated about how to gather and analyze all the available data, and thus produce results in shorter time frames. What has been observed and realized is that the key enabler to obtaining valuable insights in real time, is the collection of less and more focused data. And thus, getting the right answers without having to analyze huge volumes of data which most likely may be needless and irrelevant. From this point of view, this research aimed to examine and study the different factors that govern the data collection process. It has emphasized the essence of analyzing the requirements of the scenario that needs to be addressed prior to attempting to data collection. The study developed an understanding of the different big data analytical techniques to know how to choose the right one according to the scenario under investigation THE CONCEPTUAL MODEL FOR SCENARIO-BASED DATA COLLECTION The proposed Requirements Specification Framework is a full description of the scenario-based data collection process based on primary, secondary and additional questions inspired by [6] which was derived following the work of Hermagoras of Temnos. A focus on the framework to perform data collection, based on answering the primary, secondary and additional questions leads to a 73 P a g e

87 Chapter 8: Discussion more detailed description of the scenario environment. And thus, making the big picture of big data collection process clear. To this big data collection challenge, a conceptual model is developed for scenario-based data collection modeling (Figure 8.1). The model in Figure 8.1 fulfills the need composed to serve the following purposes: The investigation through simple and organized questions according to the primary factors on wherefore, whereof, by what, and worthiness further yielding to secondary and additional questions along the scenario, sources, search patterns and value space covers the scenario properties requirements for data collection. The powerful questions are produced in relation to big data collection by the inspiration of W*H service modeling approach [6], which is an extension with the inquiry system of Hermagoras of Temnos frameworks. The conceptual model comprises 21 questions in total that cover the complete spectrum of questions addressing the description of the scenario-based data collection process. The model s comprehensiveness and flexibility supports the application to any given scenario across all domains during the data collection phase of big data solution. The exhaustiveness of the model became the main contributor to the scenario dependent factors that govern the data collection process. The model contributes as the primary input model yielding a structured and well-defined data collection activity on big data solution modelling. 74 P a g e

88 Chapter 8: Discussion TABLE 8.1 THE CONCEPTUAL MODEL FOR SCENARIO-BASED DATA COLLECTION Scenario-based Data Collection Properties Scenario Sources Search Patterns Value Wherefore? purpose Problem Why? Domain Where to? Period For when? Motivation For which reason? Where of? Data Provider Where from? Consumer To whom? Format In what format? Location Where? Quantity How much? By-what? Supporting Analytical technique What? needs Capturing technique How? Triggering event Whence? Processing methodology Whether? Activity Input What in? Output What out? Worthiness? Context Scenario context Where about? Time context When? FRAMEWORK EVALUATION CASE STUDY The framework is examined in chapter 5, through the application on three case studies of real-life scenarios. The first scenario aimed to tackle the problem of opinion polls and estimate the winning president in the United States. The presidential campaign, which is interested in the predictive analysis of the US Presidential Elections, has two major objectives: To assess the amount of information regarding the political events in the United States. To use the information gained to predict the overall voting intentions and estimate the winning candidate a week before the day of the elections (see table 6.1 in chapter 6). The second scenario aimed to track Ebola scare in Saudi Arabia in order to make timely disease awareness decisions. The ICT departments in the MOH needs to collect real-time streams of Twitter data to achieve the following objectives: 75 P a g e

89 Chapter 8: Discussion To track citizens reaction against EVD scare in Saudi Arabia, and how they tend to protect themselves. Based on the feedback gained, they need to make timely informed decisions regarding the disease awareness campaign (see table 6.2 in chapter 6). Moreover, the third scenario case aimed to limit traffic problems in Saudi Arabia and increase driver and passenger s safety. The Interior Ministry of the government of Saudi Arabia decided to develop an automated traffic violation detection system that captures geo-location and vehicle data. As per their specification, following are the objectives of their mission [96]: Improve the level of traffic safety. Use the latest advanced technologies in Intelligent Transportation (ITS) to ensure a safe traffic environment. Raise the efficiency of existing road networks. Strengthen the public security by using the latest surveillance systems. Implement traffic regulations strictly and continuity (see table 6.3 in chapter 6). Without identifying the key points for collecting the right social media data for analysis in such scenarios, it will be difficult to make decisions about what to capture and what is just useless chatter. Whereas for the third case, the framework provides a well-defined data collection approach to the scenario being processed. The case studies have shown how a systematic and a structured process can provide useful data collection that yields timely analysis results. Therefore, it is necessary to incorporate scenario-based data collection modeling into any big data solution as it performs as a filter to the other phases of the solution FRAMEWORK VALIDATION EXPERIMENT The Requirements Specification Framework has been validated in chapter 7 through performing an experiment that is divided into two parts. The purpose of this experiment was to validate the framework for effectiveness in collecting the relevant data according to the scenario requirements and minimizing the unwanted data. Findings of the experiment have shown that when performing an ad hoc process without planning for the data collection requirements, most 76 P a g e

90 Chapter 8: Discussion of the resulting data appears to be random and irrelevant. However, when applying the framework to the scenario for data collection, and then performing the actual process of data collection, the resulting data appears to be more accurate, useful and relevant to the scenario. Figure 8.1 presents the results of the two parts of the experiment. Scenario-based against Ad hoc Data Collection K1 K2 K3 K4 K5 K6 Scenario-based Ad hoc FIGURE 8.1 COMPARISON OF THE EXPERIMENTATION RESULTS It is visible in the figure above, how the readings have sharply increased for each keyword when the framework was applied, when compared with the ad hoc process readings. Therefore, the validation experiment of the framework has proved its effectiveness in improving the current (ad hoc) process of data collection. This reflects on the research question, where the Requirements Specification Framework contributes to a paradigm shift of big data collection. This chapter has provided an analysis of the research in terms of its contribution in the development of a Requirements Specification Framework of Scenario-based Big Data Collection processes. It has presented a description of a conceptual model based on the framework questions. In addition, the chapter has discussed the evaluation of the framework applications on the three case studies, and the validation of the framework along with its reflection on the research question. 77 P a g e

91 CHAPTER 9: CONCLUSION It s fine to celebrate success but it is more important to head the lessons of failure Bill Gates

92 Chapter 9: Conclusion 9.1 INTRODUCTION In this chapter, the research conclusion is provided in which the main research question presented in the introductory chapter will be answered. Moreover, additional observations during this study are discussed, research issues and limitations are raised and the research areas recommended for further study, are provided. 9.2 RESEARCH CONCLUSION This thesis has commenced with an introduction to the main research elements such as an illustration of the problem statement, scope and objectives followed. As described in the introductory chapter as well, the main contribution of this research were mainly focused on improving the current process of big data collection. In the same chapter, the main research question has been defined as follows: How can we improve the ad hoc process of data collection that hinders the efficiency of extracting value from large datasets in a timely manner? In an attempt to answer this question, the research has gone through various areas of interest. It highlighted the phenomenon of big data as a broad space and a complex environment, taking into account the general challenges of big data: data, process, and management challenges. The research has explored the literature and related works around Software Engineering applications, the available approaches that have control over limiting or reducing useless data, Hadoop and its core components as a scalable and efficient framework that is capable of processing tremendous bulks of big data. It has reviewed some theories and approaches that may lead to a sound process of data collection. Furthermore, some ad hoc process of data collection and its weakness has been highlighted. It has investigated how the potential idea of: the more data, the better results may lead to the collection of many unwanted data resulting in wasted resources and longer time to decision. Afterwards, it discussed on the methodology along with the instruments, procedures and methods followed to conduct this thesis. Based on a study of the scientific research materials and the literature exploration, it has been observed that the concept of Backward Analysis, which is performing Reverse Engineering, can add a positive impact to the process of data collection. A data collection that considers the properties of the scenario and the required output 79 P a g e

93 Chapter 9: Conclusion before attempting to collect any data (input). The research has studied the scenariobased factors that govern the data collection process and organized them in the form of primary, secondary and additional questions. These questions form the kernel of the Requirements Specification Framework developed as a structured, well-defined approach for scenario-based big data collection process. Moreover, each of the big data analytical solutions as well as Hadoop s capturing techniques has been mapped to the situations where each can be the right choice to address the given scenario. Figure 9.1 shows the current process of data collection, where there is no sort of planning before attempting to collect the data. As a result of this ad hoc process of data collection, most of the resulting data is irrelevant. Data Sources Data Collection Data Processing Data Processing Storage Data Analysis FIGURE 9.1 AD HOC PROCESS OF DATA COLLECTION (I) The improved process of data collection through the Requirements Specification Framework in provided in figure 9.2. Data Sources Requirements Specification Framework Data Collection & Processing Data Processing Storage Data Analysis Data Processing FIGURE 9.2 IMPROVED PROCESS OF DATA COLLECTION (II) The Requirements Specification Framework has been evaluated on three real-life scenario case studies for effectiveness in delivering a better data collection process. The scenarios covered the three types of big data: structured, semi-structured and unstructured, to demonstrate the comprehensiveness of the framework. Afterwards, the framework has been validated through an experiment that compared the readings 80 P a g e

94 Chapter 9: Conclusion of the results when applying the framework before collecting data, against performing the data collection without any planning. The experimentation on the Requirements Specification Framework for Scenariobased Big Data Collection, proves that applying the framework improves the ad hoc process of data collection by providing just the relevant data, and therefore, allows extracting value from large datasets in a timely manner. This reflects on addressing the research question, as the proposed framework contributes to a paradigm improvement to the data collection process. Hence, the Requirements Specification Framework is an added value as it provides simplicity, correctness and completeness into in the arena of big data collection. Domain experts can base their capturing infrastructure on this framework to obtain timely management decisions. 9.3 LIMITATIONS In addition to all possible improvements already mentioned in this thesis, there are different limitations and valuable enhancements which are important to mention as further research. Next, these limitations and recommendations are described in a random order SUBJECTIVE THEME This thesis focuses on the phenomenon of Big Data. However, big data is a fuzzy term that has no fixed definition. It is not yet mature enough, looselydefined and prone to context and understanding issues. Based on the 5 V s provided in section that characterize big data, the section derived the characteristics of big data scenarios. These characteristics perform as the main factors contributing to the development of the scenario-based big data collection framework. Although these factors that govern the data collection process were defined after a comprehensive literature exploration on this topic, both researchers and organizations may define more or less Vs of big data as it is a subjective topic. The difference in the definition of big data characteristics may lead to a different set of questions governing the data collection process LIMITED NUMBER OF CASES The three cases studied in this thesis covered the three types of big data. Nevertheless, by using more cases to cover multiple big data sources as much as possible and by asking the constant framework questions on each of the 81 P a g e

95 Chapter 9: Conclusion different scenario cases might lead to further framework refinements. This may increase the level of validity and generality of the framework EXPERIMENT AND VALIDATION Limited Sample Size With a sample size of two hundred, the experiment would have met its purpose, and the results are visible. However, a bigger sample size would have generated more accurate, reliable and unbiased results Lack of Generalizability As the experiment was applied on a single real-life case that requires social media feeds, the study may be limited in external validity to other cases because the findings are based on a single case. 9.4 FURTHER RESEARCH DIRECTIONS By studying more cases to that require data from other sources, this may lead to further framework refinements, and therefore will likely be more generic and cover scenarios in other domains such as scientific scenarios. Thus, enhancing the validity of the application and experimentation results. Another promising research direction is to pursue scenario-based data collection engineering as a standard approach for big data collection improvement in a big data solution environment. This could be through the automation of the framework and the development of software tools to be plugged in before the data collection and analysis, to provide the relevant data for timely decision making strategy and process improvement. 82 P a g e

96 REFERENCES [1] R. Mark (2012). Frog's Mark Rolston: the 'Minority Report' interface is a 'terrible idea'. The Verge, Wall Street Journal. [2] Economist Intelligence Unit (2012), The Deciding Factor: Big Data & Decision Making Capgemini. [3] J. F. Gantz, and D. Reinsel, (2011), Extracting Value from Chaos. IDC. [4] B. Purcell (2013), The Emergence of Big Data Technology and Analytics Journal of Technology Research JTR, Vol. 4, Page 1. [5] G. Guest, K. M. MacQueen, and E. E.Namey, (2012), Applied thematic analysis. Thousand Oaks, Calif. Sage Publications. [6] A. Dahanayake, and B. Thalheim (2014), W*H: The conceptual Model for Services. ESF 2014 workshop on "Correct software for web application", Sringer-Verlage. [7] J. A. Zachman (1987), A framework for information systems architecture. IBM Systems Journal 26, Vol. 3, Pages [8] T.L. Tuten (2008), Advertising 2.0: Social media marketing in a web 2.0 world. Wesrport, Connecticut: Praeger. [9] M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano, (2012), Analytics: The real-world use of big data: How innovative enterprises extract value from uncertain data. IBM Global Business Services, Business Analytics and Optimization, Executive Report. [10] P. Patterson (2013), More Data Does Not Equal Better Insights. Online article. Retrieved from [11] Z. Santovena, Alejandro, (2013), Big data: evolution, components, challenges and opportunities. Massachusetts Institute of Technology. [12] C. Regina, M. Beyer, M. Adrian, T. Friedman, D. Logan, F. Buytendijk, M. Pezzini, R. Edjlali, A. White, and D. Laney (2013), Top 10 Technology Trends Impacting Information Infrastructure. Gartner publication. [13] P. Hitzler, and K. Janowicz, (2013), Linked Data, Big Data, and the 4th Paradigm. Semantic Web Journal. Vol. 4, No. 3, Pages , IOS Press. [14] G. Martin (2013), Profit from Big Data white paper, HP Corp. [15] emorphis (2013), How Big data Can Help Your Organization to Improve Business. Emorphis.com. [16] R. Akerkar (2013), Big Data computing, First Edition, Chapman and Hall/CRC. [17] S. B. Siewert (2013), Big data in the cloud IBM Corp. [18] R. Barca, L. Haas, A. Halevy, P. Miller, and R. V. Zicari, (2012), Big Data for Good. ODBMS Industry Watch. [19] W. Rasmus,and S. Velu (2013), The value of Big Data. Insights of the Bain & Company. [20] ODBM (2012), Data Modeling for Analytical Data Warehouses. Interview with Michael Blaha. ODBMS Industry Watch. [21] K. F. Punch (2005), Introduction to Social Research: Qualitative and Quantitative Approaches. Britain: Sage. [22] T. Pearson, and R. Wegener (2013), Big Data: The organizational challenge, Bain & Company. [23] NESSI (2014), Software Engineering: Key Enabler for Innovation Networked European Software and Services Initiative (NESSI). 83 P a g e

97 [24] Lei Tang, and Huan Liu (2010), "Toward Predicting Collective Behavior via Social Dimension Extraction", IEEE Intelligent Systems, Vol. 25, No. 4, Pages [25] Biswas, Amitava, and J. Singh. (2006), "Software engineering challenges in new media applications." 10th International conference on Software Engineering & Applications (SEA), Dallas. [26] Li, Jiang, A. Eberlein, and B. H. Far. (2005), "Combining Requirements Engineering Techniques-Theory and Case Study". Engineering of Computer-Based Systems ECBS'05. 12th IEEE International Conference and Workshops. [27] R. S. Pressman, (2010), Software Engineering: A Practitioner's Approach. 7th Ed. New York: McGraw-Hill. [28] META. (2001), "3D Data Management: Controlling Data Volume, Velocity, and Variety". META Group. [29] S. W. Hermansen, (2012), Reducing Big Data to Manageable Portions. SESUG, USA. [30] Z. Guo and J. Wang (2011), "Information retrieval from large data sets via multiplewinners-take-all", in Proc. ISCAS, Rio de Janeiro, pages [31] EY, (2014), Big Data, Changing the way business compete and operate. Insights on Governance, Risk and Compliance. [32] T. Nakanishi, (2014), A Data-driven Axes Creation Model for Correlation Measurement on Big Data Analytics. Proceedings of 24th International Conference on Information Modelling and Knowledge Bases (EJC 2014). [33] L. Cutler (1987) Futurist Laurel Cutler retrieved from [34] T. Lethbridge, and j. Singer, (2001), Advances in Software Engineering: Comprehension, Evaluation, and Evolution. Experiences conducting studies of the work practices of software engineers. In H. Erdogmus & O. Tanir (Eds.) New York: Springer. [35] W. Westfall, (2005), Software Requirements Engineering: What, Why, Who, When, and How. Software Quality Professional, Vol.7, No.4, pages [36] J. Castro, M. Kolp, and J. Mylopoulos, (2002), Towards requirements-driven information systems engineering: the Tropos project. Information Systems, Vol. 27, Pages [37] S.L. Pfleeger, and J.M. Atlee. (2010), Software Engineering: Theory and Practice. 4th Edition, Pearson Education. [38] R. Conradi, and A. Fuggetta, (2002), "Improving Software Process Improvement". IEEE Software, Vol. 19, No. 4, Pages [39] T. White (2012) "Hadoop: The Definitive Guide. UK: O'Reilly. [40] Apache Hadoop. The Apache Software Foundation. Retrieved from [41] S. Singh, and N. Singh (2012), "Big data analytics" International Conference on Communication, Information & Computing Technology (ICCICT), Pages 1-4. [42] W. Xu, W. Luo, and N. Woodward (2012), Analysis and optimization of data import with Hadoop. In Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE 26th International, Pages [43] J. Dean, and S. Ghemawat (2008), MapReduce: Simplified Data Processing on Large Clusters Communications of the ACM, Vol. 51, No. 1, Pages P a g e

98 [44] S. Kurazumi, T. Tsumura, S. Saito, and H. Matsuo (2012), "Dynamic Processing Slots Scheduling for I/O Intensive Jobs of Hadoop MapReduce" Networking and Computing (ICNC), Third International Conference, Pages 288,292, 5-7. [45] Hortonworks. Retrieved from [46] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur (2000), PVFS: A parallel file system for Linux clusters in Proceedings of 4th Annual Linux Showcase and Conference, Pages [47] Lustre File System. Retrieved from [48] M. K. McKusick, and S. Quinlan (2009), GFS: Evolution on Fast-forward ACM Queue, Vol. 7, No. 7, Page 10. [49] K. Shvachko, H. Kuang, S. Radia, and R. Chansler (2010), "The Hadoop Distributed File System" Mass Storage Systems and Technologies (MSST), IEEE 26th Symposium, Pages 1,10, 3-7. [50] X. Qin, H. Wang, F. Li, B. Zhou, Y. Cao, C. Li, H. Chen, X. Zhou, X. Du, and S. Wang (2012), "Beyond Simple Integration of RDBMS and MapReduce -- Paving the Way toward a Unified System for Big Data Analytics: Vision and Progress" Cloud and Green Computing (CGC), Second International Conference, Pages 716,725, 1-3. [51] B. Thalheim, and Y. Kiyoki (2012), Analysis-Driven Data Collection, Integration and Preparation for Visualisation EJC Pages [52] B. B. Claire, (2013), Managing Semantic Big Data for Intelligence STIDS, Vol of CEUR Workshop Proceedings, Page [53] Intel (2012), Extract, Transform, and Load Big Data with Apache Hadoop White paper. [54] D. Mysore, S. Khupat, and S. Jain (2013), Big Data architecture and patterns, Part1: Introduction to Big Data classification and architecture. IBM Corp. [55] C. Spencer Big Data scenarios and case studies IBM Corp. [56] N. Alnajran, M. Alswilmi, and A. Dahanayake, (2014), Conceptual Framework for Big Data Analytics Solutions Proceedings of 24th International Conference on Information Modelling and Knowledge Bases (EJC 2014). [57] C. Angela (2013), Challenges of Capturing Relevant Data Umati Project. [58] F. Neck, G. and A. David (2006), Challenges and Opportunities in Internet Data Mining, Carnegie Mellon University, Pittsburgh, PA [59] X. Su and T. M. Khoshgoftaar (2009), "A Survey of Collaborative Filtering Techniques" Advances in Artificial Intelligence, Vol. 6, No. 4. [60] D. Dimitre, P. Roopa, Q. Abir and H. Jeff (2009) ISENS: A System for Information Integration, Exploration, and Querying of Multi-Ontology Data Sources, IEEE Computer Society, ICSC. Pages [61] W. Chu, D. Johnson and H. Kangarloo (2000), "A medical digital library to support scenario and user-tailored information retrieval" IEEE Trans. Inf. Technol. Biomed., Vol. 4, No. 2, Pages [62] V. Sitalakshmi and K. Sadhana (2013), Intelligent Information Retrieval and Recommender System Framework International Journal of Future Computer and Communication, Vol. 2, No. 2, Pages ISSN: [63] R. Sanjay (2013), Big Data and Hadoop with components like Flume, Pig, Hive and Jaql. International Conference on Cloud, Big Data and Trust 2013, Nov 13-15, RGPV. 85 P a g e

99 [64] Z. Z. A. Siti, M.D. Noorazida and H. H. Azizul (2010), "Extraction and classification of unstructured data in WebPages for structured multimedia database via XML" in Information Retrieval & Knowledge Management, (CAMP). International Conference on, pages [65] K. Craig and S. Pedro (2013), Semantics for Big Data Integration and Analysis In AAAI Fall Symposium Series. [66] H. Olaf, B. Christian and C. F. Johann (2009), Executing SPARQL Queries Over the Web of Linked Data The Semantic Web-ISWC. Pages [67] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti (2011), Crawling Facebook for Social Network Analysis Purposes arxiv: v1 [cs.si]. [68] T. Cao and Q. Nguyen (2012), "Semantic approach to travel information search and itinerary recommendation" International Journal of Web Information Systems, Vol. 8, No.3. Pages [69] D. Buckingham (2011), Data is the new Oil. Business Computing World co, UK. [70] Henver, S. March, J. Park, and S. Ram, (2004), Design Science in Information Systems Research. MIS Quarterly Vol. 28 No. 1. Pages [71] M. Saunders, P. Lewis, and A. Thornhill, (2012), Research Methods for Business Students. 6 th ed. England: Pearson. [72] S. Shulman, (2011), DiscoverText: Software training to unlock the power of text. Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times. ACM. Pages [73] F. Javed, 9 best languages for crunching data, A FastCoLabs article. [74] Occupational Outlook Quarterly (2013), Working with big data retrieved from [75] Domain [Def. 4]. (n.d.). Merriam-Webster Online. In Merriam-Webster. [76] SWE (2013) Modeling the Environment: Business Domains, Technology Groups, Archetypes, and Vignettes A Community-developed Dictionary of Software Weakness Types. [77] Scenario [Def. 4]. (n.d.). Merriam-Webster Online. In Merriam-Webster. [78] I. Jacobson, M. Christerson, P. Jonsson, and G. Overgaard (1992), Object Oriented Software Eng.: A Use-Case Driven Approach. AddisonWesley. [79] C. Potts, K. Takahashi, and A.I. Anton (1994), Inquiry-Based Requirements Analysis IEEE Software, Vol. 11, No. 2. Pages [80] A. G. Sutcliffe (1997), A Technique Combination Approach to Requirements Engineering Proceedings of the Third IEEE Int l Symp. Requirements Engineering, IEEE CS Press. Pages 65 74, [81] M. Kyng (1995), Creating Contexts for Design J.M. Carroll, ed., Scenario-Based Design: Envisioning Work and Technology in System Development. New York: John Wiley & Sons. [82] A. G. Sutcliffe, N. A. Maiden, S. Minocha, and D. Manuel (1998), Supporting Scenario-Based Requirements Engineering IEEE Transactions on Software Engineering. Vol. 24, No. 12. [83] AIHW (2008), Scenario-based evaluation of existing data collections Australia, Australian Institute of Health and Welfare. [84] Backward analysis. (n.d.). The Free On-line Dictionary of Computing. [85] S. Barksdale, and T. Lund (2006), 10 Steps to Successful Strategic Planning. Alexandria, Virginia: ASTD Press. 86 P a g e

100 [86] D. Stephenson (2013), 7 Big Data Techniques that create Business Value. Retrieved from [87] D. Zowghi, and V. Gervasi (2002), The Three Cs of Requirements: Consistency, Completeness, and Correctness. Proceedings of 8th International Workshop on Requirements Engineering: Foundation for Software Quality. [88] Spinnakr (2012), The Data Science of Digital Marketing. Retrieved from [89] B. H. Obama, Jr. (2014). The Biography.com website. [90] W. M. Romney. (2014). The Biography.com website [91] Twitter, Inc., Twitter API. The diev.twitter.com website. [92] Mayo Clinic, (2014), Ebola Virus and Marburg Virus. The Mayo Clinic website. [93] D. S. Fedson (2014), A Practical Treatment for Patients with Ebola Virus Disease, Journal of Infectious Diseases Advance Access. Vol No. 6. [94] Daily Bhaskar (2014), Saudi Arabia blocks visas from West African countries amid Ebola crisis. Retrieved from [95] S. Bennett (2014), How Much Data is generated on Twitter Every Minute. Retrieved from [96] Saher System (2014), Ministry of Interior (MOI) website. [97] M. Walsh, Imperfection the Workshop of Creation. Authorhouse, P a g e

101 APPENDIX A THE UNSTRUCTURED INTERVIEW Briefly, during the interview, questions such as the following have been asked: What is your process of data collection when it comes to big data analytics? Our enterprise data management software stores and manages all the internal generated data into HDFS for processing. Do you record only internal data or do you expand your collection mechanism to retrieve external data as well? We record server logs, auditing, employee records, sales transactions, and EMC internal data. No external or social data is captured. Do you record all the available data weather it matches your need or not, and then analyze them all in order to find insights, or do you perform initial sorts of data filtering according to your needs prior collecting the data? Well, we don t perform such filtering before collecting any data. This is because we only store our organizational internal data that is related to our transactions, employees, and sales in HDFS as they are being generated, which is always needful data that will be accessed one day. If yes, does this process allows you to make decisions in a timely manner or is the data too big to be analyzed efficiently? We have managed to adopt big data analytics to leverage our sales, productivity, and cope with the age of data evolution. And we are in the process of training our employees to using R in executing our advanced analytics that includes Naive Bayesian Classifier, K-Means Clustering, Association Rules, Decision Trees, Linear and Logistic Regression, Time Series Analysis, and Text Analytics methods. Going back to your question, some of our collected data streams is analyzed in near real-time and some types of data is batched for later processing while some other portions of data are unfortunately stored but may never be retrieved and used. 88 P a g e

102 APPENDIX B OTHER HADOOP COMPONENTS The aspects are explained here in a highly simplified manner. A detailed description of them can be found in [39-50]. TABLE 1B: HADOOP ECOSYSTEM COMPONENTS Type of Component Description Service Core HDFS Provides scalable and reliable data storage of massive amounts of data (data blocks are distributed among clusters) for further processing. It is suitable for applications with large and multi-structured data sets (e.g., web and social data, human generated log, and biometrics data) to provide for performing predictive analysis and pattern recognition. HDFS is possible to interact with batch data processing as well as the data in real time events (sensors or fraud) even before it lands on HDFS. MapReduce Yarn Framework for writing applications that process large amounts of structured and unstructured data in parallel by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel on commodity hardware reliably, and in a fault-tolerant manner. Framework for Hadoop data processing supports MapReduce and other programming models. It handles the resource management, security, etc.. and to allow for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, interactive SQL with Apache Hive and Apache Tez). 89 P a g e

103 Tez Generalizes MapReduce to support near real-time processing. It can scale up request and meet demands for fast response times providing the suitable framework to execute near real-time processing systems. Data Pig Platform paired with MapReduce and HDFS for processing large Big Data. It performs all of the data processing by compiling its Latin scripts to produce sequences of MapReduce programs. Hive Data Warehouse that enables easy data summarization and ad-hoc queries. It also allows a mechanism for structuring the semi-structured (customer logs) and unstructured data (machine generated and transaction data) and perform queries using SQL-like language called HiveQL. Hive resides on top of MapReduce and next to Pig. HBase HCatalog A distributed, scalable, big data store with random, real time read/write access. For storing huge amounts of unstructured data, RDBMS will not be adequate as the data sets will grow and accordingly will rise issues with scaling up request since these relational databases were not designed to be distributed. Hbase (column-based), NoSQL database that allows for low-latency, quick lookups in Hadoop is needed to maintain a class of a non-relational data storage systems that supports data consistency, scalability and excellent performance. Provides centralized way for data processing systems to understand the structure and location of the data stored within Hadoop. It handles metadata management service to understand the structure and location of the data stored within HDFS and supports 90 P a g e

104 abstraction and language independent allowing the choice of using different data processing tools. HCatalog acts as an adapter between Hadoop on one side and the query language frameworks on the other. Storm Distributed real-time computation system for processing fast, large streams of data adding reliable real-time processing capabilities to Hadoop. The fault-tolerant and high performance real time computations of Storm facilitates reliably processing continuous feeds and unbounded streams of data to respond to real time events as they happen. Mahout Accumulo Flume Sqoop Provides scalable machine learning algorithms, cause to support clustering, classification and batch based collaborative filtering. This data-mining library provides algorithms for clustering the unstructured data, collaborative filtering, regression testing, and statistical modeling to analyze insights. It has the potential to pull in vast troves of data from exposed social media sites and make far-reaching conclusions. High performance data storage and retrieval system with cell-level access control. It works on top of Hadoop and ZooKeeper and supports high performance storage and retrieval of data to allow predictive analysis. Allows to efficiently aggregate and move large amount of log and stream data from many different sources to Hadoop into HDFS for storage. Open-source tool that allows users to extract data from a relational database into Hadoop for further processing. It allows to speed and ease the movement 91 P a g e

105 of monitoring generated data from different smart meters databases. Operational ZooKeeper Coordinates distributed processes. Stores, and mediates updates to important configuration information. It handles configuration and synchronization services and the management layer of Hadoop platform. It is necessary to enable Accumulo work efficiently. Ambari Oozie Falcon Knox Manages, and monitors Hadoop clusters. Schedule Hadoop jobs. Combines multiple jobs sequentially into one logical unit. Data management framework for simplifying data lifecycle management and processing pipelines on Hadoop. System that provides a single point of authentication and access for Hadoop services in a cluster. To simplify Hadoop security of users, and operators. 92 P a g e

106 APPENDIX C DIMENSIONS OF THE W*H MODEL FOR SERVICES The W*H model merges the gap between main service design initiatives and their abstraction level interpretations. It introduces four dimensions that specifies a full description of IT services [6]. 1. The Annotation dimension covers the Party and Activity dimensions a. Party: this dimension provides a description of the stakeholders that are requesting the service by identifying the supplier, consumer, and producer. These factors are declared by answering the by whom, to whom, and whichever questions. b. Activity: this dimension provides a description of the actions processed during the service execution through identifying the input and the output of the service. It is declared by answering what in and what out questions. 2. The Content dimension provides a description the supporting means through the application domain dimension. The application domain covers the area where the service will be used, the case, the problem in which to be solved by the service, organizational unit that requests the service, events in which the service will be triggered for execution, and IT. These factors are captured through answering questions like wherein, where, for what, where-from, whence, what, and how. 3. The Concept dimension provides a description of need and purpose of the required service. It is declared by answering why, whereto, for when, and for which reason. 4. The Added value dimension provides a description of the surplus value a potential user is expected to gain by using this service through defining its context characteristics. The context captures the provider or developer or supplier context for a service, the user context for a service, the environment that must exist for service utilization, and the coexistence context for a service within a set of services. These context dimensions are declared by answering whereat, whereabout, whither, when. Figure 1C combines the service dimensions to a general W H framework [6]. 93 P a g e

107 FIGURE 1C. THE W H INQUIRY BASED CONCEPTUAL MODEL FOR SERVICES MAPPING OF THE W*H SERVICE MODEL TO THE BIG DATA COLLECTION DOMAIN The success of W*H as a requirements specification framework for services, inspires the requirements specification framework for scenario based big data collection. The W*H performs as an inquiry system for service requirements that is specified before designing and integrating the service into the actual environment in order to fulfill the usefulness, usage, and usability requirements for service systems. The questions describing the dimensions of a service in the W*H framework provides a well-defined conception on using a similar set of questions to guide the data collection process in the big data domain. 94 P a g e

108 Each dimension of the service in the W*H model inspires the factors that describe the requirements of the data collection and have control over eliminate the collection of irrelevant and unwanted data. The mapping of the dimensions from the W*H model to the proposed framework for big data collection based on scenarios of interest appears as follows: The ends or purpose in the W*H model which captures the purpose and usefulness of the service is mapped to the scenario dimension that describes the scenario of interest or the business problem that derives an inquiry for the data collection. It is described through defining the problem, the domain of the business scenario, the period range to start and end fetching the data, and the motivation behind the data collection process. These factors are captured through answering why, whereto, for when, and for which reason questions respectively. The sources of the service which refers to the service environment in the W*H model, is mapped in the proposed framework to the sources that generate and hold the data. These sources in the requirements framework for data collection is described through capturing the minimum amount of data that is sufficient for real-time decision support, the type of the data, the location where the data exist, the consumer and the provider of the data. These factors are captured through answering how much, in what format, where, to whom, and where from questions. The supporting means that describes the application domain properties that a user must know in case of service utilization in the W*H model is mapped to the search patterns in the proposed framework. These search patterns refer to the patterns, phrases and keywords that are relevant to the scenario of interest. This dimension captures the Supporting needs and the Activity dimensions. o The supporting needs describes the technical requirements of the data collection process. It captures the collecting and analyzing technique, the frequency of collecting data batches, and whether the processing will take place in real-time or batched for later processing. These factors are specified through answering what, how, whence, and whether questions. o The activity that describes the processes played during service application in the W*H model inspires and is mapped to the activity dimension in the proposed framework. This activity describes the input to the data 95 P a g e

109 collection process and the expected output. It is specified through answering what in and what out questions. The surplus value that describes the worthiness a service utilization may provide to the user in the W*H model, provides a good idea on specifying the data collection requirements. It is therefore mapped to the value dimension in the proposed framework. This value describes the data collection worthiness in terms of scenario context (the value provided to the business domain) and time context (on-demand or continuous data collection). It is specified through answering when and where about questions. A summary mapping the W*H primary, secondary, and additional questions to the Scenario-based Big Data Collection Requirements Framework, appears in table 1C. TABLE 1C: MAPPING W*H MODEL QS TO SCENARIO-BASED DATA COLLECTION QS W*H Model Questions Scenario-based Data Collection Ends or Why? Scenario Why? Purpose Where to? Where to? For when? For when? For which reason? For which reason? Supporting Wherein? Search What? means Wherefrom? How? Patterns For what? Whence? Where? Whether? Whence? What in? What? What out? How? Sources By whom? Sources Where from? To whom? To whom? Whichever? In what format? What in? Where? What out? How much? Surplus value Where at? Value Where about? Where about? Wither? When? When? 96 P a g e

110 Why Problem Whereto Domain ForWhen Period Purpose WhereFor ForWhichReason Motivation WhereAbout Scenario Context When Time Context Context Worthiness HowMuch Quantity InWhatFormat Where Location Consumer Format Data WhereOf ToWhom WhereFrom e Provider Scenario Value Sources SCENARIO-BASED DATA COLLECTION Search Patterns Activity ByWhat Supporting needs In-put WhatIn Out-put WhatOut Analytical Technique Capturing Technique Triggering Processing Event methodology What How Whence Whether FIGURE 2C. THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION 97 P a g e

111 APPENDIX D COLLECTING DATA IN DISCOVERTEXT 1. The DiscoverText Dashboard. It has a free trial version for one month that requires only login to start collecting data. 2. Start a new project. FIGURE 1D. DISCOVERTEXT DASHBOARD FIGURE 2D. START A NEW PROJECT 98 P a g e

112 3. Name your project. FIGURE 3D. NAME YOUR PROJECT 4. Import data from project details screen FIGURE 4D. IMPORT DATA 99 P a g e

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 ISSN 2278-7763. BIG DATA: A New Technology

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 ISSN 2278-7763. BIG DATA: A New Technology International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 BIG DATA: A New Technology Farah DeebaHasan Student, M.Tech.(IT) Anshul Kumar Sharma Student, M.Tech.(IT)

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data and Healthcare Payers WHITE PAPER

Big Data and Healthcare Payers WHITE PAPER Knowledgent White Paper Series Big Data and Healthcare Payers WHITE PAPER Summary With the implementation of the Affordable Care Act, the transition to a more member-centric relationship model, and other

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS Megha Joshi Assistant Professor, ASM s Institute of Computer Studies, Pune, India Abstract: Industry is struggling to handle voluminous, complex, unstructured

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

More information

Getting to Know Big Data

Getting to Know Big Data Getting to Know Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: putchong@ku.th Information Tsunami Rapid expansion of Smartphone

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

BIG DATA IN BUSINESS ENVIRONMENT

BIG DATA IN BUSINESS ENVIRONMENT Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty

More information

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

ANALYTICS BUILT FOR INTERNET OF THINGS

ANALYTICS BUILT FOR INTERNET OF THINGS ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Buyer s Guide to Big Data Integration

Buyer s Guide to Big Data Integration SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Exploiting Data at Rest and Data in Motion with a Big Data Platform Exploiting Data at Rest and Data in Motion with a Big Data Platform Sarah Brader, sarah_brader@uk.ibm.com What is Big Data? Where does it come from? 12+ TBs of tweet data every day 30 billion RFID tags

More information

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out

Big Data Challenges and Success Factors. Deloitte Analytics Your data, inside out Big Data Challenges and Success Factors Deloitte Analytics Your data, inside out Big Data refers to the set of problems and subsequent technologies developed to solve them that are hard or expensive to

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 美 國 13 歲 學 生 用 Big Data 找 出 霸 淩 熱 點 Puri 架 設 網 站 Bullyvention, 藉 由 分 析 Twitter 上 找 出 提 到 跟 霸 凌 相 關 的 詞, 搭 配 地 理 位 置

More information

Big Data. Fast Forward. Putting data to productive use

Big Data. Fast Forward. Putting data to productive use Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

More information

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach Unlocking The Value of the Deep Web Harvesting Big Data that Google Doesn t Reach Introduction Every day, untold millions search the web with Google, Bing and other search engines. The volumes truly are

More information

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute. www.htcinc.com

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute. www.htcinc.com WHITE PAPER ON Operational Analytics www.htcinc.com Contents Introduction... 2 Industry 4.0 Standard... 3 Data Streams... 3 Big Data Age... 4 Analytics... 5 Operational Analytics... 6 IT Operations Analytics...

More information

Approaches for parallel data loading and data querying

Approaches for parallel data loading and data querying 78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper

More information

White Paper. Version 1.2 May 2015 RAID Incorporated

White Paper. Version 1.2 May 2015 RAID Incorporated White Paper Version 1.2 May 2015 RAID Incorporated Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively

More information

Navigating Big Data business analytics

Navigating Big Data business analytics mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Data Warehousing in the Age of Big Data

Data Warehousing in the Age of Big Data Data Warehousing in the Age of Big Data Krish Krishnan AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD * PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier

More information

Big Data a threat or a chance?

Big Data a threat or a chance? Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON BIG DATA ISSUES AMRINDER KAUR Assistant Professor, Department of Computer

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

Trustworthiness of Big Data

Trustworthiness of Big Data Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Where is... How do I get to...

Where is... How do I get to... Big Data, Fast Data, Spatial Data Making Sense of Location Data in a Smart City Hans Viehmann Product Manager EMEA ORACLE Corporation August 19, 2015 Copyright 2014, Oracle and/or its affiliates. All rights

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Bringing Big Data into the Enterprise

Bringing Big Data into the Enterprise Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?

More information

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

Data Isn't Everything

Data Isn't Everything June 17, 2015 Innovate Forward Data Isn't Everything The Challenges of Big Data, Advanced Analytics, and Advance Computation Devices for Transportation Agencies. Using Data to Support Mission, Administration,

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

NTT DATA Big Data Reference Architecture Ver. 1.0

NTT DATA Big Data Reference Architecture Ver. 1.0 NTT DATA Big Data Reference Architecture Ver. 1.0 Big Data Reference Architecture is a joint work of NTT DATA and EVERIS SPAIN, S.L.U. Table of Contents Chap.1 Advance of Big Data Utilization... 2 Chap.2

More information

Big Data. A general approach to process external multimedia datasets. David Mera

Big Data. A general approach to process external multimedia datasets. David Mera Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014 Table of Contents

More information

Taking Data Analytics to the Next Level

Taking Data Analytics to the Next Level Taking Data Analytics to the Next Level Implementing and Supporting Big Data Initiatives What Is Big Data and How Is It Applicable to Anti-Fraud Efforts? 2 of 20 Definition Gartner: Big data is high-volume,

More information

Enhance Collaboration and Data Sharing for Faster Decisions and Improved Mission Outcome

Enhance Collaboration and Data Sharing for Faster Decisions and Improved Mission Outcome Enhance Collaboration and Data Sharing for Faster Decisions and Improved Mission Outcome Richard Breakiron Senior Director, Cyber Solutions Rbreakiron@vion.com Office: 571-353-6127 / Cell: 803-443-8002

More information

The big data revolution

The big data revolution The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS 9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

There s no way around it: learning about Big Data means

There s no way around it: learning about Big Data means In This Chapter Chapter 1 Introducing Big Data Beginning with Big Data Meeting MapReduce Saying hello to Hadoop Making connections between Big Data, MapReduce, and Hadoop There s no way around it: learning

More information

The Rise of Industrial Big Data

The Rise of Industrial Big Data GE Intelligent Platforms The Rise of Industrial Big Data Leveraging large time-series data sets to drive innovation, competitiveness and growth capitalizing on the big data opportunity The Rise of Industrial

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank Agenda» Overview» What is Big Data?» Accelerates advances in computer & technologies» Revolutionizes data measurement»

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Search Engine Marketing Analytics with Hadoop. Ecosystem: Case Study for Red Engine Digital Agency

Search Engine Marketing Analytics with Hadoop. Ecosystem: Case Study for Red Engine Digital Agency Search Engine Marketing Analytics with Hadoop Ecosystem: Case Study for Red Engine Digital Agency Mia Alowaish Northwestern University MaiAlowaish2012@u.northwestern.edu Sunil Kakade Northwestern University

More information

REAL-TIME OPERATIONAL INTELLIGENCE. Competitive advantage from unstructured, high-velocity log and machine Big Data

REAL-TIME OPERATIONAL INTELLIGENCE. Competitive advantage from unstructured, high-velocity log and machine Big Data REAL-TIME OPERATIONAL INTELLIGENCE Competitive advantage from unstructured, high-velocity log and machine Big Data 2 SQLstream: Our s-streaming products unlock the value of high-velocity unstructured log

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

IJRCS - International Journal of Research in Computer Science ISSN: 2349-3828

IJRCS - International Journal of Research in Computer Science ISSN: 2349-3828 ISSN: 2349-3828 Implementing Big Data for Intelligent Business Decisions Dr. V. B. Aggarwal Deepshikha Aggarwal 1(Jagan Institute of Management Studies, Delhi, India, vbaggarwal@jimsindia.org) 2(Jagan

More information