A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing

Transcription

1 A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Of Masters of Science in Software Engineering At the College of Computer and Information Sciences At Prince Sultan University By: Mashail A. Alswilmi May, 2015

2 A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing By Mashail A. Alswilmi This thesis was defended on 25 th May 2015 Supervisor: Prof. Dr. Ajantha Dahanayake Members of the Exam Committee Prof. Dr. Ajantha Dahanayake Dr. Areej Alwabil Dr. Sarab AlMuhaideb Chair Member Member 2

3 ACKNOWLEDGMENTS First and foremost, praises and thanks to Allah, for his showers of blessings throughout my research work and to complete it successfully. I would like to express my deep and sincere gratitude to my research supervisor Prof. Ajantha Dahanayake for the continuous support of my master study and research, for giving me the opportunity to do research, providing invaluable guidance and for her patience, motivation, enthusiasm, and immense knowledge throughout this research. She has taught me the methodology to carry out the research and to present it as clearly as possible. It was a great privilege and honor to work and study under her guidance. Her dynamism, vision, sincerity and motivation have deeply inspired me and I am extremely grateful for what she has offered me. I am extremely grateful to my parents for their love, prayers, caring and sacrifices for educating and preparing me for my future. I am very much thankful to my husband and my son for their love, understanding, prayers and continuing support to complete this research. Also I wish to express thanks to my sisters, and brothers for their support and valuable prayers. Finally, my thanks extends to all those individuals who supported me to complete the research project, either directly or indirectly. I

4 Abstract This master s thesis utilizes a decision-back concept to optimize the process of social media data collection. Leveraging this type of Big Data extends the requirements of traditional data capturing techniques, due to their large volume, velocity, variety, and veracity. Comprehensive analysis of the properties of the problem at hand and determining the analyzing needs upfront for the data collection, eliminates the chance of being overwhelmed by masses of irrelevant data, and helps users and businesses to generate management decisions and answer mission critical questions in an efficient and timely manner. Therefore, this master s thesis has developed an architecture of a requirements acquisition tool that applies a decision-back approach to capture social media data analyzing requirements. The tool captures the requirements by providing a set of questions in multiple phases. In the first phase: Problem Domain set of questions; the system is analyzing the user answers by using NLP technique to extract keywords, time, and location constraints. Then with the second phase: Data Source set of questions; the system is analyzing user s selections by using data source recommendation system to recommend the most suitable data source. Within the final phase: Analytical Tool set of questions; the system is analyzing user s selections by using an analytical tool recommendation system to recommend the most suitable analytical tool. The tool outputs are: keywords, time and location constraints, recommended data source, and recommended analytical tool. This tool is validated for correctness and efficiency quality factors, through performing an experiment that compares data collection for social media analytics with and without the use of the tool. The experiment proved that the correctness and the efficiency average rate of improvements increased after using the tool. The main contribution of this research is the design of a value-added and well-defined process to capture social media data analyzing requirements upfront for the data collection to accelerate the analytics tasks. The requirements acquisition tool also contributes to: 1) Requirements engineering field, by building a tool that helps the user captures his requirements prior to data collection process during the social media data Analytic and 2) Software engineering field, by providing a II

5 user-centered solution that captures the user s social media data analyzing needs within a user friendly environment. III

6 ملخص البحث تستخدم رسالة الماجستير مفهوم "القرار التابع للنتيجة "Decision-back لتحسين عملية جمع بيانات وساي ل الا علام الاجتماعية. لا ن الاستفادة من هذا النوع من البيانات الضخمة يتجاوز ما توفره تقنيات جمع البيانات التقليدية نظرا لضخامة كميتها و سرعتها و تنوعها ومدى صحتها. ا ن التحليل الشامل على طريقة "القرار التابع للنتيجة" عن طريق تحليل خصاي ص المشكلة الحالية وتحديد احتياجات التحليل قبل البدء بجمع البيانات سوف يقلل من فرص الانغمار في كتل من البيانات غير ذات العلاقة ومساعدة المستخدمين والشركات لاتخاذ قرارات ا دارية والا جابة على ا سي لة المهمات الحاسمة بطريقة فعالة وفي الوقت المناسب لذلك رسالة الماجستير هذه تطور بنية ا داة جمع المتطلبات و التي تطبق مفهوم "القرار التابع للنتيجة" لجمع بيانات وساي ل الا علام الاجتماعية. الا داة تجمع المتطلبات بمجموعة من الا سي لة على عدة مراحل المرحلة الا ولى: مجموعة ا سي لة مجال المشكلة توفر الا ساس لتحليل ا جابات المستخدم بطريقة "معالجة اللغة الطبيعية "NLP لاستخراج مفاتيح البحث و قيود الوقت والمكان. المرحلة الثانية: مجموعة ا سي لة مصدر البيانات حيث يتم تحليل اختيارات المستخدم عن طريق نظام توصية مصدر البيانات. المرحلة الا خيرة: مجموعة ا سي لة ا دوات تحليل البيانات حيث يتم تحليل اختيارات المستخدم عن طريق نظام توصية ا داة تحليل البيانات. مخرجات الا داة هي: مفاتيح البحث وقيود الوقت والمكان مصدر البيانات الموصى به و ا داة التحليل الموصى بها. تم التحقق من فعالية هذه الا داة بالنسبة لعاملي الملاي مة والكفاءة من خلال ا جراء تجربة تقارن تحليل بيانات وساي ل الا علام الاجتماعية مع وبدون استخدام الا داة. ا ثبتت التجربة ا ن متوسط معدل تحسن عامل الملاي مة و الكفاءة ارتفع بعد استخدام الا داة. المساهمة الري يسية لهذا البحث هو تصميم عملية ذات قيمة مضافة واضحة المعالم لجمع متطلبات تحليل بيانات وساي ل الا علام الاجتماعية قبل جمع البيانات لتسريع مهام التحليل. كما تساهم ا داة اكتساب المتطلبات ا يضا في: ۱) مجال هندسة المتطلبات للبرمجيات من خلال بناء الا داة التي توفر مجموعة من الا سي لة التي تساعد المستخدم على التقاط احتياجاته قبل عملية جمع البيانات خلال تحليلات بيانات وساي ل الا علام الاجتماعية. ۲) مجال هندسة البرمجيات من خلال توفير حل محوره المستخدم و الذي يخدم احتياجات المستخدمين في تحليل بيانات وساي ط الا علام الاجتماعية في بيي ة سهلة ومريحة. IV

7 Table of Contents Acknowledgment... I Abstract... II Abstract in Arabic... IV Table of Contents...V List of Figures... XII List of Tables... XV List of Appendix Figures... XVI List of Appendix Tables...XVII List of Abbreviations... XVIII Chapter 1: Introduction Introduction Motivation Definition of Big Data Social Media Big Data Definition of Social Media Data Capturing and Analyzing Challenges of Social Media Requirements Engineering for Social Media Big Data Analytics Problem Statement Research Questions and Objectives V

8 1.8. Scope of the Thesis Related Published Paper Outline of the Thesis Chapter 2: Research Methods Research Methods Research Design Research Participants Research Techniques and Data Analysis Research Work Packages Research Instruments and Procedures Social Mention Trackur Chapter 3: Literture Review Big Data and Social Media Related Works Innovative Big Data and Data Capturing Approaches Literatures Analysis Innovative Social Media Data Collection and Analytics Approaches Literatures Analysis Software Engineering and Social Media Data Analytics Reverse Engineering Software Requirements Engineering Decision-back Data Capturing Approach Literatures Analysis VI

9 3.4. Related Tools and Environments for Social Media Data Analytics Hadoop The Big Data Management Framework Apache Hadoop Literatures Analysis Theories and Frameworks W*H Conceptual Model for Services Stanford CoreNLP Framework Summary Chapter 4: Social Media Types and Analytical Techniques Social Media Types Social Media Sites Categorizations Social Networking Microblogging Blogging Photo Sharing Video Sharing Social Media Sites Examples Facebook Twitter LinkedIn Google Summary of Social Media Sites Characteristics Social Media Analytical Tools VII

10 Social Listening Software/ Social Media Monitoring Software Social Conversation Software/ Social Media Engagement Software, Social Media Management Software Social Marketing Software/ Social Media Management Software Social Analytics Software Social Influencer Software Social Media Analytical Tools Examples Chapter 5: Decision-Back Data Capturing Approach for Social Media Data Backward Analysis Capturing Social Media Data Plan The Conceptual Model Identification of the Problem Domain Identification of the Data Source and the Analytical Tool W*H Conceptual Model for Services Defining the Social Media Data Capturing Model Tool Architecture and Design Tool Layers: Data Ingest Module. (Presentation Layer) Data Analysis Module (Middle Layer) Database Layer Tool s User Interface Design Part1: Problem Domain Part2: Data Source VIII

11 Part3: Analytical Tool Chapter 6: Case Study Stanford CoreNLP Tool Part of Speech Tagger Named Entity Recognizer Case 1: Start On-Line Business Project Problem Description Tool Application Case 2: A Saving Lincoln Movie Promotion Problem Description Tool Application Case 3: YouTube Music Channel Promotion Problem Description Tool Application Case 4: Middle East Respiratory Syndrome Awareness Problem Description Tool Application Case 5: DAESH Terrorist Movement Problem Description Tool Application Chapter 7: Tool Experiment and Validation Purpose of the Experiment Design and the Scope of the Experiment IX

12 7.3 Experiment Case 1: Start On-Line Business Project Without Tool With Tool Results Case 4: Middle East Respiratory Syndrome Awareness Without Tool With Tool Results Case 5: Start On-Line Business Project Without Tool With Tool Results Results Comparison Rate of Improvements (ROI) Unpaired T-Test Chapter 8: Discussion Analysis of Research Outcomes Resulting outcome of tool Tool Evaluation Case Studies Tool Validation Experiment Chapter 9: Conclusion and Future Work Conclusion X

13 9.2 Limitations Limited Number of Cases Limited Databases tools and Social Media sites Experiment and Validation Lack of Generalizability Limited Quality Factors Validation The Limited use of NLP Tool Limited Number of Cases in the Experiment Future Work Directions Refernces Appendices Appendix A. Glossary Appendix B. Hadoop Components Appendix C. Dimentions of the W*H Model for Services Appendix D. Analytical Tools Database Appendix E. Related published papers XI

14 List of Figures Figure 1.1: The Interest for the Term "Big Data" on Google Feb, Figure 1.2: Timeline of the Launch Dates of Many Major Social Networks Sites and Dates Until 2005 [21]. 8 Figure 1.3: Social Media Analytics Life Cycle Figure 2.1: Thesis Work Packages (WPs) Figure 3.1: Hortonworks Data Platform Figure 5.1: Decision Back Approach Applied in the Analysis Process Figure 5.2: The Conceptual Model of the Decision Back Capturing Approach Figure 5.3: The W*H Service Description Model [34] Figure 5.4: Refined Model for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements Figure 5.5: The 4+1 View Model [80] Figure 5.6: Requirements Acquisition Tool Architecture for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements Figure 5.7: NLP Analysis Subsystem Figure 5.8: Stanford CoreNLP Example Figure 5.9: Data Source Recommendation Subsystem Figure 5.10: Analytical Tool Recommendation Subsystem Figure 5.11: Tool Interface Design (Home Page) Figure 5.12: Tool Interface Design (Process Part1) Figure 5.13: Tool Interface Design (Process Part2) Figure 5.14: Tool Interface Design (Process Part3) Figure 5.15: Tool Interface Design (Result) Figure 6.1: Annotation Guidelines [96] XII

15 Figure 6.2: Part of Speech NLP - Case Figure 6.3: Named Entity Recognition NLP- Case Figure 6.4: Part of Speech NLP - Case Figure 6.5: Named Entity Recognition NLP - Case Figure 6.6: Part of Speech NLP - Case Figure 6.7: Named Entity Recognition NLP - Case Figure 6.8: Part of Speech NLP - Case Figure 6.9: Named Entity Recognition NLP - Case Figure 6.10: Part of Speech NLP - Case Figure 6.11: Named Entity Recognition NLP - Case Figure 7.1: Experiment Time Recording Log Figure 7.2: Snapshot of Trackur - Case 1 Without Tool Figure 7.3: Time Recording Log Sheet - Case 1 Without Tool Figure 7.4: Snapshot of Social Mention Case 1 With Tool Figure 7.5: Time Recording Log Sheet - Case 1 With Tool Figure 7.6: Quality Factors Comparison - Case Figure 7.7: Time Recording Log Sheet - Case 4 Without Tool Figure 7.8: Time Recording Log Sheet - Case 4 With Tool Figure 7.9: Quality Factors Comparison - Case Figure 7.10: Time Recording Log Sheet - Case 5 Without Tool Figure 7.11: Time Recording Log Sheet for Case 5 - With Tool Figure 7.12: Quality Factors Comparison - Case Figure 7.13: Quality Factors Comparison Chart Average Results XIII

16 Figure 8.1: Experiment Summary - Correctness Factor Comparison Figure 8.3: Experiment Summary - Effeciency Factor Comparison XIV

17 List of Tables Table 4.1: Summary of Social Media Sites Categorization Based on their Functionalities Table 4.2: Facebook Information [70] Table 4.3: Twitter Information [71] Table 4.4: LinkedIn Information [71] Table 4.5: Google+ Information [69] Table 4.6: Social Media Sites Characteristics Summary Table 4.7: Social Media Analytical Tools' Characteristics Example Table 7.1: Keywords Relevant Feeds Numbers - Case1 Without Tool Table 7.2: Keywords Relevant Feeds Numbers - Case 1 With Tool Table 7.3: Keywords Relevant Feeds Numbers - Case 4 Without Tool Table 7.4: Keywords Relevant Feeds Numbers Case 4 With Tool Table 7.5: Keywords Relevant Feeds Numbers - Case 5 Without Tool Table 7.6: Keywords Relevant Feeds Numbers Case 5 With Tool Table 7.7: Results Comparison Table 7.8: Correctness and Effeciency Rate of Improvments (ROI) Table 7.9: Experiment Data Sample XV

18 List of Appendix Figures Figure C 1: The W H Inquiry Based Conceptual Model for Services XVI

19 List of Appendix Tables Table B1: Hadoop Ecosystem Components [2][16] Table D1: Analytical Tools Database XVII

20 List of Abbreviations SDLC RE HDFS HDP YARN NOSQL SM ETL NLP MIDIS SNAP POS MOH MOI CCC SHC NER MERS-COV WRM Software Development Life Cycle Requirements Engineering Hadoop Distributed File System Hortonworks Data Platform Yet Another Resource Negotiator Not Only SQL Social Media Extract, Transform, and Load Natural Language Processing Multi-Intelligence Data Integration Services Stanford Network Analysis Platform Part Of Speech Ministry of Health Ministry Of Interior Command & Control Center Supreme Hajj Committee Named entity recognizer Middle East Respiratory Syndrome Coronavirus Wholesale Revenue Management XVIII

21 CHAPTER 1: INTRODUCTION

22 Chapter1: Introduction 1.1. Introduction Social Media (SM) Data is a representative of Big Data with its massive growth, its multiple channels and the enormous scope of its content and subject matter [1]. In the business world, SM is a powerful marketing tool, which is reshaping the way organizations engage with their customers and nurture their relationship into brands, products and services [1] [2]. It can be deployed to share news from a corporate event on a near real-time basis, create a buzz about a great new product within minutes of its launch, or it can be used to share the details of an unpleasant experience with customer services [3] [4]. It has many other innovative uses, such as political leaders who try to influence public opinion through them [5], creation of job applications, including organization of learning groups, online training sessions, and many others [2] [6]. When it comes to analyzing this powerful source of data, many organizations are concerned with the amount of collected data becoming so cumbersome that it is difficult to find the most valuable pieces of information. Many questions arise [3]: What if data volume gets so large and varied, that one does not know how to deal with it? How much data should be stored? All the data? Or only a subset? How much data should be analyzed? All the data? Or only a subset? How can one find out which data sets are really important? Until recently, organizations were limited to using subsets of their data, or they were constrained to simplistic analyses, because the sheer volumes of data overwhelmed their processing platforms [7]. There are two choices in this context [4] [8]: Incorporate massive data volumes in the analysis. The needed answers be better provided by analyzing all the data. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high- 2 Page

23 Chapter1: Introduction performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics. Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when querying the data, the analyst discovers what is relevant. Then the ability to apply analytics on the front end determines the relevance based on the particular context. This type of analysis determines which data should be included in analytical processes and which can be placed in low-cost storage for later use if needed. Gathering massive amounts of data are proving to be impractical in a SM world that is expanding with infinite amounts of user generated data [5]. The consequence of this approach the case of SM data is that users are often unable to obtain specific relevant information from large-scale, high volatile, varied SM data collections. On the other hand, determining upfront the relevant data and specifying the analyzing requirements prior to data collection is the approach that should be followed in SM data analytics. It should not be a fishing expedition [8], because discovering patterns and information from this large, and complex collection of datasets is not only challenge, but also immensely time consuming. Due to the advances in data acquisition and business computing, today s datasets are becoming increasingly complex [8]. Some authors and data analysts such as [3], [8], [9] and many others, recommended Decision-back approach, which begins with answering the right questions that can give the road map for a more structured data collection and SM data analytics processes. Therefore, a more structured plan for capturing SM data analyzing requirements is needed to avoid a waste of time and resources in analyzing irrelevant data Motivation Analyzing SM Big Data with low latency update, almost in real-time, is a challenge in the near future [10]. It has special characteristics and requires continuous investigation and analysis [11], since in real-life cases it is important to know what is happening now and 3 Page

24 Chapter1: Introduction make decisions as quickly as possible [12]. Therefore, this thesis is motivated by the vision of ensuring access to the most valuable sources with minimal resources. It emphasizes the demand for a well-defined mechanism that aims to develop an effective process. This takes the maximum value from the available data that brings decision makers close to extracting value out of SM data. The need for a value-added and well-defined process to capture SM data analyzing requirements upfront for data collection is the main contribution of this research Definition of Big Data There is no perfect definition of Big Data. The term is used by many companies and literatures in varying definitions, and became more popular as a search keyword as shown in Figure 1.1 with Google s tool: Google Trend 1. Year Figure 1.1: The Interest for the Term "Big Data" on Google Feb, Big Data is defined by Gartner, the leading IT industry research group, as: Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process Page

25 Chapter1: Introduction optimization [13]. Gartner characterized Big Data by three main elements: volume, velocity, and ariety which are known as the 3V s model [14]: Volume: The size of data is very large and is in terabytes and petabytes. Velocity: A conventional understanding of velocity, typically considers how quickly the data is arriving and stored, and its associated rates of retrieval. Variety: It extends beyond the structured data, including unstructured data of all varieties: text, audio, video, posts, log files etc. Some researchers use a slightly modified 3V s model. Sam Madden describes Big Data as data that is too big, too fast, or too hard [15], where too hard refers to data that does not fit neatly into existing processing tools. Therefore too hard is very similar to data variety. Kaisler et al. define Big Data as the amount of data just beyond technology s capability to store, manage and process efficiently, but mention variety and velocity as additional characteristics [16]. Tim Kraska moves away from the 3V s, but still acknowledges, that Big Data is more than just volume. He describes Big Data as data for which the normal application of current technology doesn t enable users to obtain timely, cost-effective, and quality answers to data-driven questions [17]. However, he leaves which characteristics of this data go beyond normal application of current technology open [18]. IBM uses the 3V s model, but they introduced an additional V veracity : Veracity: Uncertainty of data, and data trust worthiness [19], signals that data keeps changing so one cannot trust the data on making decisions. The leader in analytics, Statistical Analysis System (SAS) Institute considers two additional dimensions [7]: 5 Page

26 Chapter1: Introduction Variability: In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Seasonal and event-triggered peak data loads can be challenging to manage which further intensifies with unstructured data. Complexity: Today's data comes from multiple sources, and it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or the data can quickly spin out of control. Overall the 4V s model or adaptations of it seems to be the most widely used and accepted description of what the term Big Data means [20]: Gartner 3V s model [14] + IBM additional V [19]. The model clearly describes characteristics that can be used to derive requirements for respective technologies and products. However, the primary concerns of this thesis are volume, velocity, veracity, and variety, as they are the main barriers to an interoperable analytic platform [20]. Handling the same volume might be a really hard problem if it is arriving fast and needs to be processed within seconds. Meanwhile, handling volume might get harder as the data set to be processed becomes unstructured. This adds the necessity to conduct pre-filtering steps so only the data that matter may enter to be processed and analyzed Social Media Big Data Definition of Social Media Nowadays, SM Networks such as MySpace, Facebook, Cyworld, Twitter, Instagram, Bebo, Snapchat, LinkedIn...etc. (see Figure. 1.2) have become increasingly popular, and they support a wide range of interests and practices. While their key technological features are fairly consistent, the cultures that emerge around Social Networks are varied. Most sites support the maintenance of pre-existing social networks, but others help strangers connect 6 Page

27 Chapter1: Introduction based on shared interests, political views, or activities. Some sites cater to diverse audiences, while others attract people based on a common language or shared racial, sexual, religious, or nationality-based identities. Sites also vary to the extent in which they incorporate new information and communication tools, such as mobile connectivity, blogging, and photo/video-sharing [21]. Many such social networks are extremely rich, and they typically contain a tremendous amount of content and linkage data which can be leveraged for analysis. The linkage data is essentially the graph structure of the social network and the communications between entities; whereas the content data contains the text, images and other multimedia data in the network [22] [23]. 7 Page

28 Chapter1: Introduction Figure 1.2: Timeline of the Launch Dates of Many Major Social Networks Sites and Dates Until 2005 [21] Social networks have been defined by [24] as web-based services that allow individuals to: (1) Construct a public or semi-public profile within a bounded system. (2) Articulate a list of other users with whom they share a connection. (3) View and traverse their list of connections and those made by others within the system. 8 Page

29 Chapter1: Introduction These connections or relationships are often displayed in a diagram, where entities are the points (also called nodes) and connections are the lines [25]. This definition is the one used to define SM sites in this thesis as it is widely used by many publications [24] [25] Data Capturing and Analyzing Challenges of Social Media Data generated from SM sites are different from conventional attribute-value data for classical data mining. SM data are largely user-generated content on SM sites [2]. SM data are typically Big Data with its special characteristics: volatile, noisy, distributed, unstructured, and vast. Main SM data challenges and issues as a Big Data representative are [26]: Privacy and Security: It is the most important issue with big data which is sensitive and includes conceptual, technical as well as legal significance. Data Access and sharing information: If data is to be used to make accurate decisions in time it becomes necessary that it should be available in a precise, complete and timely manner. This makes the Data management and governance process more complex adding the necessity to make Data open and available to government agencies in a standardized manner with standardized APIs, metadata and formats thus leading to better decision making, business intelligence and productivity improvements. Storage and Processing Issues: The available storage cannot accommodate the large amount of data which is being produced since: SM sites are themselves a great contributor along with the sensor devices. The processing of such enormous sets of data is also time consuming. To find suitable elements, all of the data set needs to be scanned, which is somewhat impossible. Analytical Issues: The main challenging questions are: 9 Page

30 Chapter1: Introduction (1) What if the data volume gets so unwieldy and varied that it is too difficult to manipulate? (2) Does all data need to be stored? (3) Does all data need to be analyzed? (4) Which data points are really important and for what reasons? (5) How can the data be used for the best advantage? Skill Requirements: Since Big Data is a fledgling and an emerging technology, it needs to attract organizations and youth with diverse new skill sets. These skills should not only be limited to technical ones but should also extend to research, analytical, interpretive and creative ones. Technical Challenges: (1) Fault Tolerance. (2) Scalability. (3) Quality of Data. (4) Heterogeneous Data. Indeed this thesis is motivated to address storage and processing issues, and analytical issues. Hence, the lack of an effective process for capturing SM data analyzing requirements in organizations adapting SM data solutions, can result in a negative impact in the financial as well as the industry s reputation and credibility. [27] Requirements Engineering for Social Media Big Data Analytics Requirements acquisition is being recognized as one of the most important albeit difficult phases in software engineering [28]. The literature repeatedly cites the role of well-defined requirements and requirements acquisition process in problem analysis and project management, as beneficial to software development throughout the life cycle: during design, coding, testing, maintenance and documentation of software [28] [29]. By recognizing SM Big Data collection and analytics similar to when designing IT software 10 Page

31 Chapter1: Introduction systems, it needs to invest in a Requirement Engineering approach that specifies the requirements prior to data collection and acquire the structure for gathering and collecting user s analytics requirements. Therefore, a tool architecture for requirements acquisition is the supporting software solution for the requirement engineering phase; for the SM data collection process. The guiding questions within the tool defines a structured process for system analysts to elicit the SM data analyzing requirements in a more effective and userfriendly manner. This approach to Requirements Engineering is one of the main principles of the Software Development Life Cycle (SDLC) [28] Problem Statement As a result of SM s rapid growth, recent years have seen an accelerating shift in different domains away from traditional channels such as print and broadcast to digital channels [1]. This transformation is being driven by the cost advantages and precision offered by digital platforms. In particular, the growing area of applications to manage the increasing volume and influence of SM [5]. Here are some statistics that offer an insight of the scope of the SM phenomenon [1]: 1.43 billion people worldwide visited a social networking sites last year (2014). Nearly 1 in 8 people worldwide have their own Facebook page. In 2014, one million new accounts were added to Twitter everyday Three million new blogs come online every month 65 percent of SM users say they use it to learn more about brands, products and services. The amount of information is continuing to increase at an enormous rate. Therefore, it is imperative that businesses, organizations, and associations find better approaches for information filtering and requirements capturing which would effectively decrease the information overload and improve the precision of analytics results [30]. All things considered SM data analytics can only be effective when the underlying data collection 11 Page

32 Chapter1: Introduction processes are able to leverage the relevant information to a particular domain [31]. It is critical to improve the usefulness of the analysis results and accelerate the SM data analytics. Therefore, a more powerful mechanism of data analytics requirements capturing guidance is needed to reduce both time and resource consumption when analyzing irrelevant data Research Questions and Objectives The study examines the decision-back approach for data capturing and its ability to be applied in capturing SM data analytics requirements. Therefore, the research question is formulated as follows: How can we define an architecture of a SM data requirements capturing tool, which accelerates the analytics tasks? This research defines a requirements acquisition tool architecture that captures SM data analytics requirements using decision back approach, which can play a role for SM data capturing process. Therefore, the main objectives of the study are: To examine SM sites, and determine what make them different from each other. To explore different SM data analytical tools, and their different techniques and main vendors. To examine the decision-back approach, and how it can leverage the SM data collection. To provide a well-planned tool architecture which can ease the analyst task on capturing SM analysis requirements. This tool is to apply the decision-back approach, through determining what the output requirements are and then filter the input data accordingly. To examine specific real-life cases from different problem domains using this tool to prove its worthiness. 12 Page

33 Chapter1: Introduction To validate the tool for its correctness and efficiency to ensure that it answers the research question. Therefore, the goal of this thesis is to define coherent processes to acquiring user s analyzing requirements. Thus, data analytics can be done in smaller time frames, allowing decisions to be made faster and with higher precision, by improving the current data capturing process from where one can draw accurate and useful conclusions. Then it will contribute to changing the way people are collecting analyzing requirements and subsequently transform decision making in a way that gives businesses the required advantage Scope of the Thesis Figure 1.3 is a simplified adaptation to SM analytics life cycle. As presented, it has four main stages: Data Collection, Data Processing, Data Storage, and Data Analysis [2]. The first stage: Data Collection, is the phase that is concerned with collecting SM data from different SM sources e.g. blogs, microblogs, etc. The goal of this thesis is to reduce this tremendous amount of data by identifying analysis requirements prior to data collection phase in SM analytics solution. This is inspired by and is similar to the primary phase of a Software Development Life Cycle (SDLC), which is Requirements Engineering (RE) [28]. This study follows RE in providing a well-defined tool architecture to capture SM analyzing requirements to improve the data collection process and accelerate the analytics tasks. Investigating the other phases of SDLC or Big Data analytics process, and examining other SM analytics problems are beyond the scope of this thesis. 13 Page

34 Chapter1: Introduction 1.9. Related Published Paper Figure 1.3: Social Media Analytics Life Cycle Published papers under this research [1] M. Alswilmi, A. Dahanayake, (2015), A Requirements Acquisition Tool Architecture for the Decision Back Approach for Social Media Big Data Capturing 5 th Advances in Software Engineering Conference, Prince Sultan University [2] M. Alswilmi, N. Alnajran, A. Dahanayake, (2014), Conceptual Framework for Big Data Analytics Solutions Proceedings of 24 th International Conference on Information Modelling and Knowledge Bases, EJC [3] M. Alswilmi, N. Alnajran, A. Dahanayake, (2013), Conceptual Framework for Big Data Analytics Solutions 2 nd Advances in Software Engineering Conference, Prince Sultan University Page

35 Chapter1: Introduction Outline of the Thesis Apart from the introduction, the remainder of this research is structured in to eight Chapters as outlined in: Research Method, Literature Review, SM Types and Analytical Techniques. Decision-back Data Capturing Approach for SM Data, Case Study, Tool Experiment and Validation, Discussion, and Conclusion and Future Work. Chapter 2 consists of the research methods. It provides a demonstration of the adapted methodology to conduct this research. The literature review in Chapter 3 discusses the related works including some available data reduction approaches, highlighting the innovativeness of this research. Additionally, an overview of the tools and frameworks that has been used to build the proposed tool is presented. Chapter 4 is discussing SM sites categorizations, and SM data analytical tools and different analytical techniques. The tool architecture is built and proposed as the core of this research in Chapter 5 along with supporting materials. Chapter 6 provides an application of the framework on five case studies from different problem domains. The framework has been validated through a prototype and an experiment to prove its correctness and efficiency in Chapter 7. Afterwards, the research analysis, a discussion on the tool prototype, and its evaluation and validation is provided in Chapter 8. Finally, Chapter 9 contains the conclusion, limitations of this research and future research directions. 15 Page

36 CHAPTER 2: RESEARCH METHODS

37 Chapter2: Research Method In this Chapter, the research methods followed within this study are outlined including the research design, participants, the techniques and data analysis methods used for research data analysis, and evaluation and validation approaches of the results. Moreover, the tools used to conduct the experimental work are also discussed Research Methods Research Design The major aim of this research is to apply the decision-back approach concept and to develop a requirement acquisition tool architecture to capture SM data analytics requirements. For this purpose, a qualitative and to some extent a quantitative research methods of investigation are chosen. The research is descriptive in nature and allows gathering a more in depth contextual understanding of the topic. Initially, the inductive approach is followed to analyze the qualitative data. The research begins from general information about the decision-back capturing approach and SM analyzing requirements, towards a more specific conclusions about how to apply this data capturing approach in SM Big Data analytics and a requirements acquisition tool architecture building Research Participants In order to maximize the validity of findings, the research uses a hybrid access type [32] to gather the relevant data. The primary source of data collection is through the use of indepth Internet access of SM sites and going through various scientific publications and white papers that is of interest to this research. Supporting data is collected through traditional access, observing several leading companies who benefit from Big Data and SM data capture and analysis technologies. Choice of companies is determined by the availability of information, reputation, and level of involvement in this field such as: IBM, Gartner, and SAS. 17 Page

38 Chapter2: Research Method Research Techniques and Data Analysis This is a mixed method research. It uses a variety of data collection techniques and analytical procedures to develop the foundation and to validate the tool architecture. In order to maximize the validity and trustworthiness of the findings, the research intended to use a hybrid access type to gather a richer set of data. The research advanced through multiple Work Packages (WP) to develop the tool architecture and the tool s prototype, as explained below (See Figure 2.1) Research Work Packages 1. WP1: Learning From Available Literatures 1.1. The primary source of data collection is through literature exploration and use of in-depth Internet access of SM sites and SM analytical tools, and perusing various relevant publications and white papers that discuss decision-back approach and SM data analytics Supporting data is collected through: Traditional access and conversations with interested participants in scientific conferences such as the European Japanese Conference In addition, observations were conducted involved documents reviews of data analytics solutions of several companies. 2. WP2: Developing the Conceptual Framework to Facilitate the Decision-Back SM Big Data Requirements Capturing 2.1. By examining the decision-back approach, and how it has been used in a variety of literatures, general questions from the article [8] have been used to identify the main concepts for using this approach for analyzing SM data Each concept in the framework is examined to describe how it can be beneficial for capturing SM data within more efficient timelines with less consumption of resources Connecting the framework with SM analytics life cycle and showing its relevancy to SDLC. 18 Page

39 Chapter2: Research Method 3. WP3: Fine-tuning the Conceptual Framework 3.1. Examining W*H Conceptual Model for Services [33], and customize it to be used to make the concepts in the framework more descriptive After relating the conceptual framework to the SDLC, and showing how it does work as a requirements acquisition phase in the Big Data analytics lifecycle, the requirements framework is built and its components are described accordingly. 4. WP4: Design of a tool s Prototype and the Component Architecture of a Tool that Supports the Decision-Back SM Big Data Requirements Capturing 4.1. Based on the requirements acquisition framework the tool architecture is designed Each model in the tool is described showing how it supports on capturing the SM data analyst s requirements. 5. WP5: Validation 5.1. Two types of validation tests are provided: theoretical and experimental Theoretical by showing some case studies from different problem domains Experimental by using actual analytical tools and measuring correctness and efficiency quality factors. 6. Wp6: Discussion, Conclusions and Future Research Directions 6.1. Discussing the worthiness of the provided tool architecture by comparing two results: analysis with the tool, and without the tool Conclude the research, discuss what its limitations are, and provide some future work directions for further improvements. 19 Page

40 Chapter2: Research Method Figure 2.1: Thesis Work Packages (WPs) 2.2. Research Instruments and Procedures This research attempts to build a requirements acquisition tool architecture for decisionback approach for capturing SM data. In order to validate this tool for correctness and efficiency, a prototype consisting of combination of tools need to be available to support the tool evaluation process is described below Social Mention Social Mention 2 is a free Social Media search and analysis platform that aggregates user generated content from across the universe into a single stream of information. It allows to Page

41 Chapter2: Research Method easily track and measure what people are saying about a person, a company, a new product, or any topic across the web's Social Media landscape in real-time. Social Mention monitors 100+ Social Media properties directly including: Twitter, Facebook, Friend Feed, YouTube, Digg, and Google Trackur Trackur 3 is a SM monitoring tool designed to assist companies and public relations PR professionals in tracking what is said about brands on the Internet. It scans hundreds of millions of web pages including news, blogs, videos, images, and forums and alerts the user to anything that matches the keywords monitored. It cost at least $97 a month and it offers a 10-day trial Page

42 CHAPTER 3: LITERATURE REVIEW

43 Chapter3: Literature Review 3.1. Big Data and Social Media Related Works Many software startups, research and development efforts are actively trying to harness the power of Big Data and SM, and create software with the potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to the engineering aspects of Big Data and SM software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during collecting, processing, and analyzing data that needs to be delivered on time and within budget [34]. This research is focusing on SM requirements capturing approach, and studies that are discussing SM and data capturing approaches Innovative Big Data and Data Capturing Approaches IBM in [35], provides a means of classifying Big Data business problems according to a specified criteria. They have provided a pattern-based approach to facilitate the task of defining an overall Big Data architecture. Their idea of classifying data in order to map each problem with its suitable solution pattern provides an understanding of how a structured classification approach can lead to an analysis of the needs and a clear vision of what needs to be captured. Moreover, IBM has presented several real-life samples of Big Data case studies in [36]. The authors in [37], have studied different Big Data types and problems. They developed a conceptual framework that classifies Big Data problems according to the format of the data that must be processed. It maps the Big Data types with the appropriate combinations of data processing components. These components are the processing and analytic tools in order to generate useful patterns from this type of data. Constraint-Driven Data Mining technique proposed by [38] identifies the following classes of constraints: database constraints, pattern constraints, and time constraints. Database constraints are used to specify the source dataset. Pattern constraints specify which patterns are interesting and should be returned by the query. Finally, time constraints influence the 23 Page

44 Chapter3: Literature Review process of checking whether a given data/sequence contains a given pattern. However, data mining can only be applied to structured data that can be stored in a relational database [39], but this constraint-driven approach can provide an understanding of how these types of constraints can lead to more efficient data collection. The article [40] proposes a novel approach for consistent collective evaluation of multiple continuous queries for filtering two different types of data streams: a relational stream and an XML stream. The proposed approach commonly provides a region-based selection constructs: an attribute selection construct for relational queries and a path selection construct for XPath queries. Both collectively evaluate the selection predicates of the same attribute (path), based on the precomputed matching results of the queries in each of the disjoint regions divided by the selection predicates. The performance experiments show that the proposed approach is basically more efficient and stable at run-time. C. Anne and B. Boury in [41], proposed a framework facilitating the integration of heterogeneous unstructured and structured data, enabling Hard/Soft fusion and preparing for various analytics exploitation. It provides timely and relevant information to the analyst through intuitive search and discovery mechanisms. The authors described the design and implementation of a prototype for scalable Multi-Intelligence Data Integration Services (MIDIS), based on a flexible data integration approach, making use of Semantic Web and Big Data technologies. In [42], the white paper published by Intel walk through the challenge of extracting Big Data from multiple sources. It has explained how Hadoop infrastructure can contribute to the process of Big Data Extract, Transform & Load (ETL). It illustrates the process of loading different data formats from multiple data sources into Hadoop s warehouse from a technical point of view. However, they did not touch the idea of reducing useless data capture nor producing real-time management decisions. 24 Page

45 Chapter3: Literature Review Literatures Analysis From the IBM contributions [35], [36], in the field of Big Data, the idea of decision-back concept for a structured approach to SM data collection has emerged. Moreover [37], [40], [41], [42], discussed how the data classification according to some parameters can lead to better understanding of the problem at hand. While [40], discussed the constraint-driven approach and how can it provide an understanding of how these types of constraints can lead to more efficient data collection Innovative Social Media Data Collection and Analytics Approaches In [43] the authors present a multi-layered knowledge extraction approach of social networks with a comprehensive survey of relevant notions and techniques from multidisciplines. They analyzed the SM characteristics in a multi-mode, multi-layer knowledge dimensions using twitter as an example. They also improve the hyper graph model of social network behaviors based on the dimensions proposed in the model with a case study in Twitter illustrating the multi-dimensional relations between Twitter users. Their main focus was to improve the understanding of social network services. The authors in [23], studied the application of the concept and techniques of web mining for on-line social networks in terms of how to use web mining and a general process of its use for on-line social networks analysis. They discussed several challenges in this research area; for example: data sampling is a big issue when using web mining for on-line social networks analysis. In other web mining applications, data sampling is a simple task to reduce the amounts of data size. However, in on-line social networks analysis, it becomes a difficult task to select suitable samples representative of the real social networks. In [44], the authors empirically designed and developed the Real-time Twitter Trend Mining (RT²M) system which allows in real-time to: 1) crawl and store every textual data tweet produced in Twitter into a local database; 2) keep track of social issues by temporal Topic Modeling, and; 3) visualize mention-based user networks. They also demonstrated a 25 Page