A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing

Size: px
Start display at page:

Download "A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing"

Transcription

1 A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Of Masters of Science in Software Engineering At the College of Computer and Information Sciences At Prince Sultan University By: Mashail A. Alswilmi May, 2015

2 A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing By Mashail A. Alswilmi This thesis was defended on 25 th May 2015 Supervisor: Prof. Dr. Ajantha Dahanayake Members of the Exam Committee Prof. Dr. Ajantha Dahanayake Dr. Areej Alwabil Dr. Sarab AlMuhaideb Chair Member Member 2

3 ACKNOWLEDGMENTS First and foremost, praises and thanks to Allah, for his showers of blessings throughout my research work and to complete it successfully. I would like to express my deep and sincere gratitude to my research supervisor Prof. Ajantha Dahanayake for the continuous support of my master study and research, for giving me the opportunity to do research, providing invaluable guidance and for her patience, motivation, enthusiasm, and immense knowledge throughout this research. She has taught me the methodology to carry out the research and to present it as clearly as possible. It was a great privilege and honor to work and study under her guidance. Her dynamism, vision, sincerity and motivation have deeply inspired me and I am extremely grateful for what she has offered me. I am extremely grateful to my parents for their love, prayers, caring and sacrifices for educating and preparing me for my future. I am very much thankful to my husband and my son for their love, understanding, prayers and continuing support to complete this research. Also I wish to express thanks to my sisters, and brothers for their support and valuable prayers. Finally, my thanks extends to all those individuals who supported me to complete the research project, either directly or indirectly. I

4 Abstract This master s thesis utilizes a decision-back concept to optimize the process of social media data collection. Leveraging this type of Big Data extends the requirements of traditional data capturing techniques, due to their large volume, velocity, variety, and veracity. Comprehensive analysis of the properties of the problem at hand and determining the analyzing needs upfront for the data collection, eliminates the chance of being overwhelmed by masses of irrelevant data, and helps users and businesses to generate management decisions and answer mission critical questions in an efficient and timely manner. Therefore, this master s thesis has developed an architecture of a requirements acquisition tool that applies a decision-back approach to capture social media data analyzing requirements. The tool captures the requirements by providing a set of questions in multiple phases. In the first phase: Problem Domain set of questions; the system is analyzing the user answers by using NLP technique to extract keywords, time, and location constraints. Then with the second phase: Data Source set of questions; the system is analyzing user s selections by using data source recommendation system to recommend the most suitable data source. Within the final phase: Analytical Tool set of questions; the system is analyzing user s selections by using an analytical tool recommendation system to recommend the most suitable analytical tool. The tool outputs are: keywords, time and location constraints, recommended data source, and recommended analytical tool. This tool is validated for correctness and efficiency quality factors, through performing an experiment that compares data collection for social media analytics with and without the use of the tool. The experiment proved that the correctness and the efficiency average rate of improvements increased after using the tool. The main contribution of this research is the design of a value-added and well-defined process to capture social media data analyzing requirements upfront for the data collection to accelerate the analytics tasks. The requirements acquisition tool also contributes to: 1) Requirements engineering field, by building a tool that helps the user captures his requirements prior to data collection process during the social media data Analytic and 2) Software engineering field, by providing a II

5 user-centered solution that captures the user s social media data analyzing needs within a user friendly environment. III

6 ملخص البحث تستخدم رسالة الماجستير مفهوم "القرار التابع للنتيجة "Decision-back لتحسين عملية جمع بيانات وساي ل الا علام الاجتماعية. لا ن الاستفادة من هذا النوع من البيانات الضخمة يتجاوز ما توفره تقنيات جمع البيانات التقليدية نظرا لضخامة كميتها و سرعتها و تنوعها ومدى صحتها. ا ن التحليل الشامل على طريقة "القرار التابع للنتيجة" عن طريق تحليل خصاي ص المشكلة الحالية وتحديد احتياجات التحليل قبل البدء بجمع البيانات سوف يقلل من فرص الانغمار في كتل من البيانات غير ذات العلاقة ومساعدة المستخدمين والشركات لاتخاذ قرارات ا دارية والا جابة على ا سي لة المهمات الحاسمة بطريقة فعالة وفي الوقت المناسب لذلك رسالة الماجستير هذه تطور بنية ا داة جمع المتطلبات و التي تطبق مفهوم "القرار التابع للنتيجة" لجمع بيانات وساي ل الا علام الاجتماعية. الا داة تجمع المتطلبات بمجموعة من الا سي لة على عدة مراحل المرحلة الا ولى: مجموعة ا سي لة مجال المشكلة توفر الا ساس لتحليل ا جابات المستخدم بطريقة "معالجة اللغة الطبيعية "NLP لاستخراج مفاتيح البحث و قيود الوقت والمكان. المرحلة الثانية: مجموعة ا سي لة مصدر البيانات حيث يتم تحليل اختيارات المستخدم عن طريق نظام توصية مصدر البيانات. المرحلة الا خيرة: مجموعة ا سي لة ا دوات تحليل البيانات حيث يتم تحليل اختيارات المستخدم عن طريق نظام توصية ا داة تحليل البيانات. مخرجات الا داة هي: مفاتيح البحث وقيود الوقت والمكان مصدر البيانات الموصى به و ا داة التحليل الموصى بها. تم التحقق من فعالية هذه الا داة بالنسبة لعاملي الملاي مة والكفاءة من خلال ا جراء تجربة تقارن تحليل بيانات وساي ل الا علام الاجتماعية مع وبدون استخدام الا داة. ا ثبتت التجربة ا ن متوسط معدل تحسن عامل الملاي مة و الكفاءة ارتفع بعد استخدام الا داة. المساهمة الري يسية لهذا البحث هو تصميم عملية ذات قيمة مضافة واضحة المعالم لجمع متطلبات تحليل بيانات وساي ل الا علام الاجتماعية قبل جمع البيانات لتسريع مهام التحليل. كما تساهم ا داة اكتساب المتطلبات ا يضا في: ۱) مجال هندسة المتطلبات للبرمجيات من خلال بناء الا داة التي توفر مجموعة من الا سي لة التي تساعد المستخدم على التقاط احتياجاته قبل عملية جمع البيانات خلال تحليلات بيانات وساي ل الا علام الاجتماعية. ۲) مجال هندسة البرمجيات من خلال توفير حل محوره المستخدم و الذي يخدم احتياجات المستخدمين في تحليل بيانات وساي ط الا علام الاجتماعية في بيي ة سهلة ومريحة. IV

7 Table of Contents Acknowledgment... I Abstract... II Abstract in Arabic... IV Table of Contents...V List of Figures... XII List of Tables... XV List of Appendix Figures... XVI List of Appendix Tables...XVII List of Abbreviations... XVIII Chapter 1: Introduction Introduction Motivation Definition of Big Data Social Media Big Data Definition of Social Media Data Capturing and Analyzing Challenges of Social Media Requirements Engineering for Social Media Big Data Analytics Problem Statement Research Questions and Objectives V

8 1.8. Scope of the Thesis Related Published Paper Outline of the Thesis Chapter 2: Research Methods Research Methods Research Design Research Participants Research Techniques and Data Analysis Research Work Packages Research Instruments and Procedures Social Mention Trackur Chapter 3: Literture Review Big Data and Social Media Related Works Innovative Big Data and Data Capturing Approaches Literatures Analysis Innovative Social Media Data Collection and Analytics Approaches Literatures Analysis Software Engineering and Social Media Data Analytics Reverse Engineering Software Requirements Engineering Decision-back Data Capturing Approach Literatures Analysis VI

9 3.4. Related Tools and Environments for Social Media Data Analytics Hadoop The Big Data Management Framework Apache Hadoop Literatures Analysis Theories and Frameworks W*H Conceptual Model for Services Stanford CoreNLP Framework Summary Chapter 4: Social Media Types and Analytical Techniques Social Media Types Social Media Sites Categorizations Social Networking Microblogging Blogging Photo Sharing Video Sharing Social Media Sites Examples Facebook Twitter LinkedIn Google Summary of Social Media Sites Characteristics Social Media Analytical Tools VII

10 Social Listening Software/ Social Media Monitoring Software Social Conversation Software/ Social Media Engagement Software, Social Media Management Software Social Marketing Software/ Social Media Management Software Social Analytics Software Social Influencer Software Social Media Analytical Tools Examples Chapter 5: Decision-Back Data Capturing Approach for Social Media Data Backward Analysis Capturing Social Media Data Plan The Conceptual Model Identification of the Problem Domain Identification of the Data Source and the Analytical Tool W*H Conceptual Model for Services Defining the Social Media Data Capturing Model Tool Architecture and Design Tool Layers: Data Ingest Module. (Presentation Layer) Data Analysis Module (Middle Layer) Database Layer Tool s User Interface Design Part1: Problem Domain Part2: Data Source VIII

11 Part3: Analytical Tool Chapter 6: Case Study Stanford CoreNLP Tool Part of Speech Tagger Named Entity Recognizer Case 1: Start On-Line Business Project Problem Description Tool Application Case 2: A Saving Lincoln Movie Promotion Problem Description Tool Application Case 3: YouTube Music Channel Promotion Problem Description Tool Application Case 4: Middle East Respiratory Syndrome Awareness Problem Description Tool Application Case 5: DAESH Terrorist Movement Problem Description Tool Application Chapter 7: Tool Experiment and Validation Purpose of the Experiment Design and the Scope of the Experiment IX

12 7.3 Experiment Case 1: Start On-Line Business Project Without Tool With Tool Results Case 4: Middle East Respiratory Syndrome Awareness Without Tool With Tool Results Case 5: Start On-Line Business Project Without Tool With Tool Results Results Comparison Rate of Improvements (ROI) Unpaired T-Test Chapter 8: Discussion Analysis of Research Outcomes Resulting outcome of tool Tool Evaluation Case Studies Tool Validation Experiment Chapter 9: Conclusion and Future Work Conclusion X

13 9.2 Limitations Limited Number of Cases Limited Databases tools and Social Media sites Experiment and Validation Lack of Generalizability Limited Quality Factors Validation The Limited use of NLP Tool Limited Number of Cases in the Experiment Future Work Directions Refernces Appendices Appendix A. Glossary Appendix B. Hadoop Components Appendix C. Dimentions of the W*H Model for Services Appendix D. Analytical Tools Database Appendix E. Related published papers XI

14 List of Figures Figure 1.1: The Interest for the Term "Big Data" on Google Feb, Figure 1.2: Timeline of the Launch Dates of Many Major Social Networks Sites and Dates Until 2005 [21]. 8 Figure 1.3: Social Media Analytics Life Cycle Figure 2.1: Thesis Work Packages (WPs) Figure 3.1: Hortonworks Data Platform Figure 5.1: Decision Back Approach Applied in the Analysis Process Figure 5.2: The Conceptual Model of the Decision Back Capturing Approach Figure 5.3: The W*H Service Description Model [34] Figure 5.4: Refined Model for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements Figure 5.5: The 4+1 View Model [80] Figure 5.6: Requirements Acquisition Tool Architecture for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements Figure 5.7: NLP Analysis Subsystem Figure 5.8: Stanford CoreNLP Example Figure 5.9: Data Source Recommendation Subsystem Figure 5.10: Analytical Tool Recommendation Subsystem Figure 5.11: Tool Interface Design (Home Page) Figure 5.12: Tool Interface Design (Process Part1) Figure 5.13: Tool Interface Design (Process Part2) Figure 5.14: Tool Interface Design (Process Part3) Figure 5.15: Tool Interface Design (Result) Figure 6.1: Annotation Guidelines [96] XII

15 Figure 6.2: Part of Speech NLP - Case Figure 6.3: Named Entity Recognition NLP- Case Figure 6.4: Part of Speech NLP - Case Figure 6.5: Named Entity Recognition NLP - Case Figure 6.6: Part of Speech NLP - Case Figure 6.7: Named Entity Recognition NLP - Case Figure 6.8: Part of Speech NLP - Case Figure 6.9: Named Entity Recognition NLP - Case Figure 6.10: Part of Speech NLP - Case Figure 6.11: Named Entity Recognition NLP - Case Figure 7.1: Experiment Time Recording Log Figure 7.2: Snapshot of Trackur - Case 1 Without Tool Figure 7.3: Time Recording Log Sheet - Case 1 Without Tool Figure 7.4: Snapshot of Social Mention Case 1 With Tool Figure 7.5: Time Recording Log Sheet - Case 1 With Tool Figure 7.6: Quality Factors Comparison - Case Figure 7.7: Time Recording Log Sheet - Case 4 Without Tool Figure 7.8: Time Recording Log Sheet - Case 4 With Tool Figure 7.9: Quality Factors Comparison - Case Figure 7.10: Time Recording Log Sheet - Case 5 Without Tool Figure 7.11: Time Recording Log Sheet for Case 5 - With Tool Figure 7.12: Quality Factors Comparison - Case Figure 7.13: Quality Factors Comparison Chart Average Results XIII

16 Figure 8.1: Experiment Summary - Correctness Factor Comparison Figure 8.3: Experiment Summary - Effeciency Factor Comparison XIV

17 List of Tables Table 4.1: Summary of Social Media Sites Categorization Based on their Functionalities Table 4.2: Facebook Information [70] Table 4.3: Twitter Information [71] Table 4.4: LinkedIn Information [71] Table 4.5: Google+ Information [69] Table 4.6: Social Media Sites Characteristics Summary Table 4.7: Social Media Analytical Tools' Characteristics Example Table 7.1: Keywords Relevant Feeds Numbers - Case1 Without Tool Table 7.2: Keywords Relevant Feeds Numbers - Case 1 With Tool Table 7.3: Keywords Relevant Feeds Numbers - Case 4 Without Tool Table 7.4: Keywords Relevant Feeds Numbers Case 4 With Tool Table 7.5: Keywords Relevant Feeds Numbers - Case 5 Without Tool Table 7.6: Keywords Relevant Feeds Numbers Case 5 With Tool Table 7.7: Results Comparison Table 7.8: Correctness and Effeciency Rate of Improvments (ROI) Table 7.9: Experiment Data Sample XV

18 List of Appendix Figures Figure C 1: The W H Inquiry Based Conceptual Model for Services XVI

19 List of Appendix Tables Table B1: Hadoop Ecosystem Components [2][16] Table D1: Analytical Tools Database XVII

20 List of Abbreviations SDLC RE HDFS HDP YARN NOSQL SM ETL NLP MIDIS SNAP POS MOH MOI CCC SHC NER MERS-COV WRM Software Development Life Cycle Requirements Engineering Hadoop Distributed File System Hortonworks Data Platform Yet Another Resource Negotiator Not Only SQL Social Media Extract, Transform, and Load Natural Language Processing Multi-Intelligence Data Integration Services Stanford Network Analysis Platform Part Of Speech Ministry of Health Ministry Of Interior Command & Control Center Supreme Hajj Committee Named entity recognizer Middle East Respiratory Syndrome Coronavirus Wholesale Revenue Management XVIII

21 CHAPTER 1: INTRODUCTION

22 Chapter1: Introduction 1.1. Introduction Social Media (SM) Data is a representative of Big Data with its massive growth, its multiple channels and the enormous scope of its content and subject matter [1]. In the business world, SM is a powerful marketing tool, which is reshaping the way organizations engage with their customers and nurture their relationship into brands, products and services [1] [2]. It can be deployed to share news from a corporate event on a near real-time basis, create a buzz about a great new product within minutes of its launch, or it can be used to share the details of an unpleasant experience with customer services [3] [4]. It has many other innovative uses, such as political leaders who try to influence public opinion through them [5], creation of job applications, including organization of learning groups, online training sessions, and many others [2] [6]. When it comes to analyzing this powerful source of data, many organizations are concerned with the amount of collected data becoming so cumbersome that it is difficult to find the most valuable pieces of information. Many questions arise [3]: What if data volume gets so large and varied, that one does not know how to deal with it? How much data should be stored? All the data? Or only a subset? How much data should be analyzed? All the data? Or only a subset? How can one find out which data sets are really important? Until recently, organizations were limited to using subsets of their data, or they were constrained to simplistic analyses, because the sheer volumes of data overwhelmed their processing platforms [7]. There are two choices in this context [4] [8]: Incorporate massive data volumes in the analysis. The needed answers be better provided by analyzing all the data. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high- 2 Page

23 Chapter1: Introduction performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics. Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when querying the data, the analyst discovers what is relevant. Then the ability to apply analytics on the front end determines the relevance based on the particular context. This type of analysis determines which data should be included in analytical processes and which can be placed in low-cost storage for later use if needed. Gathering massive amounts of data are proving to be impractical in a SM world that is expanding with infinite amounts of user generated data [5]. The consequence of this approach the case of SM data is that users are often unable to obtain specific relevant information from large-scale, high volatile, varied SM data collections. On the other hand, determining upfront the relevant data and specifying the analyzing requirements prior to data collection is the approach that should be followed in SM data analytics. It should not be a fishing expedition [8], because discovering patterns and information from this large, and complex collection of datasets is not only challenge, but also immensely time consuming. Due to the advances in data acquisition and business computing, today s datasets are becoming increasingly complex [8]. Some authors and data analysts such as [3], [8], [9] and many others, recommended Decision-back approach, which begins with answering the right questions that can give the road map for a more structured data collection and SM data analytics processes. Therefore, a more structured plan for capturing SM data analyzing requirements is needed to avoid a waste of time and resources in analyzing irrelevant data Motivation Analyzing SM Big Data with low latency update, almost in real-time, is a challenge in the near future [10]. It has special characteristics and requires continuous investigation and analysis [11], since in real-life cases it is important to know what is happening now and 3 Page

24 Chapter1: Introduction make decisions as quickly as possible [12]. Therefore, this thesis is motivated by the vision of ensuring access to the most valuable sources with minimal resources. It emphasizes the demand for a well-defined mechanism that aims to develop an effective process. This takes the maximum value from the available data that brings decision makers close to extracting value out of SM data. The need for a value-added and well-defined process to capture SM data analyzing requirements upfront for data collection is the main contribution of this research Definition of Big Data There is no perfect definition of Big Data. The term is used by many companies and literatures in varying definitions, and became more popular as a search keyword as shown in Figure 1.1 with Google s tool: Google Trend 1. Year Figure 1.1: The Interest for the Term "Big Data" on Google Feb, Big Data is defined by Gartner, the leading IT industry research group, as: Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process Page

25 Chapter1: Introduction optimization [13]. Gartner characterized Big Data by three main elements: volume, velocity, and ariety which are known as the 3V s model [14]: Volume: The size of data is very large and is in terabytes and petabytes. Velocity: A conventional understanding of velocity, typically considers how quickly the data is arriving and stored, and its associated rates of retrieval. Variety: It extends beyond the structured data, including unstructured data of all varieties: text, audio, video, posts, log files etc. Some researchers use a slightly modified 3V s model. Sam Madden describes Big Data as data that is too big, too fast, or too hard [15], where too hard refers to data that does not fit neatly into existing processing tools. Therefore too hard is very similar to data variety. Kaisler et al. define Big Data as the amount of data just beyond technology s capability to store, manage and process efficiently, but mention variety and velocity as additional characteristics [16]. Tim Kraska moves away from the 3V s, but still acknowledges, that Big Data is more than just volume. He describes Big Data as data for which the normal application of current technology doesn t enable users to obtain timely, cost-effective, and quality answers to data-driven questions [17]. However, he leaves which characteristics of this data go beyond normal application of current technology open [18]. IBM uses the 3V s model, but they introduced an additional V veracity : Veracity: Uncertainty of data, and data trust worthiness [19], signals that data keeps changing so one cannot trust the data on making decisions. The leader in analytics, Statistical Analysis System (SAS) Institute considers two additional dimensions [7]: 5 Page

26 Chapter1: Introduction Variability: In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Seasonal and event-triggered peak data loads can be challenging to manage which further intensifies with unstructured data. Complexity: Today's data comes from multiple sources, and it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or the data can quickly spin out of control. Overall the 4V s model or adaptations of it seems to be the most widely used and accepted description of what the term Big Data means [20]: Gartner 3V s model [14] + IBM additional V [19]. The model clearly describes characteristics that can be used to derive requirements for respective technologies and products. However, the primary concerns of this thesis are volume, velocity, veracity, and variety, as they are the main barriers to an interoperable analytic platform [20]. Handling the same volume might be a really hard problem if it is arriving fast and needs to be processed within seconds. Meanwhile, handling volume might get harder as the data set to be processed becomes unstructured. This adds the necessity to conduct pre-filtering steps so only the data that matter may enter to be processed and analyzed Social Media Big Data Definition of Social Media Nowadays, SM Networks such as MySpace, Facebook, Cyworld, Twitter, Instagram, Bebo, Snapchat, LinkedIn...etc. (see Figure. 1.2) have become increasingly popular, and they support a wide range of interests and practices. While their key technological features are fairly consistent, the cultures that emerge around Social Networks are varied. Most sites support the maintenance of pre-existing social networks, but others help strangers connect 6 Page

27 Chapter1: Introduction based on shared interests, political views, or activities. Some sites cater to diverse audiences, while others attract people based on a common language or shared racial, sexual, religious, or nationality-based identities. Sites also vary to the extent in which they incorporate new information and communication tools, such as mobile connectivity, blogging, and photo/video-sharing [21]. Many such social networks are extremely rich, and they typically contain a tremendous amount of content and linkage data which can be leveraged for analysis. The linkage data is essentially the graph structure of the social network and the communications between entities; whereas the content data contains the text, images and other multimedia data in the network [22] [23]. 7 Page

28 Chapter1: Introduction Figure 1.2: Timeline of the Launch Dates of Many Major Social Networks Sites and Dates Until 2005 [21] Social networks have been defined by [24] as web-based services that allow individuals to: (1) Construct a public or semi-public profile within a bounded system. (2) Articulate a list of other users with whom they share a connection. (3) View and traverse their list of connections and those made by others within the system. 8 Page

29 Chapter1: Introduction These connections or relationships are often displayed in a diagram, where entities are the points (also called nodes) and connections are the lines [25]. This definition is the one used to define SM sites in this thesis as it is widely used by many publications [24] [25] Data Capturing and Analyzing Challenges of Social Media Data generated from SM sites are different from conventional attribute-value data for classical data mining. SM data are largely user-generated content on SM sites [2]. SM data are typically Big Data with its special characteristics: volatile, noisy, distributed, unstructured, and vast. Main SM data challenges and issues as a Big Data representative are [26]: Privacy and Security: It is the most important issue with big data which is sensitive and includes conceptual, technical as well as legal significance. Data Access and sharing information: If data is to be used to make accurate decisions in time it becomes necessary that it should be available in a precise, complete and timely manner. This makes the Data management and governance process more complex adding the necessity to make Data open and available to government agencies in a standardized manner with standardized APIs, metadata and formats thus leading to better decision making, business intelligence and productivity improvements. Storage and Processing Issues: The available storage cannot accommodate the large amount of data which is being produced since: SM sites are themselves a great contributor along with the sensor devices. The processing of such enormous sets of data is also time consuming. To find suitable elements, all of the data set needs to be scanned, which is somewhat impossible. Analytical Issues: The main challenging questions are: 9 Page

30 Chapter1: Introduction (1) What if the data volume gets so unwieldy and varied that it is too difficult to manipulate? (2) Does all data need to be stored? (3) Does all data need to be analyzed? (4) Which data points are really important and for what reasons? (5) How can the data be used for the best advantage? Skill Requirements: Since Big Data is a fledgling and an emerging technology, it needs to attract organizations and youth with diverse new skill sets. These skills should not only be limited to technical ones but should also extend to research, analytical, interpretive and creative ones. Technical Challenges: (1) Fault Tolerance. (2) Scalability. (3) Quality of Data. (4) Heterogeneous Data. Indeed this thesis is motivated to address storage and processing issues, and analytical issues. Hence, the lack of an effective process for capturing SM data analyzing requirements in organizations adapting SM data solutions, can result in a negative impact in the financial as well as the industry s reputation and credibility. [27] Requirements Engineering for Social Media Big Data Analytics Requirements acquisition is being recognized as one of the most important albeit difficult phases in software engineering [28]. The literature repeatedly cites the role of well-defined requirements and requirements acquisition process in problem analysis and project management, as beneficial to software development throughout the life cycle: during design, coding, testing, maintenance and documentation of software [28] [29]. By recognizing SM Big Data collection and analytics similar to when designing IT software 10 Page

31 Chapter1: Introduction systems, it needs to invest in a Requirement Engineering approach that specifies the requirements prior to data collection and acquire the structure for gathering and collecting user s analytics requirements. Therefore, a tool architecture for requirements acquisition is the supporting software solution for the requirement engineering phase; for the SM data collection process. The guiding questions within the tool defines a structured process for system analysts to elicit the SM data analyzing requirements in a more effective and userfriendly manner. This approach to Requirements Engineering is one of the main principles of the Software Development Life Cycle (SDLC) [28] Problem Statement As a result of SM s rapid growth, recent years have seen an accelerating shift in different domains away from traditional channels such as print and broadcast to digital channels [1]. This transformation is being driven by the cost advantages and precision offered by digital platforms. In particular, the growing area of applications to manage the increasing volume and influence of SM [5]. Here are some statistics that offer an insight of the scope of the SM phenomenon [1]: 1.43 billion people worldwide visited a social networking sites last year (2014). Nearly 1 in 8 people worldwide have their own Facebook page. In 2014, one million new accounts were added to Twitter everyday Three million new blogs come online every month 65 percent of SM users say they use it to learn more about brands, products and services. The amount of information is continuing to increase at an enormous rate. Therefore, it is imperative that businesses, organizations, and associations find better approaches for information filtering and requirements capturing which would effectively decrease the information overload and improve the precision of analytics results [30]. All things considered SM data analytics can only be effective when the underlying data collection 11 Page

32 Chapter1: Introduction processes are able to leverage the relevant information to a particular domain [31]. It is critical to improve the usefulness of the analysis results and accelerate the SM data analytics. Therefore, a more powerful mechanism of data analytics requirements capturing guidance is needed to reduce both time and resource consumption when analyzing irrelevant data Research Questions and Objectives The study examines the decision-back approach for data capturing and its ability to be applied in capturing SM data analytics requirements. Therefore, the research question is formulated as follows: How can we define an architecture of a SM data requirements capturing tool, which accelerates the analytics tasks? This research defines a requirements acquisition tool architecture that captures SM data analytics requirements using decision back approach, which can play a role for SM data capturing process. Therefore, the main objectives of the study are: To examine SM sites, and determine what make them different from each other. To explore different SM data analytical tools, and their different techniques and main vendors. To examine the decision-back approach, and how it can leverage the SM data collection. To provide a well-planned tool architecture which can ease the analyst task on capturing SM analysis requirements. This tool is to apply the decision-back approach, through determining what the output requirements are and then filter the input data accordingly. To examine specific real-life cases from different problem domains using this tool to prove its worthiness. 12 Page

33 Chapter1: Introduction To validate the tool for its correctness and efficiency to ensure that it answers the research question. Therefore, the goal of this thesis is to define coherent processes to acquiring user s analyzing requirements. Thus, data analytics can be done in smaller time frames, allowing decisions to be made faster and with higher precision, by improving the current data capturing process from where one can draw accurate and useful conclusions. Then it will contribute to changing the way people are collecting analyzing requirements and subsequently transform decision making in a way that gives businesses the required advantage Scope of the Thesis Figure 1.3 is a simplified adaptation to SM analytics life cycle. As presented, it has four main stages: Data Collection, Data Processing, Data Storage, and Data Analysis [2]. The first stage: Data Collection, is the phase that is concerned with collecting SM data from different SM sources e.g. blogs, microblogs, etc. The goal of this thesis is to reduce this tremendous amount of data by identifying analysis requirements prior to data collection phase in SM analytics solution. This is inspired by and is similar to the primary phase of a Software Development Life Cycle (SDLC), which is Requirements Engineering (RE) [28]. This study follows RE in providing a well-defined tool architecture to capture SM analyzing requirements to improve the data collection process and accelerate the analytics tasks. Investigating the other phases of SDLC or Big Data analytics process, and examining other SM analytics problems are beyond the scope of this thesis. 13 Page

34 Chapter1: Introduction 1.9. Related Published Paper Figure 1.3: Social Media Analytics Life Cycle Published papers under this research [1] M. Alswilmi, A. Dahanayake, (2015), A Requirements Acquisition Tool Architecture for the Decision Back Approach for Social Media Big Data Capturing 5 th Advances in Software Engineering Conference, Prince Sultan University [2] M. Alswilmi, N. Alnajran, A. Dahanayake, (2014), Conceptual Framework for Big Data Analytics Solutions Proceedings of 24 th International Conference on Information Modelling and Knowledge Bases, EJC [3] M. Alswilmi, N. Alnajran, A. Dahanayake, (2013), Conceptual Framework for Big Data Analytics Solutions 2 nd Advances in Software Engineering Conference, Prince Sultan University Page

35 Chapter1: Introduction Outline of the Thesis Apart from the introduction, the remainder of this research is structured in to eight Chapters as outlined in: Research Method, Literature Review, SM Types and Analytical Techniques. Decision-back Data Capturing Approach for SM Data, Case Study, Tool Experiment and Validation, Discussion, and Conclusion and Future Work. Chapter 2 consists of the research methods. It provides a demonstration of the adapted methodology to conduct this research. The literature review in Chapter 3 discusses the related works including some available data reduction approaches, highlighting the innovativeness of this research. Additionally, an overview of the tools and frameworks that has been used to build the proposed tool is presented. Chapter 4 is discussing SM sites categorizations, and SM data analytical tools and different analytical techniques. The tool architecture is built and proposed as the core of this research in Chapter 5 along with supporting materials. Chapter 6 provides an application of the framework on five case studies from different problem domains. The framework has been validated through a prototype and an experiment to prove its correctness and efficiency in Chapter 7. Afterwards, the research analysis, a discussion on the tool prototype, and its evaluation and validation is provided in Chapter 8. Finally, Chapter 9 contains the conclusion, limitations of this research and future research directions. 15 Page

36 CHAPTER 2: RESEARCH METHODS

37 Chapter2: Research Method In this Chapter, the research methods followed within this study are outlined including the research design, participants, the techniques and data analysis methods used for research data analysis, and evaluation and validation approaches of the results. Moreover, the tools used to conduct the experimental work are also discussed Research Methods Research Design The major aim of this research is to apply the decision-back approach concept and to develop a requirement acquisition tool architecture to capture SM data analytics requirements. For this purpose, a qualitative and to some extent a quantitative research methods of investigation are chosen. The research is descriptive in nature and allows gathering a more in depth contextual understanding of the topic. Initially, the inductive approach is followed to analyze the qualitative data. The research begins from general information about the decision-back capturing approach and SM analyzing requirements, towards a more specific conclusions about how to apply this data capturing approach in SM Big Data analytics and a requirements acquisition tool architecture building Research Participants In order to maximize the validity of findings, the research uses a hybrid access type [32] to gather the relevant data. The primary source of data collection is through the use of indepth Internet access of SM sites and going through various scientific publications and white papers that is of interest to this research. Supporting data is collected through traditional access, observing several leading companies who benefit from Big Data and SM data capture and analysis technologies. Choice of companies is determined by the availability of information, reputation, and level of involvement in this field such as: IBM, Gartner, and SAS. 17 Page

38 Chapter2: Research Method Research Techniques and Data Analysis This is a mixed method research. It uses a variety of data collection techniques and analytical procedures to develop the foundation and to validate the tool architecture. In order to maximize the validity and trustworthiness of the findings, the research intended to use a hybrid access type to gather a richer set of data. The research advanced through multiple Work Packages (WP) to develop the tool architecture and the tool s prototype, as explained below (See Figure 2.1) Research Work Packages 1. WP1: Learning From Available Literatures 1.1. The primary source of data collection is through literature exploration and use of in-depth Internet access of SM sites and SM analytical tools, and perusing various relevant publications and white papers that discuss decision-back approach and SM data analytics Supporting data is collected through: Traditional access and conversations with interested participants in scientific conferences such as the European Japanese Conference In addition, observations were conducted involved documents reviews of data analytics solutions of several companies. 2. WP2: Developing the Conceptual Framework to Facilitate the Decision-Back SM Big Data Requirements Capturing 2.1. By examining the decision-back approach, and how it has been used in a variety of literatures, general questions from the article [8] have been used to identify the main concepts for using this approach for analyzing SM data Each concept in the framework is examined to describe how it can be beneficial for capturing SM data within more efficient timelines with less consumption of resources Connecting the framework with SM analytics life cycle and showing its relevancy to SDLC. 18 Page

39 Chapter2: Research Method 3. WP3: Fine-tuning the Conceptual Framework 3.1. Examining W*H Conceptual Model for Services [33], and customize it to be used to make the concepts in the framework more descriptive After relating the conceptual framework to the SDLC, and showing how it does work as a requirements acquisition phase in the Big Data analytics lifecycle, the requirements framework is built and its components are described accordingly. 4. WP4: Design of a tool s Prototype and the Component Architecture of a Tool that Supports the Decision-Back SM Big Data Requirements Capturing 4.1. Based on the requirements acquisition framework the tool architecture is designed Each model in the tool is described showing how it supports on capturing the SM data analyst s requirements. 5. WP5: Validation 5.1. Two types of validation tests are provided: theoretical and experimental Theoretical by showing some case studies from different problem domains Experimental by using actual analytical tools and measuring correctness and efficiency quality factors. 6. Wp6: Discussion, Conclusions and Future Research Directions 6.1. Discussing the worthiness of the provided tool architecture by comparing two results: analysis with the tool, and without the tool Conclude the research, discuss what its limitations are, and provide some future work directions for further improvements. 19 Page

40 Chapter2: Research Method Figure 2.1: Thesis Work Packages (WPs) 2.2. Research Instruments and Procedures This research attempts to build a requirements acquisition tool architecture for decisionback approach for capturing SM data. In order to validate this tool for correctness and efficiency, a prototype consisting of combination of tools need to be available to support the tool evaluation process is described below Social Mention Social Mention 2 is a free Social Media search and analysis platform that aggregates user generated content from across the universe into a single stream of information. It allows to Page

41 Chapter2: Research Method easily track and measure what people are saying about a person, a company, a new product, or any topic across the web's Social Media landscape in real-time. Social Mention monitors 100+ Social Media properties directly including: Twitter, Facebook, Friend Feed, YouTube, Digg, and Google Trackur Trackur 3 is a SM monitoring tool designed to assist companies and public relations PR professionals in tracking what is said about brands on the Internet. It scans hundreds of millions of web pages including news, blogs, videos, images, and forums and alerts the user to anything that matches the keywords monitored. It cost at least $97 a month and it offers a 10-day trial Page

42 CHAPTER 3: LITERATURE REVIEW

43 Chapter3: Literature Review 3.1. Big Data and Social Media Related Works Many software startups, research and development efforts are actively trying to harness the power of Big Data and SM, and create software with the potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to the engineering aspects of Big Data and SM software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during collecting, processing, and analyzing data that needs to be delivered on time and within budget [34]. This research is focusing on SM requirements capturing approach, and studies that are discussing SM and data capturing approaches Innovative Big Data and Data Capturing Approaches IBM in [35], provides a means of classifying Big Data business problems according to a specified criteria. They have provided a pattern-based approach to facilitate the task of defining an overall Big Data architecture. Their idea of classifying data in order to map each problem with its suitable solution pattern provides an understanding of how a structured classification approach can lead to an analysis of the needs and a clear vision of what needs to be captured. Moreover, IBM has presented several real-life samples of Big Data case studies in [36]. The authors in [37], have studied different Big Data types and problems. They developed a conceptual framework that classifies Big Data problems according to the format of the data that must be processed. It maps the Big Data types with the appropriate combinations of data processing components. These components are the processing and analytic tools in order to generate useful patterns from this type of data. Constraint-Driven Data Mining technique proposed by [38] identifies the following classes of constraints: database constraints, pattern constraints, and time constraints. Database constraints are used to specify the source dataset. Pattern constraints specify which patterns are interesting and should be returned by the query. Finally, time constraints influence the 23 Page

44 Chapter3: Literature Review process of checking whether a given data/sequence contains a given pattern. However, data mining can only be applied to structured data that can be stored in a relational database [39], but this constraint-driven approach can provide an understanding of how these types of constraints can lead to more efficient data collection. The article [40] proposes a novel approach for consistent collective evaluation of multiple continuous queries for filtering two different types of data streams: a relational stream and an XML stream. The proposed approach commonly provides a region-based selection constructs: an attribute selection construct for relational queries and a path selection construct for XPath queries. Both collectively evaluate the selection predicates of the same attribute (path), based on the precomputed matching results of the queries in each of the disjoint regions divided by the selection predicates. The performance experiments show that the proposed approach is basically more efficient and stable at run-time. C. Anne and B. Boury in [41], proposed a framework facilitating the integration of heterogeneous unstructured and structured data, enabling Hard/Soft fusion and preparing for various analytics exploitation. It provides timely and relevant information to the analyst through intuitive search and discovery mechanisms. The authors described the design and implementation of a prototype for scalable Multi-Intelligence Data Integration Services (MIDIS), based on a flexible data integration approach, making use of Semantic Web and Big Data technologies. In [42], the white paper published by Intel walk through the challenge of extracting Big Data from multiple sources. It has explained how Hadoop infrastructure can contribute to the process of Big Data Extract, Transform & Load (ETL). It illustrates the process of loading different data formats from multiple data sources into Hadoop s warehouse from a technical point of view. However, they did not touch the idea of reducing useless data capture nor producing real-time management decisions. 24 Page

45 Chapter3: Literature Review Literatures Analysis From the IBM contributions [35], [36], in the field of Big Data, the idea of decision-back concept for a structured approach to SM data collection has emerged. Moreover [37], [40], [41], [42], discussed how the data classification according to some parameters can lead to better understanding of the problem at hand. While [40], discussed the constraint-driven approach and how can it provide an understanding of how these types of constraints can lead to more efficient data collection Innovative Social Media Data Collection and Analytics Approaches In [43] the authors present a multi-layered knowledge extraction approach of social networks with a comprehensive survey of relevant notions and techniques from multidisciplines. They analyzed the SM characteristics in a multi-mode, multi-layer knowledge dimensions using twitter as an example. They also improve the hyper graph model of social network behaviors based on the dimensions proposed in the model with a case study in Twitter illustrating the multi-dimensional relations between Twitter users. Their main focus was to improve the understanding of social network services. The authors in [23], studied the application of the concept and techniques of web mining for on-line social networks in terms of how to use web mining and a general process of its use for on-line social networks analysis. They discussed several challenges in this research area; for example: data sampling is a big issue when using web mining for on-line social networks analysis. In other web mining applications, data sampling is a simple task to reduce the amounts of data size. However, in on-line social networks analysis, it becomes a difficult task to select suitable samples representative of the real social networks. In [44], the authors empirically designed and developed the Real-time Twitter Trend Mining (RT²M) system which allows in real-time to: 1) crawl and store every textual data tweet produced in Twitter into a local database; 2) keep track of social issues by temporal Topic Modeling, and; 3) visualize mention-based user networks. They also demonstrated a 25 Page

46 Chapter3: Literature Review case study related to the 2012 Korean presidential election carried out by RT²M. The major contribution of the study is making it possible to mine dynamic social trends and contentbased networks generated in Twitter through integration techniques. The authors in [45], developed models of content based social networks analysis in order to find out the discussed topics. They focused on Latent Dirichlet Allocation (LDA) model, Author-Recipient-Topic (ART) model, Gibbs Sampling and automatically labeling topics. From this, they built a system with three main parts: term extraction, detection and automatic labeling of topics. The authors in [46], examined the extent to which emotion is present in Myspace comments, using a combination of data mining and content analysis, and exploring age and gender. They began the process of moving from opinion mining to emotion detection by using a case study of Myspace comments to demonstrate that it is possible to extract emotion-bearing comments on a large scale, to gain preliminary results about the social role of emotion and to identify key problems for the task of identifying emotion in informal textual communications online. The paper [47], reported on an in-depth case study of 15 graduate students who used NodeXL to learn Social Network Analysis (SNA) concepts and apply them to study online interaction. Their successes demonstrate that novice analysts can effectively adopt and apply SNA techniques within a relatively short time. Their practices are characterized by the Network Analysis and Visualization (NAV) process model, which can be used to articulate challenges in sense making of network datasets. In [48], the authors considered the problem of inferring sentiment or emotion expressed in SM content. Furthermore, motivated by the nature of security informatics, they seek the capability to infer sentiment/emotion for new classes of content (e.g., written in an unfamiliar language or about a novel emerging topic) without the need to collect and label new training documents. They presented a new method for estimating sentiment or emotion which addresses the challenges associated with SM analysis. They formulate the problem 26 Page

47 Chapter3: Literature Review as one of text classification, model the data as a bipartite graph of documents and words, and construct the sentiment/emotion classifier using a combination of semi-supervised learning and graph transduction. The classifier can be implemented with no labeled training documents, and enables accurate text classification using only a modest lexicon of words of known sentiment/emotion polarity. The authors in [49]conducted a study to evaluate sentimental analysis for Arabic language using two popular social analysis tools: Social Mention and SentiStrength. A dataset of Arabic posts is collected and prepared using Facebook and Twitter posts. A tool is developed to judge the polarity of comments based on their contents. Several classification algorithms are used and evaluated to judge which one can better predict comment polarities especially in Arabic language Literatures Analysis [43], [23], [44], [45], [46], [47], [48], [49], literatures discussed how utilizing SM characteristics can benefits on analyzing SM data. Nevertheless, they were more interested in improving SM analysis phase in order to improve and accelerate the analytics process Software Engineering and Social Media Data Analytics Reverse Engineering Decision-Back analysis is the phase that introduces the requirements engineering into SM Big Data analytics process. This requirements engineering phase as presented for Big Data capturing using backwards analysis is called reverse engineering in software engineering terminology [50]. It refers to looking at the properties of the solution that is needed (output) to figure out the appropriate input [51]. The system s specifications may be reverse engineered and provided as an input to the requirements specification process for system replacement. In reverse engineering, the system may be restructured and re-documented without changing its functionality, in order to support frequent maintenance [50] [51]. 27 Page

48 Chapter3: Literature Review Software Requirements Engineering The hardest part of building a software system is deciding precisely what to build. No other part of the conceptual work is as difficult as establishing the detailed technical requirements, including all of the interfaces to people, to machines, and to other software systems [52]. Therefore, for any software application to be successful, whether it is a SM data application or not, there must exist Requirements Engineering as the primary phase in its life cycle [51] Decision-back Data Capturing Approach The authors in [9] applied decision back approach on heterogeneous legacy databases. To start with, they developed a novel approach to data analysis by reversing the analysis task. The analysis task drives the features of data collectors. These collectors are small databases which collect data within their interest profile. Data from the collector databases are then used for the presentation database. In addition, they build their analysis approach in a layered database, developing it for a real example: the Tokyo metropolitan railway databases. The feature of this approach is to realize dynamic data integration and analysis among heterogeneous databases by computing spatial and temporal interrelationships objective-dependently, while such integration realizes how to retrieve, analyze and extract new information generated with a viewpoint of spatial and temporal occurrences among legacy databases. The master thesis [53], applied the concept of backward analysis by introducing scenariobased data collection approach to the Big Data solution. The research has studied the scenario-based factors that govern the data collection process and organized them in the form of primary, secondary and additional questions. These questions form the kernel of the Requirements Specification Framework developed as a structured, well-defined approach for scenario-based Big Data collection process. It has emphasized the essence of analyzing the requirements of the scenario that needs to be addressed prior to attempting 28 Page

49 Chapter3: Literature Review data collection. The study developed an understanding of the different Big Data analytical techniques to know how to choose the right one according to the scenario under investigation. Decision-back approach is discussed in the article [8], in the topic of advanced data analytics. The authors claim that companies have to take three steps to capture value from Big Data and advanced analytics. First, companies must be able to choose the right data and manage multiple data sources. Second, they need the capability to turn the data into insights that is, they must combine deep analytical talent with commercial judgment. The third and most critical, is management must undertake a transformational-change program so that the insights actually yield better business decisions and translate into effective frontline action. To choose the right data, they recommend a decision back approach, which begins with the company answering two related questions: which decisions do we want to improve? What data and analyses will help us improve those decisions? According to their observations, they saw great potential for consumer-facing organizations that adopt Big Data and advanced analytics as a platform for growth Literatures Analysis Bernhard Thalheim and Yasushi discussed the application of this approach on heterogeneous legacy databases [9]. While their analysis-driven data collection approach proved to be working, it is limited to legacy databases. Nouf Alnajran discussed the application of this approach on Big Data [53]. She applied scenario-based data collection to collect the relevant data according to a scenario. This scenario is built by asking two sets of questions inspired by the W*H model [33]. Her approach improved the data collection process, but it did not consider SM data special characteristics. 29 Page

50 Chapter3: Literature Review 3.4. Related Tools and Environments for Social Media Data Analytics IBM [1] white paper, describes the role that SM can play in presenting a more strategic view of customer data and how the right combination of technologies can deliver insight to help companies more effectively meet perpetually shifting consumer demands expressed through, and influenced by, these dynamic communication channels. It covers the implications for not only marketing and sales, but also for IT and considers how and why SM tools and applications can be integrated with existing technology investments. Enghouse Systems Enghouse Wholesale Revenue Management (WRM) platform [4], is a suite for interconnect and roaming billing, including intelligent routing and efficient fraud detection capabilities. WRM includes a Fraud Management system, earlier branded under the name Watchdog FMS by Basset. This system is only collecting relevant data by determining upfront what is actually relevant. It has the ability to apply analytics on the front end to determine relevance based on context. This type of analysis determines which data should be included in analytical processes and what can be placed in low-cost storage for later use if needed. They believe that the only way to maximize the system effectiveness in fighting wholesale fraud is to rely on a solution dedicated to analyzing business specific relevant data, which is basically applying decision-back approach to collect relevant data. SAS white paper [54], proposed stream it, store it, score it approach which determines the 1 percent that is truly important in all the data within an organization. The idea is to use analytics to determine relevance instead of always putting all data in storage before analyzing it. Stream it, score it, store it SAS s approach is incorporating highperformance analytics and analytical intelligence into the data management process for highly efficient modeling and faster results. For instance, one can analyze all the information within an organization (such as , product catalogs, wiki articles and blogs) and extract important concepts from that information, and look at the links between them to identify and assign weights to millions of terms and concepts. This organizational context is then used to assess data as it streams into the organization, churns out of internal 30 Page

51 Chapter3: Literature Review systems, or sits in offline data stores. This up-front analysis identifies the relevant data that should be pushed to the enterprise data warehouse or to high-performance analytics Hadoop The Big Data Management Framework This Section provides an overview of Apache Hadoop as a Big Data processing framework and its core components. This framework is one of the most used platform to store and process Big Data [55]. Hadoop s components are explained in a highly simplified manner in Appendix B. A detailed description of them can be found in [56] [57] [58] [59] [60] [61] [62] [63] Apache Hadoop. Hadoop is the name that creator Doug Cutting s son gave to his stuffed toy elephants. He was looking for something that was easy to say and stands for nothing in particular [56]. Hadoop provides a distributed file system (HDFS) and a framework for the capturing, processing and transformation of very large data sets using the MapReduce [58] paradigm. The important characteristic of Hadoop is the partitioning of data and computation across thousands of hosts, and executing application computations in parallel proximity to their data. Hadoop is an Apache project; all components are available via the Apache open source license. Yahoo! has developed and contributed to 80% of the core of Hadoop [59]. (see Figure 3.1). 31 Page

52 Chapter3: Literature Review Figure 3.1: Hortonworks Data Platform Literatures Analysis The leading IT companies: SAS Information Management System [54] and WRM platform [4] discussed the approach and how they are applying it in their tools to capture relevant data. Their approach to determine the relevant data is based on data context, so the irrelevant data will be placed in low-cost storage for later use if needed Theories and Frameworks W*H Conceptual Model for Services The authors in their research [33], have studied the concept of services as a design artifact. They have aimed to merge the gap between main service design initiatives and their abstraction level interpretations. In order to address their research goal, the authors Page

53 Chapter3: Literature Review have developed an inquiry-based conceptual model for service systems designing. This model formulates the right questions that specify service systems innovation, design and development Stanford CoreNLP Framework Stanford CoreNLP 5 is a tool that provides a set of natural language analysis tools which can take raw text input and give the base forms of words, their parts of speech (e.g. names of companies), normalize dates, times, and numeric quantities. In addition it can mark up the structure of sentences in terms of phrases and word dependencies, indicate sentiment, as well as which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework whose goal is to simplify the application of a bunch of linguistic analysis tools to a piece of text. Starting from plain text, a developer can run all the tools on it with just two lines of code. It is designed to be highly flexible and extensible. With a single option, a developer can change which tools should be enabled and which should be disabled. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications [64] Summary The analysis of massive data volumes is not recommended in SM Data Analytics [54]. Even though relatively low cost storage has driven a propensity to hoard data, this habit is seen as unsustainable. Organizations do not have to grapple with overwhelming data volumes if it does not meet the requirements. Nor do they have to rely solely on analysis based on subsets of available data. Given the pivotal role of collecting relevant data, it is important to establish both a better information engineering pipeline and governance Page

54 Chapter3: Literature Review approach for data collection [4] [8] [54]. One study has shown that determining upfront what is the relevant data can accelerate analysis tasks which yields better business decisions and translates into more effective frontline actions [4]. Consequently, much research has been conducted on applying decision-back capturing approach to capture relevant data [4] [8] [9] [53] [54]. The research presented in this thesis examines applying this approach to build a requirement acquisition tool architecture to capture the necessary SM data. Until recently, this area has been surprisingly neglected, as the majority of the literature on SM data analytics has focused on improving the process of data analytics [43] [23] [44] [45] [46] [47] [48] [49]. To emphasize the effects of decision-back capturing approach on accelerating the analysis tasks, different methodologies that have been used to apply this approach are examined. Bernhard Thalheim and Yasushi discussed the application of this approach on heterogeneous legacy databases [9]. While their analysis-driven data collection approach proved to be working, it is limited to legacy databases. Nouf Alnajran discussed the application of this approach on Big Data [53]. She applied scenario-based data collection to collect the relevant data according to a scenario. This scenario is built by asking two sets of questions inspired by the W*H model [33]. Her approach improved the data collection process, but it did not consider SM data special characteristics. Companies such as SAS Information Management System [54] and WRM platform [4] discussed the approach and how they are applying it in their tools to capture relevant data. Their approach to determine the relevant data is based on data context, so the irrelevant data will be placed in low-cost storage for later use if needed. Therefore, this research focuses on the development of a requirements acquisition tool architecture that applies decision-back approach to capture SM data. It integrates the requirements acquisition [50] [51] into data capturing when collecting SM data for analytics. Furthermore, this research studies the main concepts in the process of collecting SM data based on [8] and makes it tailored to the analyzing needs, in order to decrease the 34 Page

55 Chapter3: Literature Review analysis time and increase the value by increasing the accuracies of the results and making faster management decisions. 35 Page

56 CHAPTER 4: SOCIAL MEDIA TYPES AND ANALYTICAL TECHNIQUES.

57 Chapter4: Social Media Sites and Analytical Techniques In this Chapter SM sites are introduced underlining their different characteristics. The SM analytical tools are discussed as well as highlighting their unique characteristics and main vendors Social Media Types Since their introduction, SM sites such as MySpace, Facebook, Twitter, and LinkedIn have attracted millions of users, many of whom have integrated these sites into their daily practices. As of this writing, there are hundreds of SM Sites, with various technological affordances, supporting a wide range of interests and practices [21]. Since these SM sites share some characteristics (Chapter 1 Section 1.4), an efficient strategy needed to be found to identify the best data source for the project being undertaken. Melissa Leiter [65], an SM expert suggested two important questions to consider when choosing the data source: Who are the problem s targeted people? Once you have a good understanding of your audience, you can then move forward [65]. Users type and geographic demographic can be vital factors on choosing the best data source 6 : For example; information such as: 22.3% of Facebook members are from the United States and 26% of members are between 18 and 24 years old can help marketers to choose Facebook to initiate their campaign if they are targeting the U.S. market. What SM platform(s) do they prefer? This could be achieved by categorizing the SM sites with special characteristics. The authors in [66] [67] recommended categorizing SM sites according to their functionalities Page

58 Chapter4: Social Media Sites and Analytical Techniques into five categories. These categories are presented in the following Sections (summarized in Table 4.1) Social Media Sites Categorizations 1. Social Networking Definition: Using websites and applications to communicate informally with others, locating people, and sharing similar interests. Allows users to directly connect with one another through groups, networks, and location. Examples: Facebook, Google+, and LinkedIn. 2. Microblogging Definition: Posting very short entries or updates on a social networking site. It allows users to subscribe to other users' content, send direct messages, and reply publicly. Allows users to create and share hashtags to share content about related subjects. Examples: Twitter and Tumblr. 3. Blogging Definition: Recording opinions, stories, articles, and links to other websites on a personal website. Examples: WordPress and Blogger. 4. Photo Sharing. Definition: Publishing a user's digital photos, enabling the user to share photos with others either publicly or privately. Examples: Instagram, Flickr, and Pinterest 38 Page

59 Chapter4: Social Media Sites and Analytical Techniques 5. Video Sharing. Definition: Publishing a user's digital videos, enabling the user to share videos with others either publicly or privately. Allows users to embed media in a blog or Facebook post, or link media to a tweet. Examples: YouTube, Vimeo, and Vine. Category Examples Social Networking Facebook, Google+, and LinkedIn Microblogging Twitter and Tumblr. Blogging WordPress and Blogger. Photo Sharing Instagram, Flickr, and Pinterest Video Sharing YouTube, Vimeo, and Vine. Table 4.1: Summary of Social Media Sites Categorization Based on their Functionalities Social Media Sites Examples Below is a brief description for four of the major SM sites: Facebook, Twitter, LinkedIn, and Google+ [68]. For each one of the SM sites, a respective table describe its users type and geographic demographic. Table 4.6 summarizes these SM sites characteristics Facebook. Benefits: Family photos and videos, personal updates, chronicling life changes, and sharing interesting links. Shortfalls: Up-to-the-minute news, one-way follows for individuals [69]. Facebook is a popular free social networking website that allows registered users to create profiles, upload photos and video, send messages and keep in touch with friends, family and colleagues. The site, which is available in 37 different languages, includes public features such as [70]: 39 Page

60 Chapter4: Social Media Sites and Analytical Techniques Marketplace - allows members to post, read and respond to classified ads. Groups - allows members who have common interests to locate each other and interact. Events - allows members to publicize an event, invite guests and track who plans to attend. Pages - allows members to create and promote a public page built around a specific topic. Presence technology - allows members to see which contacts are online and chat. Users Type Number 500,000,000+ Website General Category Gender 57% Female Education 81% College geographic demographic (Visits by Country) Table 4.2: Facebook Information [70] Twitter Benefits: Instant news, site updates, breaking news, and quick links. Shortfalls: Long-form anything and reading everything in a feed [69]. Twitter transforms lengthy blog articles to small snippets of information. This is great for grabbing people's attention quickly, driving more people to the blog, which lead to new blog subscribers, easy as that. Hashtags are also a special twitter attribute. They are great for search purposes when looking for people to follow and also to see what people are tweeting [71]. 40 Page

61 Chapter4: Social Media Sites and Analytical Techniques Users Type Number 10,000, ,000,000 Website blogging social networking site Category Gender 62% Female Education 84% College Age 56% 35+ years, 37% years, 7% years. geographic demographic (Visits by Country) Table 4.3: Twitter Information [71] LinkedIn Benefits: Communicating with people in professional environments, connecting with business as whole. Shortfalls: Sharing with friends and family, and fast news. LinkedIn is definitely different from the rest of the SM outlets out there for business. Instead of just connecting with individuals themselves such as clients or customers, you can connect with other business as a whole. LinkedIn carries a unique feature that most SM platforms do not support, which is the means to see who views his/her profile. A LinkedIn profile is basically like an online resume where one can list data such as experience, schooling and interests [71]. 41 Page

62 Chapter4: Social Media Sites and Analytical Techniques Users Type Number 100,000,000+ Website professional/business social Category networking site Gender 50% Female Education 88% College Age 80% 35+ years, 19% years, 1% years. geographic demographic (Visits by Country) Table 4.4: LinkedIn Information [71] Google+ Benefits: One-way following, photography, videos, long-form text content, selective sharing, and animated GIFs. Shortfalls: Sharing with friends and family, and fast news. Google+ is a relative newcomer to the field. It was first introduced a couple years ago, however, by some accounts, it has already outpaced Twitter a the second largest social network. The biggest feature is the site's powerful tools for organizing someone s contacts and controlling what shows up in his/her feed. Unlike most other services, one cannot add people without sorting them into some default category, but one can also create them spontaneously. This means that, from the start, one is already approaching his/her follow lists as though some feeds have more priority than others (which they do). It paves the way to carefully curate a casual feed [69]. 42 Page

63 Chapter4: Social Media Sites and Analytical Techniques Users Type Number 10,000, ,000,000 Website General Category Gender - Education - Age - geographic demographic (Visits by Country) Table 4.5: Google+ Information [69] Summary of Social Media Sites Characteristics Users Type Demographic SM Site Category Adult s More More More More Higher Geographic Usage% Female Male Educated Income Facebook Social Networking 71% X X 22.3% US,7.5 India, 48.8% Other Twitter Microblogging 18% X X 24.5% US, 5.1% UK, 46% Other 31 % US, 10% Social LinkedIn 22% X X X India, 10% Networking Other Instagram Photo Sharing 17% X NA Pinterest Photo Sharing 21% X X X NA 28 % US, 8% Social Google+ NA NA NA NA NA NA India, 47% Networking Other Table 4.6: Social Media Sites Characteristics Summary 43 Page

64 Chapter4: Social Media Sites and Analytical Techniques 4.2. Social Media Analytical Tools The sheer number of SM analysis tool vendors has changed dramatically. Just a few years ago there were only a handful to choose from, whereas today there are hundreds of them [72]. The increasing importance of SM data in the process of shaping brand sentiment, brand reputation and influencing consumer behavior is now crucial for any business success. Companies no longer control SM and more importantly have recognized it. Coca- Cola for example, has mentioned it does not have the extensive resources to monitor and control all global conversations. Other brands try to engage key influencers and control key topics which they believe are important for its brand health and success [73]. In the article [74] Jay Bear provided an efficient categorization for SM software provided by different vendors. He organized it into five categories based on user s needs, which is basically useful for this thesis goal, in providing a user-centered solution. Those categories are described below (See Table 4.7) Social Listening Software/ Social Media Monitoring Software When to use it: Know what is being said about a company, its competitors, and a category in SM. What words are used in association with a brand? Where is this chatter occurring? Key Vendors: Crimson Hexagon, Lithium, Meltwater Buzz, Radian6, Sysomos, Trackur, Visible Technologies. Small businesses alternative: ViralHeat, SocialMention. 44 Page

65 Chapter4: Social Media Sites and Analytical Techniques Social Conversation Software/ Social Media Engagement Software, Social Media Management Software When to use it: To efficiently respond to questions posed to a user s company or profile in SM, and find real-time opportunities to provide assistance. Ideally, it could assign conversation opportunities to various people in the company. Key Vendors: Argyle Social (a Convince & Convert sponsor), Attensity, Awareness, CoTweet (now SocialEngage from ExactTarget a Convince & Convert client), Spredfast, Sprinklr. Small business alternative: Hootsuite, Jugnoo, Postling Social Marketing Software/ Social Media Management Software When to use it: To create custom Facebook apps, launch and administer promotions, and manage creative assets on YouTube and beyond. Ideally, multiple people can create with workflow and approvals. Key Vendors: Buddy Media, EngageSciences. Small business alternative: Agorapulse, ShortStack Social Analytics Software When to use it: To know how effective user SM efforts are, both on specific platforms and (ideally) overall, and whether all of this is productive. Key Vendors: SAS, Social Bakers. Small business alternative: Google Analytics. 45 Page

66 Chapter4: Social Media Sites and Analytical Techniques Social Influencer Software When to use it: To find SM participants that are disproportionately interested in, or influential about, a particular topic. To understand their passions and spheres of influence. Key Vendors: Appinions, GroupHigh Social Media Analytical Tools Examples Table 4.7 showing a short list of SM analytical tools and their characteristics. For more detailed information see Appendix D, where an expanded list of SM analytical tools is presented. Analytical Tool SAS Statistical Analysis System Facebook Insights YouTube Insights Meltwater Software Category Problem Domain Data Source Cost Social Marketing Any data Basic windows Analytics package costs Analytics source $8,700 for the first year. (High) Software Social Personal Facebook Free (the page has to have more than 30 Analytics Improvements likes). Software SM Analytics Any Domain YouTube Free Software Social Marketing, Any There is a free trial for companies. And Listening Sales, Public digital different prices for different solutions: Software Relationships data Online news and PR solution (Meltwater source. News), Complete Solution (Meltwater News + Meltwater Buzz), and SM Marketing Solution (Meltwater News) Table 4.7: Social Media Analytical Tools' Characteristics Example 46 Page

67 CHAPTER 5: DECISION-BACK DATA CAPTURING APPROACH FOR SOCIAL MEDIA DATA

68 Chapter5: Decision-Back Data Capturing Approach for Social Media Data In this Chapter, the tool architecture is proposed. Starting with introducing the decision-back approach and how can it be utilized to capture SM analytics requirements. W*H conceptual model [33] is used to define the main concepts in the proposed decision-back approach Backward Analysis Backward analysis is the phase that introduce the requirements engineering into SM Big Data analytics process. This requirements engineering phase as presented for Big Data capturing using backwards analysis is called reverse engineering in software engineering terminology [50]. It refers to looking at the properties of the solution that is needed (output) to figure out the appropriate input [51]. According to the online Computing Dictionary 7, Backward analysis is the process of defining the properties of the input, given or based on the context and properties of the output. This concept is utilized in optimizing the process of data collection. Analyzing the properties of the problem at hand and determining the relevant elements that, when collected, will probably reveal hidden values, has to be executed prior to the data collection process. Comprehensive backward analysis will eliminate the chance of being overwhelmed by bulks of irrelevant data, and help users and businesses to generate management decisions and answer mission critical questions in an efficient and timely manner [53] Page

69 Chapter5: Decision-Back Data Capturing Approach for Social Media Data 5.2. Capturing Social Media Data Plan Figure 1.3 (Chapter 1 Section 1.8), represents the normal process flow of SM analytics. In order to capture data in more efficient way and set some "filters", the analyst should start planning for the first stage in the process: planning to collect data. This phase is the basic building block in the analytics process, as it is equivalent to the planning for requirements elicitation in Software Development Life Cycle SDLC [53]. Using the decision-back approach; certain concepts could be defined for the Analytics tasks, which begin by answering three related questions suggested by [8] and improved by [75]: Why decisions need to be made? By answering this question, problem domains should be determined. Which decisions need to be improved? By answering this question, constraints should be determined. What data and analysis will help improve those decisions? Answering this question will help identifying the SM data source and the suitable analytical tool. These are general questions that help build on the conceptual framework for decision-back approach to capture SM data analyzing requirements (See Figure 5.1). Later on in this Chapter, these questions will be more detailed to help the data analyst capture the analyzing requirements. 49 Page

70 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.1: Decision Back Approach Applied in the Analysis Process 5.3 The Conceptual Model A conceptual model is useful for structuring contextual elements within complex settings. It is defined as a visual or written product, one that explains, either graphically or in narrative form, the main things to be studied the key factors, concepts, or variables and the presumed relationships among them [76]. It is something that is constructed, not found. It incorporates pieces that are borrowed from elsewhere, but the structure, the overall coherence, is something that is build, not something that exists ready-made [76] [77]. Ultimately, the basic concepts for capturing SM Big Data and the general questions [8] [75](Section 5.2) for incorporating the decision-back data capturing approach use the above 50 Page

71 Chapter5: Decision-Back Data Capturing Approach for Social Media Data definition. The following steps need to be followed to collect SM in a more efficient manner with less volume, variety, veracity, and velocity: 1. Determine the problem domain: Objectives, and Constraints. 2. Identify the data source. 3. Determine the analytical tool. 4. Analyze the data collection based on the concept structure defined under the conceptual model. 5. Conduct further analysis based on the collected data. (See Figure 5.2). Figure 5.2: The Conceptual Model of the Decision Back Capturing Approach 51 Page

72 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Identification of the Problem Domain. A domain is a sphere of thought or operation; the situations where a particular science, law, etc., is applicable. 8 Problem domain (or problem space) is an engineering term referring to all information that defines the problem and constrains the solution (the constraints being part of the problem). It includes the goals that the problem owner wishes to achieve, the context within which the problem exists, and all rules that define essential functions or other aspects of any solution product represents the environment in which a solution will have to operate, as well as the problem itself [78]. In broad terms, the problem domain describes the area undergoing analysis, and the identification of the problem domain is the first step to arrive at a solution. This is also applicable in the SM analytics, where the analyst should start by identifying the problem domain: constraints and scope of the problem domain. This can vary greatly depending on the goals of the project being undertaken. This will limit the focus, and provide a clear direction needed to make sense of the enormous amount of data at hand Identification of the Data Source and the Analytical Tool As mentioned in Chapter 4, there are different SM data sources and analytical tools. To differentiate data sources from each other we can use: SM site s categorization [66] [67], users type, geographic and demographic [70] (Chapter 4 Section 4.1); these two factors can help when problem domain is described and the goals of the undertaken project are defined. The analytical tool should also be identified by using [74] categorization (Chapter 4 Section 4.2) and price Page

73 Chapter5: Decision-Back Data Capturing Approach for Social Media Data 5.4 W*H Conceptual Model for Services In order to define the conceptual model for the SM data capturing, the W*H model for services is followed (see Figure 5.3. [33]). W*H is applied in the domain of IT services to identify, describe, and define IT services. It has been tested in different cases ranging from medical domains to other IT services. In the W*H framework, the service is primarily specified as: The ends or purpose (wherefore) of the service and thus the benefit a potential user may obtain when using the service. The purpose description governs the service. It allows to characterize the service. This characterization is based on the answers for the following questions: why, whereto, for when, for which reason. They have called these properties primary since they define in which cases a service has a usefulness, usage, and usability. They define the potential and the capability of the service. The sources (whereof ) of the services with a general description of the environment for the service. The supporting means (wherewith) which must be known to potential users in the case of utilizing the service. The surplus value (worthiness) is a service utilization potentially given to the user. The service description language is thus based on the following questions: Primarily: wherefore, whereof, wherewith Supporting means, worthiness added value(= W4), and additionally why, whereto, for when, for which reason purpose (= W4); Secondarily: by whom, to whom, whichever party; wherein, where, for what, wherefrom, whence, what, how (= W10H); Application domain Additionally: whereat, where about, whither, when context worthiness (= W4). More detailed description of the W*H model is in Appendix C. 53 Page

74 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.3: The W*H Service Description Model [34] 5.5. Defining the Social Media Data Capturing Model Requirements acquisition for capturing SM data is a software application that can also be defined as IT services. According to [33], IT services is defined as a kind of a web information systems service, which are notorious for their low half-life period and the high potential of evolution, migration and integration. Therefore, those questions for describing IT services can be mapped to describe SM Big Data capturing problems and to assist on specifying the analyst requirements. In this approach, each concept is mapped to questions to make it more useful, easier to use and better understood by users. Thus, it forms the foundation of the requirements acquisition tool that apply the decision-back data capturing approach concept for capturing SM data (See Figure 5.4): 54 Page

75 Chapter5: Decision-Back Data Capturing Approach for Social Media Data 1. Problem Domain: To be more specific on describing problem domain using the model these questions should be answered: o Worthiness? (Surplus-value). o Wherefore? (Purpose). o Whereof? (Sources) o For when? (Time Frame). 2. Data Source: To choose the most suitable data source, one should be precise and information should be available (Using the model questions): o Wherewith? (Supporting Means). o For what? (The type of SM). o Where to? (Geographic Demographic, and the targeted people). 3. Analytical Tool: o For what? (The type of analytic). o How? (Price) 55 Page

76 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.4: Refined Model for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements 5.6. Tool Architecture and Design Software architecture is a novel file of software engineering devoted towards describing the architecture of software systems [79]. Software architecture deals with the design and implementation of the high-level structure of the software. It is the result of assembling a certain number of architectural elements in some well-chosen forms to satisfy the major functionality and performance requirements of the system. Software architecture deals with abstraction, with decomposition and composition, with style and esthetics [80]. To describe a software architecture, the architect should use a model composed of single or multiple views or perspectives. In order to eventually address large and challenging architectures, the model is made up of five main views (see Figure 5.5) [80] [81] : 1. The logical view, which is the object model of the design (when an object-oriented design method is used), 56 Page

77 Chapter5: Decision-Back Data Capturing Approach for Social Media Data 2. The process view, which captures the concurrency and synchronization aspects of the design. 3. The physical view, which describes the mapping(s) of the software onto the hardware and reflects its distributed aspect. 4. The development view, which describes the static organization of the software in its development environment. The description of an architecture the decisions made can be organized around these four views, and then illustrated by a few selected use cases, or scenarios which become a fifth view [80] [81]. Figure 5.5: The 4+1 View Model [80] To describe the proposed tool architecture based on the conceptual model defined in Section 5.4. (See Figure 5.4), the development view is used. Because according to [82], not all of these views may apply to describe software architecture, and some will be more important than others based on the stakeholders concerns. So we chose the development architecture view because it focuses on the actual software module organization on the software development environment. The software in this view is packaged in small chunk 57 Page

78 Chapter5: Decision-Back Data Capturing Approach for Social Media Data program libraries, or subsystems that can be developed by one or a small number of developers. The subsystems are organized in a hierarchy of layers, with each layer providing a narrow and well-defined interface to the previous layers [80] [82]. Figure 5.6 represents the development organization in three layers for the proposed requirements acquisition tool: User Interface Layer: which is Data Ingest Module Middle Layer: which is Data Analysis Module. Database Layer. These layers and the design of the tool s user interface are described in more detail in the following Sections. 58 Page

79 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.6: Requirements Acquisition Tool Architecture for Decision-Back Approach for Capturing Social Media Data Analyzing Requirements Tool Layers: Data Ingest Module. (Presentation Layer) This module is called Data Ingest because it is interacting directly with the user to get his/her analytics requirements data. Each part of this module is providing the user with a set of questions (See Section 5.3), to capture requirements. Then the answers are given to the middle layer to analyze them, so the system can provide the user with keywords, recommended data source, and recommended analytical tool Data Analysis Module (Middle Layer) This module is responsible of analyzing the data that comes from the Data Ingest Module. It has three sub-systems: 59 Page

80 Chapter5: Decision-Back Data Capturing Approach for Social Media Data 1. Natural Language Processing (NLP) Subsystem. Natural Language Processing (NLP) is a field in computer science and linguistics that is closely related to Artificial Intelligence (AI) and Computational Linguistics (CL). NLP is generally employed to convert information stored in natural language to a machine understandable format [83] [84]. The subsystem is using NLP technique to process user s answers to get the keywords that would help on filtering the data (See Figure 5.7). Understanding the user s answers can lead to providing better services and making more accurate choices of keywords and filters. At this stage of this research we concentrate on constructing a composition that works as the prototype, therefore, Stanford CoreNLP [64] (See Section 3.5.2) which is the most applicable NLP technique, is used for this purpose. Figure 5.7: NLP Analysis Subsystem Using this tool, the nouns, location, and time constraints in user s answers will be extracted and then the system will generate them as keywords and constraints. Pursuing more advanced NLP will be considered as a future direction. Example: The System: What is the purpose of doing this data analytics?(purpose) 60 Page

81 Chapter5: Decision-Back Data Capturing Approach for Social Media Data User: Ministry of Health wants to track the scare of Corona in Saudi Arabia, and what people know about Corona infection. Stanford CoreNLP: See Figure 5.8 Figure 5.8: Stanford CoreNLP Example The System: Keywords: Saudi Arabia, Ministry Health, Scare Corona, People, Corona infection. 2. Data Source Recommendation Subsystem. Recommendation systems are software tools and techniques that provide suggestions for items to be of use to a user. The suggestions provided are aimed at supporting their users in various decision-making processes [85]. This recommendation subsystem is analyzing the user choices from the upper level (Data Ingest Module), and by using Data Source 61 Page

82 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Database (Using Table 4.7 as an example); the subsystem can suggest a SM site to the user if he/she does not have something in mind to help improve decision making(see Figure 5.9). Example: Figure 5.9: Data Source Recommendation Subsystem What SM type do you prefer? (List of SM Categories Chapter 4 Section 4.1.1) Photo Sharing More Popular with: NA/Men/Women? Women. Age Preference: NA, 13-18, 19-35, Above 35? NA Education: NA, College, no College? College Higher Income: NA, Yes, No? 62 Page

83 Chapter5: Decision-Back Data Capturing Approach for Social Media Data NA System Recommendation: Pinterest 3. Analytical Tool Recommendation Subsystem This subsystem is analyzing user s analytical tool requirements, and the output from the other subsystem to find out the most suitable analytical tool using Analytical tool DB. See Figure Example: Figure 5.10: Analytical Tool Recommendation Subsystem NLP Subsystem Output: Keywords. Data Source Recommendation Subsystem: Facebook. What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) Social Analytics Software 63 Page

84 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Price: Free/NA? Free System Recommendation: Facebook Insights Database Layer. This layer will contain two different databases: analytical tools database, and SM sites characteristics table. As we are in the stage of building a prototype, these databases tables (Table 4.7 and Table 4.6) are used in the tool. Furthur improvements will be considered as a future direction to this research. Analytical Tools Database Architecture Tools database example is shown in Section in Table 4.7, the expanded list is presented in Appendix D. SM Sites Table The SM sites in Table Chapter 4 Section Tool s User Interface Design Figure 5.11 to Figure 5.15 show how the tool should be applied. When the user clicks on Start a process of 3 parts will start: 64 Page

85 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Part1: Problem Domain. Figure 5.11: Tool Interface Design (Home Page) This part will ask the user four questions (Figure 5.12): o Worthiness? (Surplus-value). What will you achieve when you analyze the data? o Wherefore? (Purpose). What is the purpose of doing this data analytics? o Whereof? (Sources). Where of the data you are looking for? o For when? (Time Frame). When do you need this data. These questions should help the user to describe their problem domain, so the system will be able to highlight the keywords and the constraints which are necessary in the analyzing process. The system should use Natural Language Processing (NLP) subsystem (Section 3.5.2); to analyze the user s answers and highlight the keywords and problem constraints. 65 Page

86 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.12: Tool Interface Design (Process Part1) Part2: Data Source The System will give the user a choice if he/she wants to choose one particular SM website, or can select the SM Type, geographic demographic and Targeted People Characteristics and based on them the system should suggest the most suitable data source. The System should use Data Source Recommendation System (Section ) and (Figure 5.13). 66 Page

87 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.13: Tool Interface Design (Process Part2) Part3: Analytical Tool This part will ask the user to choose the: analytical type, and price for the tool. the cateogries will be presented as drop down lists (Figure 5.14). 67 Page

88 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.14: Tool Interface Design (Process Part3) The system output will be given to the user as shown in Figure 5.15 based on his requirements. 68 Page

89 Chapter5: Decision-Back Data Capturing Approach for Social Media Data Figure 5.15: Tool Interface Design (Result) 69 Page

90 CHAPTER 6: CASE STUDY

91 Chapter6: Case Study This Chapter will discuss the application of the tool on different cases. It will start by discussing Stanford CoreNLP tool and how it will help with analyzing user s answers. Then it will discuss different cases by analyzing the problem and developing a solution using the tool Stanford CoreNLP Tool Stanford CoreNLP [64] will be used in the following cases to analyze user s answers as explained in Chapter 3 Section Stanford CoreNLP system integrates many of The Stanford Natural Language Processing Group s (SNLP) NLP tools, including the part-of-speech (POS) tagger [86] [87], the named entity recognizer (NER) [88], the parser [89], the coreference resolution system [90] [91] [92], the sentiment analysis [93], and the bootstrapped pattern learning tools [94]. The basic distribution provides model files for the analysis of English, but the engine is compatible with models for other languages. There are packaged models for Chinese and Spanish, and Stanford NLP models for German and Arabic are usable inside CoreNLP [64]. At this point of research we are going to focus on analyzing English Language, and only using the part-of-speech (POS) tagger [86] [87], and the named entity recognizer (NER) [88] tools to analyze user s answers. Using the other tools and analyzing other languages will be considered as a future direction for this research Part of Speech Tagger Part-of-speech (POS) tagger [86] [87], is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc. [95]. This tool will be used to analyze the first two answers in the Problem Domain part to extract nouns as keywords (See Chapter 5 Section for more information). Nouns will be tagged as NN, NNS, NNP, or NNPS as shown in Figure Page

92 Chapter6: Case Study Figure 6.1: Annotation Guidelines [96] 72 Page

93 Chapter6: Case Study Named Entity Recognizer. Named entity recognizer (NER) [88], labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, along with many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION) [97]. This tool will be used to analyze the last two answers in the Problem Domain part to find location and time constraints (See Chapter 5 Section for more information) Case 1: Start On-Line Business Project Problem Description A small group of people want to start an online business, but they do not know how to start. They have a limited budget so they want to know what people are attracted to in terms of phone accessories: original products, phone brands, head phones, phone cases, phone chargers and phone accessories trends. They are targeting the Saudi market and teenagers as they are the subgroup who primarily embrace these kind of products. Key Problem: They want to implement SM analytics to explore people s opinions about this type of product and explain the current trends Tool Application The requirements acquisition tool that applies decision-back data collecting approach described in Chapter 5, supports the guidance to a development of a well-defined data collection plan. It can provide the team with guidance for the most applicable tool, data 73 Page

94 Chapter6: Case Study source and for building virtual boundaries to the collection process independent of the answers to the tool s questions. Below is a version of the application of the proposed tool in the form of questions and answers. 1. Problem Domain Screen: o What are you going to achieve when you analyze the data? To sell phone accessories. o What is the purpose of doing this data analytics? Want to know what currently appeals to people in the field of phone accessories: original phone products, phone brands, head phones, phone cases, phone chargers, phone accessories trends. o Where of the data you are looking for? Saudi Arabia Market. o For when you need this data? Last Month s reviews and opinions. 2. Data Source Screen: o Do you have Something in Mind? (List of SM Sites) All o What SM type do you prefer? (List of SM Categories Chapter 4 Section 4.1.1) NA o More Popular with: NA/Men/Women? NA. o Age Preference: NA, 13-18, 19-35, Above 35? NA o Education: NA, Collage, no Collage? NA o Higher Income: NA, Yes, No? NA. 3. Analytical Tool Screen: 74 Page

95 Chapter6: Case Study o What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) SM Monitoring Software o Price: Free/NA? Free 4. Tool Output: Keywords and Constraints: Using Stanford CoreNLP to analyze user answers. The system is extracting keywords from the first two answers, the third one is for the location and the last one is for time constraints. Answer1: To sell phone accessories. Answer2: Want to know what currently appeals to people in the field of phone accessories: original phone products, phone brands, head phones, phone cases, phone chargers, phone accessories trends. Figure 6.2: Part of Speech NLP - Case 1 Nouns: phone accessories people days field phone accessories phone products phone brands head phones phone cases phone chargers phone accessories trends (See Figure 6.2). To make useful keywords, the system will first take each single word as a keyword, then, it will combine 2 words together to make a phrase and remove the duplicates. Keywords: Single words: phone, accessories, people, days, field, accessories, products, phone, brands, head, phones, cases, chargers, accessories, trends. 75 Page

96 Chapter6: Case Study Phrases: phone accessories, people days, field phone, accessories phone products phone, brands head, phones phone, cases phone, chargers phone, accessories trends. Figure 6.3: Named Entity Recognition NLP- Case 1 Answer3: Saudi Arabia Market. The system will recognize Saudi Arabia as location (See Figure 6.3). Location: Saudi Arabia. Answer4: Last Month s reviews and opinions. The system will recognize last month as date (See Figure 6.3). Time Frame: Last month. Recommended Data Source: o For all types of SM. Recommended Analytical Tool: o Because they want SM Monitoring Software For all types of SM for Free the tool recommended is : Social Mention Case 2: A Saving Lincoln Movie Promotion Problem Description In [98], with thousands of films released yearly, it is a challenge for filmmakers to break through all the noise. Traditionally, films with larger marketing budgets have more success in reaching their audience. With the onset of SM, the rules are changing. As a result, smaller marketing and advertising agencies are increasingly incorporating SM as a bigger 76 Page

97 Chapter6: Case Study part of their marketing and promotional campaigns. A team of writers want to launch a Twitter storytelling campaign aimed to build an audience for the release of the independent film Saving Lincoln. With a limited budget, they recognized the importance of having an avant-garde SM strategy that was cost-effective and easy to implement. They wanted to: Build brand awareness and buzz for an unreleased movie. Create a network of fans for the film Sustain the campaign for over a year Implement the strategy within a $1000 monthly budget team focused on telling the history around Abraham Lincoln through the voice of Ward Hill Lamon, an often neglected historical figure who was Lincoln s friend and legal partner in Illinois and played a major role in keeping the President alive during the Civil War as his personal bodyguard. By using Lamon s voice, based on his own writings from the Civil War era, and with keeping in the true spirit of the film, the writers were able to add another dimension to the story. Instead of using a traditional website to market the film, Twitter was used as a medium for telling Ward Hill Lamon s stories. Each story was structured in the way serial comic strips are: a story beat per day, told in several tweets, as opposed to a few art panels. Each day s series of tweets furthers the story while leaving readers eager to find out what happens next. Fridays became the cliffhanger day that set the stage for the tale to resume on Monday. Key Problem: The team needed each tweet published in a specific order, as it was critical they were published the same way, at specific intervals. They needed a streamlined process for reviews, edits, and contributions to their weekly Twitter stories Tool Application If this group used the tool they have to answer the following questions: 77 Page

98 Chapter6: Case Study 1. Problem Domain: o What are you going to achieve when you analyze the data? By twitter profile, they are aiming to: Build brand awareness and buzz for an unreleased movie. Create a network of fans for the film Sustain the campaign for over a year Implement the strategy within a $1000 monthly budget o What is the purpose of doing this data analytics? Want to manage their twitter profile, so tweets can be published in specific order at specific intervals. They needed a streamlined process for reviews, edits and contributions to their weekly Twitter stories. o Where of the data you are looking for? USA. o For when you need this data? Before releasing the film: for a year from August. 2. Data Source: o Do You have Something in Mind? (List of SM Sites) Twitter o What SM type do you preferring? (List of SM Categories Chapter 4 Section 4.1.1) NA o More Popular with: NA/Men/Women? NA. o Age Preference: NA, 13-18, 19-35, Above 35? NA o Education: NA, College, no College? NA o Higher Income: NA, Yes, No? NA. 78 Page

99 Chapter6: Case Study 3. Analytical Tool: o What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) SM Management Software o Price: Free/NA? Free. Tool Output: Keywords and Constraints: Using Stanford CoreNLP [64] to analyze user answers. The system is extracting keywords from the first two answers, the third one is for the location and the last one is for time constraints. Answer1: By twitter profile, they are aiming to Build brand awareness and buzz for an unreleased movie. Create a network of fans for the film Sustain the campaign for over a year Implement the strategy within a $1000 monthly budget Answer2: Want to manage their twitter profile, so tweets can be published in specific order at specific intervals. They needed a streamlined process for review, edits and contributions to their weekly Twitter stories. 79 Page

100 Chapter6: Case Study Figure 6.4: Part of Speech NLP - Case 2 Nouns: : SavingLincoln twitter profile brand awareness buzz movie network fans film campaign year strategy budget want twitter profile tweets order intervals process review contributions Twitter stories (See Figure 6.4). To make useful keywords, the system will first take each single word as a keyword, then, it will take 2 words together to make a phrase and remove the duplicates. Keywords: Single words: SavingLincoln, twitter, profile, brand, awareness, buzz, movie, network, fans, film, campaign, year, strategy, budget, want, profile, tweets, order, intervals, process, review, contributions, stories. Phrases: SavingLincoln twitter, profile brand, awareness buzz, movie network, fans film, campaign year, strategy budget, want twitter, profile tweets, order intervals, process review, contributions Twitter, stories. Figure 6.5: Named Entity Recognition NLP - Case 2 80 Page

101 Chapter6: Case Study Answer3: USA The system will recognize USA as location (See Figure 6.5). Location: USA. Answer4: Before releasing the film: a year from August. The system will recognize a year from August as date (See Figure 6.5). Time Frame: a year from August. Recommended Data Source: o Twitter. Recommended Analytical Tool: Because they wanted SM Management Software for their Twitter profile for Free the tool recommended is: hootsuite Case 3: YouTube Music Channel Promotion Problem Description A group of musicians set out on their band s first country tour. They thought about YouTube s potential as a vehicle for promotion, especially in the music industry. They wanted to create a channel to promote their songs and launch the channel simultaneously when they start their tour on the16th of July, which will last for one month. However, the ability to analyze data from YouTube was seriously lacking. They want to manage their channel and measure their popularity and effectiveness of their SM efforts. Key Problem: They want to use SM analytics to help them to manage their channel and measure their popularity and effectiveness with their SM efforts Tool Application. When this group uses the tool they have to answer the following questions: 81 Page

102 Chapter6: Case Study 1. Problem Domain: o What are you going to achieve when you analyze the data? To measure their popularity, by using their Channel Videos. o What is the purpose of doing this data analytics? They want to use SM analytics to help them manage their channel and to measure their popularity and effectiveness of their SM efforts. o Where of the data you are looking for? YouTube Channel. o For when you need this data? When they start their tour, effective 16 th of July for one month. 2. Data Source: o Do you have Something in Mind? (List of SM Sites) YouTube. o What SM type do you preferring? (List of SM Categories Chapter 4 Section 4.1.1) NA o More Popular with: NA/Men/Women? NA. o Age Preference: NA, 13-18, 19-35, Above 35? NA o Education: NA, College, no College? NA o Higher Income: NA, Yes, No? NA. 3. Analytical Tool: o What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) SM Analytics Software. o Price: Free/NA? Free. 82 Page

103 Chapter6: Case Study Tool Output: Keywords and Constraints: Using Stanford CoreNLP [64] to analyze user answers. The system is extracting keywords from the first two answers, the third one is for the location and the last one is for time constraints. Answer1: To measure their popularity, by using their Channel Videos. Answer2: They want to use SM analytics to help them manage their channel and to measure their popularity and effectiveness of their SM efforts. Figure 6.6: Part of Speech NLP - Case 3 Nouns: popularity Channel Videos media analytics channel popularity media efforts (See Figure 6.6). To make useful keywords, the system will first take each single word as a keyword, then, it will take 2 words together to make a phrase and remove the duplicates. Keywords: Single words: popularity, Channel, Videos, media, analytics, channel, efforts. Phrases: popularity Channel, Videos media, analytics channel, popularity media, efforts. 83 Page

104 Chapter6: Case Study Figure 6.7: Named Entity Recognition NLP - Case 3 Answer3: YouTube Channel. The system will not recognize any location, so it will put the answer as it is (See Figure 6.7). Location: YouTube Channel. Answer4: When they start their tour from the 16th of July for one month. The system will recognize a month from the16th of July as a date (See Figure 6.7). Time Frame: a month from 16th of July. Recommended Data Source: o YouTube. Recommended Analytical Tool: Because they want SM Analytics Software for Free the tool recommended is: YouTube Insights Case 4: Middle East Respiratory Syndrome Awareness Problem Description A novel coronavirus called Middle East Respiratory Syndrome Coronavirus (MERS- CoV) was first reported in 2012 in Saudi Arabia. It has caused severe illness and deaths, and it has been identified in multiple countries in the Arabian Peninsula. There have also been cases in several other countries with individuals who traveled to the Arabian Peninsula and, in some instances, their close contacts. Two cases were confirmed in May 84 Page

105 Chapter6: Case Study 2014 among two health care workers living in Saudi Arabia who were visiting the United States [99]. Subsequently, the Ministry of Health s (MOH) Command & Control Center (CCC) in Saudi Arabia launched a rigorous examination of data related to MERS-CoV patients from 2012 onwards. The main objective of that review was to ensure a more complete and accurate understanding of the MERS-CoV outbreak in the Saudi Arabia Kingdom. The review has already enhanced the Ministry s policy development process and improved measures taken to address the situation. Following the review, the new total number of cases recorded in the Saudi Arabia Kingdom between 2012 and today, is 688. Of that total, 282 of the cases were fatal, 53 are currently receiving treatment and 353 have recovered. Dr Tariq Madani, Head of the Scientific Advisory Board within the Ministry s Command and Control Center said: The Ministry is committed to fully understanding MERS-CoV and putting in place the policies needed to protect public health and safety. To do this the Ministry has reviewed historical cases of MERS-CoV to give a more comprehensive understanding of the facts. While the review has resulted in higher total number of previously unreported cases, we still see a decline in the number of new cases reported over the past few weeks. [100]. Key Problem: Saudi Arabia MOH, which the CCC department unit belongs to, wants to engage the public with an awareness campaign through SM sites. To do that, they want to: To track citizens reactions about MERS-CoV infection in Saudi Arabia, and how they protect themselves. Based on the feedback, they need to make timely informed decisions regarding the disease awareness campaign. 85 Page

106 Chapter6: Case Study Tool Application The following details presents how the tool will help CCC with their SM data collection and capture their requirements: 1. Problem Domain: o What are you going to achieve when you analyze the data? To provide MERS-CoV awareness and virus precautions. o What is the purpose of doing this data analytics? To track MERS-CoV news in Saudi Arabia, and how people are protecting themselves from MERS-CoV infections. o Where of the data you are looking for? Saudi Arabia. o For when you need this data? A week starting from June 1st Data Source: o Do You have Something in Mind? (List of SM Sites) No. o What SM type do you preferring? (List of SM Categories Chapter 4 Section 4.1.1) Microblogging. o More Popular with: NA/Men/Women? NA. o Age Preference: NA, 13-18, 19-35, Above 35? NA o Education: NA, College, no College? NA o Higher Income: NA, Yes, No? NA. 3. Analytical Tool: 86 Page

107 Chapter6: Case Study o What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) SM Monitoring Software. o Price: Free/NA? NA. Tool Output: Keywords and Constraints: Using Stanford CoreNLP [64] to analyze user answers. The system is extracting keywords from the first two answers, the third one is for the location and the last one is for time constraints. Answer1: To provide MERS-CoV awareness and virus precautions. Answer2: To track MERS-CoV news in Saudi Arabia, and how people are protecting themselves from MERS-CoV infections. Figure 6.8: Part of Speech NLP - Case 4 Nouns: MERS-CoV awareness virus precautions MERS-CoV news Saudi Arabia people MERS-CoV infection (See Figure 6.8). To make useful keywords, the system will first take each single word as a keyword, then, it will take 2 words together to make a phrase and remove the duplicates. Keywords: Single words: MERS-CoV, awareness, virus, precautions, news, Saudi, Arabia, people, infections. 87 Page

108 Chapter6: Case Study Phrases: MERS-CoV awareness, virus precautions, MERS-CoV news, Saudi Arabia, people MERS-CoV, infections. Figure 6.9: Named Entity Recognition NLP - Case 4 Answer3: Saudi Arabia. The system will recognize Saudi Arabia as location (See Figure 6.9). Location: Saudi Arabia. Answer4: A week starting from June 1 st The system will recognize A week June 1 st 2014 as a date (See Figure 6.9). Time Frame: A week June 1 st Recommended Data Source: o Twitter. Recommended Analytical Tool: Because they want SM Monitoring Software and the cost was not important, the recommended tool is: Meltwater Buzz Case 5: DAESH Terrorist Movement Problem Description The Islamic State of Iraq and the Levant (ISIL) is an extremist Islamist rebel group that controls territory in Iraq and Syria and also has operations in eastern Libya, the Sinai Page

109 Chapter6: Case Study Peninsula of Egypt, and other areas of the Middle East, North Africa, South Asia, and Southeast Asia. The group's Arabic name is transliterated as (ad-dawlah al-islāmīyah fī al- Irāq wash-shām) leading to the Arabic acronym (Da ish داعش ) or DAESH. On 29 June 2014, the group proclaimed itself to be a worldwide caliphate and renamed itself the Islamic State (IS), but the new name has been widely criticized and condemned, with various governments, and mainstream Muslim groups refusing to acknowledge it. Many Islamic and non-islamic communities judge the group to be unrepresentative of Islam [101]. Mufti of Saudi Arabia's top cleric, Sheikh Abdul Aziz bin Abdullah Al-Sheikh confirmed that the members of the organization of the Islamic State "Daash" distorted the image of Islam abroad, pointing out that they are "harmful to the people of the infidel obvious disbelief. Those wrongdoers criminals hoping their careers and sounding conditions and consider their work, really aware that they were brought in order to humiliate the Muslim Ummah and hit the hearts to each other and later to be said about Islam is a religion bloodshed does not care and saves money nor blood". This assertion was part of his speech at the World Conference on "Islam and the fight against terrorism", which was organized by the Muslim World League on Sunday in Mecca, under King Salman bin Abdul Aziz, and included the participation of a large number of scholars from around the world. "These distorted the image of Islam abroad and attributed to Islam what s not from it, and they claimed an Islamic state (Daash) and God knows that the hypocrites are liars. [102] " Daash planned to attack Saudi Arabia during the Eid al-adha (September 2014), where the report noted that the fighters Daash planned armed operations in Saudi Arabia during the Hajj season in an attempt to topple Saudi Arabia [103]. Prince Mohammed bin Naif bin Abdulaziz, Minister of Interior, and Chairman of Supreme Hajj Committee (SHC), inspected preparedness of Hajj-related authorities participating in the general plan of this year s Hajj season. The warnings of Custodian of the Two Holy Mosques King Abdullah bin Abdulaziz Al Saud came to light during his recent speech to 89 Page

110 Chapter6: Case Study the Arab and Islamic nations and the world at large. 'Regarding Daash group, we know that Daash are formed with supporting of some states and organizations. The Saudi security authorities will firmly combat this organization or others. The Saudi security deals with all ability and professionalism with hundreds of terrorist operations. The Saudi Arabia Kingdom also has worked to prevent citizens' joining to the countries of conflicts or foreign groups. The Saudi Arabia Kingdom issued firm instructions in this regard. Regard to the International Alliance against Daesh, Prince Mohammed said that this was an urgent requirement to combat these terrorist organizations, especially since these organizations are found in important and strategic places. Key Problem: Saudi Arabia Ministry of Interior (MOI) and SHC want to keep an eye on what people are talking about and thinking regarding Daesh. They want to use SM data as a source of information because they are one of the major online platforms for political associations as well as various interest groups expressing their opinions and thoughts Tool Application The details below present how the tool will help MOI and SHC on their SM data collection and capture their requirements: 1. Problem Domain: o What are you going to achieve when you analyze the data? Based on the feedback gained, we need to make timely informed decisions regarding the terrorism awareness campaign. o What is the purpose of doing this data analytics? To track Daesh attack news in Saudi Arabia, and how people are thinking about that terrorist s attack. o Where of the data you are looking for? Saudi Arabia. 90 Page

111 Chapter6: Case Study o For when you need this data? During Hajj Season September Data Source: o Do you have something in Mind? (List of SM Sites) No. o What SM type do you preferring? (List of SM Categories Chapter 4 Section 4.1.1) Microblogging. o More Popular with: NA/Men/Women? NA. o Age Preference: NA, 13-18, 19-35, Above 35? NA o Education: NA, College, no College? NA o Higher Income: NA, Yes, No? NA. 3. Analytical Tool: o What Type of Analytics Do you want? (List of tools Categories Chapter 4 Section 4.2) SM Monitoring Software. o Price: Free/NA? NA. Tool Output: Keywords and Constraints: Using Stanford CoreNLP [64] to analyze user answers. The system is extracting keywords from the first two answers, the third one is for the location and the last one is for time constraints. 91 Page

112 Chapter6: Case Study Answer1: Based on the feedback gained, we need to make timely informed decisions regarding the terrorism awareness campaign. Answer2: To track Daesh attack news in Saudi Arabia, and how people are thinking about that terrorist s attack. Figure 6.10: Part of Speech NLP - Case 5 Nouns: feedback decisions terrorism awareness campaign attack news Saudi Arabia people terrorist s attack (See Figure 6.10). To make useful keywords, the system will first take each single word as a keyword, then, it will combine 2 words together to make a phrase and remove the duplicates. Keywords: Single words: feedback, decisions, terrorism, awareness, campaign, attack, news, Saudi, Arabia, people, terrorist s, attack. Phrases: feedback decisions, terrorism awareness, campaign attack, news Saudi, Arabia people, terrorist s, attack. Figure 6.11: Named Entity Recognition NLP - Case 5 Answer3: Saudi Arabia. The system will recognize Saudi Arabia as location (See Figure 6.11). 92 Page

113 Chapter6: Case Study Location: Saudi Arabia. Answer4: During Hajj Season September The system will recognize Season September 2014 as a date (See Figure 6.11). Time Frame: Season September Recommended Data Source: o Twitter. Recommended Analytical Tool: Because they want SM Monitoring Software and the cost is not an issue, the tool recommended is: Meltwater Buzz. 93 Page

114 CHAPTER 7: TOOL EXPERIMENT AND VALIDATION

115 Chapter7: Tool Experiment and Validation In this Chapter, experimental work is presented in order to examine the effectiveness of the tool. The experimentation is applied to the cases given in Chapter 6, and it is divided into two parts. The first experiment provides a random data collection without applying the tool as a first step prior to performing the actual data collection. The second experiment is where the tool has been applied in an attempt to collect relevant feeds and accelerate SM data analytics. A comparison and an analysis of the experiments and the validation of the tool has also been discussed. 7.1 Purpose of the Experiment Validation is determining if the system complies with the requirements and performs functions for which it is intended and meets the goals and user needs. It is a continuous process which must be applied in each phase of the software engineering process [28]. With the experimental work, we aim to validate that the tool answers the research question and fulfils its purpose. This can be made visible by measuring correctness and efficiency when analyzing SM data using the tool and without using the tool. Therefore, in order to ensure the acceleration of the analytics task through providing correct and sufficient data and most fitting analytical tool according to user requirements, the validity of the tool could be proven to the extent of correctness and efficiency. 7.2 Design and the Scope of the Experiment This experiment is designed to answer the research question: How can we define an architecture of a SM data capturing tool, which accelerates the analysis tasks? In order to validate the quality of the experiment to answer the question, it needs to be measured using software measurements factors, especially those that are concerned with how well does the tool run, the Product Operation Factors: Correctness, and efficiency [50]. The definitions of those factors and the methods of measurement are specified below: 95 Page

116 Chapter7: Tool Experiment and Validation Correctness: Definition: Extent to which a program satisfies its specifications and fulfils the user s mission objectives [104]. Method of Measurement: Specifying percentage of relevant outputs(feeds) by using equation (1). Correctness = The Total # of relevantfeeds Sample Size Number of Keywords 100 (1) Efficiency: Definition: The amount of computing resources and code required by a program to perform a function [104]. Method of Measurement: Because we are concerned with accelerating the analyzing task, the time it takes for the SM analysis process need to be measured. In order to do that the user will have to use a Time Recording Log sheet (See Figure 7.1), to record the time consumed to analyze SM data. This log sheet is inspired by Time Recording Log sheet of the Personal Software Process (PSP) [105]. Figure 7.1: Experiment Time Recording Log 96 Page

117 Chapter7: Tool Experiment and Validation Different cases from Chapter 6 will be shown to random users and then these two factors will be measured in both situations: with the use of the tool, and without the tool. The data collection sample size is fixed to 300 feeds. According to the sampling theory [106], we are in a situation where the data is too large and the measurement process is time-consuming to allow more than a small segment of the population to be observed. After conducting the experiment both results are analyzed. Other quality factors such as usability and reliability [50] [104], are beyond the scope of this experiment. 7.3 Experiment Case 1: Start On-Line Business Project Without Tool As given in Section 6.2 according to the case description, after searching the user used trackur as a SM analytics tool (for more information about this tool, see Section 2.2.2). From the case he extracted these Keywords: K1: original phone products, K2: phone brands, K3: head phones, K4: phone cases, K5: phone chargers, K6: phone accessories trends. When used those keywords for the analysis task with trackur [105] the retrieved relevant feeds for each keyword K# is given in Table 7.1: Keywords K1 K2 K3 K4 K5 K6 Total # of relevant Feeds Table 7.1: Keywords Relevant Feeds Numbers - Case1 Without Tool 97 Page

118 Chapter7: Tool Experiment and Validation Figure 7.2: Snapshot of Trackur - Case 1 Without Tool Correctness: Correctness = The Total # of relevantfeeds = Sample Size Number of Keywords = 48.5% Efficiency: Total Consumed Time = 235 minutes. The detailed Time Recording Log sheet is shown in Figure Page

119 Chapter7: Tool Experiment and Validation Figure 7.3: Time Recording Log Sheet - Case 1 Without Tool With Tool As described in Section 6.2.2, the Tool Output is: Keywords: o Single words: phone, accessories, people, days, field, accessories, products, phone, brands, head, phones, cases, chargers, accessories, trends. o Phrases: phone accessories, people days, field phone, accessories phone products phone, brands head, phones phone, cases phone, chargers phone, accessories trends. Location: Saudi Arabia. Time Frame: Last month. Recommended Data Source: o For all types of SM. Recommended Analytical Tool: 99 Page

120 Chapter7: Tool Experiment and Validation o Because they want SM Monitoring Software For all types of SM for Free the tool recommended is: SocialMention. The following keywords are made by the user after considering the system suggestions: K1: phone accessories, K2: phone brands, K3: phone products, K4: phone trends, K5: phone accessories trends, K6: head phones, K7: head phones brands, K8: phone cases, K9: phone cases trends, K10: phone cases brands, K11: phone chargers, k12: phone chargers accessories. Filtered by the time and location constraints as provided by the tool, Table 7.2 presents how the number of relevant feeds are improving. Keywords K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K11 K12 Total # of relevant Feeds Table 7.2: Keywords Relevant Feeds Numbers - Case 1 With Tool 100 Page

121 Chapter7: Tool Experiment and Validation Figure 7.4: Snapshot of Social Mention Case 1 With Tool Correctness: Correctness = The Total # of relevantfeeds Sample Size Number of Keywords 100 = = 87.5% 3600 Efficiency: Total Consumed Time = 90 minutes. The detailed Time Recording Log sheet shown in Figure Page

122 Chapter7: Tool Experiment and Validation Figure 7.5: Time Recording Log Sheet - Case 1 With Tool Results The chart in figure 7.6, is comparing the two quality factors for case 1. It shows how the relevancy (correctness) is improving while the analysis time (efficiency) is decreasing after using the tool Tool Experiment - Quality Factors Comprison Correctness% Efficiency minutes Case 1 Figure 7.6: Quality Factors Comparison - Case 1 WTT WT 102 Page

123 Chapter7: Tool Experiment and Validation Case 4: Middle East Respiratory Syndrome Awareness Without Tool As given in Section 6.5 according to the case description, after searching the user used trackur (more information about the tool in Section 2.2.2) as a SM analytics tool. From the case he extracted these Keywords: K1: MERS-CoV Infection, K2: MERS-CoV Protection, K3: MERS-CoV Awareness. Analyzing the relevant feeds for each keyword K# is in Table 7.3: Keywords K1 K2 K3 Total # of relevant Feeds Table 7.3: Keywords Relevant Feeds Numbers - Case 4 Without Tool Correctness: Correctness = The Total # of relevantfeeds Sample Size Number of Keywords = 100 = 61.9% 900 Efficiency: Total Consumed Time = 205 minutes. The detailed Time Recording Log sheet is shown in Figure Page

124 Chapter7: Tool Experiment and Validation Figure 7.7: Time Recording Log Sheet - Case 4 Without Tool With Tool As described in Section 6.5.2, the Tool Output is: Keywords: o Single words: MERS-CoV, awareness, virus, precautions, news, Saudi, Arabia, people, infections. o Phrases: MERS-CoV awareness, virus precautions, MERS-CoV news, Saudi Arabia, people MERS-CoV, infections. Location: Saudi Arabia. Time Frame: A week starting from June 1 st Recommended Data Source: o Twitter. Recommended Analytical Tool: Meltwater Buzz. 104 Page

125 Chapter7: Tool Experiment and Validation The following keywords are made by the user after considering the system suggestions: K1: MERS-CoV awareness, K2: MERS-CoV precautions, K3: MERS-CoV news, K4: MERS-CoV infections, K5: MERS-CoV infections news. Filtered by the time and location constraints as provided by the tool Table 7.4 presenting how the number of relevant feeds are improving. Keywords K1 K2 K3 K4 K5 Total # of relevant Feeds Table 7.4: Keywords Relevant Feeds Numbers Case 4 With Tool Correctness: Relevancy Percentage = Correctness = The Total # of relevantfeeds Sample Size Number of Keywords = 100 = 85.8% 1500 Efficiency: Total Consumed Time = 85 minutes. The detailed Time Recording Log sheet shown in Figure Page

126 Chapter7: Tool Experiment and Validation Figure 7.8: Time Recording Log Sheet - Case 4 With Tool Results The chart in figure 7.9, is comparing the two quality factors for case 4. It shows how the relevancy (correctness) is improving while the analysis time (efficiency) is decreasing after using the tool. 106 Page

127 Chapter7: Tool Experiment and Validation Tool Experiment - Quality Factors Comprison WTT WT 0 Correctness % Efficiency minutes Case 4 Figure 7.9: Quality Factors Comparison - Case Case 5: Start On-Line Business Project Without Tool As given in Section 6.6 according to the case description, after searching the user used trackur (more information about the tool in Section 2.2.2), as a SM analytics tool. From the case he extracted these Keywords: K1: Daesh Attack, K2: Daesh terrorist, K3: Daesh Hajj Attack. Analyzing the relevant feeds for each keyword K# is in Table 7.5: Keywords K1 K2 K3 Total # of relevant Feeds Table 7.5: Keywords Relevant Feeds Numbers - Case 5 Without Tool 107 Page

128 Chapter7: Tool Experiment and Validation Correctness: Relevancy Percentage = Correctness = The Total # of relevantfeeds Sample Size Number of Keywords = 100 = 59.4% 900 Efficiency: Total Consumed Time = 145 minutes. The detailed Time Recording Log sheet is shown in Figure With Tool Figure 7.10: Time Recording Log Sheet - Case 5 Without Tool As described in Section 6.6.2, the Tool Output is: Keywords: o Single words: feedback, decisions, terrorism, awareness, campaign, attack, news, Saudi, Arabia, people, terrorist s, attack. 108 Page

129 Chapter7: Tool Experiment and Validation o Phrases: feedback decisions, terrorism awareness, campaign attack, news Saudi, Arabia people, terrorist s, attack. Location: Saudi Arabia. Time Frame: During Hajj Season September Recommended Data Source: o Twitter. Recommended Analytical Tool: Meltwater Buzz. The following keywords are made by the user after considering the system suggestions: K1: terrorism awareness, K2: Daesh terrorism, K3: terrorist, attack, K4: Daesh attack, K5: Daesh Hajj Attack. K6: Hajj terrorism. Filtered by the time and location constraints as provided by the tool. Table 7.6 presenting how the number of relevant feeds are improving. Keywords K1 K2 K3 K4 K5 K6 Total # of relevant Feeds Table 7.6: Keywords Relevant Feeds Numbers Case 5 With Tool Correctness: Relevancy Percentage = Correctness = The Total # of relevantfeeds Sample Size Number of Keywords 100 = = 84.5% Efficiency: Total Consumed Time = 65 minutes. The detailed Time Recording Log sheet shown in Figure Page

130 Chapter7: Tool Experiment and Validation Figure 7.11: Time Recording Log Sheet for Case 5 - With Tool Results. The chart in figure 7.12, is comparing the two quality factors for case 5. It shows how the relevancy (correctness) is improving while the analysis time (efficiency) is decreasing after using the tool. 110 Page

131 Chapter7: Tool Experiment and Validation Tool Experiment - Quality Factors Comprison Correctness % Efficiency minutes Case 5 Figure 7.12: Quality Factors Comparison - Case 5 WTT WT 7.4. Results Comparison. Comparison of the two processes based on the quality factors: with the tool (WT), and without the tool (WTT) summarized in Table 7.7: Quality Factors Case 1 Case 4 Case 5 Correctness % Efficiency minutes Correctness % Efficiency minutes Correctness % Efficiency minutes WTT WT % Table 7.7: Results Comparison 111 Page

132 Chapter7: Tool Experiment and Validation WTT WT 0 Correctness% Efficiency minutes Average Figure 7.13: Quality Factors Comparison Chart Average Results The chart presented in Figure 7.13 is a visualization of the data in Table 7.7. It shows that the tool has improved the relevant feeds Correctness and reduced the time for analyzing SM data Efficiency for the cases under the experiment. Furthermore, this experiment shows the effectiveness of the tool is based on two quality factors: correctness and efficiency and helps in answering the research question: How can we define an architecture of a SM data capturing tool, which accelerates the analysis tasks? The importance of these two factors are discussed in the next Chapter as contributions to SM analytics Rate of Improvements (ROI) To measure the rate of improvement (ROI) for these two factors in the experiment, the following equations are used to calculate the rate [107]: Percentage Increase = (New Number Original Number) Original Number 100 (1) 112 Page

133 Chapter7: Tool Experiment and Validation Percentage Decrease = (Original Number New Number) Original Number 100 (2) For Correctness the equation (1) is used, because relevancy should be increasing after applying the tool. For Efficiency the equation (2) is used, because time should be decreasing after applying the tool. Equations (3), and (4) are the customized equations according to the variables of the experiment. Table 7.8 reflects the ROI for these two factors in the three cases. Correctness ROI = (WT Value WTT Value) WTT Value 100 (3) Efficiency ROI = (WTT Value WT Value) WTT Value 100 (4) Case 1 Case 4 Case 5 Quality Factors Correctness % Efficiency minutes Correctness % Efficiency minutes Correctness % Efficiency minutes WTT WT ROI 80% 62% 38% 59% 42% 55% Table 7.8: Correctness and Effeciency Rate of Improvments (ROI) Unpaired T-Test Statistical tests analyze a particular set of data to make more general conclusions. There are several approaches to do that, but the most common is based on assuming that data in the population have a certain distribution. The distribution used most commonly is the bellshaped Gaussian distribution, also called the Normal distribution. This assumption 113 Page

134 Chapter7: Tool Experiment and Validation underlies many statistical tests such as t tests and ANOVA, as well as linear and nonlinear regression [108]. The experiment data has been gathered in Table 7.9. As shown the experiment data is following the normal distribution, there is no very high, nor very low value. K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K11 K12 WTT WT K13 K14 K15 K16 K17 K18 K19 K20 K21 K22 K23 WTT WT Table 7. 9: Experiment Data Sample An unpaired t-test [109] is conducted to compare two population means: the related feeds with the tool(wt) and without the tool(wtt). The unpaired t-test is chosen because of the following reasons [109]: Unpaired test because the two rows of data are not matched. That means that values on the same row are not related to each other. Each one of them is for different keyword. Parametric test, because the data sample follow the normal distribution. GraphPad is used to conduct the unpaired t-test, which is an online statistics software 10. Unpaired t-test results: P value and statistical significance: The two-tailed P value is less than Page

135 Chapter7: Tool Experiment and Validation The P value is a probability, with a value ranging from zero to one, that answers this question: In an experiment of this size, if the populations really have the same mean, what is the probability of observing at least as large a difference between sample means as was, in fact, observed? [109]. By conventional criteria, this difference is considered to be extremely statistically significant. Confidence interval: The confidence interval (CI) of a mean tells how precisely the mean is determined [109]. The mean of WTT minus WT equals % confidence interval of this difference: From to Intermediate values used in calculations: t = df = 31. standard error of difference = The t-test results proofed that the proposed tool has achieved its objectives and it is with a high power because of the statistically significant" P value. 115 Page

136 CHAPTER 8: DISCUSSION

137 Chapter8: Discussion This Chapter reflects a discussion on the research in terms of its contributions to the data collection process in relation to SM Big Data analytics solutions, and to the requirements engineering for data collection process Analysis of Research Outcomes This thesis has started with the motivation to develop a well-defined approach that is able to assist executives and decision makers in making timely management decisions. The development of this approach is challenging especially with the overwhelming SM data generated and stored everywhere [21]. This frustrates data analysts about how to gather and analyze all the available data, and thus produce results in shorter time frames [10], [11]. What has been observed and acknowledged is that the key enabler to obtaining valuable insights in real-time, is the collection of less and more focused data [110] and with the use of the most suitable tool according to the user s requirements. Thus, obtaining the right answers without having to analyze huge volumes of data may most likely be needless and irrelevant. From this point of view, this research aimed to examine the decision-back approach [8] for capturing SM data, and study the different factors that govern the SM data collection and analyzing processes. It has emphasized the essence of analyzing the requirements needed to be addressed prior to attempting to collect data. The research developed an understanding of the different SM Big Data analytical techniques and tools categories, different SM sites features and categorizations. This helped to develop the conceptual model which is the first block on designing the requirements acquisition tool architecture Resulting outcome of tool The proposed Requirements Acquisition Tool Architecture is an application of the decision-back approach concept [8] enhanced by [75] and W*H model [33]. The testing of the tool is to perform data collection and analytics based on answering three phases of 117 Page

138 Chapter8: Discussion questions, leads to a stronger understanding of the research project and offers better choices to accelerate SM data analytics task. For this Big Data collection challenge, a tool architecture is developed to apply decisionback SM data collection approach (Figure 5.6). The architecture in Figure 5.6 fulfills the needs identified and serve the following purposes: Each phase of questions in the tool is designed to cover the main concepts of the decision-back data collecting approach [8] [75] and to develop a systematic plan to capture user s analytics requirements prior to collecting data. The tool s questions are produced to define keywords, time and location constraints, suitable data source, and SM analytical tool inspired by the W*H service modeling approach [33] which is an extension with the inquiry system of Hermagoras of Temnos frameworks. In the problem domain phase, the user s answers are analyzed using NLP technique [64], which gives the user a chance to describe the problem with his/her own words. Furthermore, analyzing those answers generated more related keywords including the time and location constraints and thus improved correctness. Second phase s questions are designed using W*H service modeling approach [33] and a deep study to SM sites features and categorization. Third phase s questions are designed using W*H service modeling approach [33] and an intensive study of SM analytical tools types. The tool s 3 phases comprises of 12 questions in total addressing the description of the user s SM data analytics requirements, and give a tailored solution according to his/her needs. The tool s comprehensiveness and flexibility supports its application to any given SM data analytics case prior to the data collection phase of SM Big Data solutions, and this has been statistically significant proven. The main contribution of this research is the design of a value-added and welldefined process to capture SM data analyzing requirements upfront for the data 118 Page

139 Chapter8: Discussion collection to accelerate the analytics tasks. The requirements acquisition tool also contributes to: a. Requirements engineering field, by building a tool that helps the user to capture his/her requirements prior to data collection process during the SM data Analytics. b. Software engineering field, by providing a user-centered solution that captures the user s SM data analyzing needs within a user friendly environment Tool Evaluation Case Studies The tool is examined in Chapter 6, through the application on five different case studies of real-life SM data analytics problems. Without identifying the key factors for collecting the right SM data for analysis, the right SM data source, and the right SM data analytical tool in such cases, it will be difficult to make decisions about what to capture, from where, and using what. The case studies have shown how a systematic and a structured process can provide a useful SM analytics requirements elicitation plan that yields timely and more relevant analytics results (See Figure 8.1). Therefore, it is necessary to incorporate the requirement acquisition tool into any SM Big Data solution as it provides a roadmap for a SM data analytical solution Tool Validation Experiment The effectiveness of this tool has been validated in Chapter 7, through performing three experiments using three cases from Chapter 6. The purpose of this experiment is to validate that the tool answers the research question and fulfils its purpose. This was made visible by measuring correctness and efficiency when analyzing SM data in two situations: using tool and without using tool for the three cases. Moreover, this quality measurement process conducted to validate the tool is applied during the tool design; this key aspect further distinguishes it from the testing and verification 119 Page

140 Chapter8: Discussion activities. Correctness and efficiency are predictive in nature and oriented toward the development phases rather than toward the finished system. This early measurement gives an indication of how well the software product will operate in relation to the quality requirements levied on it. In other words, an initial assessment will be made of the quality of the software system. By obtaining such an assessment before testing or final delivery, faults or inadequacies can be identified and corrected early enough in the development process to result in substantial cost savings [104]. Findings of the experiment has shown that when doing analytics without tool (WTT), the correctness was low, and the required time to analytics was high (Efficiency was Low). However, when applying the tool (WT) to the same cases, the correctness and efficiency improved. The ROI for correctness and the ROI for efficiency have increased. Below figures showing how the proposed tool improved both factors for each case in the experiment. Correctness for the three cases with and without tool. Tool Experiment - Correctness Factor Comprison Correctness% Correctness % Correctness % Case 1 Case 4 Case 5 Figure 8.1: Experiment Summary - Correctness Factor Comparison WTT WT 120 Page

141 Chapter8: Discussion Efficiency for the three cases with and without tool Tool Experiment - Effeciency Factor Comprison Efficiency minutes Efficiency minutes Efficiency minutes 145 Case 1 Case 4 Case 5 Figure 8.2: Experiment Summary - Effeciency Factor Comparison 65 WTT WT It is visible in the figures above, how the quality factors have sharply improved for each case when the tool (WT) was applied, when compared to the situation when we did not apply the tool (WTT). Therefore, the validation experiment of the tool and the unpaired t- test have proved its statistically significant and its effectiveness in answering the research question How can we define an architecture of a SM data capturing tool, which accelerates the analysis tasks? It improved the current process of capturing user s SM data analytics requirements. The quality measurement process must be able to be 121 Page

142 CHAPTER 9: CONCLUSION AND FUTURE WORK

143 Chapter9: Conclusion and Future Work 9.1 Conclusion In spite of the challenging analytical workload of SM Big Data, it is the key enabler for business success. Investigating the problem and adopting the right track to recognize patterns and reveal customers insights from large volumes and variety of data gives new opportunities and resolves bottlenecks. Indeed, this thesis has provided a view of the Big Data space, focusing on SM Big Data which is a broad and complex Big Data environment, and considering the four Big Data characteristics volume, velocity, variety, and value. Moreover, it described different SM sites, analytical types and tools, and produced a framework for applying decision-back approach for capturing SM data according to analytical needs. The initial description of SM Big Data and its distinguishing features from other types of Big Data, helped define different methodologies and techniques to SM data analysis, and then categorized various analytical tools according to their functionalities. This innovative research was conducted in the areas of decision-back approach for analyzing SM data, and SM analytics. The model was built based on following basic concepts: SM Problem Domain, SM Data Sources, and SM Analytical tools. After studying the W*H model, which is a model that is used to describe IT services, the model s concepts are fine-tuned so that it can be formulated into a more descriptive manner. The fine-tuned model has been used to develop the requirements acquisition tool for decision-back approach for capturing SM data analyzing requirements. The tool has been applied in different cases and for each of those cases, the problem is analyzed by answering the questions at each module in the tool. The answers highlighted the keywords and the constraints filters the most appropriate data source, with a suggestion for the suitable analytical tool. The experimentation on the tool for decision-back approach proves that using the tool improves the normal process of data collection by providing just the relevant data, and therefore, allows extracting value from large datasets in a timely manner. This reflects on addressing the research 123 Page

144 Chapter9: Conclusion and Future Work question How can we define an architecture of a SM data capturing tool, which accelerates the analysis tasks? Clearly, the proposed tool contributes to a paradigm revision to the SM data analytics process and the Requirements Acquisition Tool is an added value as it provides simplicity, efficiency, and correctness into the arena of SM Big Data collection. 9.2 Limitations This tool provides a roadmap to SM Big Data analysts. Rather than initiating a mass collection without any criteria, collect only the data that matters and determine upfront the analytics requirements which in turn, significantly reduces the volume, veracity, velocity, and variety of the collected data and the time required to analyse SM data. In addition to all possible improvements already mentioned in this thesis, there are different limitations and valuable enhancements which are important to mention as further research. These limitations and recommendations are described in a random order Limited Number of Cases The five cases studied in this thesis covered various types of SM analytical techniques and different problem domains. Nevertheless, using more cases to cover multiple SM Big Data sources, tools, and techniques and asking the constant tool questions on each of the different cases might lead to further tool refinements. Overall, this may increase the level of validity and generality of the tool Limited Databases tools and Social Media sites Table 5.1 which represents the database for SM analytical tools and Table 5.2 which represents SM sites characteristics are used as sources to the recommendation system in the proposed tool. They showed how the proposed tool should work to recommend the most applicable data source and analytical tool even though they have limited the number of records (tools/sm sites). A major advantage is that adding more records to those tables will improve the tool s recommendation system and expand its validity. 124 Page

145 Chapter9: Conclusion and Future Work Experiment and Validation Lack of Generalizability. As the experiment was applied to three real-life cases that require SM feeds, the study may be limited in external validity to other cases because the findings are based on only a few cases Limited Quality Factors Validation The experiment validated two quality factors: Correctness and Efficiency as we are aiming to answer the research question. Other quality factors should also be measured in the improved version of the tool to ascertain that the tool meets software quality purposes The Limited use of NLP Tool The experiment just used two tools from Stanford CoreNLP [64] tool: part-of-speech (POS) tagger [86] [87], and the named entity recognizer (NER) [88] tools to analyze user s answers. Using the other tools: the parser, the coreference resolution system, the sentiment analysis, and the bootstrapped pattern learning tools, may lead to more improved language processing and more accurate keywords and constraints Limited Number of Cases in the Experiment. The three cases tested in this thesis covered various types of SM analytical techniques and different problem domains and they proved the validity of the tool by improving the correctness and efficiency. However, ROI was generally around +-50% because of the use of only 3 test cases, if the number of cases was higher than the generalized ROI would be higher Future Work Directions To conduct ongoing experimental research on the tool for further improvements. To pursue more advanced NLP and analyze user s answers incorporating different languages. 125 Page

146 Chapter9: Conclusion and Future Work To develop the tool as a plug-in before starting the data collection and analytics. To provide the relevant data, recommended tool, and data source for timely decision making strategy and process improvement. 126 Page

147 References REFERENCES [1] S. B. Siewert (Feb. 2013), "Social media analytics: Making customer insights actionable." IBM Corporation. [2] P. Gundecha and H. Liu (2014), "Mining Social Media: A Brief Introduction". Cambridge University Press, UK. [3] A. Nelson (2013), "How to Use Social Media Data for Customer Insight". [Online]. Available: [Accessed 12 February 2014]. [4] (n.d) (2015), "Wholesale Fraud Management". Enghouse Network, Ontario, Kanada. [5] A. Semenov (May 31, 2013), Principles of Social Media Monitoring and Analysis Software. Jyväskylä, Finland: University of Jyväskylä. [6] G. Lotan, E. Graeff, M. Ananny, D. Gaffney, I. Pearce and d. boyd (2011), "The Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions". International Journal of Communications 5, Pages [7] META Group (2011), "Big Data, What it is and why it matters". SAS, [Online]. Available: [Accessed 4 March 2014]. [8] P. Breuer, L. Forina and J. Moulton (2013), "Beyond the hype: Capturing value from big data and advanced analytics". McKinsey & Company. [9] B. Thalheim, and Y. Kiyoki (2012), Analysis-Driven Data Collection, Integration and Preparation for Visualization. Frontiers in Artificial Intelligence and Applications EJC Pages [10] D. Zeng, H. Chen,. L. R. and S.-H. Li (Nov.-Dec. 2010), "Social Media Analytics and Intelligence". Intelligent Systems, IEEE, vol. 25, no. 6, pp Page

148 References [11] I. Kompatsiaris, D. Gatica-Perez, X. Xie and J. Luo (2013), "Special Section on Social Media as Sensors". Multimedia, IEEE Transactions, vol. 15, no. 6, pp [12] T. Chardonnens (June 2013), Big Data analytics on high velocity streams. Switzerland: University of Fribourg. [13] C. Regina, M. Beyer, M. Adrian, T. Friedman and D. Logan (2013), "Top 10 Technology Trends Impacting Information Infrastructure 2013". Gartner. [14] S. Singh and N. Singh (2012), ""Big Data analytics". in Communication, Information & Computing Technology (ICCICT), International Conference, Pages: [15] S. Madden (2012), "From Databases to Big Data". IEEE Internet Computing, p. 16(3):4 6. [16] S. Kaisler, F. Armour, J. A. Espinosa and a. W. Money (2013), "Big Data: Issues and Challenges Moving Forward" In Proceedings of the 46th Hawaii International Conference on System Sciences, HICSS 13, p [17] T. Kraska (2013), "Finding the Needle in the Big Data Systems Haystack". IEEE Internet Computing, p. 17(1): [18] M. Markus (Oct. 2013), Towards a Big Data Reference Architecture. Eindhoven, Netherlands: Department of Mathematics and Computer Science, Eindhoven University of Technology. [19] IBM Corp. (2013), "The Big Data and Analytic Hub" IBM, [Online]. Available: [Accessed 14 Feb. 2014]. [20] D. Mauro, Andrea, M. Greco and M. Grimaldi (September 2014), "What is big data? A consensual definition and a review of key research topics". American Institute of Physics and related Sciences AIP Conference Proceedings, pp , 5 8. [21] D. M. Boyd and N. B. Ellison (2007), "Social Network Sites: Definition, History, and Scholarship". Journal of Computer-Mediated Communication, vol. 13, no. 1, p. 128 Page

149 References [22] C. C. Aggarwal (2011), Social Network Data Analytics, New York: Springer. [23] H. Ting (2008), "Web Mining Techniques for On-line Social Networks Analysis". in Service Systems and Service Management, 2008 International Conference, p [24] D. M. Boyd and N. B. Ellison (2007), "Social Network Sites: Definition, History, and Scholarship". University of California-Berkeley, Michigan State University, USA. [25] N. Sharma (2011), "Sphere of Influence, The Importance of Social Network Analysis". Solutions for Enabling Lifetime Customer Relationships, Pitney Bows Software. [26] A. Katal, M. Wazid and R. H. Goudar (8-10 Aug. 2013), "Big data: Issues, challenges, tools and Good practices". in Contemporary Computing (IC3) Sixth International Conference, Noida, p [27] G. & M. Guest, N. M. Kathleen and E. E. (2012), "Applied thematic Analysis". Sage Publications, Thousand Oaks, Calif. [28] I. Summerville (2010), Software Engineering, 9th Edition, Addison Wesley. [29] D. Damian, J. Chisan, L. Vaidyanathasamy and Y. Pal (2005), "Requirements Engineering and Downstream Software Development: Findings from a Case Study". Empirical Software Engineering, vol. 10, no. 3, pp [30] S. W. Hermansen (2012), "Reducing Big Data to Manageable Portions". in Southeastern SAS User's Group (SESUG) Conference, USA, p [31] Z. Guo and J. Wang (2011), "Information retrieval from large data sets via multiplewinners-take-all".international Symposium on Circuits and Systems (ISCAS) Conference, Rio De Janeiro, pp [32] M. Saunders, P. Lewis and A. Thornhill (2012), Research Methods for Business Students. Pearson, England. 129 Page

150 References [33] A. Dahanayake, and B. Thalheim (2014), W*H: The conceptual Model for Services. ESF 2014 workshop on "Correct software for web application", Sringer- Verlage. [34] C. Otero and A. Peter (2015), "Research Directions for Engineering Big Data Analytics Software". Intelligent Systems, IEEE, vol. 30, no. 1, pp [35] D. Mysore, S. Khupat, and S. Jain (2013), Big Data architecture and patterns, Part1: Introduction to Big Data classification and architecture. IBM Corp. [36] C. Spencer Big Data scenarios and case studies IBM Corp. [37] M. Alswilmi, N. Alnajran, and A. Dahanayake, (2014), Conceptual Framework for Big Data Analytics Solutions Proceedings of 24th International Conference on Information Modelling and Knowledge Bases (EJC 2014), p [38] T. Morzy, M. Wojciechowski and M. Zakrzewicz (2002), "Efficient Constraint- Based Sequential Pattern Mining Using Dataset Filtering Techniques". in Poznan University of Technology, Institute of Computing Science, Poland. [39] F. Neck and G. A. David (2006), "Challenges and Opportunities in Internet Data Mining". in Carnegie Mellon University, Pittsburgh, PA [40] Hyun-Ho Lee and W.-S. Lee (2010), "Consistent collective evaluation of multiple continuous queries for filtering heterogeneous data streams". Knowledge and Information Systems, vol. 22, no. 2, pp [41] A. Claire and B. Brisset (2013), "Managing Semantic Big Data for Intelligence" Central Europe STIDS - CEUR Workshop Proceedings, vol. 1097, pp [42] F. Daniel (2012), Extract, Transform, and Load Big Data with Apache Hadoop White paper. Intel. [43] Y. W. Zhao, W.-J. van den Heuvel and X. Ye (2013), "Exploring big data in small forms: A multi-layered knowledge extraction of social networks". Big Data, 2013 IEEE International Conference, pp [44] M. Song and Meen Chul Kim (2013), "RT^2M: Real-Time Twitter Trend Mining 130 Page

151 References System,", Social Intelligence and Technology (SOCIETY), 2013 International Conference, pp [45] M. Nguyen, T. Ho and Phuc Do (2013), "Social Networks Analysis Based on topic modeling". Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2013 IEEE RIVF International Conference. pp [46] M. Thelwall, D. Wilkinson and S. Uppal (2010), "Data Mining Emotion in Social Network Communication: Gender differences in MySpace". Journal of the American Society for Information Science and Technology, vol. 61, no. 1, p [47] D. Hansen, D. Rotman, E. Bonsignore, N. Milic-Frayling, E. Rodrigues, M. Smith and B. Shneiderman (2012), "Do You Know the Way to SNA?: A Process Model for Analyzing and Visualizing Social Media Network Data". in Social Informatics (SocialInformatics), 2012 International Conference. pp [48] R. Colbaugh and K. Glass (2013), "Analyzing Social Media Content for Security Informatics" in Intelligence and Security Informatics Conference (EISIC), 2013 European, p [49] R. T. Khasawneh, H. A. Wahsheh, M. N. Al-Kabi and I. M. Alsmadi (2013), "Sentiment analysis of arabic social media content: a comparative study". in Internet Technology and Secured Transactions (ICITST), th International Conference. [50] D. Galin (2004), Software Quality Assurance: From Theory to Practice. England: Pearson. [51] S. Pfleeger and J. Atlee (2010), Software Engineering: Theory and Practice, 4th edition, Pearson Education. [52] W. Westfall, (2005), Software Requirements Engineering: What, Why, Who, When, and How. Software Quality Professional, Vol.7, No.4, pages [53] N. Alnajran (2015), "Big Data Analytics and Scenario-based Big Data Collection". Master Thesis, Prince Sultan University, Riyadh, Saudi Arabia. [54] M. Troester (2012), "Big Data meets Big Data Analytics". SAS. 131 Page

152 References [55] J. Abuin, J. Pichel, T. Pena, P. Gamallo and M. García (2014), "Perldoop: Efficient execution of Perl scripts on Hadoop clusters" Big Data (Big Data), 2014 IEEE International Conference, pp [56] T. White (2012) "Hadoop: The Definitive Guide. UK: O'Reilly. [57] Apache Hadoop. The Apache Software Foundation. Retrieved from [Accessed 26 April 2014]. [58] J. Dean, and S. Ghemawat (2008), MapReduce: Simplified Data Processing on Large Clusters Communications of the ACM, Vol. 51, No. 1, Pages [59] S. Kurazumi, T. Tsumura, S. Saito, and H. Matsuo (2012), "Dynamic Processing Slots Scheduling for I/O Intensive Jobs of Hadoop MapReduce" Networking and Computing (ICNC), Third International Conference, Pages 288,292, 5-7. [60] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur (2000), PVFS: A parallel file system for Linux clusters in Proceedings of 4th Annual Linux Showcase and Conference, Pages [61] M. K. McKusick, and S. Quinlan (2009), GFS: Evolution on Fast-forward ACM Queue, Vol. 7, No. 7, Page 10. [62] K. Shvachko, H. Kuang, S. Radia, and R. Chansler (2010), "The Hadoop Distributed File System" Mass Storage Systems and Technologies (MSST), IEEE 26th Symposium, Pages 1,10, 3-7. [63] X. Qin, H. Wang, F. Li, B. Zhou, Y. Cao, C. Li, H. Chen, X. Zhou, X. Du, and S. Wang (2012), "Beyond Simple Integration of RDBMS and MapReduce -- Paving the Way toward a Unified System for Big Data Analytics: Vision and Progress" Cloud and Green Computing (CGC), Second International Conference, Pages 716,725, 1-3. [64] C. D. Manning, Surdeanu, Miha, Bauer, John, Finkel, B. Jenny, Bethard, Steven J. and,. D. McClosky (2014), "The Stanford CoreNLP Natural Language Processing Toolkit." of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp Page

153 References [65] M. Leiter (2014), "How to Choose Social Media Platforms," Melissa Leiter, San Francisco. [66] P. Meier (2013), "Classification of Social Media Platforms" Delvalle Institute Knowledge Base. [67] A. Mayfield (2008), What Is Social Media. US: icrossing. [68] A. DuVander (2013), "How Top Social APIs Use Social Media". Programmable web. [69] E. Ravinscraft (2013), "Which Social Network Should I Use?". LifeHacker. [Online]. Available: [70] A. Dean (2009), "WhatIs". TechTarget. [Online]. Available: [Accessed ]. [71] A. Bozzuto (2012), "The Difference Between Facebook, Tiwitter, Linkedin, Google+, Youtube, Pinterest". IMPACT Branding and Design. [72] J. Taylor (2014), "Choosing a social media monitoring tool" Our Social Time. [Online]. Available: [73] M. E. Mármol (2013), "How to choose a Social Media Monitoring tool". edigital - Digital Marketing Consultants and Trainers in Sydney, Sydney. [74] J. Bear (2012), "Clearing Clouds of Confusion the 5 Categories of Social Media Software, Convince and Convert - Digital Marketing Advisors". [Online]. Available: [75] M. M. Berg (2014), Modelling of Natural Dialogues in the Context of Speechbased Information and Control Systems. Kiel: Christian-Albrechts University. [76] C. D. Ennis (1986), "Conceptual Frameworks as a Foundation for the Study of Operational Curriculum". Journal of Curriculum and Supervision, vol. 2, no. 1, pp. 133 Page

154 References [77] J. A. Michel (2012), Qualitative Research Design: An Interactive Approach. 3rd Edition, Sagepub. [78] D. Mills, (n.d.) Problem Domain. Cunningham & Cunningham, Inc.," [Online]. Available: [Accessed 10 April 2014]. [79] A. Smeda (2010), "A formal definition of software architecture behavioral concepts". Research Challenges in Information Science (RCIS), 2010 Fourth International Conference, pp , Nice, France. [80] P. Kruchten (1995), "Architectural Blueprints The 4+1 View Model of Software Architecture". IEEE Software, pp [81] Len Bass, P. Clements and R. Kazman (2003), Software Architecture in Practice. Second Edition: Addison Wesley. [82] N. Rozanski and E. Woods (2011), "Applying Viewpoint and Views to Software Architecture". Addison-Wesley Professional. [83] M. J. Bates and M. N. Maack (2010), Encyclopedia of Library and Information Science. 3rd Edition :Marcel Decker, Inc. [84] F. Hogenboom, F. Frasincar and U. Kaymak (2010), "An Overview of Approaches to Extract Information from Natural Language Corpora". 10th Dutch-Belgian Information Retrieval Workshop, p [85] F. Ricci, L. Rokach and B. Shapira (2011), Recommender Systems Handbook Newyork: Springer. [86] K. T. Alex and C. D. Manning (2000), " Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger". Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC- 2000), pp [87] K. Toutanova, D. Klein, C. Manning and Y. Singer (2003), "Feature-Rich Part-of- Speech Tagging with a Cyclic Dependency Network". HLT-NAACL 2003, pp Page

155 References 259. [88] J. R. Finkel, T. Grenager and C. Manning (2005), "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling." 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp [89] D. Chen, and C. D. Manning (2014), A Fast and Accurate Dependency Parser using Neural Networks. Proceedings of EMNLP [90] M. Recasens, M. D. Marneffe, and C. Potts (2013), The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of NAACL [91] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu and D. Jurafsky (2013), Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics 39(4). [92] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky (2011), Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In Proceedings of the CoNLL-2011 Shared Task. [93] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng and C. Potts (2013), Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Conference on Empirical Methods in Natural Language Processing. [94] S. Gupta and C. D. Manning (2014), Improved Pattern Learning for Bootstrapped Entity Extraction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL) [95] The Stanford Natural Language Processing Group (2011), "Stanford Log-linear Part-Of-Speech Tagger," Stanford University. [Online]. Available: [Accessed 4 January 2015]. [96] Natural Language Processing Musings (2011), "Part of Speech Tags," Stanford University. [Online]. Available: [Accessed 4 January 2015]. [97] The Stanford Natural Language Processing Group (2015), "Stanford Named Entity 135 Page

156 References Recognizer (NER)". Stanford University. [Online]. Available: [Accessed 3 March 2015]. [98] (2011), "A Saving Lincoln Case Study," Hootsuite Inc.. [Online]. Available: [99] Centers for Disease Control and Prevention (2014), "MERS-CoV" Ministry of Health. [Online]. Available: [Accessed Feb 2015]. [100] Command and Control Center (2014), "Ministry of Health Institutes New Standards for Reporting of MERS-CoV". Ministry of Health. [Online]. Available: [Accessed 28 Feb 2015]. [101] D. I. Khan (2014), "Pakistan Taliban splinter group vows allegiance to Islamic State". Reuters. [102] Shafaqna (2015), "Mufti of Saudi Arabia:Daash is a terrorist not related to Islam". Shafaqna Online News. [103] Casey (2014), "Daash plans to enter Saudi Arabia during the pilgrimage season". worldanalysis. [104] J. P. Cavano and M. A. James (1978), "A framework for the measurement of software quality". ACM SIGMETRICS Performance, vol. 7, pp [105] W. S. Humphrey (2000), "The Personal Software Process". Software Engineering Ins., Pittsburgh. [106] T. &. L. P. Hill (2007), "Designing an Experiment, Power Analysis" in STATISTICS: Methods and Applications,Tulsa, Oklahoma, StatSoft, Inc. [107] J.M. (2011), "Percentage Change Increase and Decrease," SkillsYouNeed. [Online]. Available: [Accessed 12 March 2015]. [108] S. Nadarajah (2005), A Generalized Normal Distribution. Journal of Applied 136 Page

157 References Statistics, Vol.32, No. 7, P [109] J.C.F. Winter (2013), Using the Student s t-test with extremely small sample sizes, Practical Assessment, Research & Evaluation Journal, Vol.18, No. 10. [110] M. Bamberger, J. Rugh and L. Mabry (2012), "Chapter 3: Not Enough Money: Addressing Budget Constraints in RealWorld evaluation: working under budget, time, data and political constraints." in In RealWorld Evaluation Working Under Budget,Time, Data and Political Constraints., CA: Sage, Thousand Oaks. pp Page

158 Appendix A APPENDICES APPENDIX A. GLOSSARY Architecture Business Requirements Constraint Functional Requirement IBM IEEE Gartner Non-functional Requirement Paper Prototype Process The structure of a software-containing system, including the software and hardware components that make up the system, the interfaces and relationships between those components. A high-level business objective of the organization that builds a product or a customer who procures it. A restriction that is imposed on the choices available to the user and/or developer for the use/design and construction of a product. A statement of a piece of required functionality or a behaviour that a system will exhibit under specific conditions. The International Business Machines Corporation (IBM) is an American multinational technology and consulting corporation, with headquarters in Armonk, New York. IBM manufactures and markets computer hardware and software, and offers infrastructure, hosting and consulting services in areas ranging from mainframe computers to nanotechnology. The Institute of Electrical and Electronics Engineers. A professional society that maintains a set of standards for managing and executing software and system engineering projects. Gartner, Inc. is an American information technology research and advisory firm providing technology related insight headquartered in Stamford, Connecticut, United States. A description of property or characteristic that a software must exhibit or a constraint that it must respect, other than an observable system behaviour. A non-executable mock-up of a software system s user interface using inexpensive. low-tech screen sketches. A sequence of activities performed for a given purpose. 138 Page

159 Appendix A Prototype Quality Attribute Requirements Requirement Attribute Requirement Allocation Requirement Elicitation Software Development Lifecycle SAS Validation Verification A partial, preliminary, or possible implementation of a program. Used to explore and validate requirements and design approaches. A kind of non-functional requirement that describes a quality or property of a system. Examples include usability, portability etc. It describe the extent to which a software product demonstrates desired characteristics, not what the product does. A statement of a customer need or objective, or of a condition or capability that a product must possess to satisfy such a need or objective. Descriptive information about a requirement that enriches its definition beyond the statement of intended functionality. The process of apportioning system requirements among various architectural subsystem and components. The process of identifying software or system requirements from various sources through interviews, workshops, workflow, and task analysis, document analysis and other mechanisms. A sequence of activities by which a software product is design, defined, built, and verified. SAS Institute is an American developer of analytics software based in Cary, North Carolina. SAS develops and markets a suite of analytics software (also called SAS), which helps manage, access, analyse and report on data to aid in decision-making. The process of evaluating a work product to determine whether it satisfies customer requirements. The process of evaluating a work product to determine whether it satisfies the specifications and conditions imposed on it at the beginning of the development phase during which it was created. 139 Page

160 Appendix B APPENDIX B. HADOOP COMPONENTS Hadoop Distributed File System (HDFS) HDFS is the file system component of Hadoop designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware [56]. HDFS stores file systems metadata and application data separately. As in other distributed file systems, such as, PVFS [60], Lustre and GFS [61], HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCPbased protocols [62]. YARN (MapReduce 2.0) MapReduce was created by Google mainly to process enormous volumes of unstructured data. MapReduce is a general execution engine that is ignorant of storage layouts and data schemas. The runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user only needs to provide a map function and a reduce function. The map function is applied to all input rows of the dataset and produces an intermediate output that is aggregated by the reduce function later to produce the final result [63]. In 2010, a group at Yahoo! began to design the next generation of MapReduce. The result was YARN shortened for Yet Another Resource Negotiator. YARN meets the scalability shortcomings of classic MapReduce. YARN is more general than MapReduce, and in fact MapReduce is just one type of YARN application. The beauty of YARN s design is that different YARN applications can co-exist on the same cluster, so a MapReduce application can run at the same time as an MPI (Message Passing Interface) application. It performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-mapreduce workloads associated with other programming 140 Page

161 Appendix B models. On the whole it offers greater benefits for manageability and cluster utilization [56]. Other Hadoop Components. Type of Component Description Service Core HDFS Provides scalable and reliable data storage of massive amounts of data (data blocks are distributed among clusters) for further processing. It is suitable for applications with large and multi-structured data sets (e.g., web and social data, human generated log, and biometrics data) to provide for performing predictive analysis and pattern recognition. HDFS is possible to interact with batch data processing as well as the data in real time events (sensors or fraud) even before it lands on HDFS. MapReduce Framework for writing applications that process large amounts of structured and unstructured data in parallel by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel on commodity hardware reliably, and in a fault-tolerant manner. YARN Framework for Hadoop data processing supports MapReduce and other programming models. It handles the resource management, security, etc.. and to allow for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, interactive SQL with Apache Hive and Apache Tez). Tez Generalizes MapReduce to support near real-time processing. It can scale up request and meet demands for fast response times providing the suitable framework to execute near real-time processing systems. Data Pig Platform paired with MapReduce and HDFS for processing large Big Data. It performs all of the data processing by compiling its Latin scripts to produce sequences of MapReduce programs. Hive Data Warehouse that enables easy data summarization and ad-hoc queries. It also allows a mechanism for structuring the semi-structured (customer logs) and unstructured data (machine generated and transaction data) and perform queries using SQL-like language called HiveQL. Hive rresides on top of MapReduce and next to Pig. 141 Page

162 Appendix B HBase HCatalog Storm Mahout Accumulo Flume Sqoop A distributed, scalable, Big Data store with random, real time read/write access. For storing huge amounts of unstructured data, RDBMS will not be adequate as the data sets will grow and accordingly will rise issues with scaling up request since these relational databases were not designed to be distributed. Hbase (column-based), a Not Only SQL (NoSQL) database that allows for low-latency, quick lookups in Hadoop is needed to maintain a class of a non-relational data storage systems that supports data consistency, scalability and excellent performance. Provides centralized way for data processing systems to understand the structure and location of the data stored within Hadoop. It handles meta data management service to understand the structure and location of the data stored within HDFS and supports abstraction and language independent allowing the choice of using different data processing tools. HCatalog acts as an adapter between Hadoop on one side and the query language frameworks on the other. Distributed real-time computation system for processing fast, large streams of data adding reliable real-time processing capabilities to Hadoop. The fault-tolerant and high performance real time computations of Storm facilitates reliably processing continuous feeds and unbounded streams of data to respond to real time events as they happen. Provides scalable machine learning algorithms, cause to support clustering, classification and batch based collaborative filtering. This data-mining library provides algorithms for clustering the unstructured data, collaborative filtering, regression testing, and statistical modeling to analyze insights. It has the potential to pull in vast troves of data from exposed social media sites and make far-reaching conclusions. High performance data storage abd retrieval system with cell-level access control. It works on top of Hadoop and ZooKeeper and supports high performance storage and retrieval of data to allow predictive analysis. Allows to efficiently aggregate and move large amount of log and stream data from many different sources to Hadoop into HDFS for storage. Open-source tool that allows users to extract data from a relational database into Hadoop for further processing. It allows to speed and ease the movement of monitoring generated data from different smart meters databases 142 Page

163 Appendix B Operational ZooKeeper Coordinates distributed processes. Stores, and mediates updates to important configuration information. It handles configuration and synchronization services and the management layer of Hadoop platform. It is necessary to enable Accumulo work efficiently. Ambari Manages, and monitors Hadoop clusters. Oozie Schedule Hadoop jobs. Combines multiple jobs sequentially into one logical unit Falcon Data management framework for simplifying data lifecycle management and processing pipelines on Hadoop. Knox System that provides a single point of authentication and access for Hadoop services in a cluster. To simplify Hadoop security of users, and operators. Table B1: Hadoop Ecosystem Components [2][16] 143 Page

164 Appendix C APPENDIX C. DIMENSIONS OF THE W*H MODEL FOR SERVICES The W*H model merges the gap between main service design initiatives and their abstraction level interpretations. It introduces four dimensions that specifies a full description of IT services [33]. 1. The Annotation dimension covers the Party and Activity dimensions a. Party: this dimension provides a description of the stakeholders that are requesting the service by identifying the supplier, consumer, and producer. These factors are declared by answering the by whom, to whom, and whichever questions. b. Activity: this dimension provides a description of the actions processed during the service execution through identifying the input and the output of the service. It is declared by answering what in and what out questions. 2. The Content dimension provides a description the supporting means through the application domain dimension. The application domain covers the area where the service will be used, the case, the problem in which to be solved by the service, organizational unit that requests the service, events in which the service will be triggered for execution, and IT. These factors are captured through answering questions like wherein, where, for what, where-from, whence, what, and how. 3. The Concept dimension provides a description of need and purpose of the required service. It is declared by answering why, whereto, for when, and for which reason. 4. The Added value dimension provides a description of the surplus value a potential user is expected to gain by using this service through defining its context characteristics. The context captures the provider or developer or supplier context for a service, the user context for a service, the environment that must exist for service utilization, and the coexistence context for a service within a set of services. These context dimensions are declared by answering whereat, whereabout, whither, when. 144 Page

165 Appendix C Figure C1 combines the service dimensions to a general W H framework [33]. Figure C 1: The W H Inquiry Based Conceptual Model for Services 145 Page

166 Appendix D APPENDIX D. ANALYTICAL TOOLS DATABASE. Analytical URL/Link Software Problem Data Cost Tool Category Domain Source SAS Social Marketing Any data basic windows Analytics package Statistical sas.com/en_ Analytics source costs $8,700 for the first year. Analysis us/home.ht Software (High) System ml Facebook Social Sales and Faceboo Free (the page has to have more than Insights. facebook.co Analytics Marketing k 30 likes). m/insights/ Software Google Social Marketing Any If site generates 10 million or fewer Analytics google.com/ Analytics digital hits per month, then Google analytics/ Software data Analytics is free. If site generates source. more than 10 million hits per month, fee of $150,000 (USD) covers all the data, guarantees and service. (High) SocialMenti SM Marketing All SM Free on socialmentio Monitoring Sites n.com/ Software hootsuite SM Any All SM Free uite.com/das Managemen Domain. Sites hboard t Software YouTube SM Any YouTube Free Insights.youtube.co Analytics Domain m/analytics? Software o=u Meltwater Social Marketing Any There is a free trial for companies. meltwater.c Listening, Sales, digital And different prices for different om/ Software Public data solutions: Online news and PR Relationsh source. solution (Meltwater News), ips Complete Solution (Meltwater News + Meltwater Buzz), and SM 146 Page

167 Appendix D Marketing Solution (Meltwater News) Radian6 Sysomos radian6.com / m/ Social Listening Software Social Listening Software Sales, Business Business Twitter, Google Buzz, LinkedIn Faceboo k Twitter, Faceboo k $100/month for user account $500/month for 5 queries, 5 competitors, 10 tags, 5 users and 1 Facebook page. Google Social Limited Google Free Alert.google.com Listening business /alerts?hl=ar Software mentoring Crimson Hexagon crimsonhexa gon.com/ Social Listening Software Business, Marketing, Research, Different SM sites, especial Twitter Free Demo for large companies Lithium ithium.com/ Social Managemen Customer services, Any SM site Offers a free trial for the basic package. products- t Software Marketing solutions/so cialmarketing Meltwater Buzz meltwater.c om/products Social Listening Software Marketing, Sales, Public Any digital data source. There is a free trial for companies. And different prices for different solutions. / Relationsh ips Trackur rackur.com/ Social Listening Software PR, Marketing Any digital data source. Offers a free demo for 10 days, and it cost at least 97$ a month Visible Technologie s cision.com/ us/ Social Listening Software PR, Marketing, Sales Any digital data source. It offers a free demo for registered companies and it has different pricing for different packages. ViralHeat Social Managment Marketing Any digital data It offers a free demo for registered companies and it has different 147 Page

168 Appendix D Argyle Social Sprinklr Spredfast CoTweet Awareness Attensity plumlytics Buddy Media m/ Software source pricing for different packages. m/ m/.viralheat.co m/argylesocial/ exacttarget.c om/products /socialmediamarketing awarenesshu b.com/ attensity.co m/what-wedo ytics.com/# product exacttarget.c om/products Social Conversatio n Software Social Conversatio n Software Social Conversatio n and Managemen t Software Social Conversatio n and Managemen t Software Social Conversatio n and Managemen t Software Social Conversatio n Software Social Listening Software Social Marketing Software Marketing Customer service, Marketing, PR Marketing Customer Service, Marketing Marketing Business Marketing Marketing and Sales Any digital data source Any SM site Any SM site Twitter and Faceboo k Any SM site Any data source Any Digital Source Any Digital Source It offers a free demo for registered companies and it has different pricing for different packages. It offers a free demo for registered companies and it has different pricing for different packages. It offers a free demo for registered companies and it has different pricing for different packages. The Standard edition is free and allows up to six Twitter accounts. The Enterprise version costs $1,500 a month, but there is a free demo. It offers a free demo for registered companies and it has different pricing for different packages. It offers a free demo for registered companies and it has different pricing for different packages. Starts with $69/Month/User It offers a free demo for registered companies and it has different pricing for different packages. 148 Page

169 Appendix D EngageScie nces Agorapulse ShortStack Social Bakers Appinions GroupHigh /socialmediamarketing/b uddy-media engagescien ces.com/ agorapulse.c om/ shortstack.c om/ socialbakers.com/ appinions.co m/ grouphigh.c om/ Social Engagement Software Social Marketing Software Social Marketing Software Social Analytics Software Social Influencer Software Social Influencer Software Marketing Marketing Marketing Marketing, Research Marketing, Sales MArketin g Any Digital Source Any Digital Source Any SM Source Any digital Source Any digital Source Any digital Source Table D1: Analytical Tools Database It offers a free demo for registered companies and it has different pricing for different packages. Offers a free trial for 10 days, and it cost at least 29$ a month Offers a free campaign lunching, It offers a free demo for registered companies and it has different pricing for different packages. It offers a free demo for registered companies and it has different pricing for different packages. It offers a free demo for registered companies and it has different pricing for different packages. Offers a free trial, and it has different pricing for different packages. 149 Page

170 Appendix E APPENDIX E. RELATED PUBLISHED PAPERS. Published papers under this research [4] M. Alswilmi, A. Dahanayake, (2015), A Requirements Acquisition Tool Architecture for the Decision Back Approach for Social Media Big Data Capturing 5 th Advances in Software Engineering Conference, Prince Sultan University [5] M. Alswilmi, N. Alnajran, A. Dahanayake, (2014), Conceptual Framework for Big Data Analytics Solutions Proceedings of 24 th International Conference on Information Modelling and Knowledge Bases, EJC [6] M. Alswilmi, N. Alnajran, A. Dahanayake, (2013), Conceptual Framework for Big Data Analytics Solutions 2 nd Advances in Software Engineering Conference, Prince Sultan University Page

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

How To Make Sense Of Data With Altilia

How To Make Sense Of Data With Altilia HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS. ALTILIA turns Big Data into Smart Data and enables businesses to

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]

More information

Big Data Integration: A Buyer's Guide

Big Data Integration: A Buyer's Guide SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 ISSN 2278-7763. BIG DATA: A New Technology

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 ISSN 2278-7763. BIG DATA: A New Technology International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May-2014 18 BIG DATA: A New Technology Farah DeebaHasan Student, M.Tech.(IT) Anshul Kumar Sharma Student, M.Tech.(IT)

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

5 Point Social Media Action Plan.

5 Point Social Media Action Plan. 5 Point Social Media Action Plan. Workshop delivered by Ian Gibbins, IG Media Marketing Ltd ([email protected], tel: 01733 241537) On behalf of the Chambers Communications Sector Introduction: There

More information

Understanding the Value of In-Memory in the IT Landscape

Understanding the Value of In-Memory in the IT Landscape February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

Leveraging Global Media in the Age of Big Data

Leveraging Global Media in the Age of Big Data WHITE PAPER Leveraging Global Media in the Age of Big Data Introduction Global media has the power to shape our perceptions, influence our decisions, and make or break business reputations. No one in the

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach Unlocking The Value of the Deep Web Harvesting Big Data that Google Doesn t Reach Introduction Every day, untold millions search the web with Google, Bing and other search engines. The volumes truly are

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A SURVEY ON BIG DATA ISSUES AMRINDER KAUR Assistant Professor, Department of Computer

More information

Hadoop for Enterprises:

Hadoop for Enterprises: Hadoop for Enterprises: Overcoming the Major Challenges Introduction to Big Data Big Data are information assets that are high volume, velocity, and variety. Big Data demands cost-effective, innovative

More information

Big Data Executive Survey

Big Data Executive Survey Big Data Executive Full Questionnaire Big Date Executive Full Questionnaire Appendix B Questionnaire Welcome The survey has been designed to provide a benchmark for enterprises seeking to understand the

More information

ANALYTICS STRATEGY: creating a roadmap for success

ANALYTICS STRATEGY: creating a roadmap for success ANALYTICS STRATEGY: creating a roadmap for success Companies in the capital and commodity markets are looking at analytics for opportunities to improve revenue and cost savings. Yet, many firms are struggling

More information

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning

More information

Data Warehousing in the Age of Big Data

Data Warehousing in the Age of Big Data Data Warehousing in the Age of Big Data Krish Krishnan AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD * PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann is an imprint of Elsevier

More information

Integrated Social and Enterprise Data = Enhanced Analytics

Integrated Social and Enterprise Data = Enhanced Analytics ORACLE WHITE PAPER, DECEMBER 2013 THE VALUE OF SOCIAL DATA Integrated Social and Enterprise Data = Enhanced Analytics #SocData CONTENTS Executive Summary 3 The Value of Enterprise-Specific Social Data

More information

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014 5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS 9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence

More information

We are Big Data A Sonian Whitepaper

We are Big Data A Sonian Whitepaper EXECUTIVE SUMMARY Big Data is not an uncommon term in the technology industry anymore. It s of big interest to many leading IT providers and archiving companies. But what is Big Data? While many have formed

More information

How To Make Data Streaming A Real Time Intelligence

How To Make Data Streaming A Real Time Intelligence REAL-TIME OPERATIONAL INTELLIGENCE Competitive advantage from unstructured, high-velocity log and machine Big Data 2 SQLstream: Our s-streaming products unlock the value of high-velocity unstructured log

More information

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Exploiting Data at Rest and Data in Motion with a Big Data Platform Exploiting Data at Rest and Data in Motion with a Big Data Platform Sarah Brader, [email protected] What is Big Data? Where does it come from? 12+ TBs of tweet data every day 30 billion RFID tags

More information

AdTheorent s. The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising. The Intelligent Impression TM

AdTheorent s. The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising. The Intelligent Impression TM AdTheorent s Real-Time Learning Machine (RTLM) The Intelligent Solution for Real-time Predictive Technology in Mobile Advertising Worldwide mobile advertising revenue is forecast to reach $11.4 billion

More information

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved. Big Data Analytics 1 Priority Discussion Topics What are the most compelling business drivers behind big data analytics? Do you have or expect to have data scientists on your staff, and what will be their

More information

Big Data a threat or a chance?

Big Data a threat or a chance? Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

More information

The Lab and The Factory

The Lab and The Factory The Lab and The Factory Architecting for Big Data Management April Reeve DAMA Wisconsin March 11 2014 1 A good speech should be like a woman's skirt: long enough to cover the subject and short enough to

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

Big Data for Investment Research Management

Big Data for Investment Research Management IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

Chartis RiskTech Quadrant for Model Risk Management Systems 2014

Chartis RiskTech Quadrant for Model Risk Management Systems 2014 Chartis RiskTech Quadrant for Model Risk Management Systems 2014 The RiskTech Quadrant is copyrighted June 2014 by Chartis Research Ltd. and is reused with permission. No part of the RiskTech Quadrant

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

ANALYTICS BUILT FOR INTERNET OF THINGS

ANALYTICS BUILT FOR INTERNET OF THINGS ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

More information

BIG DATA FUNDAMENTALS

BIG DATA FUNDAMENTALS BIG DATA FUNDAMENTALS Timeframe Minimum of 30 hours Use the concepts of volume, velocity, variety, veracity and value to define big data Learning outcomes Critically evaluate the need for big data management

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

There s no way around it: learning about Big Data means

There s no way around it: learning about Big Data means In This Chapter Chapter 1 Introducing Big Data Beginning with Big Data Meeting MapReduce Saying hello to Hadoop Making connections between Big Data, MapReduce, and Hadoop There s no way around it: learning

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

How To Use Big Data To Help A Retailer

How To Use Big Data To Help A Retailer IBM Software Big Data Retail Capitalizing on the power of big data for retail Adopt new approaches to keep customers engaged, maintain a competitive edge and maximize profitability 2 Capitalizing on the

More information

How To Turn Big Data Into An Insight

How To Turn Big Data Into An Insight mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

Turning Big Data into a Big Opportunity

Turning Big Data into a Big Opportunity Customer-Centricity in a World of Data: Turning Big Data into a Big Opportunity Richard Maraschi Business Analytics Solutions Leader IBM Global Media & Entertainment Joe Wikert General Manager & Publisher

More information

Beyond listening Driving better decisions with business intelligence from social sources

Beyond listening Driving better decisions with business intelligence from social sources Beyond listening Driving better decisions with business intelligence from social sources From insight to action with IBM Social Media Analytics State of the Union Opinions prevail on the Internet Social

More information

Addressing government challenges with big data analytics

Addressing government challenges with big data analytics IBM Software White Paper Government Addressing government challenges with big data analytics 2 Addressing government challenges with big data analytics Contents 2 Introduction 4 How big data analytics

More information

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES Techniques For Optimizing The Relationship Between Data Storage Space And Data Retrieval Time For Large Databases TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

ISSN:2321-1156 International Journal of Innovative Research in Technology & Science(IJIRTS)

ISSN:2321-1156 International Journal of Innovative Research in Technology & Science(IJIRTS) Nguyễn Thị Thúy Hoài, College of technology _ Danang University Abstract The threading development of IT has been bringing more challenges for administrators to collect, store and analyze massive amounts

More information

Apache Hadoop Patterns of Use

Apache Hadoop Patterns of Use Community Driven Apache Hadoop Apache Hadoop Patterns of Use April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Big Data: Apache Hadoop Use Distilled There certainly is no shortage of hype when

More information

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

A Strategic Approach to Unlock the Opportunities from Big Data

A Strategic Approach to Unlock the Opportunities from Big Data A Strategic Approach to Unlock the Opportunities from Big Data Yue Pan, Chief Scientist for Information Management and Healthcare IBM Research - China [contacts: [email protected] ] Big Data or Big Illusion?

More information

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data SOLUTION BRIEF Understanding Your Customer Journey by Extending Adobe Analytics with Big Data Business Challenge Today s digital marketing teams are overwhelmed by the volume and variety of customer interaction

More information

QUICK FACTS. Implementing a Big Data Solution on Behalf of a Media House TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

QUICK FACTS. Implementing a Big Data Solution on Behalf of a Media House TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES [ Communications, Services ] TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES Client Profile (parent company) Industry: Media, broadcasting and entertainment Revenue: Approximately $28 billion Employees:

More information

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment www.wipro.com Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment Pon Prabakaran Shanmugam, Principal Consultant, Wipro Analytics practice Table of Contents 03...Abstract

More information

Integrating SAP and non-sap data for comprehensive Business Intelligence

Integrating SAP and non-sap data for comprehensive Business Intelligence WHITE PAPER Integrating SAP and non-sap data for comprehensive Business Intelligence www.barc.de/en Business Application Research Center 2 Integrating SAP and non-sap data Authors Timm Grosser Senior Analyst

More information

Getting the most out of big data

Getting the most out of big data IBM Software White Paper Financial Services Getting the most out of big data How banks can gain fresh customer insight with new big data capabilities 2 Getting the most out of big data Banks thrive on

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

Annex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013

Annex: Concept Note. Big Data for Policy, Development and Official Statistics New York, 22 February 2013 Annex: Concept Note Friday Seminar on Emerging Issues Big Data for Policy, Development and Official Statistics New York, 22 February 2013 How is Big Data different from just very large databases? 1 Traditionally,

More information

VIEWPOINT. High Performance Analytics. Industry Context and Trends

VIEWPOINT. High Performance Analytics. Industry Context and Trends VIEWPOINT High Performance Analytics Industry Context and Trends In the digital age of social media and connected devices, enterprises have a plethora of data that they can mine, to discover hidden correlations

More information

Big Data Discovery: Five Easy Steps to Value

Big Data Discovery: Five Easy Steps to Value Big Data Discovery: Five Easy Steps to Value Big data could really be called big frustration. For all the hoopla about big data being poised to reshape industries from healthcare to retail to financial

More information

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS Megha Joshi Assistant Professor, ASM s Institute of Computer Studies, Pune, India Abstract: Industry is struggling to handle voluminous, complex, unstructured

More information

Beyond the Single View with IBM InfoSphere

Beyond the Single View with IBM InfoSphere Ian Bowring MDM & Information Integration Sales Leader, NE Europe Beyond the Single View with IBM InfoSphere We are at a pivotal point with our information intensive projects 10-40% of each initiative

More information

Big Data. Fast Forward. Putting data to productive use

Big Data. Fast Forward. Putting data to productive use Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

More information

Teradata s Big Data Technology Strategy & Roadmap

Teradata s Big Data Technology Strategy & Roadmap Teradata s Big Data Technology Strategy & Roadmap Artur Borycki, Director International Solutions Marketing 18 March 2014 Agenda > Introduction and level-set > Enabling the Logical Data Warehouse > Any

More information

White Paper: Big Data and the hype around IoT

White Paper: Big Data and the hype around IoT 1 White Paper: Big Data and the hype around IoT Author: Alton Harewood 21 Aug 2014 (first published on LinkedIn) If I knew today what I will know tomorrow, how would my life change? For some time the idea

More information

How To Listen To Social Media

How To Listen To Social Media WHITE PAPER Turning Insight Into Action The Journey to Social Media Intelligence Turning Insight Into Action The Journey to Social Media Intelligence From Data to Decisions Social media generates an enormous

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

The Rise of Industrial Big Data

The Rise of Industrial Big Data GE Intelligent Platforms The Rise of Industrial Big Data Leveraging large time-series data sets to drive innovation, competitiveness and growth capitalizing on the big data opportunity The Rise of Industrial

More information

Master big data to optimize the oil and gas lifecycle

Master big data to optimize the oil and gas lifecycle Viewpoint paper Master big data to optimize the oil and gas lifecycle Information management and analytics (IM&A) helps move decisions from reactive to predictive Table of contents 4 Getting a handle on

More information

IBM System x reference architecture solutions for big data

IBM System x reference architecture solutions for big data IBM System x reference architecture solutions for big data Easy-to-implement hardware, software and services for analyzing data at rest and data in motion Highlights Accelerates time-to-value with scalable,

More information

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches. Detecting Anomalous Behavior with the Business Data Lake Reference Architecture and Enterprise Approaches. 2 Detecting Anomalous Behavior with the Business Data Lake Pivotal the way we see it Reference

More information

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014 Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions

More information

Beyond Watson: The Business Implications of Big Data

Beyond Watson: The Business Implications of Big Data Beyond Watson: The Business Implications of Big Data Shankar Venkataraman IBM Program Director, STSM, Big Data August 10, 2011 The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy Much higher Volumes. Processed with more Velocity. With much more Variety. Is Big Data so big? Big Data Smart Data Project HAVEn: Adaptive Intelligence

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Big Data and Open Data

Big Data and Open Data Big Data and Open Data Bebo White SLAC National Accelerator Laboratory/ Stanford University!! [email protected] dekabytes hectobytes Big Data IS a buzzword! The Data Deluge From the beginning of

More information

MCCM: An Approach to Transform

MCCM: An Approach to Transform MCCM: An Approach to Transform the Hype of Big Data into a Real Solution for Getting Better Customer Insights and Experience Muhammad Salman Sami Khan, Chief Research Analyst, Global Marketing Team, ZTEsoft

More information