DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Vikas Kumar Spring 2013
iii Copyright 2013 by Vikas Kumar All Rights Reserved
iv DEDICATION To my father Mr. B.P. Singh, and my mother Mrs. Meena Singh, my family and friends who have always given me endless support and love.
v ABSTRACT OF THE THESIS Data Analysis Using Business Intelligence Tool by Vikas Kumar Master of Science in Computer Science San Diego State University, 2013 Information is growing at a very high rate. As information grows, the need for organizations to manage it and to make it process-able grows as well. So, as the amount of the information grows it becomes a problem to get the right one at the right place for the right people. And this in fact is certainly important for a successful enterprise. That's the reason why Business intelligence is today's tech priority. Business Intelligence is the process of going from raw data to legible information. A BI solution helps to transform raw data into actionable information which helps support business decision making.this can help firms to develop new opportunities. By identifying these opportunities, and implementing an effective strategy, can provide a competitive market advantage and long-term stability. In this report, we will see how to analyze and visualize vast amounts of data to help us make better business decisions using a Business Intelligence tool.
vi TABLE OF CONTENTS PAGE ABSTRACT...v LIST OF FIGURES... vii ACKNOWLEDGEMENTS... viii CHAPTER 1 INTRODUCTION...1 1.1 Why Business Intelligence...1 1.2 Benefits of Using Business Intelligence...1 2 BUSINESS INTELLIGENCE CONCEPT...3 3 FUNCTIONS OF BUSINESS INTELLIGENCE...6 3.1 Functions Which Can Be Performed Using Business Intelligence...6 3.2 Popular KDP Models...9 3.2.1 Nine-Step Model by Fayyad et al....9 3.2.2 CRISP-DM (Cross-Industry Standard Process for Data Mining) Model...10 3.2.3 Six-Step Model by Cios et al....12 4 BUSINESS INTELLIGENCE IMPLEMENTATION, CREATING DATA MART AND DASHBOARD DESIGN...14 5 CONCLUSION...20 REFERENCES...21
vii LIST OF FIGURES PAGE Figure 2.1. Figure showing business intelligence component....3 Figure 2.2. Dashboard showing Sales in different continents....5 Figure 3.1. Dashboard showing total sales by year through different Sales Channel...7 Figure 3.2. Dashboard showing total sales in different countries in Asia Pacific....7 Figure 3.3. Shows sequential structure of the KDP model....8 Figure 3.4. The six-step KDP model....13 Figure 4.1. Data source 1 (contains product ID, customer ID, customer name, sales channel, unit sold and date sold)....15 Figure 4.2. Data source 1....15 Figure 4.3. Data source 2 (contains, product ID, product name, STD cost and STD price)....16 Figure 4.4. Data source 3....16 Figure 4.5. Figure shows the created data mart using data source....17 Figure 4.6. Shows SQL script....17 Figure 4.7. Dashboard showing total sales by year, by country, by continent, through different Sales Channel and the total profit....18 Figure 4.8. Dashboard showing total sales by year, by country, by continent, the total profit through direct sales....18 Figure 4.9. Dashboard showing total sales in different countries in Asia Pacific....19 Figure 4.10. Dashboard showing total sales and total profit in different countries in Asia Pacific....19
viii ACKNOWLEDGEMENTS First of all, I would like to express my deepest thanks to Dr. Roman Swiniarski for giving me an opportunity to work on this project and for being a chair person for this project. Every suggestion made by him encouraged me towards development of this project. Despite of his busy work schedule he has been continuously supported and motivated me throughout the project work. I am grateful to Dr. Joseph Lewis and Professor Mahasweta Sarkar for being on my committee and for their help and cooperation. Finally, I would like to convey my gratitude to all the people, especially my wife Mrs. Priyam Singh and my manager Mr. Julian Vanderpump who have supported and encouraged me in completing this thesis project.
1 CHAPTER 1 INTRODUCTION Information is growing at a very high rate. As information grows, the need for organizations to manage it and to make it process-able grows as well. So, as the amount of the information grows it becomes a problem to get the right one at the right place for the right people. And this in fact is certainly important for a successful enterprise. Business Intelligence is a technology to manage, export and present the raw data in a useful and meaningful way which can help any organization to make correct business decisions. Business Intelligence present the raw data in a useful and meaningful way which can help any organization to make correct business decisions. Business intelligence aims to support better business decision-making. BI relies on data collected from other systems, so the quality of the data is very important to BI. 1.1 WHY BUSINESS INTELLIGENCE The goal of BI is to help decision-makers make more informed and better decisions to guide the business.bi allows organizations to get a more accurate and detailed picture of what is going on in terms of business and customers. It can do this in different ways through accurate view of costs, liabilities, risks, customer buying patterns, supplier cost-effectiveness, etc. 1.2 BENEFITS OF USING BUSINESS INTELLIGENCE Some of the benefits of having a Business Intelligence system include the ability to access data in a common format from multiple sources and a way to measure goals and analyze data. A well-executed and maintained Business Intelligence environment will: 1. Eliminates Guess Work: Business intelligence can provide more accurate historical data and can give real-time updates. Management is able to see detailed, current data on all aspects of the business like financial data, production data, and customer data. Hence management can make fact-based decisions and not just a guess work. 2. Helps in identifying business opportunity: Business intelligence can help a company assess its own capabilities; compare its relative strengths and weaknesses against its competitors; identify trends and market conditions; and respond quickly to change to gain a competitive advantage and to identify new business opportunity.
3. Helps in understanding customer behavior: Identify, track and monitor customer purchase habits to effectively segment current and future customer base and gives insight into customer behavior. 4. Helps in setting realistic goals: Business Intelligence enables organizations to accurately identify current performance levels though data analysis, allows setting realistic goals and helps in comparing it with some benchmark. 5. Helps in Identifying cross-selling and up-selling opportunities: By analyzing customer s behavior and patterns, sales representatives can up-sell and cross-sell products at appropriate customer. 6. Improves efficiency: Using Business Intelligence information can be centralized and can be viewed in a dashboard or report, saving enormous amounts of time and eliminating inefficiencies. 2
3 CHAPTER 2 BUSINESS INTELLIGENCE CONCEPT Business Intelligence is the process of going from raw data to legible information. It is a tool to consolidate, analyze, and visualize vast amounts of data to help users make better business decisions. A BI solution helps to transform raw data into actionable information which helps support business decision making. Business intelligence systems are primarily focused on reporting, querying, and analysis of data residing in an enterprise data warehouse (EDW), and both dependent and independent data marts. Business Intelligence comprises of 5 main components, these components are Data Source, ETL (Extract, Transform, and Load), Data Ware house/ Data Marts and Interactive Dashboards and Reports. Figure 2.1 shows the component of BI System. Data Source 1 Data ETL (Extract, Transform, Load) Data Marts and Data Warehouse Reports and Dashboards Source 2 Figure 2.1. Figure showing business intelligence component. 1. Data Source: Let s first define what data is. Data can be any facts and statistics collected together for reference or analysis. Hence any source of date like people, documents, products, activities, events and records from which data are obtained can be called as Data Source. 2. ETL (Extract, Transform, and Load): The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL. ETL is stands for Extraction, Transformation, Loading these are three database functions to pull data from one or more data source and place it into another database or Data Warehouse. ETL is used to migrate data from one database to another and to form data marts and data warehouses and also to convert databases from one format or type to another.
a. Extract: Extract is the process of extracting or reading data from a data source or from database. b. Transform: Transform is the process of converting the extracted data from its data source form into the form it needs to be in so that it can be placed into data Warehouse or another database. The transform stage applies a series of rules or functions to the extracted data from the data source to derive the data for loading into the end target. c. Load: Load is the process of loading the data into the target database or the data warehouse. 3. Data Warehouse: A data warehouse is a place to store data; it s a database which can be used for reporting, query and data analysis. It is a central repository of data which is created by integrating data from one or more data sources. a. Characteristics of Data Warehousing: There are four characteristics of Data Warehouse as set by William Inmon. i. Subject Oriented: Data warehouses are designed to help analyze data therefore it should be constructed as per the analysis are to done on it. For example: To learn about any organization sales data we must build a data warehouse that concentrates on sales. ii. Integrated: We get data from many sources which can be in different format, so we should convert them to a single format which will reduce inconsistency and will be useful for further interpretation. iii. Nonvolatile: Once the data loaded into the data warehouse, there should not be any correction or alteration of the data. It s better to verify the data before loading it into data warehouse since modifying a data in a warehouse take lots of time and effort. iv. Time Variant: In order to identify the change in business trends large amount of data has to be analyzed in a short duration of time it is Time variant, this means that data can change with time and any data that is changed in the data warehouse can be tracked. 4. Data Marts: A Data Mart is a database that has the same characteristics as a data warehouse, but is usually smaller and is focused on the data for one division or one workgroup within an enterprise (e.g. Sales data or Finance data).we can also say that data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. The primary use for a data mart is business intelligence (BI) applications which can be used to gather, store, access and analyze data. Data Mart takes less time to implement and is less expensive as compared to Data warehouse. There are two basic types of data marts: dependent and independent. The categorization is based primarily on the data source that feeds the data mart. Dependent data marts draw data from a central data warehouse that has already been created. Independent data marts, in contrast, are standalone systems built by drawing data directly from operational or external sources of data. 4
5. Reports and Dashboards: We can create many kinds of reports and dashboards using Business Intelligence which can be highly customizable and interactive. These dashboards can be used to analyze data and make decisions; these dashboards can be very interactive and can be very user friendly. For example Figure 2.2, is a dashboard showing Sales in different continents which is very interactive. 5 Figure 2.2. Dashboard showing Sales in different continents.
6 CHAPTER 3 FUNCTIONS OF BUSINESS INTELLIGENCE 3.1 FUNCTIONS WHICH CAN BE PERFORMED USING BUSINESS INTELLIGENCE 1. Reporting: We can create many kinds of reports and dashboards using Business Intelligence which can be highly customizable and interactive. For example, Figure 3.1 gives total sales by year through different Sales Channel (Retail, Online, and Direct Sales). 2. Analytics: Analytics is the discovery and communication of meaningful patterns in data [5]. Analytics is an umbrella term that encapsulates data collection, statistics, data mining, and decision. a. There are three types of data analysis: i. Predictive (forecasting): Predictive data analysis is a technique that analyze current and historical fact to make predictions about future. In business we look at patterns through historical and transactional data to identify risks and opportunities. ii. Descriptive: Descriptive data analysis provides simple summaries about the data and about the observations that have been made. E.g. Sales data give only the performance of a Sales Person or the team. iii. Prescriptive: Prescriptive analytics is the area of business analytics (BA) dedicated to finding the best course of action for a given situation to optimize the accuracy of predictions and provide better decision options. 3. Business performance management: Business performance management (BPM) is a form of business intelligence used to monitor and manage a company's performance by collecting data from various sources, analyze it and use this knowledge to improve the company's performance. Key performance indicators (KPI) are used for this purpose. These KPIs include Sales, revenue, return on investment, overhead and operational costs. 4. Benchmarking: Business Intelligence can be used for Benchmarking which can be done by evaluation or comparing with the standard. Business uses benchmarking to identify those area in which they are either under-performing or exceeding the performance. For example: I am using QlikView BI for Benchmarking (Figure 3.2) and comparing the Total sales in different countries in Asia pacific against average sales (which is 710.00) for year 2012. Here we see that only 16 countries have sales higher that the average sales (Nepal having the max sales amount). Hence, we can see how we can use Business Intelligence tool for Benchmarking.
7 Figure 3.1. Dashboard showing total sales by year through different Sales Channel Figure 3.2. Dashboard showing total sales in different countries in Asia Pacific. 5. Data Mining: Data Mining (sometimes called data or knowledge discovery) is the process of finding hidden patterns and relationships in the data and summarizing it into useful information. The users of Data Mining are often domain experts who not only own the data but also collect the data themselves.
8 Analyzing data involves the recognition of significant patterns. Human analysts can see patterns in small data sets. Specialized data mining tools are able to find patterns in large amounts of data. These tools are also able to analyze significant relationships that exist only when several dimensions are viewed at the same time. The aim of data mining is to make sense of large amounts of mostly unsupervised data, in any domain (e.g. Finance, Marketing, Sales). For any data to make sense, it should be understandable, valid, novel, and useful. Also, Data Mining is for analyzing large amount of data (data having millions of record) because small data sets can be easily analyzed using many standard techniques, or even manually. Data Mining is also called knowledge discovery. Knowledge Discovery Process (KDP) is defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Figure 3.3 Shows Sequential structure of the KDP model. It consists of many steps (one is DM), each attempting at the completion of a particular task. Input data (database, images, video, semistructured data, etc.) STEP 1 STEP 2 STEP n-1 STEP n Knowledge (patterns, rules, clusters, classification, associations, etc.) Figure 3.3. Shows sequential structure of the KDP model. Knowledge Discovery Process includes how the data is stored and accessed, how to use efficient and scalable algorithms, how to interpret and visualize the results, and how to model and support interaction between human and machine. It concerns support for learning and analyzing the application domain. In Knowledge Discovery Process model: The steps are executed in a sequence. the next step is initiated upon successful completion of the previous step - the result generated by the previous step are its input It stretches between the task of understanding the project domain and data, through data preparation and analysis, to evaluation and application of the generated results.
9 It is iterative, i.e. includes feedback loops that are triggered by revisions. 3.2 POPULAR KDP MODELS Although the models usually emphasize independence from specific applications and tools, they can be broadly divided into those that take into account industrial issues and those that do not (like Academic issues or both) 1. Nine-step model by Fayyad et al. (Academic Research Model) [1]. 2. CRISP-DM (Cross-Industry Standard Process for Data Mining) model (Industrial Research Model). 3. Six-step KDP model by Cios et al. (hybrid (academic/industrial Research Model)) [1]. 3.2.1 Nine-Step Model by Fayyad et al. 1. Developing and Understanding of the Application Domain: It includes learning the relevant prior knowledge and the goals specified by the end-user. 2. Creating a Target Data Set: It selects a subset of attributes and data points (examples), which will be used to perform discovery tasks. It includes querying the existing data to select a desired subset. 3. Data Cleaning and Preprocessing: It consists of removing outliers, dealing with noise and missing values, and accounting for time sequence information. 4. Data Reduction and Projection: It consists of finding useful attributes by applying dimension reduction and transformation methods, and finding invariant representation of the data. 5. Choosing the Data Mining Task: It matches the goals defined in step 1 with a particular DM method, such as classification, regression, clustering, etc. 6. Choosing the Data Mining Algorithm: It selects methods for searching patterns in the data, and decides which models and parameters may be appropriate. 7. Data Mining: It generates patterns in a particular representational form, such as classification rules, decision trees, regression models, trends, etc. 8. Interpreting Mined Patterns: It usually involves visualization of the extracted patterns and models, and visualization of the data. 9. Consolidating Discovered Knowledge: It consists of incorporating the discovered knowledge into the performance system, and documenting and reporting it to the end user. It may include checking and resolving potential conflicts with previously believed knowledge. In Nine- Step Model: The process is iterative. The model details technical description with respect to data analysis but lacks description of business aspects.
10 Major applications: A commercial Knowledge Discovery system called MineSet. Was used to facilitate projects in a number of domains including engineering, medicine, production, e-business, and software development. 3.2.2 CRISP-DM (Cross-Industry Standard Process for Data Mining) Model The CRISP-DM Model consists of six steps, which are summarized below: 1. Business Understanding: a. Focus is on understanding objectives and requirements from a business perspective. It converts them into a DM problem definition, and designs a preliminary project plan to achieve the objectives. It is broken into several substeps: i. determination of business objectives ii. assessment of situation iii. determination of DM goals, and iv. Generation of project plan. 2. Data Understanding: a. Starts with an initial data collection and familiarization with the data. Includes identification of data quality problems, discovery of initial insights into the data and detection of interesting data subsets. It is broken down into: i. collection of initial data ii. description of data iii. exploration of data, and iv. Verification of data quality. 3. Data preparation: a. Covers all activities to construct the final dataset, which constitutes the data to be fed into DM tool(s) in the next step. It includes table, record, and attribute selection, data cleaning, construction of new attributes, and data transformation. This step is divided into: i. selection of data ii. cleansing of data iii. construction of data iv. integration of data, and v. Formatting of data sub-steps.
4. Modeling: a. Selects and applies various modeling tools. It involves using several methods for the same DM problem and calibration of their parameters to optimal values. Since some methods require a specific format for input data, often reiteration into the previous step is necessary. This step is subdivided into: i. selection of modeling technique(s) ii. generation of test design iii. creation of models, and iv. Assessment of generated models. 5. Evaluation: a. After building one or more high quality (from a data analysis perspective) models, they are valuated from business objective perspective and review of the steps executed to construct the models is performed. A key objective is to determine if there are important business issues that have not been considered. At the end, a decision on the use of Data Mining result is reached. The key sub-steps include: i. evaluation of the results ii. process review iii. Determination of the next step. 6. Deployment: a. Involves organization and presentation of the discovered knowledge in a userfriendly way. Depending on the requirements, this can be as simple as generating a report or as complex as implementing a repeatable KDP. This step is subdivided into: i. planning of the deployment ii. planning of the monitoring and maintenance iii. generation of final report, and iv. Review of the process sub-steps. Crisp-DM Model: uses easy to understand vocabulary and is well documented acknowledges the iterative nature of the process with loops between the steps extensively used model, mainly because of its grounding in industrial real-world experience major applications medicine, engineering, marketing, sales Turned into a commercial KD system called Clementine. 11
3.2.3 Six-Step Model by Cios et al. It was inspired by the CRISP-DM model and adopted for academic research. Figure 3.4 [2] shows Six-step model by Cios et al [1]. The main differences and extensions include: providing more general, research-oriented description of the steps has a Data Mining step instead of the Modeling step Introducing several new explicit feedback mechanisms. The CRISP-DM model has only three major feedbacks, while this model has detailed feedback mechanisms modification of the last step; the knowledge discovered for a particular domain may be applied in other domains A description of the six steps follows: 1. Understanding the Problem Domain: Involves working closely with domain experts to define the problem and determine the project goals, identifying key people, and learning about current solutions to the problem. It involves learning domain-specific terminology. A description of the problem and its restrictions is prepared. Project goals are translated into the DM goals and initial selection of DM tools to be used is performed. 2. Understanding of the Data: Includes collection of sample data and deciding which data, including its format and size, will be needed. Background knowledge is used to guide these efforts. Data is checked for completeness, redundancy, missing values, plausibility of attribute values, etc. Includes verification of the usefulness of the data in respect to the DM goals. 3. Preparation of the Data: Concerns deciding what data will be used as input to DM tools in the next step. Involves sampling, running correlation and significance tests, data cleaning and checking completeness of data records, removing or correcting for noise and for missing values, etc. The cleaned data is further processed by feature selection and extraction algorithms (to reduce dimensionality), by derivation of new attributes (say by discretization), and by summarization of data.the results are data meeting specific input requirements of DM tools. 4. Data Mining: It involves using various DM methods to derive new knowledge or information from the preprocessed data. 5. Evaluation of the Discovered Knowledge: Includes understanding results, checking whether the discovered knowledge is novel and interesting, interpreting results by domain experts, and checking possible impact of the discovered knowledge. Only the approved models are retained and the entire process is revisited to identify which alternative actions could have been taken to improve the results. A list of errors made in the process is prepared. 12
13 Understanding of the Problem Domain Understanding of the Data input data (Database, images, video, semi-structured data, etc.) Preparation of the Data Data Mining Evaluation of the Discovered Knowledge knowledge (Patterns, rules, clusters, classifications, (Database, images, video, semi-structured data, etc.) associations, etc.) Use of the Discovered Knowledge Extend knowledge to other domains Figure 3.4. The six-step KDP model. Source: N. R. Pal and L. Jain, eds. Advanced Techniques in Data Mining and Knowledge Discovery. Springer, London, UK, 2005. 6. Use of the Discovered Knowledge: It consists of planning where and how the discovered knowledge will be used. The application in the current domain may be extended to other domains. A plan to monitor the implementation of the discovered knowledge is created and the entire project is documented. Finally, the discovered knowledge is deployed.
14 CHAPTER 4 BUSINESS INTELLIGENCE IMPLEMENTATION, CREATING DATA MART AND DASHBOARD DESIGN I have used Qlikview Business Intelligence (BI) software for doing data analysis and creating dashboards and reports. QlikView can be used for all types of reporting, forecasting, and general data analysis of virtually any type of information (e.g., GA, AR, AP, sales, inventory, estimates, trends, etc.). And because it allows end users to interact with the data, QlikView can also be used in any field where real-time data analysis is needed (e.g., science, engineering, academic research, art, etc.). I am using a personal edition of QlikView which is free and we can download it from the QlikView website [3]. I have taken a sample sales data from the TM1 Tutorials website [4]. This data I found online and I am using this data for my analysis. This data contains: Data Source1: Figure 4.1 and Figure 4.2 shows the data source1. This data source contains CustomerID,Customer Name, Country, ProductID, Sales Channe, Date Sales, this data source has almost 1000 records. Data Source2: Figure 4.3 shows the data source2.this data source contains Product Information like Product ID, Product Name, Standard Cost of Product and its Standard Price. Date Source3: Figure 4.4 shows the data source 3.This data source contains the Country data and it s Continent. Figure 4.5 is the Screen Shot Showing Data Marts and how table are linked. Figure 4.6 is the Screen Shot Showing Qlikview Scripts which is like SQL Statement. Figure 4.7 is the Screen Shot Showing Total Sales by Year, by country, by continent, through different Sales Channel and the total profit. If we select Sales Channel as Direct (which turns green after selection), the dashborad will look like which will give us information aout Total Sales by Year, by country, by continent, the total profit through direct sales is shown in Figure 4.8. This dashboard also has a drill down functionality, suppose we want to see total sales in different countries in Asia Pacific, we can select Asia and the dashboard will look like Figure 4.9. Dashboard showing total Sales
15 Figure 4.1. Data source 1 (contains product ID, customer ID, customer name, sales channel, unit sold and date sold). Figure 4.2. Data source 1. and total profit country wise in Asia pacific trough different sales channel is shown in Figure 4.10.
16 Figure 4.3. Data source 2 (contains, product ID, product name, STD cost and STD price). Figure 4.4. Data source 3.
17 Figure 4.5. Figure shows the created data mart using data source. Figure 4.6. Shows SQL script.
18 Figure 4.7. Dashboard showing total sales by year, by country, by continent, through different Sales Channel and the total profit. Figure 4.8. Dashboard showing total sales by year, by country, by continent, the total profit through direct sales.
19 Figure 4.9. Dashboard showing total sales in different countries in Asia Pacific. Figure 4.10. Dashboard showing total sales and total profit in different countries in Asia Pacific.
20 CHAPTER 5 CONCLUSION From Above report we see that how we can transform raw data into meaningful and useful information using Business Intelligence tool. Business intelligence can be great tool for reporting, benchmarking and doing data analysis for making sound business decisions. Business intelligence is used by decision makers throughout the firm. At senior levels it is used for making strategies and at lower managerial levels, it helps individuals to do their day-to-day job. According to Gartner, It is predicted that worldwide business intelligence (BI) software revenue will reach $13.8 billion in 2013 and is forecast to reach $17.1 billion by 2016. So we can see that Analytics and business intelligence is the top technology priority for CIOs and demand for such Data Mining tool will continue to rise in future.
21 REFERENCES [1] K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kurgan. Data Mining: A Knowledge Discovery Approach. Springer, New York, NY, 2007. [2] N. R. Pal and L. Jain, eds. Advanced Techniques in Data Mining and Knowledge Discovery. Springer, London, UK, 2005. [3] QlikView. Free Download, n.d. http://www.qlikview.com/us/explore/experience/freedownload, accessed Mar. 2013. [4] TM1 Tutorials. Sample Sales Data, n.d. http://blog.tm1tutorials.com/wpcontent/uploads/sample-sales-data.xls, accessed Mar. 2013. [5] Wikipedia. Analytics, 2013. http://en.wikipedia.org/wiki/analytics, accessed Mar. 2013.