IT IN FINANCE. 3. Data warehouses, OLAP and Big Data technologies. IT in Finance. Jerzy Korczak. What is a Database System? Database Systems and DBMS

IT IN FINANCE 3. Data warehouses, OLAP and Big Data technologies Jerzy Korczak email: jerzy.korczak@ue.wroc.pl http://kti.ue.wroc.pl IT in Finance 1. Overview of Information Systems 2. Data preprocessing techniques 3. Data warehouses, OLAP and Big Data technology 4. DSS, knowledge discovery methods 5. Web-based and mobile technology What is a Database System? Database: a very large, integrated collection of data s a real-world enterprise Entities (e.g., teams, games) Relationships (e.g., The Forty-Niners are playing in The Superbowl) more recently, also includes active components, often called business logic A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases. Database Systems and DBMS Collection of interrelated data Set of programs to access the data DBMS contains information about a particular enterprise DBMS provides an environment that is both convenient and efficient to use. Database Applications: Banking: all transactions Airlines: reservations, schedules Universities: registration, grades Sales: customers, products, purchases Manufacturing: production, inventory, orders, supply chain Human resources: employee records, salaries, tax deductions Databases touch all aspects of our lives Levels of Abstraction Views describe how users see the data. Conceptual schema defines logical structure Physical schema describes the files and indexes used. Users View 1 View 2 View 3 Conceptual Schema Physical Schema DB Example: University Database View 1 View 2 View 3 Conceptual schema: Students(sid: string, name: string, login: string, age: integer, gpa:real) Conceptual Schema Courses(cid: string, cname:string, credits:integer) Physical Schema Enrolled(sid:string, cid:string, grade:string) External Schema (View): DB Course_info(cid:string,enrollment:integer) Physical schema: Relations stored as unordered files. Index on first column of Students. 1

Relational Example of tabular data in the relational model Customer-id 192-83-7465 019-28-3746 192-83-7465 321-12-3123 019-28-3746 Johnson Smith Johnson Jones Smith Alma North Alma Main North Palo Alto Rye Palo Alto Harrison Rye customername customerstreet customercity accountnumber A-101 A-215 A-201 A-217 A-201 Attributes Tables Explained A tuple = a record Restriction: all attributes are of atomic type A table = a set of tuples Like a list but it is unorderd: no first(), no next(), no last() No nested tables, only flat tables are allowed! The schema of a table is the table name and its attributes: (PName, Price, Category, Manfacturer) A key is an attribute whose values are unique; we underline a key (PName, Price, Category, Manfacturer) A Sample Relational Database Structured Query Language (SQL) SQL: widely used non-procedural language e.g. find the name of the customer with customer-id 192-83-7465 select customer.customer-name from customer where customer.customer-id = 192-83-7465 e.g. find the balances of all accounts held by the customer with customer-id 192-83-7465 select account.balance from depositor, account where depositor.customer-id = 192-83-7465 and depositor.account-number = account.account-number Application programs generally access databases through one of Language extensions to allow embedded SQL Application program interface (e.g. ODBC/JDBC) which allow SQL queries to be sent to a database Tables in SQL Table name Attribute names PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SQL Query - basic form SELECT attributes relations (possibly multiple, joined) WHERE conditions (selections) What goes in the WHERE clause x = y, x < y, x <= y, etc - for number, they have the usual meanings - for CHAR and VARCHAR: lexicographic ordering -for dates and times, what you expect... P.Greenspun, SQL for Web Nerds,Ch. 3 Simple Queries, i Ch. 4.More Complex Queries, http://philip.greenspun.com/sql/ Tuples or rows 2

Simple SQL Query SELECT * WHERE category= Gadgets PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Simple SQL Query (2) SELECT PName, Price, Manufacturer WHERE Price > 100 PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi selection PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks selection and projection PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi Ordering the Results SELECT pname, price, manufacturer WHERE category= gizmo AND price > 50 ORDER BY price, pname Ordering is ascending, unless you specify the DESC keyword. Joins in SQL (3) Connect two or more tables: PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Company CName StockPrice Country What is the Connection between them? GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Joins in SQL (4) Joins in SQL (5) 15/10 (pname, price, category, manufacturer) Company (cname, stockprice, country) Find all products under $200 manufactured in Japan; return their names and prices. PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Company Cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan SELECT pname, price Join between and Company, Company WHERE manufacturer=cname AND country= Japan AND price <= 200 SELECT PName, Price, Company WHERE Manufacturer=CName AND Country= Japan AND Price <= 200 PName Price SingleTouch $149.99 3

Joins in SQL (6) PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi Company Cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Joins (pname, price, category, manufacturer) Purchase (buyer, seller, store, product) Person(persname, phonenumber, city) Find names of people living in Wroclaw that bought some product in the Gadgets category, and the names of the stores they bought such product from SELECT Country, Company WHERE Manufacturer=CName AND Category= Gadgets What is the problem? What s the solution? Country???? SELECT DISTINCT persname, store Person, Purchase, WHERE persname=buyer AND product = pname AND city= Wroclaw AND category= Gadgets Queries, Query Plans, and Operators SELECT E.loc, AVG(E.sal) SELECT COUNT Emp eid, DISTINCT Eename, (E.eid) title GROUP Emp Emp BY E, E.loc EProj P, Asgn A WHERE WHERE HAVING E.eid E.sal Count(*) = A.eid > $50K > 5 AND P.pid = A.pid AND E.loc <> P.loc System handles query plan generation & optimization; ensures correct execution. Count Having distinct Group(agg) Select Join Join Emp Emp Emp Asgn Employees Projects Assignments Proj Transaction Management A transaction is a collection of operations that performs a single logical function in a database application Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database. Transactions: ACID Properties Key concept is a transaction: a sequence of database actions (reads/writes). DBMS ensures atomicity (all-or-nothing property) even if system crashes in the middle of transaction. Each transaction, executed completely, must take the DB between consistent states or must not run at all. DBMS ensures that concurrent transactions appear to run in isolation. DBMS ensures durability of committed transactions even if system crashes. Idea: Keep a log (history) of all actions carried out by the DBMS while executing a set of transactions: Before a change is made to the database, the corresponding log entry is forced to a safe location. After a crash, the effects of partially executed transactions are undone using the log. Effects of committed transactions are redone using the log. Database Users Users are differentiated by the way they expect to interact with the system Application programmers interact with system through DML calls Sophisticated users form requests in a database query language Specialized users write specialized database applications that do not fit into the traditional data processing framework Naïve users invoke one of the permanent application programs that have been written previously e.g. people accessing database over the web, bank tellers, clerical staff 4

Internet DB - architecture WEB Advantages of database systems WEB client HTTP (TCP/IP) HTML document Internet HTTP (TCP/IP) XML document WEB server DB Data independence Efficient data access Data integrity & security Data administration Concurrent access, crash recovery Reduced application development time So why not use them always? Expensive/complicated to set up & maintain This cost & complexity must be offset by need General-purpose, not suited for special-purpose tasks (e.g. text search!) What is a Data Warehouse? Characteristics of a Data Warehouse [Inmon] Data warehouse refers to a database that provides on-line analytical processing (OLAP) tools for the interactive analysis multidimensional data of varied granularities, which facilitates effective data generalization and data mining. W.H.Inmon... a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision making process * subject-oriented - data organized by subject instead of application e.g. - an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.) - contains only the information necessary for decision support processing * integrated - encoding of data is often inconsistent e.g. - gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention * time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g. - this data is used for comparisons, trends, and forecasting - these data are not updated * non-volatile - data are not updated or changed in any way once they enter the data warehouse J. Korczak, UE 27 28 Differences between Operational Database Systems and Data Warehouses Users and system orientation A Star Is Born Global Computing Company: Sales and Marketing Data contents Database design View Access patterns PRODUCTS dimension SALES fact table CHANNELS dimension CUSTOMERS dimension DAYS dimension J. Korczak, UE 29 30 5

Data modeling A Multidimensional Data Agregation Country Region City Sales Sales Report Data warehouses and OLAP tools are based on a multidimensional data model. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. location= Chicago loc = New York loc= Toronto location = Vancouver item item item item home home home home time ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400 Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512 Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501 Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580 Time J. Korczak, UE 32 31 Example: 3D View of sales data Data Cube - Example location= Chicago location= NY location= Toronto location = Vancouver item item item item home home home home time ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400 Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512 Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501 Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580 location Chicago 854 892 89 623 NY 1087 968 38 872 Toronto 812 746 43 591 698 Vancouver 925 789 Q1 605 825 14 400 682 1002 870 time Q2 680 952 31 512 728 984 Q3 812 1023 30 501 784 Q4 927 1038 38 580 home comp ph sec J. Korczak, UE enter. item 33 3D data cube is used to model student performance over three measures or dimensions - age group, gender and department/course. The dimensional axes hold the metrics to be analyzed. In this case it is student performance, represented as the number of classes completed with a satisfactory grade as a fraction of the number attempted. The views of the metrics are called dimensions. An individual student performance metric is given by the intersection of the three axes, and is referred to as a fact. In this particular representation, the facts are for a given year, aggregated over all students in that age group, gender and course. New terms Merge (data) combine two or more data sets; values or structures. Data replication extract data from several platforms, perform some filtering and transformation, and distribute and load to another database or databases. Usually the term replication implies limited or no transformation, and moves within a homogeneous environment. (See Pump) Pump A data pump extracts data from several mainframe and client server platforms, performs some filtering and transformation, and distributes and loads to another database(s). Usually the term pump is used rather than "replicator" to connote its applicability in a cross-platform environment. Scrub (data) see Clean. Transform (data) see Abstract. Drill-down and Roll-up Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization. An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month. Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension. Roll-up is the opposite of drill-down. Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization. 35 6

Process of Extraction/Transformation/Loading (ETL) M-OLAP Base de données multidimensionnelle (hypercube) Serveur MOLAP Client OLAP Extraction of source data Transformation/cleaning Indexing and aggregation Data loading into DW Detection of changes Data refreshment R-OLAP Programms Base de données relationnelle (étoile ou flocon) Serveur ROLAP Vue multidimensionnelle Client OLAP OLTP Gateways Access tools ETL Data warehouse 37 38 Mapping Materialized View: The Query Rewrite Dramatically Improves Query Performance Definition of used attributes Definition of required transformations Queries are rewritten automatically Metadata File A F1 F2 F3 File A F1 123 F2 Bloggs F3 10/12/56 Staging File One Number Name DOB Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 Region Date City Sales Materialized Views Data prejoined and/or presummarized and automatically maintained by the database Sum by Sum by City Sum by Day 39 40 Materialized vue: Example Users at Acme Bank were complaining that typical and repeated queries were taking too long to execute. Because these queries are executed over and over by many users, any improvement in response time is bound to help. The DBAs at Acme Bank have tuned the most commonly used queries, all the needed indexes are present, and no further SQL tuning is going to make any difference to the performance of these queries. The Acme Bank DBAs' solution: Use materialized views (MVs). This example discusses how to plan for MVs, how to set up and confirm different MV capabilities, how to automatically generate the scripts to create MVs, how to make query rewrite (QR) available, and how to make sure that QR gets used. Now suppose a user wants to get the total of all account balances for the account type 'C' and issues the following query: SELECT SUM(cleared_bal) accounts WHERE acc_type = 'C'; Because the mv_bal MV already contains the totals by account type, the user could have gotten this information directly from the MV: SELECT totbal mv_bal WHERE acc_type = 'C'; Materialized vue: Query This query against the mv_bal MV would have returned results much more quickly than the query against the accounts table. Running a query against the MV will be faster than running the original query, because querying the MV does not query the source tables 41 42 7

43 Materialized vue: Query definition Code Listing 1: Original query SELECT acc_mgr_id, acc_type_desc, DECODE (a.sub_acc_type,null,'?', sub_acc_type_desc) sub_acc_type_desc, sum(cleared_bal) tot_cleared_bal, sum(uncleared_bal) tot_uncleared_bal, avg(cleared_bal) avg_cleared_bal, avg(uncleared_bal) avg_uncleared_bal, sum(cleared_bal+uncleared_bal) tot_total_bal, avg(cleared_bal+uncleared_bal) avg_total_bal, min(cleared_bal+uncleared_bal) min_total_bal balances b, accounts a, acc_types at, sub_acc_types sat WHERE a.acc_no = b.acc_no and at.acc_type = a.acc_type and sat.sub_acc_type (+) = a.sub_acc_type GROUP BY acc_mgr_id, acc_type_desc, decode (a.sub_acc_type,null,'?', sub_acc_type_desc) 44 CUBE clause: example YEAR MONTH NUM TOT_SALES TOT_RED TOT_TAXES TOT_SALES_NET ----------- ------------ ------------- ------------------ ------------------- --------------- ----------------------------------- 2001 FEB 100 1250.00 150.00 250.00 850.00 FEB 100 1250.00 150.00 250.00 850.00 2001 FEB 200 2000.00 200.00 300.00 1600.00 FEB 200 2000.00 200.00 300.00 1600.00 2001 FEB 2000.00 200.00 300.00 1600.00 FEB 1250.00 150.00 250.00 850.00 FEB 3250.00 350.00 550.00 2450.00 2001 100 1250.00 150.00 250.00 850.00 2001 100 1250.00 150.00 250.00 850.00 2001 FEB 200 2000.00 200.00 300.00 1600.00 2001 FEB 200 2000.00 200.00 300.00 1600.00 2001 2000.00 200.00 300.00 1600.00 2001 1250.00 150.00 250.00 850.00 2001 3250.00 350.00 550.00 2450.00 FEB 100 1250.00 150.00 250.00 850.00 FEB 100 1250.00 150.00 250.00 850.00 FEB 200 2000.00 200.00 300.00 1600.00 FEB 200 2000.00 200.00 300.00 1600.00 FEB 2000.00 200.00 300.00 1600.00 FEB 1250.00 150.00 250.00 850.00 FEB 3250.00 350.00 550.00 2450.00 100 1250.00 150.00 250.00 850.00 100 1250.00 150.00 250.00 850.00 DISCOVERER : Data selection DISCOVERER : Data hierarchisation 45 46 DISCOVERER : Extraction de données DATABASE TRENDS - Databases and the Web Database server: Computer in a client/server environment runs a DBMS to process SQL statements and perform database management tasks. Application server: Software handling all application operations 47 8

DATABASE TRENDS - Linking Internal Databases to the Web DATABASE TRENDS - A Hypermedia Database Big Data - Introduction History of Big Data: The term Big Data have been used first by John R. Mashey, then the chief scientist of Silicon Graphics Inc. Usenix conference (1999) Invited Talk titled: Big Data and the Next Big Wave of Infra Stress The term Big Data was also used in paper from Bryson et al. (1999) published in Communication of ACM The Report from META group (now Gartner) from Laney (2001), was the first to identify 3 Vs (volume, variety, and velocity perspective of data) Google s recent paper on Map Reduce (Map Reduce; Dean and Ghemawat 2004) was the trigger that led to lots of developments in the big data area 51 Healy J., Why what happens in an internet minute really matters, M2M 52 Concepts of Big Data Types of Data Structures Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require realtime or near-real time capabilities Structured Data Semi-Structured Data View Source Quasi-Structured Data ht tp://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big +data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_u pl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity 53 54 9

More Structured Data Structures Business Requirements Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven Data containing a defined data type, format, structure Structured Example: Transaction data and OLAP Driver Examples Semi- Structured Quasi Structured Unstructured Textual data files with a discernable pattern, enabling parsing Example: XML data files that are self describing and defined by an xml schema Textual data with erratic data formats, can be formatted with effort, tools, and time Example: Web clickstream data that may contain some inconsistencies in data values and formats Data that has no inherent structure and is usually stored as different types of files. Example: Text documents, PDFs, images and video 1 Desire to optimize business operations 2 Desire to identify business risk 3 4 Predict new business opportunities Comply with laws or regulatory requirements Sales, pricing, profitability, efficiency Customer churn, fraud, default Upsell, cross-sell, best new customer prospects Anti-Money Laundering, Fair Lending, Basel II 55 56 High BUSINESS VALUE Business Requirements Analytical Approaches for Meeting Business Drivers Business Intelligence Data Science Predictive Analytics & Data Mining (Data Science) Typical Techniques & Data Types Common Questions happening? Business Intelligence Typical Techniques & Data Types Optimization, predictive modeling, forecasting, statistical analysis Structured/unstructured data, many types of sources, very large data sets What if..? What s the optimal scenario for our business? What will happen next? What if these trends continue? Why is this Standard and ad hoc reporting, dashboards, alerts, queries, details on demand Structured data, traditional sources, manageable data sets Business Requirements A typical analytical architecture 2 Spread Marts 1 Data Sources Departmental Warehouse Departmental Warehouse Static schemas accrete over time Enterprise Applications 3 Prioritized Operational Processes Reporting Non-Agile s 4 Siloed Analytics Low Common Questions What happened last quarter? How many did we sell? Where is the problem? In which situations? Non-Prioritized Data Provisioning Past TIME Future Errant data & marts 57 58 Business Requirements Implications of Typical Architecture for Data Science High-value data is hard to reach and leverage Predictive analytics & data mining activities are last in line for data Queued after prioritized operational processes Data is moving in batches from EDW to local analytical tools In-memory analytics (such as R, SAS, SPSS, Excel) Slow Sampling can skew model accuracy time-to-insight Isolated, ad hoc analytic projects, rather than centrally-managed & harnessing of analytics reduced business impact Non-standardized initiatives Frequently, not aligned with corporate business goals Big Data Analytics Development Data Analytics Lifecycle 6 Operationalize 5 Communicate Results 1 Discovery Do I have enough information to draft an analytic plan and share for peer review? 2 Data Prep 3 Planning Do I have enough good quality data to start building the model? Is the model robust enough? Have we failed for sure? 4 Building Do I have a good idea about the type of model to try? Can I refine the analytic plan? 59 60 10

Big Data Analytics Development (cont.) Phase 1: Discovery Formulate Initial Hypotheses 1 Discovery Do I have enough information to draft an analytic plan and share for peer review? IH, Operationalize H 1, H 2, H 3, H n Data Prep Gather and assess hypotheses from stakeholders and domain experts Preliminary data exploration to inform discussions with stakeholders Communicate during the hypothesis forming stage Identify Results Data Sources Begin Learning the Data Planning Aggregate sources for previewing the data and provide high-level understanding Review the raw data Building Is the model robust Determine enough? Have the we structures and tools needed failed for sure? Scope the kind of data needed for this kind of problem Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 61 61 Big Data Analytics Development (cont.) Phase 2: Data Preparation Do I have enough information to draft an analytic plan and share for peer review? Prepare Analytic Sandbox Discovery Do I have Work space for the analytic team enough good quality data to 10x+ vs. EDW 2 start building Perform Operationalize ELT Determine needed transformations Data Prep the model? Assess data quality and structuring Derive statistically useful measures Communicate Extract Results data and determine data Planning connections for raw data, OLTP transactions, OLAP cubes or data feeds Big ELT and Big ETL Do I have a good idea Building about the type of model Is the model robust to try? Can I refine the Useful enough? Tools for Have this we phase: analytic plan? For failed Data for Transformation sure? & Cleansing: SQL, Hadoop, MapReduce, Alpine Miner 62 62 Big Data Analytics Development (cont.) Phase 3: Planning Determine Methods Discovery Select methods based on hypotheses, data structure Operationalize and volume Ensure techniques and approach will meet business objectives Techniques Communicate & Workflow Results Candidate tests and sequence Identify and document modeling assumptions Building Is the model robust Useful enough? Tools for Have this phase: we R/PostgresSQL, SQL Analytics, failed Alpine for sure? Miner, SAS/ACCESS, SPSS/OBDC Do I have enough information to draft an analytic plan and share for peer review? Data Prep 3 Planning Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 63 Big Data Analytics Development (cont.) Phase 4: Building Do I have enough information to draft an analytic plan and share for peer review? Discovery Develop data sets for testing, training, and production purposes Do I have enough good Need to ensure that the model data is sufficiently robust for the quality data to model and analytical techniques start building Operationalize Data Prep the model? Smaller, test sets for validating approach, training set for initial experiments Get the best environment you can for building models and workflows fast hardware, parallel processing Communicate Results Planning 4 Is the model robust enough? Have we Building Do I have a good idea about the type of model failed for sure? to try? Can I refine the analytic plan? Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner 64 Big Data Analytics Development (cont.) Phase 5: Communicate Results Operationalize 5 Communicate Results Is the model robust enough? Have we failed for sure? Discovery Do I have enough good quality data to Did we succeed? Did we fail? start building Data Prep the model? Interpret the results Compare to IH s from Phase 1 Identify key findings Quantify business value Summarizing findings, Planning depending on audience Building Do I have enough information to draft an analytic plan and share for peer review? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 65 Big Data Analytics Development (cont.) Phase 6: Operationalize 6 Operationalize Communicate Results Is the model robust enough? Have we failed for sure? Discovery Run a pilot Assess the benefits Data Prep environment Define process to Planning update and retrain the model, as needed Building Do I have enough information to draft an analytic plan and share for peer review? Deliver final deliverables execution in production Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 66 11

IT Solutions What is Hadoop? A framework for performing big data analytics An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability, scalability, and management Storage (Big Data) HDFS Hadoop Distributed File System Reliable, redundant, distributed file system optimized for large files Two Main Components MapReduce (Analytics) Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) IT Solutions Hadoop Hadoop (Version 1.0) comprises the Map Reduce implementation, along with the Hadoop Distributed File System (HDFS) Hadoop (Version 1.0) is OK for a number if use cases, including those in which the data can be partitioned into dependent chunks even the parallel applications Applications running in Disney, Sears, Walmart, AT&T and more 67 68 IT Solutions Learning from aggregation Learning Algorithm Query: System Data D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004 IT solutions MapReduce Transform (Map) input values to output values: <k1,v1> <k2,v2> Input Key/Value Pairs For instance, Key = line number, Value = text string Map Function Steps to transform input pairs to output pairs For example, count the different words in the input Output Key/Value Pairs For example, Key = <word>, Value = <count> Map output is the input to Reduce Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS 06. 69 70 IT solutions Example of MapReduce Mapping Reducing IT solutions MapReduce Merge (Reduce) Values from the Map phase Reduce is optional. Sometimes all the work is done in the Mapper Input Values for a given Key from all the Mappers Reduce Function Steps to combine (Sum?, Count?, Print?, ) the values Output Print values?, load into a DB? send to the next MapReduce job? 71 72 12

Record IT Solutions Map-Reduce abstraction Map Example: Word-Count Key Value Key Value Key Value Value (Dean & Ghemawat, OSDI 04): Reduce Key Value Map(docRecord) { for (word in docrecord) { emit (word, 1) Reduce(word, counts) { emit (word, SUM(counts)) } } Key Value } Map: Idempotent Reduce: Commutative and Associative 73 Why looking beyond Hadoop? Further hindrances to widespread adoption of Hadoop across enterprises are following: Lack of Object Database Connectivity (ODBC) a lot of BI tools are forced to build separate Hadoop connectors Hadoop s lack of suitability for all types of applications: If data splits are interrelated or computation needs to access data across splits, this might involve joins and might not run efficiently over Hadoop For iterative computations, Hadoop Map Reduce is not well-suited for two reasons: First is the overhead of fetching data from HDFS for each iteration Second is the lack of long-lived Map Reduce jobs in Hadoop (typically, there is a termination condition check that must be executed outside of the Map Reduce job, so as the to determine whether the computation is complete always start of new costly Map Reduce initialization) 75 IT Solutions - Why looking beyond Hadoop? Apache Mahout Open-Source Library of Algorithms on Hadoop: ALS Matrix Fact. SVD Random Forests LDA K-Means Naïve Bayes PCA Spectral Clustering Canopy Clustering Logistic Regression? IT Solutions - Why looking beyond Hadoop? Memory Opt. Dataflow View Training Data (HDFS) Map Map Reduc e Reduc e Efficiently move data between stages Map Reduc e 76 78 IT Solutions - Why looking beyond Hadoop? IT Solutions - Why looking beyond Hadoop? In-Memory Data-Flow Systems Common Pattern: Multi-Stage Aggregation Abstraction: Dataflow Ops. on Immutable datasets System What is Spark? Fault-tolerant distributed dataflow framework Improves efficiency through: In-memory computing primitives Pipelined computation Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Up to 100 faster (2-10 on disk) 2-5 less code 79 80 13

IT Solutions - Why looking beyond Hadoop? IT Solutions - Why looking beyond Hadoop? Spark Programming Abstraction: Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Distributed collections of objects that can stored in-memory or on disk Built via parallel transformations (map, filter, ) Automatically rebuilt on failure Mahout Moves to Spark: On 25 April 2014 - Goodbye MapReduce The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. 81 82 IT Solutions - Why looking beyond Hadoop? Other related systems: Microsoft Dryad and Naiad: http://research.microsoft.com/en-us/projects/dryad/ http://research.microsoft.com/en-us/projects/naiad/ Hyracks: http://hyracks.org Stratosphere: http://stratosphere.eu MADlib: http://madlib.net Hadoop Tez: http://hortonworks.com/hadoop/tez/ Conclusions and Perspectives TREND: AFTER C-C, Availabilty of BD IN-MEMORY Technology, LOWER cost, REAl TIME According to McKinsey a retailer using big data to the full could increase its operating marging by more than 60% Bad data or poor data quality costs US businesses $600 billion annually According to Gartner Big Data will drive $232 billion in spending through 2016. 83 84 By 2015, 4.4 milion IT jobs globally will be created to suport big data, generating 1.9 million IT jobs in the US. Open Issues Scalable Big/Fast Data Infrastructures Diversity in the Data Management Landscape End-to-End Processing and Understanding of Data Cloud Services Roles of Humans in the Data Life Cycle 85 14