IT IN FINANCE. 3. Data warehouses, OLAP and Big Data technologies. IT in Finance. Jerzy Korczak. What is a Database System? Database Systems and DBMS

Size: px
Start display at page:

Download "IT IN FINANCE. 3. Data warehouses, OLAP and Big Data technologies. IT in Finance. Jerzy Korczak. What is a Database System? Database Systems and DBMS"

Transcription

1 IT IN FINANCE 3. Data warehouses, OLAP and Big Data technologies Jerzy Korczak IT in Finance 1. Overview of Information Systems 2. Data preprocessing techniques 3. Data warehouses, OLAP and Big Data technology 4. DSS, knowledge discovery methods 5. Web-based and mobile technology What is a Database System? Database: a very large, integrated collection of data s a real-world enterprise Entities (e.g., teams, games) Relationships (e.g., The Forty-Niners are playing in The Superbowl) more recently, also includes active components, often called business logic A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases. Database Systems and DBMS Collection of interrelated data Set of programs to access the data DBMS contains information about a particular enterprise DBMS provides an environment that is both convenient and efficient to use. Database Applications: Banking: all transactions Airlines: reservations, schedules Universities: registration, grades Sales: customers, products, purchases Manufacturing: production, inventory, orders, supply chain Human resources: employee records, salaries, tax deductions Databases touch all aspects of our lives Levels of Abstraction Views describe how users see the data. Conceptual schema defines logical structure Physical schema describes the files and indexes used. Users View 1 View 2 View 3 Conceptual Schema Physical Schema DB Example: University Database View 1 View 2 View 3 Conceptual schema: Students(sid: string, name: string, login: string, age: integer, gpa:real) Conceptual Schema Courses(cid: string, cname:string, credits:integer) Physical Schema Enrolled(sid:string, cid:string, grade:string) External Schema (View): DB Course_info(cid:string,enrollment:integer) Physical schema: Relations stored as unordered files. Index on first column of Students. 1

2 Relational Example of tabular data in the relational model Customer-id Johnson Smith Johnson Jones Smith Alma North Alma Main North Palo Alto Rye Palo Alto Harrison Rye customername customerstreet customercity accountnumber A-101 A-215 A-201 A-217 A-201 Attributes Tables Explained A tuple = a record Restriction: all attributes are of atomic type A table = a set of tuples Like a list but it is unorderd: no first(), no next(), no last() No nested tables, only flat tables are allowed! The schema of a table is the table name and its attributes: (PName, Price, Category, Manfacturer) A key is an attribute whose values are unique; we underline a key (PName, Price, Category, Manfacturer) A Sample Relational Database Structured Query Language (SQL) SQL: widely used non-procedural language e.g. find the name of the customer with customer-id select customer.customer-name from customer where customer.customer-id = e.g. find the balances of all accounts held by the customer with customer-id select account.balance from depositor, account where depositor.customer-id = and depositor.account-number = account.account-number Application programs generally access databases through one of Language extensions to allow embedded SQL Application program interface (e.g. ODBC/JDBC) which allow SQL queries to be sent to a database Tables in SQL Table name Attribute names PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi SQL Query - basic form SELECT attributes relations (possibly multiple, joined) WHERE conditions (selections) What goes in the WHERE clause x = y, x < y, x <= y, etc - for number, they have the usual meanings - for CHAR and VARCHAR: lexicographic ordering -for dates and times, what you expect... P.Greenspun, SQL for Web Nerds,Ch. 3 Simple Queries, i Ch. 4.More Complex Queries, Tuples or rows 2

3 Simple SQL Query SELECT * WHERE category= Gadgets PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi Simple SQL Query (2) SELECT PName, Price, Manufacturer WHERE Price > 100 PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi selection PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks selection and projection PName Price Manufacturer SingleTouch $ Canon MultiTouch $ Hitachi Ordering the Results SELECT pname, price, manufacturer WHERE category= gizmo AND price > 50 ORDER BY price, pname Ordering is ascending, unless you specify the DESC keyword. Joins in SQL (3) Connect two or more tables: PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi Company CName StockPrice Country What is the Connection between them? GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Joins in SQL (4) Joins in SQL (5) 15/10 (pname, price, category, manufacturer) Company (cname, stockprice, country) Find all products under $200 manufactured in Japan; return their names and prices. PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi Company Cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan SELECT pname, price Join between and Company, Company WHERE manufacturer=cname AND country= Japan AND price <= 200 SELECT PName, Price, Company WHERE Manufacturer=CName AND Country= Japan AND Price <= 200 PName Price SingleTouch $

4 Joins in SQL (6) PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $ Photography Canon MultiTouch $ Household Hitachi Company Cname StockPrice Country GizmoWorks 25 USA Canon 65 Japan Hitachi 15 Japan Joins (pname, price, category, manufacturer) Purchase (buyer, seller, store, product) Person(persname, phonenumber, city) Find names of people living in Wroclaw that bought some product in the Gadgets category, and the names of the stores they bought such product from SELECT Country, Company WHERE Manufacturer=CName AND Category= Gadgets What is the problem? What s the solution? Country???? SELECT DISTINCT persname, store Person, Purchase, WHERE persname=buyer AND product = pname AND city= Wroclaw AND category= Gadgets Queries, Query Plans, and Operators SELECT E.loc, AVG(E.sal) SELECT COUNT Emp eid, DISTINCT Eename, (E.eid) title GROUP Emp Emp BY E, E.loc EProj P, Asgn A WHERE WHERE HAVING E.eid E.sal Count(*) = A.eid > $50K > 5 AND P.pid = A.pid AND E.loc <> P.loc System handles query plan generation & optimization; ensures correct execution. Count Having distinct Group(agg) Select Join Join Emp Emp Emp Asgn Employees Projects Assignments Proj Transaction Management A transaction is a collection of operations that performs a single logical function in a database application Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database. Transactions: ACID Properties Key concept is a transaction: a sequence of database actions (reads/writes). DBMS ensures atomicity (all-or-nothing property) even if system crashes in the middle of transaction. Each transaction, executed completely, must take the DB between consistent states or must not run at all. DBMS ensures that concurrent transactions appear to run in isolation. DBMS ensures durability of committed transactions even if system crashes. Idea: Keep a log (history) of all actions carried out by the DBMS while executing a set of transactions: Before a change is made to the database, the corresponding log entry is forced to a safe location. After a crash, the effects of partially executed transactions are undone using the log. Effects of committed transactions are redone using the log. Database Users Users are differentiated by the way they expect to interact with the system Application programmers interact with system through DML calls Sophisticated users form requests in a database query language Specialized users write specialized database applications that do not fit into the traditional data processing framework Naïve users invoke one of the permanent application programs that have been written previously e.g. people accessing database over the web, bank tellers, clerical staff 4

5 Internet DB - architecture WEB Advantages of database systems WEB client HTTP (TCP/IP) HTML document Internet HTTP (TCP/IP) XML document WEB server DB Data independence Efficient data access Data integrity & security Data administration Concurrent access, crash recovery Reduced application development time So why not use them always? Expensive/complicated to set up & maintain This cost & complexity must be offset by need General-purpose, not suited for special-purpose tasks (e.g. text search!) What is a Data Warehouse? Characteristics of a Data Warehouse [Inmon] Data warehouse refers to a database that provides on-line analytical processing (OLAP) tools for the interactive analysis multidimensional data of varied granularities, which facilitates effective data generalization and data mining. W.H.Inmon... a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision making process * subject-oriented - data organized by subject instead of application e.g. - an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.) - contains only the information necessary for decision support processing * integrated - encoding of data is often inconsistent e.g. - gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention * time-variant - the data warehouse contains a place for storing data that are five to 10 years old, or older e.g. - this data is used for comparisons, trends, and forecasting - these data are not updated * non-volatile - data are not updated or changed in any way once they enter the data warehouse J. Korczak, UE Differences between Operational Database Systems and Data Warehouses Users and system orientation A Star Is Born Global Computing Company: Sales and Marketing Data contents Database design View Access patterns PRODUCTS dimension SALES fact table CHANNELS dimension CUSTOMERS dimension DAYS dimension J. Korczak, UE

6 Data modeling A Multidimensional Data Agregation Country Region City Sales Sales Report Data warehouses and OLAP tools are based on a multidimensional data model. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. location= Chicago loc = New York loc= Toronto location = Vancouver item item item item home home home home time ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. Q Q Q Q Time J. Korczak, UE Example: 3D View of sales data Data Cube - Example location= Chicago location= NY location= Toronto location = Vancouver item item item item home home home home time ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. Q Q Q Q location Chicago NY Toronto Vancouver Q time Q Q Q home comp ph sec J. Korczak, UE enter. item 33 3D data cube is used to model student performance over three measures or dimensions - age group, gender and department/course. The dimensional axes hold the metrics to be analyzed. In this case it is student performance, represented as the number of classes completed with a satisfactory grade as a fraction of the number attempted. The views of the metrics are called dimensions. An individual student performance metric is given by the intersection of the three axes, and is referred to as a fact. In this particular representation, the facts are for a given year, aggregated over all students in that age group, gender and course. New terms Merge (data) combine two or more data sets; values or structures. Data replication extract data from several platforms, perform some filtering and transformation, and distribute and load to another database or databases. Usually the term replication implies limited or no transformation, and moves within a homogeneous environment. (See Pump) Pump A data pump extracts data from several mainframe and client server platforms, performs some filtering and transformation, and distributes and loads to another database(s). Usually the term pump is used rather than "replicator" to connote its applicability in a cross-platform environment. Scrub (data) see Clean. Transform (data) see Abstract. Drill-down and Roll-up Drill-down is the repetitive selection and analysis of summarized facts, with each repetition of data selection occurring at a lower level of summarization. An example of drill-down is a multiple-step process where sales revenue is first analyzed by year, then by quarter, and finally by month. Each iteration of drill-down returns sales revenue at a lower level of aggregation along the period dimension. Roll-up is the opposite of drill-down. Roll-up is the repetitive selection and analysis of summarized facts with each repetition of data selection occurring at a higher level of summarization. 35 6

7 Process of Extraction/Transformation/Loading (ETL) M-OLAP Base de données multidimensionnelle (hypercube) Serveur MOLAP Client OLAP Extraction of source data Transformation/cleaning Indexing and aggregation Data loading into DW Detection of changes Data refreshment R-OLAP Programms Base de données relationnelle (étoile ou flocon) Serveur ROLAP Vue multidimensionnelle Client OLAP OLTP Gateways Access tools ETL Data warehouse Mapping Materialized View: The Query Rewrite Dramatically Improves Query Performance Definition of used attributes Definition of required transformations Queries are rewritten automatically Metadata File A F1 F2 F3 File A F1 123 F2 Bloggs F3 10/12/56 Staging File One Number Name DOB Staging File One Number USA123 Name Mr. Bloggs DOB 10-Dec-56 Region Date City Sales Materialized Views Data prejoined and/or presummarized and automatically maintained by the database Sum by Sum by City Sum by Day Materialized vue: Example Users at Acme Bank were complaining that typical and repeated queries were taking too long to execute. Because these queries are executed over and over by many users, any improvement in response time is bound to help. The DBAs at Acme Bank have tuned the most commonly used queries, all the needed indexes are present, and no further SQL tuning is going to make any difference to the performance of these queries. The Acme Bank DBAs' solution: Use materialized views (MVs). This example discusses how to plan for MVs, how to set up and confirm different MV capabilities, how to automatically generate the scripts to create MVs, how to make query rewrite (QR) available, and how to make sure that QR gets used. Now suppose a user wants to get the total of all account balances for the account type 'C' and issues the following query: SELECT SUM(cleared_bal) accounts WHERE acc_type = 'C'; Because the mv_bal MV already contains the totals by account type, the user could have gotten this information directly from the MV: SELECT totbal mv_bal WHERE acc_type = 'C'; Materialized vue: Query This query against the mv_bal MV would have returned results much more quickly than the query against the accounts table. Running a query against the MV will be faster than running the original query, because querying the MV does not query the source tables

8 43 Materialized vue: Query definition Code Listing 1: Original query SELECT acc_mgr_id, acc_type_desc, DECODE (a.sub_acc_type,null,'?', sub_acc_type_desc) sub_acc_type_desc, sum(cleared_bal) tot_cleared_bal, sum(uncleared_bal) tot_uncleared_bal, avg(cleared_bal) avg_cleared_bal, avg(uncleared_bal) avg_uncleared_bal, sum(cleared_bal+uncleared_bal) tot_total_bal, avg(cleared_bal+uncleared_bal) avg_total_bal, min(cleared_bal+uncleared_bal) min_total_bal balances b, accounts a, acc_types at, sub_acc_types sat WHERE a.acc_no = b.acc_no and at.acc_type = a.acc_type and sat.sub_acc_type (+) = a.sub_acc_type GROUP BY acc_mgr_id, acc_type_desc, decode (a.sub_acc_type,null,'?', sub_acc_type_desc) 44 CUBE clause: example YEAR MONTH NUM TOT_SALES TOT_RED TOT_TAXES TOT_SALES_NET FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB FEB DISCOVERER : Data selection DISCOVERER : Data hierarchisation DISCOVERER : Extraction de données DATABASE TRENDS - Databases and the Web Database server: Computer in a client/server environment runs a DBMS to process SQL statements and perform database management tasks. Application server: Software handling all application operations 47 8

9 DATABASE TRENDS - Linking Internal Databases to the Web DATABASE TRENDS - A Hypermedia Database Big Data - Introduction History of Big Data: The term Big Data have been used first by John R. Mashey, then the chief scientist of Silicon Graphics Inc. Usenix conference (1999) Invited Talk titled: Big Data and the Next Big Wave of Infra Stress The term Big Data was also used in paper from Bryson et al. (1999) published in Communication of ACM The Report from META group (now Gartner) from Laney (2001), was the first to identify 3 Vs (volume, variety, and velocity perspective of data) Google s recent paper on Map Reduce (Map Reduce; Dean and Ghemawat 2004) was the trigger that led to lots of developments in the big data area 51 Healy J., Why what happens in an internet minute really matters, M2M 52 Concepts of Big Data Types of Data Structures Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value Requires new data architectures, analytic sandboxes New tools New analytical methods Integrating multiple skills into new role of data scientist Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require realtime or near-real time capabilities Structured Data Semi-Structured Data View Source Quasi-Structured Data ht tp:// +data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_u pl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651 Unstructured Data The Red Wheelbarrow, by William Carlos Williams Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity

10 More Structured Data Structures Business Requirements Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven Data containing a defined data type, format, structure Structured Example: Transaction data and OLAP Driver Examples Semi- Structured Quasi Structured Unstructured Textual data files with a discernable pattern, enabling parsing Example: XML data files that are self describing and defined by an xml schema Textual data with erratic data formats, can be formatted with effort, tools, and time Example: Web clickstream data that may contain some inconsistencies in data values and formats Data that has no inherent structure and is usually stored as different types of files. Example: Text documents, PDFs, images and video 1 Desire to optimize business operations 2 Desire to identify business risk 3 4 Predict new business opportunities Comply with laws or regulatory requirements Sales, pricing, profitability, efficiency Customer churn, fraud, default Upsell, cross-sell, best new customer prospects Anti-Money Laundering, Fair Lending, Basel II High BUSINESS VALUE Business Requirements Analytical Approaches for Meeting Business Drivers Business Intelligence Data Science Predictive Analytics & Data Mining (Data Science) Typical Techniques & Data Types Common Questions happening? Business Intelligence Typical Techniques & Data Types Optimization, predictive modeling, forecasting, statistical analysis Structured/unstructured data, many types of sources, very large data sets What if..? What s the optimal scenario for our business? What will happen next? What if these trends continue? Why is this Standard and ad hoc reporting, dashboards, alerts, queries, details on demand Structured data, traditional sources, manageable data sets Business Requirements A typical analytical architecture 2 Spread Marts 1 Data Sources Departmental Warehouse Departmental Warehouse Static schemas accrete over time Enterprise Applications 3 Prioritized Operational Processes Reporting Non-Agile s 4 Siloed Analytics Low Common Questions What happened last quarter? How many did we sell? Where is the problem? In which situations? Non-Prioritized Data Provisioning Past TIME Future Errant data & marts Business Requirements Implications of Typical Architecture for Data Science High-value data is hard to reach and leverage Predictive analytics & data mining activities are last in line for data Queued after prioritized operational processes Data is moving in batches from EDW to local analytical tools In-memory analytics (such as R, SAS, SPSS, Excel) Slow Sampling can skew model accuracy time-to-insight Isolated, ad hoc analytic projects, rather than centrally-managed & harnessing of analytics reduced business impact Non-standardized initiatives Frequently, not aligned with corporate business goals Big Data Analytics Development Data Analytics Lifecycle 6 Operationalize 5 Communicate Results 1 Discovery Do I have enough information to draft an analytic plan and share for peer review? 2 Data Prep 3 Planning Do I have enough good quality data to start building the model? Is the model robust enough? Have we failed for sure? 4 Building Do I have a good idea about the type of model to try? Can I refine the analytic plan?

11 Big Data Analytics Development (cont.) Phase 1: Discovery Formulate Initial Hypotheses 1 Discovery Do I have enough information to draft an analytic plan and share for peer review? IH, Operationalize H 1, H 2, H 3, H n Data Prep Gather and assess hypotheses from stakeholders and domain experts Preliminary data exploration to inform discussions with stakeholders Communicate during the hypothesis forming stage Identify Results Data Sources Begin Learning the Data Planning Aggregate sources for previewing the data and provide high-level understanding Review the raw data Building Is the model robust Determine enough? Have the we structures and tools needed failed for sure? Scope the kind of data needed for this kind of problem Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? Big Data Analytics Development (cont.) Phase 2: Data Preparation Do I have enough information to draft an analytic plan and share for peer review? Prepare Analytic Sandbox Discovery Do I have Work space for the analytic team enough good quality data to 10x+ vs. EDW 2 start building Perform Operationalize ELT Determine needed transformations Data Prep the model? Assess data quality and structuring Derive statistically useful measures Communicate Extract Results data and determine data Planning connections for raw data, OLTP transactions, OLAP cubes or data feeds Big ELT and Big ETL Do I have a good idea Building about the type of model Is the model robust to try? Can I refine the Useful enough? Tools for Have this we phase: analytic plan? For failed Data for Transformation sure? & Cleansing: SQL, Hadoop, MapReduce, Alpine Miner Big Data Analytics Development (cont.) Phase 3: Planning Determine Methods Discovery Select methods based on hypotheses, data structure Operationalize and volume Ensure techniques and approach will meet business objectives Techniques Communicate & Workflow Results Candidate tests and sequence Identify and document modeling assumptions Building Is the model robust Useful enough? Tools for Have this phase: we R/PostgresSQL, SQL Analytics, failed Alpine for sure? Miner, SAS/ACCESS, SPSS/OBDC Do I have enough information to draft an analytic plan and share for peer review? Data Prep 3 Planning Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 63 Big Data Analytics Development (cont.) Phase 4: Building Do I have enough information to draft an analytic plan and share for peer review? Discovery Develop data sets for testing, training, and production purposes Do I have enough good Need to ensure that the model data is sufficiently robust for the quality data to model and analytical techniques start building Operationalize Data Prep the model? Smaller, test sets for validating approach, training set for initial experiments Get the best environment you can for building models and workflows fast hardware, parallel processing Communicate Results Planning 4 Is the model robust enough? Have we Building Do I have a good idea about the type of model failed for sure? to try? Can I refine the analytic plan? Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner 64 Big Data Analytics Development (cont.) Phase 5: Communicate Results Operationalize 5 Communicate Results Is the model robust enough? Have we failed for sure? Discovery Do I have enough good quality data to Did we succeed? Did we fail? start building Data Prep the model? Interpret the results Compare to IH s from Phase 1 Identify key findings Quantify business value Summarizing findings, Planning depending on audience Building Do I have enough information to draft an analytic plan and share for peer review? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 65 Big Data Analytics Development (cont.) Phase 6: Operationalize 6 Operationalize Communicate Results Is the model robust enough? Have we failed for sure? Discovery Run a pilot Assess the benefits Data Prep environment Define process to Planning update and retrain the model, as needed Building Do I have enough information to draft an analytic plan and share for peer review? Deliver final deliverables execution in production Do I have enough good quality data to start building the model? Do I have a good idea about the type of model to try? Can I refine the analytic plan? 66 11

12 IT Solutions What is Hadoop? A framework for performing big data analytics An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability, scalability, and management Storage (Big Data) HDFS Hadoop Distributed File System Reliable, redundant, distributed file system optimized for large files Two Main Components MapReduce (Analytics) Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) IT Solutions Hadoop Hadoop (Version 1.0) comprises the Map Reduce implementation, along with the Hadoop Distributed File System (HDFS) Hadoop (Version 1.0) is OK for a number if use cases, including those in which the data can be partitioned into dependent chunks even the parallel applications Applications running in Disney, Sears, Walmart, AT&T and more IT Solutions Learning from aggregation Learning Algorithm Query: System Data D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst IT solutions MapReduce Transform (Map) input values to output values: <k1,v1> <k2,v2> Input Key/Value Pairs For instance, Key = line number, Value = text string Map Function Steps to transform input pairs to output pairs For example, count the different words in the input Output Key/Value Pairs For example, Key = <word>, Value = <count> Map output is the input to Reduce Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS IT solutions Example of MapReduce Mapping Reducing IT solutions MapReduce Merge (Reduce) Values from the Map phase Reduce is optional. Sometimes all the work is done in the Mapper Input Values for a given Key from all the Mappers Reduce Function Steps to combine (Sum?, Count?, Print?, ) the values Output Print values?, load into a DB? send to the next MapReduce job?

13 Record IT Solutions Map-Reduce abstraction Map Example: Word-Count Key Value Key Value Key Value Value (Dean & Ghemawat, OSDI 04): Reduce Key Value Map(docRecord) { for (word in docrecord) { emit (word, 1) Reduce(word, counts) { emit (word, SUM(counts)) } } Key Value } Map: Idempotent Reduce: Commutative and Associative 73 Why looking beyond Hadoop? Further hindrances to widespread adoption of Hadoop across enterprises are following: Lack of Object Database Connectivity (ODBC) a lot of BI tools are forced to build separate Hadoop connectors Hadoop s lack of suitability for all types of applications: If data splits are interrelated or computation needs to access data across splits, this might involve joins and might not run efficiently over Hadoop For iterative computations, Hadoop Map Reduce is not well-suited for two reasons: First is the overhead of fetching data from HDFS for each iteration Second is the lack of long-lived Map Reduce jobs in Hadoop (typically, there is a termination condition check that must be executed outside of the Map Reduce job, so as the to determine whether the computation is complete always start of new costly Map Reduce initialization) 75 IT Solutions - Why looking beyond Hadoop? Apache Mahout Open-Source Library of Algorithms on Hadoop: ALS Matrix Fact. SVD Random Forests LDA K-Means Naïve Bayes PCA Spectral Clustering Canopy Clustering Logistic Regression? IT Solutions - Why looking beyond Hadoop? Memory Opt. Dataflow View Training Data (HDFS) Map Map Reduc e Reduc e Efficiently move data between stages Map Reduc e IT Solutions - Why looking beyond Hadoop? IT Solutions - Why looking beyond Hadoop? In-Memory Data-Flow Systems Common Pattern: Multi-Stage Aggregation Abstraction: Dataflow Ops. on Immutable datasets System What is Spark? Fault-tolerant distributed dataflow framework Improves efficiency through: In-memory computing primitives Pipelined computation Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Up to 100 faster (2-10 on disk) 2-5 less code

14 IT Solutions - Why looking beyond Hadoop? IT Solutions - Why looking beyond Hadoop? Spark Programming Abstraction: Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Distributed collections of objects that can stored in-memory or on disk Built via parallel transformations (map, filter, ) Automatically rebuilt on failure Mahout Moves to Spark: On 25 April Goodbye MapReduce The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark IT Solutions - Why looking beyond Hadoop? Other related systems: Microsoft Dryad and Naiad: Hyracks: Stratosphere: MADlib: Hadoop Tez: Conclusions and Perspectives TREND: AFTER C-C, Availabilty of BD IN-MEMORY Technology, LOWER cost, REAl TIME According to McKinsey a retailer using big data to the full could increase its operating marging by more than 60% Bad data or poor data quality costs US businesses $600 billion annually According to Gartner Big Data will drive $232 billion in spending through By 2015, 4.4 milion IT jobs globally will be created to suport big data, generating 1.9 million IT jobs in the US. Open Issues Scalable Big/Fast Data Infrastructures Diversity in the Data Management Landscape End-to-End Processing and Understanding of Data Cloud Services Roles of Humans in the Data Life Cycle 85 14

Chapter 1: Introduction. Database Management System (DBMS)

Chapter 1: Introduction. Database Management System (DBMS) Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition Language Data Manipulation Language Transaction Management Storage Management Database Administrator Database

More information

Introduction to database management systems

Introduction to database management systems Introduction to database management systems Database management systems module Myself: researcher in INRIA Futurs, Ioana.Manolescu@inria.fr The course: follows (part of) the book "", Fourth Edition Abraham

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Database System Concepts, 5th Ed. See www.db book.com for conditions on re use Chapter 1: Introduction Purpose of Database Systems View of Data Database Languages Relational Databases

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition Language Data Manipulation Language Transaction Management Storage Management Database Administrator Database

More information

Introdução às Bases de Dados

Introdução às Bases de Dados Introdução às Bases de Dados 2011/12 http://ssdi.di.fct.unl.pt/ibd1112 Joaquim Silva (jfs@di.fct.unl.pt) The Bases de Dados subject Objective: To provide the basis for the modeling, implementation, analysis

More information

Database System Concepts

Database System Concepts s Design Chapter 1: Introduction Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Lesson 8: Introduction to Databases E-R Data Modeling

Lesson 8: Introduction to Databases E-R Data Modeling Lesson 8: Introduction to Databases E-R Data Modeling Contents Introduction to Databases Abstraction, Schemas, and Views Data Models Database Management System (DBMS) Components Entity Relationship Data

More information

Data Warehouse: Introduction

Data Warehouse: Introduction Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Chapter 1: Introduction. Database Management System (DBMS) University Database Example This image cannot currently be displayed. Chapter 1: Introduction Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Database Management System (DBMS) DBMS contains information

More information

Introduction to Database Systems. Module 1, Lecture 1. Instructor: Raghu Ramakrishnan raghu@cs.wisc.edu UW-Madison

Introduction to Database Systems. Module 1, Lecture 1. Instructor: Raghu Ramakrishnan raghu@cs.wisc.edu UW-Madison Introduction to Database Systems Module 1, Lecture 1 Instructor: Raghu Ramakrishnan raghu@cs.wisc.edu UW-Madison Database Management Systems, R. Ramakrishnan 1 What Is a DBMS? A very large, integrated

More information

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Trend Too much information is a storage issue, certainly, but too much information is also

More information

www.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28

www.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28 Data Warehousing - Essential Element To Support Decision- Making Process In Industries Ashima Bhasin 1, Mr Manoj Kumar 2 1 Computer Science Engineering Department, 2 Associate Professor, CSE Abstract SGT

More information

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 Online Analytic Processing OLAP 2 OLAP OLAP: Online Analytic Processing OLAP queries are complex queries that Touch large amounts of data Discover

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Warehousing and OLAP Technology for Knowledge Discovery 542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Data Warehousing Systems: Foundations and Architectures

Data Warehousing Systems: Foundations and Architectures Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA OLAP and OLTP AMIT KUMAR BINDAL Associate Professor Databases Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data,

More information

SQL Server 2005 Features Comparison

SQL Server 2005 Features Comparison Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

More information

Data Integration Checklist

Data Integration Checklist The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

ECS 165A: Introduction to Database Systems

ECS 165A: Introduction to Database Systems ECS 165A: Introduction to Database Systems Todd J. Green based on material and slides by Michael Gertz and Bertram Ludäscher Winter 2011 Dept. of Computer Science UC Davis ECS-165A WQ 11 1 1. Introduction

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

IST722 Data Warehousing

IST722 Data Warehousing IST722 Data Warehousing Components of the Data Warehouse Michael A. Fudge, Jr. Recall: Inmon s CIF The CIF is a reference architecture Understanding the Diagram The CIF is a reference architecture CIF

More information

Introduction to Database Systems CS4320. Instructor: Christoph Koch koch@cs.cornell.edu CS 4320 1

Introduction to Database Systems CS4320. Instructor: Christoph Koch koch@cs.cornell.edu CS 4320 1 Introduction to Database Systems CS4320 Instructor: Christoph Koch koch@cs.cornell.edu CS 4320 1 CS4320/1: Introduction to Database Systems Underlying theme: How do I build a data management system? CS4320

More information

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do. Database Management Systems Chapter 1 Mirek Riedewald Many slides based on textbook slides by Ramakrishnan and Gehrke 1 Logistics Go to http://www.ccs.neu.edu/~mirek/classes/2010-f- CS3200 for all course-related

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage Moving Large Data at a Blinding Speed for Critical Business Intelligence A competitive advantage Intelligent Data In Real Time How do you detect and stop a Money Laundering transaction just about to take

More information

COMP5138 Relational Database Management Systems. Databases are Everywhere!

COMP5138 Relational Database Management Systems. Databases are Everywhere! COMP5138 Relational Database Management Systems Week 1: COMP 5138 Intro to Database Systems Professor Joseph Davis and Boon Ooi Databases are Everywhere! Database Application Examples: Banking: all transactions

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006 Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006 What is a Data Warehouse? A data warehouse is a subject-oriented, integrated, time-varying, non-volatile

More information

14. Data Warehousing & Data Mining

14. Data Warehousing & Data Mining 14. Data Warehousing & Data Mining Data Warehousing Concepts Decision support is key for companies wanting to turn their organizational data into an information asset Data Warehouse "A subject-oriented,

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D. Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved. Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,

More information

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Appliances and DW Architectures John O Brien President and Executive Architect Zukeran Technologies 1 TDWI 1 Agenda What

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

SAS BI Course Content; Introduction to DWH / BI Concepts

SAS BI Course Content; Introduction to DWH / BI Concepts SAS BI Course Content; Introduction to DWH / BI Concepts SAS Web Report Studio 4.2 SAS EG 4.2 SAS Information Delivery Portal 4.2 SAS Data Integration Studio 4.2 SAS BI Dashboard 4.2 SAS Management Console

More information

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

Week 3 lecture slides

Week 3 lecture slides Week 3 lecture slides Topics Data Warehouses Online Analytical Processing Introduction to Data Cubes Textbook reference: Chapter 3 Data Warehouses A data warehouse is a collection of data specifically

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL 25 2013

DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL 25 2013 DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL 25 2013 BRAD HATHAWAY REGIONAL LEADER FOR INFORMATION MANAGEMENT AGENDA Major Technology Trends Focus on

More information

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES MUHAMMAD KHALEEL (0912125) SZABIST KARACHI CAMPUS Abstract. Data warehouse and online analytical processing (OLAP) both are core component for decision

More information

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić

Business Intelligence Solutions. Cognos BI 8. by Adis Terzić Business Intelligence Solutions Cognos BI 8 by Adis Terzić Fairfax, Virginia August, 2008 Table of Content Table of Content... 2 Introduction... 3 Cognos BI 8 Solutions... 3 Cognos 8 Components... 3 Cognos

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Part 22. Data Warehousing

Part 22. Data Warehousing Part 22 Data Warehousing The Decision Support System (DSS) Tools to assist decision-making Used at all levels in the organization Sometimes focused on a single area Sometimes focused on a single problem

More information

How To Improve Performance In A Database

How To Improve Performance In A Database Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed

More information

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types

Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types Introduction to Database Management System Linda Wu (CMPT 354 2004-2) Topics What is DBMS DBMS types Files system vs. DBMS Advantages of DBMS Data model Levels of abstraction Transaction management DBMS

More information

B.Sc (Computer Science) Database Management Systems UNIT-V

B.Sc (Computer Science) Database Management Systems UNIT-V 1 B.Sc (Computer Science) Database Management Systems UNIT-V Business Intelligence? Business intelligence is a term used to describe a comprehensive cohesive and integrated set of tools and process used

More information

Introducing Oracle Exalytics In-Memory Machine

Introducing Oracle Exalytics In-Memory Machine Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART A: Architecture Chapter 1: Motivation and Definitions Motivation Goal: to build an operational general view on a company to support decisions in

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA

Applied Business Intelligence. Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA Applied Business Intelligence Iakovos Motakis, Ph.D. Director, DW & Decision Support Systems Intrasoft SA Agenda Business Drivers and Perspectives Technology & Analytical Applications Trends Challenges

More information

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture Apps and data source extensions with APIs Future white label, embed or integrate Power BI Deploy Intelligent

More information

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT BUILDING BLOCKS OF DATAWAREHOUSE G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT 1 Data Warehouse Subject Oriented Organized around major subjects, such as customer, product, sales. Focusing on

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco Decoding the Big Data Deluge a Virtual Approach Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco High-volume, velocity and variety information assets that demand

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Safe Harbor Statement

Safe Harbor Statement Safe Harbor Statement "Safe Harbor" Statement: Statements in this presentation relating to Oracle's future plans, expectations, beliefs, intentions and prospects are "forward-looking statements" and are

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option The following is intended to outline our general product direction. It is intended for

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information