The basic data mining algorithms introduced may be enhanced in a number of ways.

Size: px
Start display at page:

Download "The basic data mining algorithms introduced may be enhanced in a number of ways."

Transcription

1 DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms. Modifications to such algorithms are required for them to work efficiently with large persistent data sets. Also, much data is no longer held in persistent form in data warehouses. The data is generated at such a rate that it is infeasible to store it in that way. Instead, data is represented by transient data streams. This necessitates further modified solutions to the data mining techniques we have studied so far. 1

2 As scalable architectures based on commodity hardware such as Hadoop have emerged, techniques to support data mining on these architectures have been developed. Also, architectures which support analytics with both real-time processing of stream data and batch-oriented processing of historical data are required. Another problem is that once a data mining model has been developed, there has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets. Hence, standards for data mining model exchange have been developed. Also, data mining has traditionally been performed by dumping data from the data warehouse to an external file, which is then transformed and mined. This results in a series of files for each data mining application, with the resulting problems of data redundancy, inconsistency and data dependence, which database technology was designed to overcome. Hence, techniques and standards for tighter integration of database and data mining technology have been developed. 2

3 DATA MINING OF LARGE DATA SETS Algorithms for classification, clustering, and association rule mining are considered. CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES To reduce the computational cost of solving the SVM optimization problem with large training sets, chunking is used. This partitions the training set into chunks each of which fits into memory. The support vector parameters are computed iteratively for each chunk. However, multiple passes of the data are required to obtain an optimal solution. Another approach is to use squashing. In this, the SVM is trained over clusters derived from the original training set, with the clusters reflecting the distribution of the original training records. A further approach reformulates the optimization problem to allow solution by efficient iterative algorithms. 3

4 CLUSTERING LARGE DATA SETS: K-MEANS Unless there is sufficient main memory to hold the data being clustered, the data scan at each iteration of the K-means algorithm will be very costly. An approach for large databases would Perform at most one scan of the database. Work with limited memory. Approaches include the following: Identify three kinds of data objects: those which are discardable because membership of a cluster has been established; those which are compressible which while not discardable belong to a well-defined subcluster which can be characterized in a compact structure; those which are neither discardable nor compressible which must be retained in main memory. Alternatively, first group data objects into microclusters and then perform 4 k-means clustering on those microclusters.

5 An approach developed at Microsoft uses such approaches as follows: 1. Read a sample subset of data from the database. 2. Cluster that data with the existing model as usual to produce an updated model. 3. On the basis of the updated model, decide for each data item from the sample whether it needs to be retained in memory discarded with summary information being updated retained in a compressed form as summary information. 4. Repeat from 1 until termination condition met. 5

6 ASSOCIATION MINING OF LARGE DATA SETS: APRIORI With one database scan for each itemset size tested, the cost of scans would be prohibitive for the Apriori algorithm unless the database is resident in memory. Approaches to enhance the efficiency of Apriori include the following. While generating 1-itemsets for each transaction, generate 2-itemsets at the same time, hashing the 2-itemsets to a hash table structure. All buckets whose final count of itemsets is less than the minimum support threshold can be ignored subsequently since any itemset therein will itself not have the minimum required support. 6

7 Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)- itemsets in parallel with counting k-itemsets. Unlike in conventional Apriori in which candidate (k+1)-itemsets are only generated after the k-itemset database scan, in this approach a database scan is divided into blocks before any of which candidate (k+1)-itemsets can be generated during the k-itemset scan. Two database scans only are needed if a partitioning approach is adopted under which transactions are divided into n partitions each of which can be held in memory. In the first scan frequent itemsets for each partition are generated. These are combined to create a candidate frequent itemsets list for the database as a whole. In the second scan, the actual support for members of the candidate frequent itemsets list is checked. 7

8 Pick a random sample of transactions which will fit in memory and search for frequent itemsets in that sample. This may result in some global frequent itemsets being missed. The chance of this happening can be lessened by adopting a lower minimum support threshold for the sample, with the remaining database then being used to check actual support for the candidate itemsets. A second database scan may be needed to ensure no frequent itemsets have been missed. 8

9 An alternative approach to increase efficiency - FP- growth - uses a tree structure which holds frequent itemsets in a compressed form in a prefix-tree structure known as an FP-tree. Using this, only two database scans are needed to identify all frequent itemsets: The first scan identifies frequent items. The second scan builds the FP-tree in which each path from the root represents a frequent itemset. Consider again the set of transactions seen when the Apriori algorithm was introduced, but with two additional transactions T5 and T6: Trans_id T1 T2 T3 T4 T5 T6 List_of_items I1,I3,I4 I2,I3,I5 I1,I2,I3,I5 I2,I5 I3,I5 I1,I4,I5 Assume frequent itemsets with minimum support 50% are required. 9

10 The first database scan is the same as Apriori, generating 1 item itemsets and their support. This is then stored in a list of descending support count: I5 5 I3 4 I2 3 I1 3 I4 2 The second database scan builds the FP-tree by processing the items in each transaction in the order of the list generated in the first scan. A branch is created for each transaction, sharing paths from the root for previously constructed branches with the same items. Each node, representing an item, also contains a count which is incremented as each transaction including that node is added. Finally, a linked list is created from each item in the list generated in the first scan to the nodes representing that item in the tree. 10

11 Hence, after the second scan for transactions T1 T6 the following structure results: root:6 I3:1 I5 5 I3 4 I2 3 I1 3 I1:1 I5:5 I2:1 I1:1 I4 insufficient support I3:3 I2:2 I1:1 FP-growth then processes the FP-tree to identify frequent itemsets. 11

12 STREAM MINING Data streams arise in many application areas including real-time sensors, internet activity and scientific experiments. Such data is characterized by being: Infinite Arriving continuously Arriving at a high rate Arriving in a time series which is significant It is not possible for such data to be stored in a conventional database or data warehouse the data volumes and stream flow are too great. Hence, new techniques with new data structures and algorithms are needed. 12

13 These techniques include the following. Sampling: use a probabilistic method for choosing which data items within the stream should be processed. Load shedding: a sequence of data streams is not processed. Sketching: only a subset of the information within the stream is processed Synopsis data structures: summarization techniques are used to hold information about the stream in a more space efficient form. Aggregation: statistical measures are used to summarize the data stream. Approximation algorithms: given the computational complexity of exact mining algorithms, algorithms which give approximate solutions with defined error bounds are sought instead. Sliding window: analysis is restricted to most recent data within the stream. Algorithm output granularity: the development of algorithms which adapt to available resources such as processor load and memory availability. 13

14 These techniques have been used in the development of classical data mining tasks such as clustering, classification and frequent itemset identification within stream data. For example, in the case of frequent itemset identification, the FP-tree structure can be used to hold itemset information which is updated incrementally as transaction data arrives. As the first batch of transactions is received, the ordering of items can be determined on the basis of their support as in the FP-growth algorithm seen above and an FP-tree created. This ordering is not subsequently changed. As subsequent batches of transactions are received, the FP-tree can be updated to reflect the revised support for itemsets. Since it is infeasible to store itemset information for all batches indefinitely, a tilted window approach is often used: as new batches of transactions are received, the itemsets are computed for those new batches individually, while itemset information for older batches are then aggregated over longer time periods. 14

15 DATA MINING ARCHITECTURE TRENDS The Hadoop technologies already seen supporting data warehousing are increasingly being used to support data mining and analytics. For example, Pig has been used extensively in organisations including Twitter to support data mining and analytics. With the increasing importance of stream data, a particular architectural challenge is how to support the real-time processing required for stream data as well as the batch-oriented processing needed for historical data: the Hadoop technologies seen so far were originally designed with the latter in mind. One influential approach has been the lambda architecture which consists of 3 layers with all data dispatched to the batch and speed layers: A batch layer manages the append-only master data set and precomputes views on this A serving layer indexes these views to enable ad-hoc low latency queries A speed layer deals with recent data only to avoid the high latency of updates to the serving layer. 15

16 Hadoop technologies are well-suited to supporting the requirements of the batch layer. Column store or NoSQL technologies are often used to support the storage requirements of the serving layer. Stream technologies are required to support the speed layer, for example Storm: https://storm.apache.org/ Storm is a distributed computing framework designed for real-time stream processing. The processing model is similar to MapReduce in that it supports a graphoriented workflow. However, while the MapReduce model is batch-oriented, Storm has been designed to support real-time processing of stream rather than batch data with streams continuing indefinitely. Storm was developed at BackType which was acquired by Twitter. It was developed and deployed for internal use before being made open-source. 16

17 DATA MINING STANDARDS Data mining standards and related standards for data grids, web services and the semantic web enable the easier deployment of data mining applications across platforms. Standards cover: The overall KDD process. Metadata interchange with data warehousing applications. The representation of data cleaning, data reduction and transformation processes. The representation of data mining models. APIs for performing data mining processes from other languages including SQL and Java. 17

18 CRISP-DM, PREDICTIVE MODEL MARKUP LANGUAGE PMML CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a process model covering the following 6 phases of the KDD process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment PMML is an XML-based standard developed by the Data Mining Group (www.dmg.org) which is a consortium of data mining product vendors. PMML represents data mining models as well as operations for cleaning and transforming data prior to modeling. The aim is to enable an application to produce a data mining model in a form PMML XML - which another data mining application can read and apply. 18

19 Below is shown the PMML representation of an example association rules model for the following transaction data. 19

20 20

21 The association model XML schema specification follows. 21

22 22

23 The components of a PMML document consist of (the first two and last being used in the example model above): Data dictionary. Defines models input attributes with type and value range. Mining schema. Defines the attributes and roles specific to a particular model. Transformation dictionary. Defines the following mappings: normalization (continuous or discrete values to numbers), discretization (continuous to discrete values), value mapping (discrete to discrete values), aggregation (grouping values as in SQL). Model Statistics. Statistics about individual attributes. Models. Includes regression models, cluster models, association rules, neural networks, Bayesian models, sequence models. PMML is used within the standards CWM, SQL/MM Part 6 Data Mining, JDM, and MS Analysis Services (OLE DB for Data Mining) providing a degree of compatibility between them all. 23

24 COMMON WAREHOUSE METAMODEL CWM CWM supports the interchange of warehouse and business intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata repositories in distributed heterogeneous environments. SQL/MM DATA MINING The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6 specifies an SQL interface to data mining applications and services through SQL:1999 user-defined types as follows. User-defined types for four data mining functions: association rules, clustering, classification and regression. 24

25 Routines to manipulate these user-defined types to allow: Setting parameters for mining activities. Training of mining models, in which a particular mining technique is chosen, parameters for that technique are set, and the mining model is built with training data sets. Testing of mining models applicable only to regression and classification models, in which the trained model is evaluated by comparing with results for known data. Application of mining models in which the model is applied to new data to cluster, predict or classify as appropriate. This phases is not applicable to rule models in which rules are determined during the training phase. User-defined types for data structures common across these data mining models. Functions to capture metadata for data mining input. 25

26 For example, for the association rule model type DM_RuleModel the following methods are supported: DM_impRuleModel CHARACTER LARGE OBJECT (DM_MaxContentLength)) Import rule model expressed as PMML Return DM_RuleModel DM_expRuleModel() Export rule model as PMML DM_getNORules() Return number of rules DM_getRuleTask() Return data mining task value, data mining settings etc. 26

27 JAVA DATA MINING JDM Java Data Mining (http://www.jcp.org/en/jsr/detail?id=73) is a Java API developed under the Java Community Process supporting common data mining operations as well as the metadata supporting mining activities. JDM 1.0 supports the following mining functions: classification, regression, attribute importance (ranking), clustering and association rules. JDM 1.0 supports the following tasks: model building, testing, application and model import/export. JDM does not support tasks such as data transformation, visualization and mining unstructured data. JDM has been designed so that metadata maps closely to PMML to provide support for the generation of XML for mining models. Likewise, metadata maps closely to CWM to support generation of XML for mining tasks. The JDM API maps closely to SQL/MM Data Mining to support an implementation of JDM on top of SQL/MM. 27

28 OLE DB FOR DATA MINING & DMX SQL SERVER ANALYSIS SERVICES OLE DB for Data Mining, developed by Microsoft and incorporated in SQL Server Analysis Services, specifies a structure for holding information defining a mining model and a language for creating and working with these mining models. The approach has been to adopt an SQL-like framework for creating, training and using a mining model a mining model is treated as though it is a special kind of table. The DMX language, which is SQL-like, is used to create and work with models. 28

29 CREATE MINING MODEL [AGE PREDICTION] ( [CUSTOMER ID] LONG KEY, [GENDER] TEXT DISCRETE, [AGE] DOUBLE DISCRETIZED() PREDICT, [ITEM PURCHASES] TABLE ([ITEM NAME] TEXT KEY, [ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS, [ITEM TYPE] TEXT RELATED TO [ITEM NAME] ) ) USING [MS DECISION TREE] The column to be predicted, AGE, is identified, together with the keyword DISCRETIZED() indicating that a discretization into ranges of values is to take place. ITEM QUANTITY is identified as having a normal distribution, which may be exploited by some mining algorithms. 29

30 ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1- many constraint each item has one type. It can be seen from the column specification of the table inserted into, that a nested table representation is used with ITEM PURCHASES iself being a table nested within AGE PREDICTION. A conventional table representation would result in duplicate data in a single non-normalized table or data in multiple normalized tables. The USING clause specifies the algorithm that will be used to construct the model. Having created a model, it may be populated with a caseset of training data using an INSERT statement. Predictions are obtained by executing a prediction join to match the trained model with the caseset to be mined. This process can be thought of as matching each case in the data to be mined with every possible case in the trained model to find a predicted value for each case which matches a case in the model. 30

31 SQL Server Analysis Services supports data mining algorithms for use with: conventional relational tables OLAP cube data Mining techniques supported include: classification - decision trees clustering - k-means association rule mining Predictive Model Markup Language (PMML) is supported. SQL Server Analysis Services Data Mining Tutorials 31

32 DATA MINING PRODUCTS OPEN SOURCE A number of open-source packages and tools support data mining capabilities, including R, Weka, RapidMiner and Mahout. R is both a language for statistical computing and visualisation of results, and a wider environment consisting of packages and other tools for the development of statistical applications. Data mining functionality is supported through a number of packages, including classification with decision trees using the rpart package, clustering with k- means using the kmeans package, and association rule mining with Apriori using the arules package. Weka is a collection of data mining algorithms written in Java including those for classification, clustering and association rule mining as well as for visualisation. 32

33 RapidMiner consists of both tools for developing standalone data mining applications and an environment for use of RapidMiner functions from other programme languages. Weka and R algorithms may be integrated within RapidMiner. An XML-based interchange format is used to enable interchange of data between data mining algorithms. Mahout is an Apache project to develop data mining algorithms for the Hadoop platform. Core MapReduce algorithms for clustering, classification are provided, but the project also incorporates algorithms designed to run on single-node architectures and non-hadoop cluster architectures. 33

34 DATA MINING PRODUCTS ORACLE Oracle supports data mining algorithms for use with conventional relational tables. Mining techniques supported include: classification - decision trees, support vector machines... clustering - k-means... association rule mining Apriori Predictive Model Markup Language (PMML) support is included In addition to SQL and PL/SQL interfaces, until Oracle 11, a Java API was supported to allow applications to be developed which mine data. This was Oracle s implementation of JDM 1.0 introduced above. 34

35 From Oracle 12, the Java API is no longer supported. Instead, support for R has been introduced with the Oracle R Enterprise component. Oracle R Enterprise allows R to be used to perform analysis on Oracle database tables. A collection of packages supports mapping of R data types to Oracle database objects and the transparent rewriting of R expressions to SQL expressions on those corresponding objects. A related product is Oracle R Connector for Hadoop. This is an R package which provides an interface between a local R environment and file system and Hadoop enabling R functions to be executed on data in memory, on the local file system and HDFS. 35

36 READING P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8), J Gama, A Survey on Learning From Data Streams: Current and Future Trends, Progress in Artificial Intelligence, 1(1), 2012 (sections 3.1, 3.2 optional). J Lin and A Kolcz, Large-scale Machine Learning at Twitter, Proc. SIGMOD 12, May 2012 (sections 5, 6 optional). X Liu et al., Survey of Real-time Processing Systems for Big Data, Proc. IDEAS 14, July A Toshniwal et al., Proc. SIGMOD 14, June

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Data Mining Standards

Data Mining Standards Data Mining Standards Arati Kadav Jaya Kawale Pabitra Mitra aratik@cse.iitk.ac.in jayak@cse.iitk.ac.in pmitra@cse.iitk.ac.in Abstract In this survey paper we have consolidated all the current data mining

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Combining the benefits of RDBMS and NoSQL database systems

Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Combining the benefits of RDBMS and NoSQL database systems DATA WAREHOUSING RESEARCH TRENDS Research trends relevant to data warehousing and OLAP include [Cuzzocrea et al.]: Data source heterogeneity and incongruence Filtering out uncorrelated data Strongly unstructured

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

The 4 Pillars of Technosoft s Big Data Practice

The 4 Pillars of Technosoft s Big Data Practice beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform... Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data

More information

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Data Mining for Scientific & Engineering Applications

Data Mining for Scientific & Engineering Applications Data Mining for Scientific & Engineering Applications Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin

More information

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

More information

Oracle9i Database Release 2 Product Family

Oracle9i Database Release 2 Product Family Database Release 2 Product Family An Oracle White Paper January 2002 Database Release 2 Product Family INTRODUCTION Database Release 2 is available in three editions, each suitable for different development

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*

More information

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Navigating the Big Data infrastructure layer Helena Schwenk

Navigating the Big Data infrastructure layer Helena Schwenk mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Data Modeling for Big Data

Data Modeling for Big Data Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/

Model Deployment. Dr. Saed Sayad. University of Toronto 2010 saed.sayad@utoronto.ca. http://chem-eng.utoronto.ca/~datamining/ Model Deployment Dr. Saed Sayad University of Toronto 2010 saed.sayad@utoronto.ca http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Fast and Easy Delivery of Data Mining Insights to Reporting Systems Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb rpulido@de.ibm.com, christoph.sieb@de.ibm.com Abstract: During the last decade data mining and predictive

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services

PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services David Ferrucci 1, Robert L. Grossman 2 and Anthony Levas 1 1. Introduction - The Challenges of Deploying Analytic Applications

More information

Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region

<Insert Picture Here> Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region 1977 Oracle Database 30 Years of Sustained Innovation Database Vault Transparent Data Encryption

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

Introduction Predictive Analytics Tools: Weka

Introduction Predictive Analytics Tools: Weka Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Customized Report- Big Data

Customized Report- Big Data GINeVRA Digital Research Hub Customized Report- Big Data 1 2014. All Rights Reserved. Agenda Context Challenges and opportunities Solutions Market Case studies Recommendations 2 2014. All Rights Reserved.

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY

DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Big Data Analytics DAMA NY DAMA Day October 17, 2013 IBM 590 Madison Avenue 12th floor New York, NY Tom Haughey InfoModel, LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 tom.haughey@infomodelusa.com

More information

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining 1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining techniques are most likely to be successful, and Identify

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer Alejandro Vaisman Esteban Zimanyi Data Warehouse Systems Design and Implementation ^ Springer Contents Part I Fundamental Concepts 1 Introduction 3 1.1 A Historical Overview of Data Warehousing 4 1.2 Spatial

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Data Warehouse design

Data Warehouse design Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout

More information

Big Data and Apache Hadoop Adoption:

Big Data and Apache Hadoop Adoption: Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards

More information

Oracle Architecture, Concepts & Facilities

Oracle Architecture, Concepts & Facilities COURSE CODE: COURSE TITLE: CURRENCY: AUDIENCE: ORAACF Oracle Architecture, Concepts & Facilities 10g & 11g Database administrators, system administrators and developers PREREQUISITES: At least 1 year of

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

What s Cooking in KNIME

What s Cooking in KNIME What s Cooking in KNIME Thomas Gabriel Copyright 2015 KNIME.com AG Agenda Querying NoSQL Databases Database Improvements & Big Data Copyright 2015 KNIME.com AG 2 Querying NoSQL Databases MongoDB & CouchDB

More information