The basic data mining algorithms introduced may be enhanced in a number of ways.
|
|
|
- Evangeline Boyd
- 10 years ago
- Views:
Transcription
1 DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms. Modifications to such algorithms are required for them to work efficiently with large persistent data sets. Also, much data is no longer held in persistent form in data warehouses. The data is generated at such a rate that it is infeasible to store it in that way. Instead, data is represented by transient data streams. This necessitates further modified solutions to the data mining techniques we have studied so far. 1
2 As scalable architectures based on commodity hardware such as Hadoop have emerged, techniques to support data mining on these architectures have been developed. Also, architectures which support analytics with both real-time processing of stream data and batch-oriented processing of historical data are required. Another problem is that once a data mining model has been developed, there has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets. Hence, standards for data mining model exchange have been developed. Also, data mining has traditionally been performed by dumping data from the data warehouse to an external file, which is then transformed and mined. This results in a series of files for each data mining application, with the resulting problems of data redundancy, inconsistency and data dependence, which database technology was designed to overcome. Hence, techniques and standards for tighter integration of database and data mining technology have been developed. 2
3 DATA MINING OF LARGE DATA SETS Algorithms for classification, clustering, and association rule mining are considered. CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES To reduce the computational cost of solving the SVM optimization problem with large training sets, chunking is used. This partitions the training set into chunks each of which fits into memory. The support vector parameters are computed iteratively for each chunk. However, multiple passes of the data are required to obtain an optimal solution. Another approach is to use squashing. In this, the SVM is trained over clusters derived from the original training set, with the clusters reflecting the distribution of the original training records. A further approach reformulates the optimization problem to allow solution by efficient iterative algorithms. 3
4 CLUSTERING LARGE DATA SETS: K-MEANS Unless there is sufficient main memory to hold the data being clustered, the data scan at each iteration of the K-means algorithm will be very costly. An approach for large databases would Perform at most one scan of the database. Work with limited memory. Approaches include the following: Identify three kinds of data objects: those which are discardable because membership of a cluster has been established; those which are compressible which while not discardable belong to a well-defined subcluster which can be characterized in a compact structure; those which are neither discardable nor compressible which must be retained in main memory. Alternatively, first group data objects into microclusters and then perform 4 k-means clustering on those microclusters.
5 An approach developed at Microsoft uses such approaches as follows: 1. Read a sample subset of data from the database. 2. Cluster that data with the existing model as usual to produce an updated model. 3. On the basis of the updated model, decide for each data item from the sample whether it needs to be retained in memory discarded with summary information being updated retained in a compressed form as summary information. 4. Repeat from 1 until termination condition met. 5
6 ASSOCIATION MINING OF LARGE DATA SETS: APRIORI With one database scan for each itemset size tested, the cost of scans would be prohibitive for the Apriori algorithm unless the database is resident in memory. Approaches to enhance the efficiency of Apriori include the following. While generating 1-itemsets for each transaction, generate 2-itemsets at the same time, hashing the 2-itemsets to a hash table structure. All buckets whose final count of itemsets is less than the minimum support threshold can be ignored subsequently since any itemset therein will itself not have the minimum required support. 6
7 Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)- itemsets in parallel with counting k-itemsets. Unlike in conventional Apriori in which candidate (k+1)-itemsets are only generated after the k-itemset database scan, in this approach a database scan is divided into blocks before any of which candidate (k+1)-itemsets can be generated during the k-itemset scan. Two database scans only are needed if a partitioning approach is adopted under which transactions are divided into n partitions each of which can be held in memory. In the first scan frequent itemsets for each partition are generated. These are combined to create a candidate frequent itemsets list for the database as a whole. In the second scan, the actual support for members of the candidate frequent itemsets list is checked. 7
8 Pick a random sample of transactions which will fit in memory and search for frequent itemsets in that sample. This may result in some global frequent itemsets being missed. The chance of this happening can be lessened by adopting a lower minimum support threshold for the sample, with the remaining database then being used to check actual support for the candidate itemsets. A second database scan may be needed to ensure no frequent itemsets have been missed. 8
9 An alternative approach to increase efficiency - FP- growth - uses a tree structure which holds frequent itemsets in a compressed form in a prefix-tree structure known as an FP-tree. Using this, only two database scans are needed to identify all frequent itemsets: The first scan identifies frequent items. The second scan builds the FP-tree in which each path from the root represents a frequent itemset. Consider again the set of transactions seen when the Apriori algorithm was introduced, but with two additional transactions T5 and T6: Trans_id T1 T2 T3 T4 T5 T6 List_of_items I1,I3,I4 I2,I3,I5 I1,I2,I3,I5 I2,I5 I3,I5 I1,I4,I5 Assume frequent itemsets with minimum support 50% are required. 9
10 The first database scan is the same as Apriori, generating 1 item itemsets and their support. This is then stored in a list of descending support count: I5 5 I3 4 I2 3 I1 3 I4 2 The second database scan builds the FP-tree by processing the items in each transaction in the order of the list generated in the first scan. A branch is created for each transaction, sharing paths from the root for previously constructed branches with the same items. Each node, representing an item, also contains a count which is incremented as each transaction including that node is added. Finally, a linked list is created from each item in the list generated in the first scan to the nodes representing that item in the tree. 10
11 Hence, after the second scan for transactions T1 T6 the following structure results: root:6 I3:1 I5 5 I3 4 I2 3 I1 3 I1:1 I5:5 I2:1 I1:1 I4 insufficient support I3:3 I2:2 I1:1 FP-growth then processes the FP-tree to identify frequent itemsets. 11
12 STREAM MINING Data streams arise in many application areas including real-time sensors, internet activity and scientific experiments. Such data is characterized by being: Infinite Arriving continuously Arriving at a high rate Arriving in a time series which is significant It is not possible for such data to be stored in a conventional database or data warehouse the data volumes and stream flow are too great. Hence, new techniques with new data structures and algorithms are needed. 12
13 These techniques include the following. Sampling: use a probabilistic method for choosing which data items within the stream should be processed. Load shedding: a sequence of data streams is not processed. Sketching: only a subset of the information within the stream is processed Synopsis data structures: summarization techniques are used to hold information about the stream in a more space efficient form. Aggregation: statistical measures are used to summarize the data stream. Approximation algorithms: given the computational complexity of exact mining algorithms, algorithms which give approximate solutions with defined error bounds are sought instead. Sliding window: analysis is restricted to most recent data within the stream. Algorithm output granularity: the development of algorithms which adapt to available resources such as processor load and memory availability. 13
14 These techniques have been used in the development of classical data mining tasks such as clustering, classification and frequent itemset identification within stream data. For example, in the case of frequent itemset identification, the FP-tree structure can be used to hold itemset information which is updated incrementally as transaction data arrives. As the first batch of transactions is received, the ordering of items can be determined on the basis of their support as in the FP-growth algorithm seen above and an FP-tree created. This ordering is not subsequently changed. As subsequent batches of transactions are received, the FP-tree can be updated to reflect the revised support for itemsets. Since it is infeasible to store itemset information for all batches indefinitely, a tilted window approach is often used: as new batches of transactions are received, the itemsets are computed for those new batches individually, while itemset information for older batches are then aggregated over longer time periods. 14
15 DATA MINING ARCHITECTURE TRENDS The Hadoop technologies already seen supporting data warehousing are increasingly being used to support data mining and analytics. For example, Pig has been used extensively in organisations including Twitter to support data mining and analytics. With the increasing importance of stream data, a particular architectural challenge is how to support the real-time processing required for stream data as well as the batch-oriented processing needed for historical data: the Hadoop technologies seen so far were originally designed with the latter in mind. One influential approach has been the lambda architecture which consists of 3 layers with all data dispatched to the batch and speed layers: A batch layer manages the append-only master data set and precomputes views on this A serving layer indexes these views to enable ad-hoc low latency queries A speed layer deals with recent data only to avoid the high latency of updates to the serving layer. 15
16 Hadoop technologies are well-suited to supporting the requirements of the batch layer. Column store or NoSQL technologies are often used to support the storage requirements of the serving layer. Stream technologies are required to support the speed layer, for example Storm: Storm is a distributed computing framework designed for real-time stream processing. The processing model is similar to MapReduce in that it supports a graphoriented workflow. However, while the MapReduce model is batch-oriented, Storm has been designed to support real-time processing of stream rather than batch data with streams continuing indefinitely. Storm was developed at BackType which was acquired by Twitter. It was developed and deployed for internal use before being made open-source. 16
17 DATA MINING STANDARDS Data mining standards and related standards for data grids, web services and the semantic web enable the easier deployment of data mining applications across platforms. Standards cover: The overall KDD process. Metadata interchange with data warehousing applications. The representation of data cleaning, data reduction and transformation processes. The representation of data mining models. APIs for performing data mining processes from other languages including SQL and Java. 17
18 CRISP-DM, PREDICTIVE MODEL MARKUP LANGUAGE PMML CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a process model covering the following 6 phases of the KDD process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment PMML is an XML-based standard developed by the Data Mining Group ( which is a consortium of data mining product vendors. PMML represents data mining models as well as operations for cleaning and transforming data prior to modeling. The aim is to enable an application to produce a data mining model in a form PMML XML - which another data mining application can read and apply. 18
19 Below is shown the PMML representation of an example association rules model for the following transaction data. 19
20 20
21 The association model XML schema specification follows. 21
22 22
23 The components of a PMML document consist of (the first two and last being used in the example model above): Data dictionary. Defines models input attributes with type and value range. Mining schema. Defines the attributes and roles specific to a particular model. Transformation dictionary. Defines the following mappings: normalization (continuous or discrete values to numbers), discretization (continuous to discrete values), value mapping (discrete to discrete values), aggregation (grouping values as in SQL). Model Statistics. Statistics about individual attributes. Models. Includes regression models, cluster models, association rules, neural networks, Bayesian models, sequence models. PMML is used within the standards CWM, SQL/MM Part 6 Data Mining, JDM, and MS Analysis Services (OLE DB for Data Mining) providing a degree of compatibility between them all. 23
24 COMMON WAREHOUSE METAMODEL CWM CWM supports the interchange of warehouse and business intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata repositories in distributed heterogeneous environments. SQL/MM DATA MINING The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6 specifies an SQL interface to data mining applications and services through SQL:1999 user-defined types as follows. User-defined types for four data mining functions: association rules, clustering, classification and regression. 24
25 Routines to manipulate these user-defined types to allow: Setting parameters for mining activities. Training of mining models, in which a particular mining technique is chosen, parameters for that technique are set, and the mining model is built with training data sets. Testing of mining models applicable only to regression and classification models, in which the trained model is evaluated by comparing with results for known data. Application of mining models in which the model is applied to new data to cluster, predict or classify as appropriate. This phases is not applicable to rule models in which rules are determined during the training phase. User-defined types for data structures common across these data mining models. Functions to capture metadata for data mining input. 25
26 For example, for the association rule model type DM_RuleModel the following methods are supported: DM_impRuleModel CHARACTER LARGE OBJECT (DM_MaxContentLength)) Import rule model expressed as PMML Return DM_RuleModel DM_expRuleModel() Export rule model as PMML DM_getNORules() Return number of rules DM_getRuleTask() Return data mining task value, data mining settings etc. 26
27 JAVA DATA MINING JDM Java Data Mining ( is a Java API developed under the Java Community Process supporting common data mining operations as well as the metadata supporting mining activities. JDM 1.0 supports the following mining functions: classification, regression, attribute importance (ranking), clustering and association rules. JDM 1.0 supports the following tasks: model building, testing, application and model import/export. JDM does not support tasks such as data transformation, visualization and mining unstructured data. JDM has been designed so that metadata maps closely to PMML to provide support for the generation of XML for mining models. Likewise, metadata maps closely to CWM to support generation of XML for mining tasks. The JDM API maps closely to SQL/MM Data Mining to support an implementation of JDM on top of SQL/MM. 27
28 OLE DB FOR DATA MINING & DMX SQL SERVER ANALYSIS SERVICES OLE DB for Data Mining, developed by Microsoft and incorporated in SQL Server Analysis Services, specifies a structure for holding information defining a mining model and a language for creating and working with these mining models. The approach has been to adopt an SQL-like framework for creating, training and using a mining model a mining model is treated as though it is a special kind of table. The DMX language, which is SQL-like, is used to create and work with models. 28
29 CREATE MINING MODEL [AGE PREDICTION] ( [CUSTOMER ID] LONG KEY, [GENDER] TEXT DISCRETE, [AGE] DOUBLE DISCRETIZED() PREDICT, [ITEM PURCHASES] TABLE ([ITEM NAME] TEXT KEY, [ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS, [ITEM TYPE] TEXT RELATED TO [ITEM NAME] ) ) USING [MS DECISION TREE] The column to be predicted, AGE, is identified, together with the keyword DISCRETIZED() indicating that a discretization into ranges of values is to take place. ITEM QUANTITY is identified as having a normal distribution, which may be exploited by some mining algorithms. 29
30 ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1- many constraint each item has one type. It can be seen from the column specification of the table inserted into, that a nested table representation is used with ITEM PURCHASES iself being a table nested within AGE PREDICTION. A conventional table representation would result in duplicate data in a single non-normalized table or data in multiple normalized tables. The USING clause specifies the algorithm that will be used to construct the model. Having created a model, it may be populated with a caseset of training data using an INSERT statement. Predictions are obtained by executing a prediction join to match the trained model with the caseset to be mined. This process can be thought of as matching each case in the data to be mined with every possible case in the trained model to find a predicted value for each case which matches a case in the model. 30
31 SQL Server Analysis Services supports data mining algorithms for use with: conventional relational tables OLAP cube data Mining techniques supported include: classification - decision trees clustering - k-means association rule mining Predictive Model Markup Language (PMML) is supported. SQL Server Analysis Services Data Mining Tutorials 31
32 DATA MINING PRODUCTS OPEN SOURCE A number of open-source packages and tools support data mining capabilities, including R, Weka, RapidMiner and Mahout. R is both a language for statistical computing and visualisation of results, and a wider environment consisting of packages and other tools for the development of statistical applications. Data mining functionality is supported through a number of packages, including classification with decision trees using the rpart package, clustering with k- means using the kmeans package, and association rule mining with Apriori using the arules package. Weka is a collection of data mining algorithms written in Java including those for classification, clustering and association rule mining as well as for visualisation. 32
33 RapidMiner consists of both tools for developing standalone data mining applications and an environment for use of RapidMiner functions from other programme languages. Weka and R algorithms may be integrated within RapidMiner. An XML-based interchange format is used to enable interchange of data between data mining algorithms. Mahout is an Apache project to develop data mining algorithms for the Hadoop platform. Core MapReduce algorithms for clustering, classification are provided, but the project also incorporates algorithms designed to run on single-node architectures and non-hadoop cluster architectures. 33
34 DATA MINING PRODUCTS ORACLE Oracle supports data mining algorithms for use with conventional relational tables. Mining techniques supported include: classification - decision trees, support vector machines... clustering - k-means... association rule mining Apriori Predictive Model Markup Language (PMML) support is included In addition to SQL and PL/SQL interfaces, until Oracle 11, a Java API was supported to allow applications to be developed which mine data. This was Oracle s implementation of JDM 1.0 introduced above. 34
35 From Oracle 12, the Java API is no longer supported. Instead, support for R has been introduced with the Oracle R Enterprise component. Oracle R Enterprise allows R to be used to perform analysis on Oracle database tables. A collection of packages supports mapping of R data types to Oracle database objects and the transparent rewriting of R expressions to SQL expressions on those corresponding objects. A related product is Oracle R Connector for Hadoop. This is an R package which provides an interface between a local R environment and file system and Hadoop enabling R functions to be executed on data in memory, on the local file system and HDFS. 35
36 READING P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8), J Gama, A Survey on Learning From Data Streams: Current and Future Trends, Progress in Artificial Intelligence, 1(1), 2012 (sections 3.1, 3.2 optional). J Lin and A Kolcz, Large-scale Machine Learning at Twitter, Proc. SIGMOD 12, May 2012 (sections 5, 6 optional). X Liu et al., Survey of Real-time Processing Systems for Big Data, Proc. IDEAS 14, July A Toshniwal et al., Storm@Twitter, Proc. SIGMOD 14, June
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo 178627 Database And Data Mining Research Group
RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo 178627 Database And Data Mining Research Group Summary RapidMiner project Strengths How to use RapidMiner Operator
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier
Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process
ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project
European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Data Mining Standards
Data Mining Standards Arati Kadav Jaya Kawale Pabitra Mitra [email protected] [email protected] [email protected] Abstract In this survey paper we have consolidated all the current data mining
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
The 4 Pillars of Technosoft s Big Data Practice
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Data Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: [email protected] 1 Introduction The promise of decision support systems is to exploit enterprise
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Big Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard
Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Navigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics
Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
BIG DATA SOLUTION DATA SHEET
BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Business Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
Model Deployment. Dr. Saed Sayad. University of Toronto 2010 [email protected]. http://chem-eng.utoronto.ca/~datamining/
Model Deployment Dr. Saed Sayad University of Toronto 2010 [email protected] http://chem-eng.utoronto.ca/~datamining/ 1 Model Deployment Creation of the model is generally not the end of the project.
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer
Alejandro Vaisman Esteban Zimanyi Data Warehouse Systems Design and Implementation ^ Springer Contents Part I Fundamental Concepts 1 Introduction 3 1.1 A Historical Overview of Data Warehousing 4 1.2 Spatial
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Data Mining + Business Intelligence. Integration, Design and Implementation
Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Chapter 5. Warehousing, Data Acquisition, Data. Visualization
Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives
1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining
1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining techniques are most likely to be successful, and Identify
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Customized Report- Big Data
GINeVRA Digital Research Hub Customized Report- Big Data 1 2014. All Rights Reserved. Agenda Context Challenges and opportunities Solutions Market Case studies Recommendations 2 2014. All Rights Reserved.
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
Big Data and Apache Hadoop Adoption:
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
<Insert Picture Here> Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region
Oracle Database Directions Fred Louis Principal Sales Consultant Ohio Valley Region 1977 Oracle Database 30 Years of Sustained Innovation Database Vault Transparent Data Encryption
Oracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
Data Warehouse design
Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources
Fast and Easy Delivery of Data Mining Insights to Reporting Systems
Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb [email protected], [email protected] Abstract: During the last decade data mining and predictive
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:
Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
Actian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
1 File Processing Systems
COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services
PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services David Ferrucci 1, Robert L. Grossman 2 and Anthony Levas 1 1. Introduction - The Challenges of Deploying Analytic Applications
