What is Analytic Infrastructure and Why Should You Care?
|
|
|
- Kathryn Brown
- 10 years ago
- Views:
Transcription
1 What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services, applications, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring data, or related activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that, with this definition, analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. In this article, we discuss the importance of analytic infrastructure and some of the standards that can be used to support analytic infrastructure. We also discuss some specialized analytic infrastructure applications and services, including applications that can manage very large datasets and build models over them and cloud based analytic services. 1. INTRODUCTION If your data is small, your statistical model is simple, your only output is a report, and the analysis needs to be done just once, then there are quite a few statistical and data mining applications that will satisfy your requirements. On the other hand, if your data is large, your model is complicated, your output is not a report, but instead a statistical or data mining model that needs to be deployed into operational systems, or parts of the work need to be done more than once, then you might benefit by using some of the infrastructure components, services, applications and systems that have been developed over the last decade to support analytics. In this article, we refer to these types of components, services and applications as analytic infrastructure. We will define analytic infrastructure a bit more precisely below. We discuss the importance of analytic infrastructure and some of the standards that can be used to support analytic infrastructure. We also discuss some specialized analytic infrastructure applications and services, including applications that can manage very large datasets and build models over them and cloud based analytic services. 2. DEFINING ANALYTIC INFRASTRUCTURE The rest of this article will be clearer if we define some terms. Define analytics to be the process of using data to make decisions based upon models. By models we mean statistical, mathematical or data mining models that are derived from data ( empirically derived ) using techniques that are generally accepted by the community ( statistically valid ). As is standard, we divide the process of building a model into several steps. We use the term data preparation or data shaping for the process that takes the available data and transforms it to produce the inputs to the models (features). We use the term modeling (in the narrow sense) or model estimation for the process that, given a dataset of features, computes the model, which usually involves estimating the parameters of the model. The application that produces the model is sometimes called a model producer. Finally, after the model is estimated, it is usually validated. Once a model is produced, model consumers take a model and data to produce scores. Often this is called scoring. Finally, scores are often post-processed using business rules to produce alerts or recommendations, or used as an input for a model or for additional data shaping. In this article, we define analytic infrastructure to be the applications, services, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring, or related activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that with this definition analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. 3. STANDARDS FOR ANALYTIC INFRASTRUCTURE For data mining applications that consist of several services or components, standards provide agreed upon interfaces for accessing the services or components. This is particularly important if you ever need to change one or more services or cmponents in the application. Standards for describing statistical and data mining models are also important. These types of standards provide an application and system independent format for persisting and interchanging models. Data mining projects often generate several models and providing life cycle management for these models is easier if there is a standard format for models. The models in this format can then be persisted in a repository or otherwise managed. Having a standard format for models is also useful for business continuity purposes and for compliance related requirements. Broadly speaking, there are two basic approaches that have Page 5
2 Step Inputs Outputs Preprocessing dataset (data dataset of features fields) Modeling dataset of features model Scoring dataset (data scores fields), model Postprocessing scores actions Table 1: The table contains some of the steps when building and deploying analytic models that are directly relevant to analytic infrastructure. been used to capture an analytic process. The first approach is to think of the analytic process as a workflow and to think of the workflow as being described as a directed acyclic graph (DAG), where the nodes are arbitrary code and the edges indicate that the output of one program is used as an input to another program. The advantage of this approach is that it is very general, easy to understand, and relatively simple to process. A disadvantage is that from an archival point of view, over time computer code can present difficulties: the necessary environment to execute it may not be available for example. Another approach is to agree via a standardization process on the meaning of the necessary elements to define a statistical model and then to specify these in the description of the model. For example, a regression tree model can specify that a regression tree consists of nodes and nodes are associated with predicates, etc. Then to describe a specific model, one just needs to specify the assignments for all the elements in the model. The advantage of this approach is that the description of a model is quite explicit, and, in principle, will be just as easy to understand ten years later as it was when it was created. Another advantage is that updating a model does not require testing new code, but instead just requires reading new parameters for the model. This approach can dramatically reduce the time to deploy new models. The second approach is the approach taken by the Predictive Model Markup Language (PMML) [5], which defines several hundred XML elements that can be used to describe most of the common statistical and data mining models. 4. THREE IMPORTANT INTERFACES Table 1 describes some of the steps in the data mining process that are most relevant for analytic infrastructure. Interfaces for producers and consumers of models. Perhaps the most important interface in analytics is the interface between components in the analytic infrastructure that produce models, such as statistical packages, which have a human in the loop, and components in the analytic infrastructure that score data using models and often reside in operational environments. Recall that the former are examples of model producers, while the latter are examples of model consumers. The Predictive Model Markup Language or PMML is a widely deployed XML standard for describing statistical and data mining models using XML so that model producers and model consumers can exchange models in an application independent fashion. Many applications now support PMML. By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models. Interfaces for producers and consumers of features. Since Version 2.0 of PMML was released in 2001, PMML has included a rich enough set of transformations that data preprocessing can be described using PMML models. Using these transformations, it would be possible to use PMML to define an interface between analytic infrastructure components and services that produce features (such as data preprocessing components) and those that consume features (such as modeling applications). This is probably the second most important interface in analytics. Interfaces for producers and consumers of scores. The current version of PMML is Version 4.0 and the PMML working group is now working on Version 4.1. One of the goals of Version 4.1 is to enable PMML to describe postprocessing of scores. This would allow PMML to be used as an interface between analytic infrastructure components and services that produce scores (such as scoring engines) and those that consume scores (such as recommendation engines). This is probably the third most important interface in analytics. Today, by using PMML to describe these interfaces, it is straightforward for analytic infrastructure components and services to run on different systems. For example, a modeler might use a statistical application to build a model, but scoring might be done in a cloud, or a cloud might be used for preprocessing the data to produce features for the modeler. 5. ANALYTIC INFRASTRUCTURE FOR WORKING WITH LARGE DATA Today, it is still a challenge to build models over data that is too large to fit into a database. One solution is to use grids [7]. Today, grids have two main limitations. First, they were not designed for working with large data per se, but rather, to support virtual communities. Second, many programmers found the MPI-based programming model rather difficult to use. An alternative approach that overcomes both of these limitations was outlined in a series of three Google technical reports [8], [6] and [3]. The architecture consists of a distributed storage system called the Google File System (GFS) [8] and a parallel programming framework called MapReduce [6]. GFS was designed to scale to clusters containing thousands of nodes and was optimized for appending and for reading data. Contrast this to a traditional relational database that is optimized to support safe writing of rows with an ACID semantics and usually supported by a single computer (perhaps with multiple cores) over a RAID file system. A good way to describe MapReduce is through an example. Assume that GFS manages nodes i spanning several, perhaps many, racks. MapReduce assumes that the data consists of a key-value pair. Assume that node i stores web pages p i,1, p i,2, p i,3,..., p i,ij. Assume also that web page p i contains words w j,1, w j,2, w j,3,.... A basic structure important in information retrieval is an inverted index, which is a data structure consisting of a word followed by a list of Page 6
3 web pages with the properties: (w 1; p 1,1, p 1,2, p i,3,...) (w 2; p 2,1, p 2,2, p 2,3,...) (w 3; p 3,1, p 3,2, p 3,3,...) 1. The inverted index is sorted by the word w j; 2. If a word w j occurs in a web page p i, then the web page p i is on the list associated with the word w j. A mapping function processes each web page independently, on its local storage node, providing data parallelism. The mapping function emits multiple <key, value> pairs (<word, page id> in this example) as the outputs. This is called the Map Phase. A partition function m(w), which given a word w, assigns a machine labeled with m(w), is then used to send the outputs to multiple common locations for further processing. This second step is usually called the Shuffle Phase. In the third step, the processor m(w i) sorts all the <key, value> pairs according to the key. (Note that there may be multiple keys sent to the same node, i.e., m(w i) = m(w j).) Pairs with the same key (keyword in this example) are then merged together to generate a portion of the inverted index <w i: p x,y,...)>. This is called the Reduce Phase. To use MapReduce, a programmer simply defines a parser for input records (called a Record Reader) and a Map, Partition, Sort (or Comparison), and Reduce functions and the infrastructure takes care of the rest. Since many applications need access to rows and columns of data (not just bytes of data provided by the GFS), a GFSapplication called BigTable [3] that provides data services that scale to thousands of nodes was developed. BigTable is optimized for appending data and for reading data. Instead of the ACID requirements of traditional databases, BigTable chose an eventual consistency model. Google s GFS, MapReduce and BigTable are proprietary and not generally available. Hadoop [11] is an Apache open source cloud that provides on-demand computing capacity and that generally follows the design described in the technical reports [8] and [6]. There is also an open source application called HBase that runs over Hadoop and generally follows the BigTable design described in [6]. Sector is another open source system designed for data intensive computing [13]. Sector was not developed following the design described in the Google technical reports, but instead was designed to manage and distribute large scientific datasets, especially over wide area high performance networks. One of the first Sector applications was the distribution of the 10+ TB Sloan Digital Sky Survey [10]. Sector is based upon a network protocol called UDT that is designed to be fair and friendly to other flows (including TCP flows), but to use all the otherwise available bandwidth in a wide area high performance network [9]. On top of the Sector Distributed File System is a parallel programming framework called Sphere that can invoke user defined functions (UDFs) over the data managed by Sector. From the perspective of Sphere s UDFs, MapReduce is simply the concatenation of three specific, but very important UDFs, namely Map, Shuffle and Reduce UDFs as described above. These specific UDFs are available with the Sector distribution, as well as several others. As measured by the MalStone Benchmark [12], Sector is approximately twice as fast as Hadoop. Table 2 contains a summary of GFS/MapReduce, Hadoop and Sector. 6. CLOUD-BASED ANALYTIC INFRASTRUCTURE SERVICES Over the next several years, cloud-based services will begin to impact analytics significantly. Although there are security, compliance and policy issues to work out before it becomes common to use public clouds for analytics, I expect that these and related issues will all be worked out over the next several years. Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle this type of surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating expense, not a capital expense.) Packaging. One of the advantages that clouds instances provide is that all the required system software, application software, auxiliary files, and data can be integrated into a single machine image [1]. This can sometimes significantly simplify the process required to prepare data, build models, or score data. Preparing data. For interactive exploration of small datasets, for preparing a single dataset for analysis, and related tasks, single workstations with desktop applications usually work just fine and are probably the easiest to use. On the other hand, for the pipelined analysis of multiple datasets with the same pipeline, a cloud that provides on demand computing instances offers several advantages. By pipeline here, we mean a sequence or workflow of computations where the inputs of one computation are the outputs of one or more prior computations. These computations occur, for example, quite frequently in bioinformatics. Once the pipeline is prepared and an appropriate computing instance is defined, handling new datasets in the future is quite simple. When a new dataset arrives, simply invoke a new computing instance, supply the dataset, and save the results of the analysis. With this approach, computing instances can easily manage surge requirements that arise when multiple datasets must be processed at the same time. To handle such a surge, simply invoke a new computing instance for each dataset so that the datasets can be processed in parallel. MapReduce for clouds that provide on-demand computing capacity has proven to be particularly effective for preparing data in which large numbers of relatively small records have to be prepared for modeling, such as web pages [6], transactional business data [2], and network packet data. Building models. For very large datasets, several traditional statistical and data mining algorithms have been re- Page 7
4 Design Decision Google s GFS, MapReduce, Hadoop Sector BigTable data management block-based file system block-based file system dividing data into segments & using native file system protocol for message passing TCP TCP Group Messaging Protocol with system protocol for transferring TCP TCP UDP-Based Data Transport data (UDT) programming model MapReduce MapReduce User defined functions, MapReduce replication strategy at the time of writing at the time of writing periodically security not mentioned not yet HIPAA capable language C++ Java C++ Table 2: Some of the similarities and differences between Google s GFS/MapReduce, Hadoop and Sector. formulated as MapReduce jobs, which enables Hadoop and similar systems to be used for computing statistical models on very large datasets (see, for example [4]). MapReduce and similar programming models enable basic statistical models to be built on very large datasets with very little effort, something that was not so easy just a short time ago. For example, the technical report [2] describes how a statistical model to identify potentially compromised web sites was built on log files containing over 10 billion records on a 20 node cluster using just a few lines of MapReduce code. Once a source of data is well understood, it is sometimes required to estimate models for different instantiations of the data. For example, this is a common scenario for some analytic models used in online systems. Here a separate model may be required for each visitor to the web site or each user of an online application. In this case, given data from the user, an on-demand computing instance can be invoked to build a visitor-specific or user-specific model. Scoring models. As mentioned above, once a model has been estimated, it is usually straightforward to apply this model to a file or stream of data to produce scores (outputs of the model). To say it another way, it is usually easier to develop a model consumer than it is to develop a model producer. Given a dataset and a model described in PMML, it is straightforward to create an on-demand computing instance that can score the data using the model. An architecture like this that i) separates model producers and model consumers and ii) scores data using on-demand computing instances provides three important advantages: 1. First, with this approach, the computing instance can be pre-configured so that essentially no knowledge is required by the system scoring the data other than the name of the pre-configured computing instance, the name of the data file, and the name of the PMML file. In contrast, estimating a statistically valid model using a statistical or data mining package often requires considerable expertise. 2. Second, by using on-demand computing instances for scoring, it is straightforward to invoke multiple computing instances so that multiple datasets can be scored at the same time (in other words to handle surges). 3. Third, it is also straightforward to score very large datasets by partitioning them and scoring each partition in parallel using on-demand computing instances. It is often not straightforward to estimate models by partitioning the data, but it is almost always easy to score large datasets by partitioning. 7. SUMMARY In this article, we used the term analytic infrastructure to refer to the tools, services, applications and platforms that support the analtyic process. Except for the most basic models, building models requires specialized software. For example, for many years, databases have been an important component of analytic infrastructure: both to manage data, as well as to help prepare data for modeling. More recently, a variety of other analytic tools and services are becoming more common, including scoring engines to deploy models, workflow systems for managing analytic tasks that need to be repeated, and cloud-based services for a variety of tasks, including data preparation, modeling, and scoring. With the proper analytic infrastructure, models can be built more quickly, deployed more easily, and updated more efficiently. For these reasons, the proper analytic infrastructure is often the difference between a successful and unsuccessful analytic project. 8. REFERENCES [1] Amazon. Amazon Web Services [2] J. Andersen, C. Bennett, R. L. Grossman, D. Locke, J. Seidman, and S. Vejcik. Detecting sites of compromise in large collections of log files. submitted for publication, [3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI 06: Seventh Symposium on Operating System Design and Implementation, [4] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In NIPS, volume 19, Page 8
5 [5] Data Mining Group. Predictive Model Markup Language (pmml), version [6] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 04: Sixth Symposium on Operating System Design and Implementation, [7] I. Foster and C. Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, [8] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP 03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29 43, New York, NY, USA, ACM. [9] Y. Gu and R. L. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7): , [10] Y. Gu, R. L. Grossman, A. Szalay, and A. Thakar. Distributing the sloan digital sky survey using udt and sector. In Proceedings of e-science 2006, [11] Hadoop. Welcome to Hadoop! [12] MalGen: A utility for generating synthetic site-entity log data for testing and benchmarking data intensive applications [13] Sector Page 9
Cloud computing doesn t yet have a
The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.
Data Mining for Data Cloud and Compute Cloud
Data Mining for Data Cloud and Compute Cloud Prof. Uzma Ali 1, Prof. Punam Khandar 2 Assistant Professor, Dept. Of Computer Application, SRCOEM, Nagpur, India 1 Assistant Professor, Dept. Of Computer Application,
Distributed Data Parallel Computing: The Sector Perspective on Big Data
Distributed Data Parallel Computing: The Sector Perspective on Big Data Robert Grossman July 25, 2010 Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
An Overview of the Open Cloud Consortium
www.opencloudconsortium.org An Overview of the Open Cloud Consortium Robert Grossman Open Cloud Consortium OMG Cloud Computing Interoperability Workshop July 13, 2009 This talk represents my personal opinions
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Distributed Lucene : A distributed free text index for Hadoop
Distributed Lucene : A distributed free text index for Hadoop Mark H. Butler and James Rutherford HP Laboratories HPL-2008-64 Keyword(s): distributed, high availability, free text, parallel, search Abstract:
TV INSIGHTS APPLICATION OF BIG DATA TO TELEVISION
TV INSIGHTS APPLICATION OF BIG DATA TO TELEVISION AN ARRIS WHITE PAPER BY: BHAVAN GANDHI, ALFONSO MARTINEZ- SMITH, & DOUG KUHLMAN TABLE OF CONTENTS ABSTRACT... 3 INTRODUCTION INTERSECTION OF TV & BIG DATA...
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Hosting Transaction Based Applications on Cloud
Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data
Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data Ms.N.Saranya, M.E., (CSE), Jay Shriram Group of Institutions, Tirupur. [email protected] Abstract- Load balancing is biggest
A Study on Data Analysis Process Management System in MapReduce using BPM
A Study on Data Analysis Process Management System in MapReduce using BPM Yoon-Sik Yoo 1, Jaehak Yu 1, Hyo-Chan Bang 1, Cheong Hee Park 1 Electronics and Telecommunications Research Institute, 138 Gajeongno,
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER TABLE OF CONTENTS INTRODUCTION WHAT IS BIG DATA?...
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Future Prospects of Scalable Cloud Computing
Future Prospects of Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University [email protected] 7.3-2012 1/17 Future Cloud Topics Beyond
Comparative analysis of Google File System and Hadoop Distributed File System
Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, [email protected]
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India [email protected] 2 PDA College of Engineering, Gulbarga, Karnataka,
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles [email protected] Abstract As the web, social networking,
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Towards Efficient and Simplified Distributed Data Intensive Computing*
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, MANUSCRIPT ID 1 Towards Efficient and Simplified Distributed Data Intensive Computing* Yunhong Gu and Robert Grossman Abstract While the capability
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
BIG DATA SOLUTION DATA SHEET
BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Yuji Shirasaki (JVO NAOJ)
Yuji Shirasaki (JVO NAOJ) A big table : 20 billions of photometric data from various survey SDSS, TWOMASS, USNO-b1.0,GSC2.3,Rosat, UKIDSS, SDS(Subaru Deep Survey), VVDS (VLT), GDDS (Gemini), RXTE, GOODS,
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
