FOUNDATIONS OF A CROSS- DISCIPLINARY PEDAGOGY FOR BIG DATA
|
|
|
- Prosper Cody Powell
- 10 years ago
- Views:
Transcription
1 FOUNDATIONS OF A CROSSDISCIPLINARY PEDAGOGY FOR BIG DATA Joshua Eckroth Stetson University DeLand, Florida [email protected] ABSTRACT The increasing awareness of big data is transforming the academic and business landscape across many disciplines. Yet, big data programming environments are still too complex for nonprogrammers to utilize. To our knowledge, only computer scientists are ever exposed to big data processing concepts and tools in undergraduate education. Furthermore, noncomputer scientists may lack sufficient common ground with computer scientists to explain their specialized big data processing needs. In order to bridge this gap and enhance collaboration among persons with big data processing needs and persons who are trained in programming and system building, we propose the foundations of a crossdisciplinary pedagogy that exposes big data processing paradigms and design decisions at an abstract level. With these tools, students and experts from different disciplines can more effectively collaborate on solving big data problems. 1. INTRODUCTION Data is growing at an exponential rate. This growth brings new opportunities for data mining and analysis, but also brings significant challenges in data storage and processing. Many fields of study, including business intelligence and analytics [4], health care [8], and social science [6], are beginning to explore how to make use of big data. However, fundamental processing paradigms for big data, such as parallel computation, are difficult for firstyear computer science students to master [3], and we expect that students in other disciplines with little computer science background have at least as much difficulty. Additionally, complex programming is normally required to build and execute a big data processing job. This leaves non- computer scientists at a disadvantage and possibly uncertain about how to design a big data processing task and communicate this design to experts who are capable of programming and executing the design. As computer scientists and educators, we have the requisite background knowledge for understanding big data processing and we have experience teaching complex computer science subjects. To our knowledge, few courses outside of computer science departments expose students to parallel and distributed processing techniques. Typically, noncomputer science students are taught how to use Microsoft Excel, SPSS, STATA, and R to process data on a single machine. These techniques simply do not work on big data, though they are useful for analyzing data
2 that result from the aggregation and summarization of big data. This work gives the foundations of a pedagogy for big data for noncomputer scientists and computer scientists alike. Big data has a variety of definitions. Our definition is adapted from Russom [9]. It makes use of three other terms: data volume, which represents the size (perhaps in bytes) of the data; data velocity, which represents the speed (perhaps in bytes/second) at which data is arriving; and commodity machine, which identifies a computer or virtual machine in a cloud computing service that has moderate specifications and cost. Commodity machines in 2015 vaguely are capable of storing 510 terabytes and have up to 32 GB RAM, moreorless. Most importantly, commodity machines are not supercomputers. We define big data as stated below. A data mining/analysis task may be described as big data if the data to be processed have such high volume or velocity that more than one commodity machine is required to store and/or process the data. The key aspect of this definition is the highlighted phrase: more than one. If the data fit on one commodity machine, or can be processed in real time by one commodity machine, then big data processing tools are not required and, in fact, just introduce additional overhead. Identifying what is and is not big data is a key skill for students and practitioners so that they may avoid introducing complexity and overhead where it is not necessary. The big data paradigms we present below support processing and data storage operations that span several machines. Breaking up a computation across several machines is nontrivial, and is the sole reason big data paradigms are nonintuitive and should be taught with paradigmatic cases. The main contributions of this paper are: (1) an outline for a pedagogy for big data; (2) big data paradigms that cover a wide range of tasks and give a common language for collaborating; (3) a decision tree for helping decide which paradigm to apply in a particular situation. The rest of this paper organized as follows. Section 2 discusses our goal for big data pedagogy. Section 3 identifies three processing paradigms and demonstrates directed computation graphs. This section also gives a decision tree for helping students and practitioners decide which paradigm is appropriate for a given situation. Section 4 walks through a concrete example. Finally, Section 5 explores related work and Section 6 offers conclusive remarks. 2. BIG DATA PEDAGOGY Our goal in developing a crossdisciplinary pedagogy for big data is to teach students from a variety of disciplines, including computer science among others, about the design of big data processing tasks. We expect noncomputer science students to generate artifacts that abstractly illustrate a processing design in the
3 form of directed computation graphs (demonstrated in the figures below). We expect computer science students would also generate such computation graphs but may also be asked to actually implement and test the design on a cluster of machines. A big data pedagogy should explain and continually reinforce the following general design process: Identify a question to ask about the data. Different disciplines may be concerned with very different kinds of questions. Determine data sources and their volume and velocity. Identify how many commodity machines may be needed to store and process the data. The outcome of this step serves as constraints for future steps. Identify a data sink for the output data. Normally, the output data is very small in volume and/or velocity compared to the input data. Decide if big data processing techniques are necessary. If not, choose a traditional tool (e.g., Microsoft Excel, Stata, etc.). The decision tree at the bottom of Section 3 may help. Design a directed computation graph by adapting a big data paradigm from Section 3, as indicated by the decision tree. (Possibly) implement the solution. Programming expertise may be required for this final step in the process. Our learning objectives are as follows. Students from various disciplines will be able to, LO1: determine whether a data processing task requires big data tools and techniques, or can be done with traditional methods; LO2: identify appropriate big data paradigms to solve the problem; LO3: design an abstract representation, in the form of a directed graph, of the required processing steps and identify input and output formats of each step. In the following sections, we introduce paradigms that serve as the foundation of the pedagogy. We then illustrate one of these paradigms with an example. 3. BIG DATA PARADIGMS We have identified three big data paradigms that cover a wide range of big data processing tasks. The first two are adapted from Kumar [7]. These paradigms are illustrated with directed computation graphs, which indicate processing steps along with data sources and sinks, identified by arrows on the left and right (respectively) of each processing step. Figure 1 shows a processing step. The key concept of a processing step is that the machine processes only those data that are already stored on the machine s disk and/or data that are made available as inputs from other steps or an external data source. Machines do not communicate with each other except as indicated by the arrows. Furthermore, each machine is
4 assumed to be a commodity machine with modest computational and storage capabilities. Thus, no single machine is able to support a big data processing task on its own. The challenge for students is to design a solution to a big data processing problem that splits the task into distinct processing steps with appropriate inputs and outputs, while limiting communication among machines and exploiting parallel processing. Figure 1: A processing step for directed computation graphs. Processing High Volume Data The fundamental challenge with high volume data is that the data are initially split and stored across several machines. Thus, an efficient processing strategy would task each machine with processing the data that are already stored on that machine, and only those data. The MapReduce architecture, commonly used with the Hadoop platform [1], breaks the task into two distinct phases. During the Map phase, each machine that holds a subset of the data is tasked with processing that subset. A Map Reduce coordinator (which typically runs on yet another machine) waits until each Map job is complete for all the machines, and then delivers the results from each Map job to machines that are tasked with aggregating the results (perhaps reusing machines from the Map jobs). This second Reduce phase is provided the results of the Map jobs and is responsible for producing the final output of the entire processing task. Figure 2 shows an example directed computation graph for the MapReduce architecture. A concrete MapReduce processing example is detailed later. Figure 2: High volume, batch processing paradigm.
5 Processing High Velocity Data The fundamental challenge with high velocity data is that the data are arriving faster than any one machine can process. Thus, the processing must be handled by several machines. The data are not necessarily stored on any machine, though such configurations are possible. A more common design challenge for high velocity data processing is deciding which machine performs which subtask and how these subtasks (and therefore machines) coordinate. The directed computation graph indicates how the data are streamed through various transformations and aggregations. No particular computation graph is more common than any other for stream processing. Its structure depends entirely on the task. This makes high velocity stream processing more challenging for students and instructors alike, because it is less constrained, and should be taught after introducing the more constrained MapReduce architecture. Figure 3 shows one such example of a directed computation graph for the stream processing paradigm. Figure 3: High velocity, stream processing paradigm. Merging Data When two or more datasets must be merged within a big data processing task, one is faced with either of two scenarios. In the first scenario, the data to be merged are small enough to fit in memory or disk on a single machine. In this case, the merge may be performed by each machine that needs the data. For example, in a Map Reduce paradigm, the merge may be performed during the Map stage. On the other hand, if the data to be merged is big data, then no single machine is capable of performing the merge, so the merge must occur piecewise across many machines. In the MapReduce paradigm, such a merge would occur during the Reduce stage. The details of merging are outside the scope of this report, but we note that merging data is a common task and should be explored in a big data curriculum. Decision Tree We have described three paradigms for big data tasks. When designing a solution to a problem, students should evaluate first whether their problem requires big data
6 processing, and if so, which paradigm to apply. The following decision tree may be offered as a tool to aid this design challenge. 1. Can the data be processed in R, SPSS, SAS, Stata, etc.? 1.1. Yes: Avoid big data technologies, use traditional methods No: Are the data high volume or high velocity? High volume: Does more than one data set need to be merged? Yes: Is more than one data set big data? Yes: Design a join in the Reduce stage of MapReduce No: Design a join in the Map stage of MapReduce No: Design a simple MapReduce processing job High velocity: Decompose the processing into transform/filter stages Both high volume and velocity: Design a streaming processing job that saves the resulting data in splits across multiple machines. Then process the saved data with MapReduce. 4. CONCRETE EXAMPLE We now develop a concrete example that is appropriate for a first exploration of the batch processing highvolume paradigm. The task is to find the maximum value of a singlecolumn numeric data set with one trillion records, stored across several machines. First, we look at some bad ideas. These include, Use Excel. This is a bad idea because the data are too voluminous to be stored on a single machine. Read through all the values on machine 1 and record the maximum. Then read all the values on machine 2 and record the maximum. And so on for all the machines with data. Then find the maximum of all the stored maximums from the subsets. This is a bad idea because the procedure does not exploit parallelism. The maximum on each subset of data may be found simultaneously since the machines need not communicate with each other. A good idea would be to start with the decision tree. Based on the decision tree, we conclude that the task is a big data task and it is high volume. There is only one data set so no merge is required. Thus, we should design a MapReduce job. Each machine would perform a Map stage, in parallel. During the Map stage, the maximum value for the data stored on that machine is found and communicated to the MapReduce manager. Once all the machines are finished finding their subset maximum, the maximum of this collection of maximums is found (by yet another machine or by reusing a machine). Figure 4 shows the directed graph of computation for this task. Notice that it is a specialization of the high volume, batch processing paradigm (Figure 2).
7 Figure 4: Directed graph of computation for the maximum value example. These paradigms and the corresponding decision tree form the foundation of our pedagogy for big data. The paradigms serve as an abstract representation of big data processing tasks that students from a wide variety of disciplines should be able to understand and design. The decision tree assists in the design of big data jobs and may be further elaborated as more paradigms are introduced and specialized. 5. RELATED WORK According to our research, there are no published reports documenting attempts to teach noncomputer science students how to design big data processing tasks. Silva et al. [10] explored pedagogy for computer science students. They identified different categories of big database management systems (BDMS) for splitting big data across several machines, and developed guidelines for deciding which to use in various situations. They designed big data processing exercises for each BDMS. These exercises required strong programming skills. They found that diagrams, similar to the diagrams presented above, helped students understand big data processing steps. Dobre and Xhafa [5] explore a wide variety of big data paradigms and tools. Their report is more detailed and extensive than we consider appropriate for non- computer scientists. However, it may help instructors expand the big data paradigms presented here to cover more subtle and sophisticated techniques. One such technique is MapReduceMerge, which adds a Merge step after the normal MapReduce. This Merge step can combine the results of multiple parallel Map Reduce jobs that operate on and produce heterogeneous data. They also explore the history of big data processing and survey a wide variety of software tools for actually programming and building big data jobs. 6. CONCLUSIONS This work introduced foundations for a big data pedagogy that is appropriate for students across various disciplines. We introduced a common language that may serve as the basis for collaboration. This language includes terms like big data, data volume, data velocity, batch processing / MapReduce, stream processing, and
8 merging. We exposed three paradigms of big data processing and provided a decision tree for making sense of when to use each paradigm. We hope that researchers and developers will continue to expand the big data ecosystem. In particular, we hope that the tools are simplified to the point that non- computer scientists may implement and execute big data jobs without assistance from big data experts. We believe that a simplified programming language such as Pig Latin, from the Apache Pig project [2], may be a step in the right direction. ACKNOWLEDGEMENTS We wish to thank the anonymous reviewers for their helpful feedback. REFERENCES [1] Apache Hadoop, The Apache Foundation [2] Apache Pig, The Apache Foundation [3] Bogaerts, S. A. Limited time and experience: Parallelism in CS1. IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), , [4] H. Chen, R. H. Chiang, and V. C. Storey. Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4): , [5] C. Dobre and Xhafa, F. Parallel programming paradigms and frameworks in big data era. International Journal of Parallel Programming, 42:710738, [6] S. GonzálezBailón. Social science in the era of big data. Policy & Internet, 5(2):147160, [7] R. Kumar. Two computational paradigm for big data. KDD Summer School on Mining the Big Data (ACM SIGKDD), [8] T. B. Murdoch and A. S. Detsky. The inevitable application of big data to health care. JAMA, 309(13): , [9] P. Russom. Big data analytics. TDWI Best Practices Report, Fourth Quarter, [10] Y. N. Silva, Dietrich, S. W., Reed, J. M., and Tsosie, L. M. Integrating big data into the computing curricula. SIGCSE, , 2014.
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
ANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Big Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Chapter 2 Big Data Panel at SIGDSS Pre-ICIS Conference 2013: A Swiss-Army Knife? The Profile of a Data Scientist
Chapter 2 Big Data Panel at SIGDSS Pre-ICIS Conference 2013: A Swiss-Army Knife? The Profile of a Data Scientist Barbara Dinter, David Douglas, Roger H.L. Chiang, Francesco Mari, Sudha Ram, and Detlef
The Next Wave of Data Management. Is Big Data The New Normal?
The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6
International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: [email protected] November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Healthcare Big Data Exploration in Real-Time
Healthcare Big Data Exploration in Real-Time Muaz A Mian A Project Submitted in partial fulfillment of the requirements for degree of Masters of Science in Computer Science and Systems University of Washington
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
INTERSEC BENCHMARK. High Performance for Fast Data & Real-Time Analytics Part I: Vs Hadoop
INTERSEC BENCHMARK High Performance for Fast Data & Real-Time Analytics Part I: Vs Hadoop BENCHMARK VS HADOOP (STAND ALONE OR COMBINED) Intersec solution in a Redhat Openstack NFV framework complements
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
A Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Business Analytics and the Nexus of Information
Business Analytics and the Nexus of Information 2 The Impact of the Nexus of Forces 4 From the Gartner Files: Information and the Nexus of Forces: Delivering and Analyzing Data 6 About IBM Business Analytics
A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data
White Paper A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data Contents Executive Summary....2 Introduction....3 Too much data, not enough information....3 Only
Keyword: YARN, HDFS, RAM
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Big Data and
Bringing the Power of SAS to Hadoop. White Paper
White Paper Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities Contents Introduction... 1 What
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Business Challenges and Research Directions of Management Analytics in the Big Data Era
Business Challenges and Research Directions of Management Analytics in the Big Data Era Abstract Big data analytics have been embraced as a disruptive technology that will reshape business intelligence,
Big Data: calling for a new scope in the curricula of Computer Science. Dr. Luis Alfonso Villa Vargas
Big Data: calling for a new scope in the curricula of Computer Science Dr. Luis Alfonso Villa Vargas 23 de Abril, 2015, Puerto Vallarta, Jalisco, México Big Data: beyond my project } This talk is not about
ICT Perspectives on Big Data: Well Sorted Materials
ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in
Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
A B S T R A C T. Index Terms: Hadoop, Clickstream, I. INTRODUCTION
Big Data Analytics with Hadoop on Cloud for Masses Rupali Sathe,Srijita Bhattacharjee Department of Computer Engineering Pillai HOC College of Engineering and Technology, Rasayani A B S T R A C T Businesses
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Customized Report- Big Data
GINeVRA Digital Research Hub Customized Report- Big Data 1 2014. All Rights Reserved. Agenda Context Challenges and opportunities Solutions Market Case studies Recommendations 2 2014. All Rights Reserved.
Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.
Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,
Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors
Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Sudarsanam P Abstract G. Singaravel Parallel computing is an base mechanism for data process with scheduling task,
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Hadoop & SAS Data Loader for Hadoop
Turning Data into Value Hadoop & SAS Data Loader for Hadoop Sebastiaan Schaap Frederik Vandenberghe Agenda What s Hadoop SAS Data management: Traditional In-Database In-Memory The Hadoop analytics lifecycle
Programme Specification Postgraduate Programmes
Programme Specification Postgraduate Programmes Awarding Body/Institution Teaching Institution University of London Goldsmiths, University of London Name of Final Award and Programme Title MSc Data Science
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing
International Journal of Computational Engineering Research Vol, 03 Issue, 10 XMPP A Perfect Protocol for the New Era of Volunteer Cloud Computing Kamlesh Lakhwani 1, Ruchika Saini 1 1 (Dept. of Computer
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Hadoop for Enterprises:
Hadoop for Enterprises: Overcoming the Major Challenges Introduction to Big Data Big Data are information assets that are high volume, velocity, and variety. Big Data demands cost-effective, innovative
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal [email protected],
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
