Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

Size: px
Start display at page:

Download "Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews"

Transcription

1 Lightweight Stack for Big Data Analytics Muhammad Asif Saleem Dissertation 2014 Erasmus Mundus MSc in Dependable Software Systems Department of Computer Science University of St Andrews A dissertation submitted in partial fulfillment Of the requirements for the Erasmus Mundus MSc in Dependable Software Systems Head of Department: Dr. Steve Linton Supervisor: Dr. Adam Barker June 2014

2 Declaration I hereby certify that this material, which I now submit for assessment of the program of study leading to the award of Master of Science in Dependable Software Systems, is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work and was performed during the current academic year except where otherwise stated. The main text of this project report is words long. In submitting this project report to the University of St Andrews, I give permission for it to be made available for use in accordance with the regulations of the University Library. I also give permission for the title and abstract to be published and for copies of the report to be made and supplied at cost to any bona fide library or research worker, and to be made available on the World Wide Web. I retain the copyright in this work. If there is a strong case for the protection of confidential data, the parts of the declaration, giving permission for its use and publication may be omitted by prior permission of the Erasmus Mundus DESEM Program Coordinator.

3 Acknowledgements I would like to express my gratitude to my supervisor Dr. Adam Barker for his keen attention, guidance and encouragement. I am also thankful for my co-supervisor Dr. Blesson Varghese for his creative ideas, suggestions and moral support over the duration of this work. I would also like to extend my gratitude to all of my lecturers in the university, for their knowledge, skills and support they imparted to me.

4 Abstract A wide variety of Big Data sets are employed in social science, physical science and the business world. Analytics on such data can predict and make estimations that were not previously possible. While there are technologies to support big data analytics, but it is not easy for a nonspecialist computer user, whether it be a social scientist or a physical scientist or a business analyst, to easily make use of the technology without significant big data expertise. The aim of this project is to develop a lightweight stack that supports technologies for big data analytics.

5 Contents 1 :-: Introduction Big Data Problem Description Motivation Goals of this research Contribution of this thesis Overview of the thesis 4 2 :-: Related Work HIVE Web based interface IBM InfoSphere BigInsights SAP Big Data Analytics TERADATA Big Data Analytics Cloudera Big Data Solutions HortonWorks Amazon Big Data Analytics Platform data Big Data Platform Oracle Big Data Platform Hewlett Packard Big Data Platform Summary 10 3 :-: Big Data Processing Technologies Hadoop Hive Summary 18 4 :-: Development Process & Solution Design Problem statement Revist Development Process Solution Requirements Specifications 20

6 4.4 Solution Design Solution and Research Objectives Summary 24 5 :-: Solution Implementation Implementation Decisions & Development Web Services Data Browser Implementation (User Interaction Layer) Query Management Layer Implementation Summary 32 6 :-: Case Studies and Experiments Case Study Case Study Summary 45 7 :-:Evaluation Comparison with related work Performance Evaluation Summary 50 8 :-: Conclusion & Future Work Conclusions Future Work 52

7 List of Figures Figure I MapReduce Programming Model 12 Figure II Hadoop Architecture with two Nodes 15 Figure III Hive Architecture [Ref Figure II] 16 Figure IV Hive Map Reduce Job Process [25] 17 Figure V Web based framework for Big Data exploration in real time 22 Figure VI Buzz Score From May Figure VII Buzz Score From June Figure VIII Daily Aggregated Buzz Scores From 01 April 2005 to 31 July Figure IX Comparison of Actual and Predicted Buzz Scores using EP technique 37 Figure X Comparison of Actual and Predicted Buzz Scores using RP technique 38 Figure XI Actual and Predicted Score with 4 week sample 40 Figure XII Actual and Predicted Score with 8 week sample 41 Figure XIII Number of bird flu related statements found in the corpus 43 Figure XIV Frequency of related statement found from the corpus 44 Figure XV Frequency of related statement found from the corpus 45 List of Tables Table 2-1 Comparison of different Big Data Platofroms/Frameworks 11 Table 4-1 Research Goals and proposed solution 24 Table 5-1 REST Vs SOAP [35] 26 Table 6-1 % Error between predicted and actual buzz score 38 Table 6-2 Correlation between actual and predicted scores 41 Table 7-1 Comparisons between related and Big Excel 48 Table 7-2 Performance Analysis 50

8 This page is left intentionally

9 Introduction 1 :-: Introduction Big Data technologies are transforming the way data is used to be analyzed. One reason is the massive amount of data that is being generated from different sources such as social networks, sensors, search engines, banks, telecommunication and web, handling this massive amount of data take us in the era of Big Data. According to the YouTube statistics 100 hours of video are being uploaded to YouTube [1] servers in every minute. Facebook is dealing with more than 500 terabytes of data daily [2], companies such as a google and yahoo are recording search engine results for analyzing the searching trends; crawling different web sources to analyze for any important events; gathering marketing data for analyzing the current and future trends which all results the generation of large data sets also referred to as Big Data. Data is everywhere, from social sciences to physical science, business and commercial world, for example, digitizing the past fifty year's newspapers will results the massive amount of data, in astronomy storing billions of astronomical objects, in biology storing genes, proteins and small molecules results in massive amounts of data. In business world such as handling millions of call data records in telecommunication, handling millions of transactions in banking and handling millions of transactions for multinational grocery store results in large data sets. Analyzing these large datasets and getting out meaningful information from it is a challenging in itself. 1.1 Big Data Big Data can be described in 3 V s such as variety, volume and velocity [3] Variety Data has different variations, for example semi-structured or unstructured, such as data, generated from web sites, social networks, s, sensors and web logs is unstructured. Structured data refers to as data generated in result of conversion from call data record to tabular format in order to calculate the monetary value out of it or banks transactions data or data generated from the airline ticketing system are different varieties in the data Volume Volume refers to the amount of data or size of the data set. Nowadays figures are in Tera and Peta bytes. For instance Airbus can generate half of terabytes of data in one flight [4]. 1

10 Introduction Velocity Velocity refers to the speed of data generation which is very fast nowadays. For example weather sensors are kept on generating data as new updates comes, Twitter is generating data at 9100 tweets per second and on Facebook users is sending 3 million messages to each other every 20 minutes [5]. There are different technologies to analyze the Big Data. Hadoop [6] is one of the most popular among them. There are many others, such as MongoDB [7] which is NoSQL database and technologies for warehousing such as Hive and HBase [6] which can be used in conjunction with Hadoop to ease the analysis and provides the abstraction over Hadoop platform. 1.2 Problem Description There are different technologies to deal with Big Data analysis, but most of them are complex and requires expertise to deal with them. Especially for non-computer scientists such as social scientists, they require good programming skills and knowledge of configuring and maintaining the infrastructure which almost makes it impossible for them to explore the large data sets or to perform ad hoc analytics on it. For example, how a social scientist can explore the data to find an event in 1975 by having the previous fifty years of newspaper data? Or how a social scientist can predict the human behavior by analyzing its previous 5 years of data gathered from different sources such as cell phone records with GPS tracking, search engine queries, internet transaction data, consumer behavior or its social network activity? [8] There are plenty of Big Data analysis platforms or frameworks are available nowadays in the market, but the problem for non-computer scientists is to master them because of the complexity involves with them and where necessary to take training in order to use them for exploring Big Data in adhoc manners and doing analytics on large data sets. These systems inherit the problem of maintaining them as well, which might include at application or infrastructure level. 1.3 Motivation The motivation behind this research is to assist non-computer scientists such as social scientists to explore large data sets and to perform adhoc analytics without having any technical expertise in Big Data technologies as well as in maintaining and configuring underlying infrastructure. Nowadays Big Data is an important area to look at in within every field, whether it is social or scientific. Experts such as social scientist who do not have a technical background for handling computer software systems, especially Big Data analytics platforms, can not get insights into large data sets as they do not have enough technical expertise to deal with these systems. One of 2

11 Introduction the motivations behind this research is to give access of Big Data exploration to scientists who want to analyze the information for cutting edge research or want to predict future trends in their respective fields by analyzing the large data sets generated from a particular event or from group of events. Technological advancements in Big Data technologies give the opportunity to computer scientists and commercial companies to look inside large data sets, but these are too complex for non-computer scientists which leads us the need to come up with simple to use and understandable framework that can hide the complexity of Big Data technologies for exploration and analytics. 1.4 Goals of this research The goals of this research are to explore Big Data in an adhoc manner by facilitating the management of user interaction with large data sets and providing the abstraction over deployment, configuration and maintenance of the underlying infrastructure and Big Data technologies. Frameworks such as BigExcel 1 will help non-expert users such as social scientist to harness the Big Data in just a few clicks with less effort. For instance, if a social scientist wants to explore a language dataset of one year gathered from different web sources to explore what was the most popular event happened in that year to use in his research to draw some conclusion, so the scientist can do it by just accessing a simple application by taking simple steps for loading and analyzing. All state-of-the-art Big Data platforms or frameworks are either commercial or very complex to be used for non-technical users and hence they are out of reach from non-computer scientists without taking any considerable training. The goal of BigExcel is to address this issue by developing a framework for non-computer scientists. 1.5 Contribution of this thesis The contributions of thesis are 1- A lightweight web based framework for large data sets, exploration and analytics in an adhoc manner 2- Abstraction for a non - computer scientist such as a social scientist for using Big Data technologies 3- Abstraction for a non - computer scientist such as a social scientist for configuring and maintaining underlying infrastructure for Big Data technologies 4- Access to Big data exploration to non-computer scientists such as social scientists 5- Real time analytics on large data sets 1 Framework developed under this research 3

12 Introduction 1.6 Overview of the thesis In this chapter, we discussed what is Big Data and how it is being generated and the challenges Big Data inherits with it. We discussed the problem statement, research goals and contribution of this work. In the next chapter we will present the related work and comparison between different Big Data platforms/frameworks followed by an overview of Big Data Technologies (Hadoop 2 and Hive 3 ) in chapter three. In chapter four we will present the design of the lightweight web based framework BigExcel followed by the implementation of the framework in chapter five. In chapter six we will present the two case studies. One will demonstrate the predictive analytics use case and one will demonstrate the exploration of a large language corpus use case. These two data sets are taken from yahoo research labs sandbox [9]. In chapter seven we evaluate our work by comparing it with the related work and by providing the performance analysis and usability of the framework. In chapter eight, we will conclude our work and provides the future direction of this research. 2 Big Data Processing Technology based on MapReduce Programming Model 3 Database for handling large data sets with SQL as an interfacing language 4

13 Related Work 2 :-: Related Work In this chapter, we will discuss the related work by analyzing tools and framework which are available to explore Big Data and are capable to perform analytics. We look at these tools and framework with respect to difficulty and expertise require in using them by non-expert users and also whether they support the exploration of Big Data or not. 2.1 HIVE Web based interface Hive has an interactive web interface which is designed for the administration purpose of the Hive and as well as for querying the database. User can create and delete tables and browse the database schema. In addition, users can execute the queries by supplying it from the web based interface. It is not really an analytical platform, but it was developed to work with Hive easily and interactively by using the web interface. The web interface is easy to use, but it is too technical for non-technical users to create and browser schema and other tasks such as starting and stoping the Hive. Also for Hive web interface to work, the user should have Hive configured and deployed on the computer system which requires Hadoop as well for processing, hence makes it very complex for non-expert users with no support of analytics. 2.2 IBM InfoSphere BigInsights InfoSphere BigInsights is a Big Data analytics platform from IBM, which support different type of analytics under one roof. InfoSphere is built on top of Hadoop to enhance its capabilities and provides an interactive interface on it for analyzing the Big Data. InfoSphere has built-in analytics capability, including text analytics for getting insights from large textual data, social data analyzer for analyzing social media data, machine data analytics for analyzing machine data such as data from sensors and GPS and InfoSphere also supports the integration with other Big Data technologies. In addition InfoSphere provides a SQL interface, namely as BigSQL and also a spreadsheet like interface called BigSheets for analyzing and exploring Big Data easily with development tools analytics and security for Big Data operations. BigSheets and BigSQL modules of InfoSphere are some of the core components of the system. BigSQL provides the facility to user to explore the database schema and provide access to analytics using structured query language. BigSheets provide Big Data analysis, exploration and manipulation for non-programmers or non-technical users. The system can load data from multiple sources such as databases, web crawlers, text files, Json, csv files etc. and can store it in 5

14 Related Work the distributed file system for processing. BigSheets provide the user with interactive interface like an excel sheet where users can select or import the data for analysis, such as by applying filters, or by using built-in aggregation functions e.g. sum or average. It also supports user defined functions where users can write their own logic for the computation. BigSheets also support the visualization of data and analysis results interactively. It supports multiple chart types for rendering the result, such as line, bar and pie charts. Despite of all these benefits and easiness InfoSphere provides, issue with InfoSphere is that, it is not easy to configure the system, even technical computer scientist needs training to install and configure the system, hence InfoSphere taken away the opportunity from non-expert users such as social scientists to use the system without having technical expertise [10]. 2.3 SAP Big Data Analytics SAP is one of the leading providers of Big Data analytics platforms. It is one of the first company to introduce in-memory database for analytics. The aim of building in-memory database was to provide a single database for both transactional and analytical data processing, commonly refer to as OLTP 4 and OLAP 5 systems. However, any applications can be built on top of SAP HANA 6 in-memory database for analytics. However SAP HANA back-end engine is available for use in the cloud as well, but there are less application exists, which can be used by the non - expert user for analytics. HANA has a strong library called Predictive analytics library (PAL) which can be used by the HANA developers to perform predictive calculations by just invoking the library functions hence providing ease for Big Data analytics. For setting up HANA system for Big Data exploration, the user needs some type of ETL 7 tool for loading the data into the system and then an application which can manipulate that data. Many sample applications exist for demonstration of the HANA power but most of them address the business use cases and designed for expert users or who can use the system by taking the adequate training [11] [12]. 2.4 TERADATA Big Data Analytics The Aster Big Analytics Appliance is solution from TERADATA for Big Data Analytics. Aster is basically a database developed by TERADATA supporting row based and column based storage and it s the key component of their Big Data analytical platform which consists of Aster database, Aster SQL-MapReduce, which is basically an interfacing SQL language for Aster 4 Online Transaction Processing 5 Online Analytical Processing 6 High Performance Analytic Appliance 7 Extraction, Transformation and Load 6

15 Related Work database but includes support for built-in functions for MapReduce using SQL language similarly to Hive which sits on top Hadoop along with their own supplied and tested hardware. We will discuss in detail about the architecture and working of Hive and Hadoop in the next chapter. The Aster Big Analytics Appliance is a combination of software and hardware and the basic idea is to run Big Data technologies on a dedicated and specially designed hardware by connecting multiple nodes with InfniBand [13] instead of running Hadoop on commodity hardware with traditional networconnectionsns. New nodes can be added as needed by using separate dedicated hardware machines. Despite the dedicated hardware and fast speed database, together with built-in functions for MapReduce task and SQL interface for analytics, there are less applications which are available for non-expert users to explore and analyze Big Data. The technology main focus is on the business users and users have to have a physical machine (not commodity hardware) together with front end application to use the system which is not suitable for non-technical users [14]. 2.5 Cloudera Big Data Solutions Cloudera founded in 2008 and was the first enterprise distributors of Apache Hadoop and other Big Data technologies. The contributions of Cloudera to Big Data world is the abstraction over different Big Data technologies such as Hadoop, Hive, HCatalog [15] etc., which provides easiness to users to use these technologies without going into technical details. Cloudera solutions are ready to install on any commodity hardware which hides the technical details of compiling and configuring the Big Data technologies and provides the system management such as configuration, deployment, security management, diagnostics, operational reports generation etc. Cloudera is at forefront of providing back-end solutions for Big Data exploration and analysis but does not provide any tool or framework which actually can be used on top of these technologies especially for non-expert users [16]. 2.6 HortonWorks HortonWorks founded in 2011 by the ex Yahoo engineers. The concept is same as Cloudera to provide different Big Data technologies for enterprise computing and as a result the company developed HortonWork Data Platform for enterprise computing which builds on top of Hadoop and have the capability to provide different type of analysis such as batch, real time and interactive. The platform also supports data management, the facility which is built on top of the Hadoop file system and resource manager. In addition to multiple analysis capabilities of the Horton data platform, it supports different technologies for data integration and data flow 7

16 Related Work control. Technologies such as Apache Falcon [17],, Sqoop [18] and Flume [19] are part of the platform and provides easy and systematic access for handling data in and out of Hadoop. Furthermore, the platform provides AAA 8 and data protection by encrypting traffic during transfer between different node such as using remote procedure calls, transferring data over HTTP, JDBC and Data transfer protocol. The platform also supports the management of Hadoop cluster operations. For deploying and configuring Horton data platform requires considerable expertise and training to use it along with other technologies experience for loading data into the distributed file system for processing. Furthermore the system lacks applications where the user can just login to the system and start analyzing the data without having any expert knowledge in system [20]. 2.7 Amazon Big Data Analytics Platform Amazon is a pioneer in cloud computing and provides technological services through web interface usually referred as web services. Amazon elastic map reduce (EMR) is a web service from amazon where we can access Big Data technologies such as Hadoop and Hive via Amazon exposed API s or via launching an EC2 9 instance, where users can deploy map and reduce script for processing on Hadoop and Hive. Hive can be accessed via the amazon exposed API s such as JDBC or by using a Hive compatible client such as Hive command line utility and beeline. Non-expert users which do not have technical grounds to write map and reduce scripts are not able to use this technology and also non-programmer can not work with the exposed API s as they have to be proficient in programming in order to work with them. In addition, users need to build an application on top of Amazon web services to provide the access to non-expert users for big data exploration data Big Data Platform 1010data is a leading provider for big data exploration and analysis, serving multiple industries such as retail, telecom, financial, healthcare, government, etc. 1010data systems can be installed on premises and can be accessed through cloud. The 1010data big data platform has very interactive interface with a spreadsheet like interface called The Trillion-Row Spreadsheet [21]. It is similar to the excel spreadsheet, but it is capable of handling billion or trillion rows of data. The cloud solution can be used by non-expert users as it only requires the authentication for accessing the system to explore the data. In addition platform also provides the facility to import and load the data from multiple sources easily. 8 Authentication, Authroziation and Audit 9 Elastic Compute Cloud by Amazon 8

17 Related Work 1010data Big Data platform depends on the underlying high performance database technology which support both row and column based storage. A column based storage is better for set type operations as data is physically present adjacent to each other in the disk and disk takes less time for locating the data when the data retrieval query comes. The database support multiple built in functions for analytics usually refers to as in-database analytics. The functions include grouping functions, statistical functions such as mean, standard deviation, time series functions such as finding regression and correlation coefficient, text functions such as finding a specific text from the data such as finding tweets from a large data set etc. 1010data provides easy to use interface even for non-expert users, but the technology they are using is now outdated. Although they are using the row and column based storage database but there are other databases available such as SAP HANA, which provides faster response to OLAP queries as compared to any other database. Furthermore Hadoop and other Big Data technologies are the forefront in analyzing Big Data and provides a reliable and fastest way to explore Big Data and 1010data platform is not supporting Hadoop and other Big Data technologies, which mean this platform might not be of interest of most of non-technical users [21]. 2.9 Oracle Big Data Platform Oracle is not behind from any other company for Big Data solutions. Oracle has introduced the in memory database option in oracle database version of 12c. Oracle partners with Cloudera to provide Hadoop for Big data analytics. In addition, Oracle has developed its own analytical tools using R [22] for Hadoop together with In-memory and NoSQL databases. In addition the system has a data management facility for moving data in and out of Hadoop and to and from the database with dedicated and tested hardware from Oracle. In addition, Oracle is providing the Big data platform as a service in the cloud so any application can be built on top of the Oracle Big Data platform. Despite the powerful hardware and software capability to deliver real insights into the Big Data there is a need to build a simple application on top of this technology to support non-expert user for exploring the Big Data [23] Hewlett Packard Big Data Platform HP provides analytics through its HP Vertica analytics platform. The product is basically based on Vertica database which is the backbone of the platform. The database provides fast execution of queries and especially for set type queries and for warehouse applications due to its columnar storage structure. In addition database uses advanced compression techniques to compress the data to be stored on the disk along with in-database analytics. Furthermore, the database has integration with Hadoop for processing stored data in the Hadoop File System 9

18 Related Work which can be loaded into the database through Hadoop-Vertica connector. However system does not provide the flexibility to execute MapReduce jobs into the Hadoop instead platform is more interested to get only the data from the Hadoop distributed file system for processing by using rich in-database analytical functions. Despite the fact that the system provides high performance analytics, but it cannot be used by non-expert users as the system has to be configured and deployed on the premise and there less applications exists which can use the platform easily for Big Data exploration. Furthermore, system is not providing MapReduce computation facilities which is a key factor for any Big Data exploration and analytics application nowadays. [24]. Table 2-1 shows the tabular comparison of different technologies and applications that we have discussed above Summary In this chapter, we talked about different Big Data platforms for analyzing and exploring the large data sets and discussed their insights and whether the non-technical users can use those platforms or not. We also provided the tabular comparison between them. In the next chapter, we will discuss the popular Big Data processing technologies Hadoop and Hive. 10

19 Related Work Features InforSphere SAP TERADATA Cloudera HortonWorks Amazon 1010data Oracle HP Hive Analysis Web Interface Yes No No Yes No Yes Yes No No Yes Hadoop Support Yes Yes Yes Yes Yes Yes No Yes No No Dedicated Hardware Support Yes Yes Yes No No No No Yes Yes N//A Analytics Yes Yes Yes No No No Yes Yes Yes No Database Support No Yes Yes No No No Yes Yes Yes No Yes Yes Yes Yes Yes No Yes Yes Yes No Data Management Nonexpert users support Yes (Low) No No No No No Yes (Medium) No No No Bussiness users support Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Table 2-1 Comparison of different Big Data Platofroms/Frameworks 11

20 Big Data Processing Technologies 3 :-: Big Data Processing Technologies In this chapter, we will discuss the big data processing technologies that we have used in developing the framework. We start with Hadoop and describe its programming model and architecture followed by an overview of Hive and its architecture and the benefits Hive provides for analyzing large data sets. 3.1 Hadoop Hadoop is based on the MapReduce Programming Model originally developed by Google in 2004 and currently maintained by an Apache software foundation as an open source software. Hadoop is most widely used Big Data technology for analyzing the large data sets. Hadoop batch and parallel processing nature make it perfect for crunching large amount of data very easily and in cost effective manners. Input Output Split0 Mapper Key-1 Value Reducer Split1 Key-2 Value Split2 Mapper Key-1 Value Reducer Split3... Split4 Mapper Key-1 Value Reducer Figure I MapReduce Programming Model MapReduce Programming Model based on the concept of key / value pairs as shown in the Figure I, where the user inputs data file which splits into multiple chunks of 64 MB in size and passed to the mapper where different chucks of the same file is assigned to different mappers for parallel processing. Mapper parses the input file and convert it to key / value pairs for processing and generates intermediate key /value pairs after processing as shown in the Figure I. Reducer then loads all these key/value pairs and sort it in order to group the related keys 12

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Certified Big Data and Apache Hadoop Developer VS-1221

Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Sisense. Product Highlights. www.sisense.com

Sisense. Product Highlights. www.sisense.com Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Delivering Real-World Total Cost of Ownership and Operational Benefits

Delivering Real-World Total Cost of Ownership and Operational Benefits Delivering Real-World Total Cost of Ownership and Operational Benefits Treasure Data - Delivering Real-World Total Cost of Ownership and Operational Benefits 1 Background Big Data is traditionally thought

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

WHITE PAPER. Four Key Pillars To A Big Data Management Solution WHITE PAPER Four Key Pillars To A Big Data Management Solution EXECUTIVE SUMMARY... 4 1. Big Data: a Big Term... 4 EVOLVING BIG DATA USE CASES... 7 Recommendation Engines... 7 Marketing Campaign Analysis...

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation

More information

Big Data Analysis and HADOOP

Big Data Analysis and HADOOP Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: jjega.cool@gmail.com Mobile: 8220380693 Abstract: - Digital universe with

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information