The Hadoop ecosystem: an overview Lamine Aouad University College Dublin

Size: px
Start display at page:

Download "The Hadoop ecosystem: an overview Lamine Aouad University College Dublin"

Transcription

1 The Hadoop ecosystem: an overview Lamine Aouad University College Dublin Hadoop [1] has become the de facto standard in the big data space. It has also become the kernel of a large distributed ecosystem. A collection of projects, many at Apache, have emerged as essential tools to be used alongside Hadoop for a number of important operations, including data management, collection and aggregation, transfer, high- level interfaces, BI reporting, and so on. The community of applications and use cases running on Hadoop, and using these projects, is constantly growing. In this paper, we will have a look at these different projects, and how they are used and complement Hadoop. A high- level map of the ecosystem is illustrated in Figure 1. Before describing these projects, it is worth mentioning a very interesting emerging project in the ecosystem called Bigtop [2]. Its primary goal is to build a community around the Hadoop ecosystem by providing a common packaging and interoperability testing. Indeed, the Hadoop ecosystem includes a range of sophisticated projects, and although each of these projects has a rigorous community practicing high- degree of software development discipline, there were a gap when it came to inter- project integration. There are different dependency issues between these projects. The Hadoop space now is actually full of companies filling this gap and delivering fully integrated pre- packaged Hadoop stacks, such as Cloudera [4], Karmasphere [3], among others. Bigtop is also filling this gap for the community by focusing on the ecosystem as a whole, rather than individual projects. Figure 1. The Hadoop Ecosystem Workflow UI & BI, High-level monitoring / Packaging Data Collection / I/O Scribe, Flume, Chukwa, Sqoop, Oozie, Azkaban, Cascading OLTP & NoSQL HBase, Cassandra, Hue / Cloudera, Karmasphere, Management / Support Zookeeper, Avro Core High-level Interfaces Pig, Hive Cloud deployments Whirr, Amazon EMR, Infochimps A vanilla Hadoop distribution comes with two core components: its MapReduce engine and its file system, called HDFS (Hadoop Distributed File System). The MapReduce engine coordinates and executes jobs (via two roles: the jobtracker and the tasktrackers). Note that in Hadoop terminology, a job is the full program, and is then a set of tasks. On the other hand, as already mentioned, Hadoop comes with HDFS (by default), but can be (and has been) integrated and used 1

2 with other file systems and storage systems. Many distributions and deployments actually use custom file systems, such as QFS from Quantcast [44] recently released as open source, IBM s GPFS [49], Amazon s S3 [45], among others. HDFS also has two basic roles: a single namenode storing metadata (not involved in ordinary activity) and the datanodes storing the actual data chunks (typically collocated with the tasktrackers). There is a lot to say about Hadoop execution and the way MapReduce and HDFS work, how to develop an application (using the native Java interface or streaming) and so on. However, this is out of the scope of this paper, as the primary focus is to introduce the ecosystem around these two core components. There are a lot of introductory courses, and training resources and tutorials available online. In the following, we are going to start with describing the data integration within the ecosystem, which I believe is one of the most important aspects here. Data connections Data has many shapes, formats, and is captured and stored using a range of technologies. It is also distributed in different places. There are a number of tools that deal with delivering the data to Hadoop, structured or unstructured. Many companies have built their own tools to serve their own needs. Some have been released as open source to the community, and have gained in functionality and support. These include Scribe [5], Flume [6], Sqoop [7], among others. There are also a number of proprietary tools providing connectivity, extraction, and mapping capabilities, such as Pervasive s data integrator. The next few paragraphs briefly introduce these tools. Ø Unstructured data Reliable and efficient data preparation is paramount. However, the data preparation here means basic assembly and delivery of the data. This includes collection, aggregation, and movement of the distributed data, as opposed to the data mining point of view where it includes an extra step manipulating the data to enhance its utility for the mining process. The challenge here is how to get the large amounts of data, logs or any other data, to HDFS to be processed by Hadoop. This has to be done in a reliable and fault- tolerant manner. Losing all the click logs of a website for a given period of time, because of a namenode failure for instance, is not tolerable. A number of (quite similar) tools have been developed to stream large flow of data to Hadoop. These include Scribe and Flume (already mentioned), and Chukwa [8]. They provide similar functionality, but they have some differences in the way they are configured, the monitoring, levels of reliability available, etc. Note that basic tools exist and can be scripted out, such as HDFS copy commands, which provide a way to get data in and out of Hadoop. But this cannot realistically be used in a large- scale production environment. Facebook s Scribe for instance was initially designed to reliably scale to tens of billions of messages a day. It is also used at Twitter and Digg, among others. Cloudera s Flume is also used in production by a number of companies including Photobucket. An additional similar system is called Kafka [9], initially developed at LinkedIn. It is comparable to Flume and Scribe, and can be used for activity stream processing. 2

3 However, it is different in its primitives as it was initially designed as a distributed messaging system. Although they offer similar functionality, they might serve different use cases, in terms of set- up, manageability, integration, etc. in addition to the support and development momentum around them, which could be a deciding factor for choosing one over the other. Ø Structured Data The same challenge applies to structured data sources, including relational databases, enterprise warehouses, and also NoSQL systems. Although some tools exist, properly placing Hadoop with existing sources of data is still quite challenging. In this case, it is not a collection issue but a rather integration or a mapping issue, i.e. the integration of Hadoop to the existing infrastructure, and the mapping of the existing data to Hadoop and related systems. Efficient integration and mapping processes will certainly make the transition to Hadoop a lot easier. Sqoop [7] is one of the tools allowing the transfer of large amounts of data between Hadoop and structured data stores. Sqoop can be used to import data into HDFS or to systems working alongside Hadoop including HBase [10] (c.f. NoSQL section). It comes with various connectors to popular databases including MySQL, PostgreSQL, or proprietary such as Oracle, Teradata, etc. It can also be plugged in with new connectors. Sqoop provides an API for operation and management, which is very handy for integration with external support and workflow systems. Sqoop was initially started at Cloudera, and is now quite established in enterprise systems. There is another system called Hiho [11], from a company called Nubetech. It aims at providing similar integration functionality with existing data stores, however, it does not seem to be in active development anymore. The importance of properly bridging existing data stores with Hadoop has led many companies, especially those in the fields of data delivery and integration, to enter the Hadoop space. A data integration company, called Pervasive [12], has developed a Hadoop edition of their data integrator, which enables users to roll data in and out from more than 200 different data sources. It offers a browser- based user interface, and visual drag- and- drop mappers integrating sources and targets. Another visual drag- and- drop data extraction and integration tool is called Pentaho s Kettle [13], and it has an open source community edition. Kettle offers data migration, export, and loading capabilities for a range of formats and database engines, including csv, XML, Oracle, SQL Server, MySQL, PostgreSQL, among others. Management - support Deploying, configuring, managing, and maintaining distributed platforms and applications are complex matters. One of the main reasons for that would be the very large scale and thus failure, which is intrinsic to these systems. There have been efforts by the community to develop coordination and support tools around Hadoop. One of these tools is Yahoo! s Zookeeper [14], which can be basically described as a fault- tolerant middleware providing common services that are needed in distributed infrastructures in order to maintain the general 3

4 availability, configuration and synchronisation of the system. Zookeeper is in essence a centralised service, but it can maintain a hierarchical tree of servers, which contain what is called znodes, which is a core concept in Zookeeper. A znode is basically a unified notion of a node that holds the data and metadata managing the cluster- wide configuration and status of the system and applications. In another context, a useful support tool for Hadoop is a serialisation tool called Avro [15]. Note that Hadoop has its own marshalling format and does not use the native one, which incidentally and according to Doug Cutting (creator of Hadoop) was big and hairy and that he thought Hadoop needed something lean and mean. Avro defines a serialisation format with support to richer data types (such as nested data) and support to RPC, versioning, and many languages. The variety of possible pieces in a Hadoop- based system has indeed made data interoperability a primary issue, and Avro is a good support to manage data throughout potential complex data flows generated by Hadoop set- ups. It is becoming a common format for the ecosystem. A Cloudera tool called RecordBreaker [53] automatically turns text- formatted data (server logs, sensor readings, etc.) into structured Avro data, without the need to write parsers or extractors, which is extremely useful. High- level interfaces Hadoop s MapReduce is still fairly low- level. Developers must think in terms of maps and reduces, and keys and values, the partitioning, etc. They also need skills in Java, which is the native interface, and/or a scripting language using the streaming interface. Implementing complicated applications in MapReduce can be quite challenging. Mapping applications into the map and reduce patterns, which usually require multiple stages, is far from being trivial. The Hadoop community has developed high- level interfaces raising the level of abstraction, essentially implementing common building blocks and operators for ad- hoc data analysis. Pig [16] and Hive [17] are two widely used high- lever interfaces. I am going to be a bit biased towards Pig since I have been using it for a little while. Pig was initially developed at Yahoo!, and is a data analysis platform made up of two pieces: a data flow language called PigLatin, and an execution environment mapping statements (written in PigLatin) into maps and reduces, and executing them on Hadoop. In Pig, the data structure is much richer, typically multivalued and nested. PigLatin defines a DAG (Directed Acyclic Graph), which is a step- by- step set of operations, each performing some transformation on the data. PigLatin would be quite straightforward for people familiar with a scripting language. It also allows rapid prototyping of algorithms for processing large datasets, and it has a pretty decent compiler. Indeed, a project called PigMix [18], which is a set of queries used to test Pig performance, compared to pure MapReduce jobs, is currently showing 1.1 times the MapReduce speed on average (over 20+ benchmarks). Pig runs as a client side application. It can actually reside outside of the Hadoop cluster. It can be used in different ways: o The shell, called Grunt, 4

5 o Submitting scripts directly, or o Embedded in Java, using PigServer Java class (similar to SQL using JDBC), or in Python. These can be executed under two modes: local and Hadoop. In the local mode, Pig still depends on Hadoop s local job runner (in its versions >0.7), which reads from the local file system and executes jobs locally. The Hadoop mode needs access to a Hadoop cluster and an HDFS installation. There is also a range of tools and plugins that ease the editing, monitoring, debugging, and integration of Pig workflows, including PigPen, Penny, PigPy, among many others [19]. In terms of the language, PigLatin has multiple statements that make it look like SQL. However, it is actually quite different from relational systems. There are many features that Pig has that are absent in relational database systems, and vice versa. These include more complex data structures in Pig as opposed to flatter data structures in SQL. There is also Pig s ability to use user- defined functions and streaming, which makes it more customisable. On the other hand, transactions and indexes are absent in Pig, as well as random reads and writes supported in relational systems, and so on. The other high- level interface we have mentioned, namely Hive, sits actually between Pig and relational database systems, as its language is a variant of SQL and it operates on data stored in tables. Let s have a look at a simple PigLatin example. The following example calculates the top locations in a check- in data. This data comes from Gowalla [20], which was a location- based social networking site. This very simple analysis will show the top social landmarks across a city. records = LOAD 'input/checkindata' AS (userid:int, date:chararray, lat:double, lg:double, locid:int); -- get the users and their check-in locations checkins = FOREACH records GENERATE userid, locid; -- group checkins by locid grouped_records = GROUP checkins BY locid; -- number of checkins per location how_many_per_location = FOREACH grouped_records GENERATE $0, COUNT($1); ordered = ORDER how_many_per_location BY $1 DESC; -- get the top 50 top_locations = LIMIT ordered 50; STORE top_locations INTO 'output/toplocations'; This script loads the input data from a field- delimited text file, with each line having the user id, the date, latitude and longitude, and the location id. Pig supports different formats, and can be customised to load and store additional ones. The AS clause, in the LOAD statement, associates a schema to records. The Schema simply associates names and types of the fields in the relation, e.g. user id is an integer. This is optional but recommended. If the AS clause is not there, Pig will consider all the fields as bytearray, and this may lead to errors. LOAD takes a URI argument. It is important to note that the output of any statement in Pig is a set of tuples. A tuple is a sequence of fields of any type. For this statement it produces a set of (user id, date, latitude, longitude, location id) tuples. The user can give any name or alias to the successive statements. Note that each statement is not immediately executed, but rather added to a logical plan. The execution is not started until the whole flow has been defined. The logical plan is then compiled into a physical one and executed. In the meantime, there are few 5

6 useful diagnostic operators when using the Grunt shell, which allow the user to interact with each statement and see what it produces: DUMP: which in fact triggers the execution and displays the results, ILLUSTRATE: shows a sample execution, DESCRIBE: shows the schema, EXPLAIN: shows the logical and physical plans. The second statement in the example use the FOREACH GENERATE statement, which acts on columns on every row of a relation. In this case, from records it generates only two fields in the relation checkins. The checkins are then grouped using locid as the grouping key. At this stage, we have tuples, one for each location in the input data, containing the location id, and a bag of tuples for that location. A bag is just a collection of tuples represented by curly braces in Pig. It will look like this: grouped_records: (group: int, checkins: bag ({userid: int, locid: int}) By grouping checkins in this way, we have created a row per location. The grouping field (the location) is given the alias group, and the second field has the same structure as checkins. What it remains now is just to add up the tuples of each bag to find the actual number of checkins per location. The next statement carries out the counting of the number of checkins per location. It uses in this case positional references, i.e. $0 and $1. This is similar to using names of the fields. The following statements order them, get the top 50, and store the result. In 6 lines of PigLatin code we have carried out an analytics operation, which would have been a lot more complex using maps and reduces. An interesting read about the mapping of PigLatin statements into MapReduce jobs can be found here [21]. On the other hand, Hive, as already mentioned, sits between Pig and relational systems, as it uses a variant of SQL as a language, called HiveQL. Hive defines tables for the data and keeps its metadata, such as schema and partitions (subdirectories in HDFS), in either a shared or local database. The schema is explicit and the types are defined upfront (different from Pig). As in Pig on the other hand, users can define custom functions; called User Defined Functions. It also provides both shell and web interfaces. Hive would be a good option for data analysts already familiar with SQL, and is consequently better suited for traditional data analytics. Hive was initially developed at Facebook, and is currently used by many sites including Digg, Scridb, among others. There is also a query language called Jaql [50], based on JSON (JavaScript Object Notation). It was open sourced by IBM, but its development has been put on hold for a little while now. It is however part of BigInsights, IBM s Hadoop distribution [51]. Ø Workflow systems Workflow systems have been increasingly attractive in a range of fields including scientific computing, coordinating business processes, etc. Workflow systems basically describe, coordinate, and map a sequence of jobs to one of different underlying execution systems (or individuals). A workflow engine supporting different types of Hadoop jobs (Java, streaming, Pig, etc.) is called Oozie [22]. Many people in distributed systems are actually very familiar with workflow systems, and would prefer to use them in developing their distributed 6

7 applications as they offer simple interfaces and languages coupled with resource- independent and management technologies able to efficiently schedule and map the workflow specification onto a large pool of resources. As many of the workflow languages, Oozie uses XML syntax to define the dependent set of actions into a DAG. There is another workflow tool called Azkaban [23], from LinkedIn, which has more or less similar functionality. Another useful system, in the same sprit of allowing the composition of rich analytics applications on top of Hadoop, is called Cascading [24]. It consists of a data processing API and query planner to define, share, and execute workflows on top of Hadoop. One of the core concepts in Cascading is the notion of pipes, which define a series of reusable data processing operations such as parsing, filtering, joining, etc. and a workflow would be a set of pipes. Cascading can be seen as a different way of doing what Pig does with a Java API, which is different from the workflow systems we have mentioned above, as they describe the workflow specification using a dedicated language (using XML for instance). In this sense, one can end up with an implementation combining both Oozie and Cascading for instance. There is another tool from Cloudera called Crunch [52], which is a Java library focusing on providing a set of primitive operations and user defined functions that can be combined to create complex multi- stage pipelines. Crunch then compiles these pipelines into a sequence of MapReduce jobs and manages their execution. This sounds a lot like Cascading, the main difference seems to be that Crunch would be more suitable for more complex data types (that would not naturally map to the tuple- based model). This is to avoid the implementation of complex user- defined functions to support this data types under Cascading or Pig. NoSQL data stores There are too many aspects to NoSQL technologies that we cannot even begin to do it justice in this small section. There will be a dedicated paper on these technologies and NoSQL data management on the cloud. Briefly, core aspects to NoSQL have initially been around unstructured data, a variety of data models, and the sacrifice of consistency over availability and scalability. The origin of the latter is Brewer s CAP theorem in It basically stated that consistency and availability cannot be maintained when a database is partitioned across a fallible large network. Or in general, for any given system, one can strongly support only two of these requirements: Consistency, Availability, and Partition tolerance. Many of the NoSQL technologies rose alongside big Internet companies. The list of existing NoSQL systems is now quite impressive [25]. As opposed to master/slave setup in relational databases, most of the NoSQL stores feature a peer- to- peer protocol to maintain their data. Not all of them though, as MongoDB [28] uses a master/slave scheme. Decentralisation has actually an advantage in removing the single point of failure, because of all the nodes playing identical roles. They are also easier to maintain and configure for the same reason. It also serves scalability, which is an essential architectural feature. A core requirement 7

8 is that the data store must be able to accept new data, nodes, and users, without major disruption of the system, reconfiguration, or performance degradation. Existing NoSQL data stores have different data models: Key- value stores, such as Redis [30], Column families stores, such as Cassandra [27], Document stores, such as CouchBase [26] or MongoDB [28], Graph- based stores, such as Neo4j [29]. The integration of these technologies with Hadoop & co. is also a hot topic and is of great interest to the community and the big data players. A company called DataStax [41] for instance has its business model around the integration of Cassandra and Hadoop. Packaged distributions - deployments The Hadoop marketplace is quite crowded nowadays! Hadoop- based solution packaging and distributions have flourished in the last couple of years. The most well known (used?) distribution would be from Cloudera [4] (the company that employs the original author of Hadoop, Doug Cutting). Others include MapR [31], Hortonworks [32], Karmasphere [3], and others. The Bigtop project, already mentioned earlier in this paper, would serve a similar purpose of packaging the ecosystem, but under the Apache umbrella. On the other hand, there are also different ready- to- use deployments on the cloud, including Amazon Elastic MapReduce (EMR) [33], Infochimps [34], Hadoop on Azure [35], among others probably. Google s AppEngine [46] also offers a MapReduce facility that could be use as part of an application; but it is not an analytics offering per se (as Google has it own big data offerings, including BigQuery). Now, most of these offerings can be programmed in the usual ways, using the native Java, streaming, Pig, Hive, etc. and offer SDKs, tools, and interfaces to develop, access, and manage resources and jobs, in addition to NoSQL stores, working and big data storage solutions, etc. There is also a possibility for a user to deploy its own Hadoop cluster on the cloud. Whirr [36] for instance provides a set of libraries for running cloud- neutral services using a common API. It has started initially as a set of bash scripts in the Hadoop project to allow the deployment on EC2, then has been ported to Python to include extra services and a wider range of providers. Its CLI provides a quite easy and convenient way to launch a cluster in few minutes. A quick word on the multi- cloud trend, which allows the usage of various cloud providers to deploy a given application or service. Many libraries (and also PaaS offerings) are actually supporting a range of provider operations, languages, frameworks, and services. These include libcloud [37] and jclouds [38] (which was used to create a Java version of the initial scripts in Whirr). On the PaaS side, Cloud Foundry would be an interesting project to mention here. Ø Bigtop Hadoop distribution Installing Hadoop and each of its projects independently can be very cumbersome. Installing the Bigtop Hadoop distribution provides a running instance of Hadoop with various ecosystem projects quickly and easily. In [39], the interested reader will find a walk through on how to install Hadoop from 8

9 Bigtop on a range of Linux distributions. This will avoid the pain of figuring out the dependency issues the hard way! Alternatively, a user can also download pre- packaged and configured virtual machines, from vendors and other sources, which can be used to explore Hadoop and the ecosystem straightaway (such as Cloudera s Hadoop demo VMs [4]). Ø User interfaces / BI Packaged solutions also provide additional user interfaces facilitating the interaction with Hadoop and other tools in the ecosystem. Cloudera s distributions for instance offer a web- based interface called Hue [40], which include a file browser, a job designer and browser, a Hive UI (called Beeswax), workflow scheduling, profiling, etc. Note that Hadoop comes with basic web interfaces that provide some information about what happening in the cluster in terms of job tracking, capacity, job statistics, etc. Hue is open source, but seems to only work with Cloudera s distributions. There are also quite few proprietary tools, we have mentioned DataStax for instance, which has a management tool called OpsCenter providing advanced functionality in managing their Cassandra/Hadoop stack. Another visual development interface is called Pentaho [42] (we have mentioned Kettle, the open source data integration tool from this company). Now, in addition to development, data access, integration, and so on, these tools offer also business intelligence and data mining capabilities. Another dedicated tool in that aspect is called Mahout [43]. It is a machine learning library implementing a range of algorithms in clustering, classification, filtering, etc. on top of Hadoop. Mahout has also built a vibrant community around a range of use cases, in frequent itemset generation, personalisation, recommendation, and so on. Yahoo!, Foursquare, twitter, among many other use Mahout implementations for various services. Another tool, dedicated to web mining, is called Bixo [47]. Bixo uses Cascading to run specialised web mining applications (pipes to use the Cascading terminology) on top of Hadoop. Bixo fetches the content from the web, parses it (using Tika parsers [48]), and analyses the data (tokenise, classify, etc.). Conclusion This article briefly presented the Hadoop ecosystem, which is still rapidly growing. There is, and will be, a lot of business and research opportunities around the whole ecosystem in terms of implementation, integration, support, management, user experience, and so on. On the business side for instance, one of the big challenges would be to help potential users sort out their options to suit their objectives, and end- up using the right stack/components for their needs. In terms of research, many areas are of prime importance including availability (the story of the namenode being a single point of failure has certainly been widely looked at, and many distributions and research projects provide solution to that), there are also performance issues, a need of concurrency- optimised and a wider range of input/output patterns, efficient data placement and movement, versioning, improved data integration, a more resource- aware scheduling, support to real- time data streaming, an so on. These issues, among others, will make up an important part of the future of Hadoop and its ecosystem. 9

10 Links [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] 10

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Bringing Big Data to People

Bringing Big Data to People Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION FACT SHEET BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION BIGDATA & HADOOP CLASS ROOM SESSION GreyCampus provides Classroom sessions for Big Data & Hadoop Developer Certification. This course will

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013 Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

More information

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem

More information

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects

More information

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis Big Data Advanced Analytics for Game Monetization Kimberly Chulis CEO Core Analytics, LLC Core Analytics / Game Loyalty Bay area and Chicago based digital advanced analytics firm Big Data / NoSQL Advanced

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Big Data and Industrial Internet

Big Data and Industrial Internet Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

SQL on NoSQL (and all of the data) With Apache Drill

SQL on NoSQL (and all of the data) With Apache Drill SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Big Data Management and Security

Big Data Management and Security Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

TRAINING PROGRAM ON BIGDATA/HADOOP

TRAINING PROGRAM ON BIGDATA/HADOOP Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews Lightweight Stack for Big Data Analytics Muhammad Asif Saleem Dissertation 2014 Erasmus Mundus MSc in Dependable Software Systems Department of Computer Science University of St Andrews A dissertation

More information

Open Source Technologies on Microsoft Azure

Open Source Technologies on Microsoft Azure Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information