The Hadoop ecosystem: an overview Lamine Aouad University College Dublin
|
|
- George McBride
- 8 years ago
- Views:
Transcription
1 The Hadoop ecosystem: an overview Lamine Aouad University College Dublin Hadoop [1] has become the de facto standard in the big data space. It has also become the kernel of a large distributed ecosystem. A collection of projects, many at Apache, have emerged as essential tools to be used alongside Hadoop for a number of important operations, including data management, collection and aggregation, transfer, high- level interfaces, BI reporting, and so on. The community of applications and use cases running on Hadoop, and using these projects, is constantly growing. In this paper, we will have a look at these different projects, and how they are used and complement Hadoop. A high- level map of the ecosystem is illustrated in Figure 1. Before describing these projects, it is worth mentioning a very interesting emerging project in the ecosystem called Bigtop [2]. Its primary goal is to build a community around the Hadoop ecosystem by providing a common packaging and interoperability testing. Indeed, the Hadoop ecosystem includes a range of sophisticated projects, and although each of these projects has a rigorous community practicing high- degree of software development discipline, there were a gap when it came to inter- project integration. There are different dependency issues between these projects. The Hadoop space now is actually full of companies filling this gap and delivering fully integrated pre- packaged Hadoop stacks, such as Cloudera [4], Karmasphere [3], among others. Bigtop is also filling this gap for the community by focusing on the ecosystem as a whole, rather than individual projects. Figure 1. The Hadoop Ecosystem Workflow UI & BI, High-level monitoring / Packaging Data Collection / I/O Scribe, Flume, Chukwa, Sqoop, Oozie, Azkaban, Cascading OLTP & NoSQL HBase, Cassandra, Hue / Cloudera, Karmasphere, Management / Support Zookeeper, Avro Core High-level Interfaces Pig, Hive Cloud deployments Whirr, Amazon EMR, Infochimps A vanilla Hadoop distribution comes with two core components: its MapReduce engine and its file system, called HDFS (Hadoop Distributed File System). The MapReduce engine coordinates and executes jobs (via two roles: the jobtracker and the tasktrackers). Note that in Hadoop terminology, a job is the full program, and is then a set of tasks. On the other hand, as already mentioned, Hadoop comes with HDFS (by default), but can be (and has been) integrated and used 1
2 with other file systems and storage systems. Many distributions and deployments actually use custom file systems, such as QFS from Quantcast [44] recently released as open source, IBM s GPFS [49], Amazon s S3 [45], among others. HDFS also has two basic roles: a single namenode storing metadata (not involved in ordinary activity) and the datanodes storing the actual data chunks (typically collocated with the tasktrackers). There is a lot to say about Hadoop execution and the way MapReduce and HDFS work, how to develop an application (using the native Java interface or streaming) and so on. However, this is out of the scope of this paper, as the primary focus is to introduce the ecosystem around these two core components. There are a lot of introductory courses, and training resources and tutorials available online. In the following, we are going to start with describing the data integration within the ecosystem, which I believe is one of the most important aspects here. Data connections Data has many shapes, formats, and is captured and stored using a range of technologies. It is also distributed in different places. There are a number of tools that deal with delivering the data to Hadoop, structured or unstructured. Many companies have built their own tools to serve their own needs. Some have been released as open source to the community, and have gained in functionality and support. These include Scribe [5], Flume [6], Sqoop [7], among others. There are also a number of proprietary tools providing connectivity, extraction, and mapping capabilities, such as Pervasive s data integrator. The next few paragraphs briefly introduce these tools. Ø Unstructured data Reliable and efficient data preparation is paramount. However, the data preparation here means basic assembly and delivery of the data. This includes collection, aggregation, and movement of the distributed data, as opposed to the data mining point of view where it includes an extra step manipulating the data to enhance its utility for the mining process. The challenge here is how to get the large amounts of data, logs or any other data, to HDFS to be processed by Hadoop. This has to be done in a reliable and fault- tolerant manner. Losing all the click logs of a website for a given period of time, because of a namenode failure for instance, is not tolerable. A number of (quite similar) tools have been developed to stream large flow of data to Hadoop. These include Scribe and Flume (already mentioned), and Chukwa [8]. They provide similar functionality, but they have some differences in the way they are configured, the monitoring, levels of reliability available, etc. Note that basic tools exist and can be scripted out, such as HDFS copy commands, which provide a way to get data in and out of Hadoop. But this cannot realistically be used in a large- scale production environment. Facebook s Scribe for instance was initially designed to reliably scale to tens of billions of messages a day. It is also used at Twitter and Digg, among others. Cloudera s Flume is also used in production by a number of companies including Photobucket. An additional similar system is called Kafka [9], initially developed at LinkedIn. It is comparable to Flume and Scribe, and can be used for activity stream processing. 2
3 However, it is different in its primitives as it was initially designed as a distributed messaging system. Although they offer similar functionality, they might serve different use cases, in terms of set- up, manageability, integration, etc. in addition to the support and development momentum around them, which could be a deciding factor for choosing one over the other. Ø Structured Data The same challenge applies to structured data sources, including relational databases, enterprise warehouses, and also NoSQL systems. Although some tools exist, properly placing Hadoop with existing sources of data is still quite challenging. In this case, it is not a collection issue but a rather integration or a mapping issue, i.e. the integration of Hadoop to the existing infrastructure, and the mapping of the existing data to Hadoop and related systems. Efficient integration and mapping processes will certainly make the transition to Hadoop a lot easier. Sqoop [7] is one of the tools allowing the transfer of large amounts of data between Hadoop and structured data stores. Sqoop can be used to import data into HDFS or to systems working alongside Hadoop including HBase [10] (c.f. NoSQL section). It comes with various connectors to popular databases including MySQL, PostgreSQL, or proprietary such as Oracle, Teradata, etc. It can also be plugged in with new connectors. Sqoop provides an API for operation and management, which is very handy for integration with external support and workflow systems. Sqoop was initially started at Cloudera, and is now quite established in enterprise systems. There is another system called Hiho [11], from a company called Nubetech. It aims at providing similar integration functionality with existing data stores, however, it does not seem to be in active development anymore. The importance of properly bridging existing data stores with Hadoop has led many companies, especially those in the fields of data delivery and integration, to enter the Hadoop space. A data integration company, called Pervasive [12], has developed a Hadoop edition of their data integrator, which enables users to roll data in and out from more than 200 different data sources. It offers a browser- based user interface, and visual drag- and- drop mappers integrating sources and targets. Another visual drag- and- drop data extraction and integration tool is called Pentaho s Kettle [13], and it has an open source community edition. Kettle offers data migration, export, and loading capabilities for a range of formats and database engines, including csv, XML, Oracle, SQL Server, MySQL, PostgreSQL, among others. Management - support Deploying, configuring, managing, and maintaining distributed platforms and applications are complex matters. One of the main reasons for that would be the very large scale and thus failure, which is intrinsic to these systems. There have been efforts by the community to develop coordination and support tools around Hadoop. One of these tools is Yahoo! s Zookeeper [14], which can be basically described as a fault- tolerant middleware providing common services that are needed in distributed infrastructures in order to maintain the general 3
4 availability, configuration and synchronisation of the system. Zookeeper is in essence a centralised service, but it can maintain a hierarchical tree of servers, which contain what is called znodes, which is a core concept in Zookeeper. A znode is basically a unified notion of a node that holds the data and metadata managing the cluster- wide configuration and status of the system and applications. In another context, a useful support tool for Hadoop is a serialisation tool called Avro [15]. Note that Hadoop has its own marshalling format and does not use the native one, which incidentally and according to Doug Cutting (creator of Hadoop) was big and hairy and that he thought Hadoop needed something lean and mean. Avro defines a serialisation format with support to richer data types (such as nested data) and support to RPC, versioning, and many languages. The variety of possible pieces in a Hadoop- based system has indeed made data interoperability a primary issue, and Avro is a good support to manage data throughout potential complex data flows generated by Hadoop set- ups. It is becoming a common format for the ecosystem. A Cloudera tool called RecordBreaker [53] automatically turns text- formatted data (server logs, sensor readings, etc.) into structured Avro data, without the need to write parsers or extractors, which is extremely useful. High- level interfaces Hadoop s MapReduce is still fairly low- level. Developers must think in terms of maps and reduces, and keys and values, the partitioning, etc. They also need skills in Java, which is the native interface, and/or a scripting language using the streaming interface. Implementing complicated applications in MapReduce can be quite challenging. Mapping applications into the map and reduce patterns, which usually require multiple stages, is far from being trivial. The Hadoop community has developed high- level interfaces raising the level of abstraction, essentially implementing common building blocks and operators for ad- hoc data analysis. Pig [16] and Hive [17] are two widely used high- lever interfaces. I am going to be a bit biased towards Pig since I have been using it for a little while. Pig was initially developed at Yahoo!, and is a data analysis platform made up of two pieces: a data flow language called PigLatin, and an execution environment mapping statements (written in PigLatin) into maps and reduces, and executing them on Hadoop. In Pig, the data structure is much richer, typically multivalued and nested. PigLatin defines a DAG (Directed Acyclic Graph), which is a step- by- step set of operations, each performing some transformation on the data. PigLatin would be quite straightforward for people familiar with a scripting language. It also allows rapid prototyping of algorithms for processing large datasets, and it has a pretty decent compiler. Indeed, a project called PigMix [18], which is a set of queries used to test Pig performance, compared to pure MapReduce jobs, is currently showing 1.1 times the MapReduce speed on average (over 20+ benchmarks). Pig runs as a client side application. It can actually reside outside of the Hadoop cluster. It can be used in different ways: o The shell, called Grunt, 4
5 o Submitting scripts directly, or o Embedded in Java, using PigServer Java class (similar to SQL using JDBC), or in Python. These can be executed under two modes: local and Hadoop. In the local mode, Pig still depends on Hadoop s local job runner (in its versions >0.7), which reads from the local file system and executes jobs locally. The Hadoop mode needs access to a Hadoop cluster and an HDFS installation. There is also a range of tools and plugins that ease the editing, monitoring, debugging, and integration of Pig workflows, including PigPen, Penny, PigPy, among many others [19]. In terms of the language, PigLatin has multiple statements that make it look like SQL. However, it is actually quite different from relational systems. There are many features that Pig has that are absent in relational database systems, and vice versa. These include more complex data structures in Pig as opposed to flatter data structures in SQL. There is also Pig s ability to use user- defined functions and streaming, which makes it more customisable. On the other hand, transactions and indexes are absent in Pig, as well as random reads and writes supported in relational systems, and so on. The other high- level interface we have mentioned, namely Hive, sits actually between Pig and relational database systems, as its language is a variant of SQL and it operates on data stored in tables. Let s have a look at a simple PigLatin example. The following example calculates the top locations in a check- in data. This data comes from Gowalla [20], which was a location- based social networking site. This very simple analysis will show the top social landmarks across a city. records = LOAD 'input/checkindata' AS (userid:int, date:chararray, lat:double, lg:double, locid:int); -- get the users and their check-in locations checkins = FOREACH records GENERATE userid, locid; -- group checkins by locid grouped_records = GROUP checkins BY locid; -- number of checkins per location how_many_per_location = FOREACH grouped_records GENERATE $0, COUNT($1); ordered = ORDER how_many_per_location BY $1 DESC; -- get the top 50 top_locations = LIMIT ordered 50; STORE top_locations INTO 'output/toplocations'; This script loads the input data from a field- delimited text file, with each line having the user id, the date, latitude and longitude, and the location id. Pig supports different formats, and can be customised to load and store additional ones. The AS clause, in the LOAD statement, associates a schema to records. The Schema simply associates names and types of the fields in the relation, e.g. user id is an integer. This is optional but recommended. If the AS clause is not there, Pig will consider all the fields as bytearray, and this may lead to errors. LOAD takes a URI argument. It is important to note that the output of any statement in Pig is a set of tuples. A tuple is a sequence of fields of any type. For this statement it produces a set of (user id, date, latitude, longitude, location id) tuples. The user can give any name or alias to the successive statements. Note that each statement is not immediately executed, but rather added to a logical plan. The execution is not started until the whole flow has been defined. The logical plan is then compiled into a physical one and executed. In the meantime, there are few 5
6 useful diagnostic operators when using the Grunt shell, which allow the user to interact with each statement and see what it produces: DUMP: which in fact triggers the execution and displays the results, ILLUSTRATE: shows a sample execution, DESCRIBE: shows the schema, EXPLAIN: shows the logical and physical plans. The second statement in the example use the FOREACH GENERATE statement, which acts on columns on every row of a relation. In this case, from records it generates only two fields in the relation checkins. The checkins are then grouped using locid as the grouping key. At this stage, we have tuples, one for each location in the input data, containing the location id, and a bag of tuples for that location. A bag is just a collection of tuples represented by curly braces in Pig. It will look like this: grouped_records: (group: int, checkins: bag ({userid: int, locid: int}) By grouping checkins in this way, we have created a row per location. The grouping field (the location) is given the alias group, and the second field has the same structure as checkins. What it remains now is just to add up the tuples of each bag to find the actual number of checkins per location. The next statement carries out the counting of the number of checkins per location. It uses in this case positional references, i.e. $0 and $1. This is similar to using names of the fields. The following statements order them, get the top 50, and store the result. In 6 lines of PigLatin code we have carried out an analytics operation, which would have been a lot more complex using maps and reduces. An interesting read about the mapping of PigLatin statements into MapReduce jobs can be found here [21]. On the other hand, Hive, as already mentioned, sits between Pig and relational systems, as it uses a variant of SQL as a language, called HiveQL. Hive defines tables for the data and keeps its metadata, such as schema and partitions (subdirectories in HDFS), in either a shared or local database. The schema is explicit and the types are defined upfront (different from Pig). As in Pig on the other hand, users can define custom functions; called User Defined Functions. It also provides both shell and web interfaces. Hive would be a good option for data analysts already familiar with SQL, and is consequently better suited for traditional data analytics. Hive was initially developed at Facebook, and is currently used by many sites including Digg, Scridb, among others. There is also a query language called Jaql [50], based on JSON (JavaScript Object Notation). It was open sourced by IBM, but its development has been put on hold for a little while now. It is however part of BigInsights, IBM s Hadoop distribution [51]. Ø Workflow systems Workflow systems have been increasingly attractive in a range of fields including scientific computing, coordinating business processes, etc. Workflow systems basically describe, coordinate, and map a sequence of jobs to one of different underlying execution systems (or individuals). A workflow engine supporting different types of Hadoop jobs (Java, streaming, Pig, etc.) is called Oozie [22]. Many people in distributed systems are actually very familiar with workflow systems, and would prefer to use them in developing their distributed 6
7 applications as they offer simple interfaces and languages coupled with resource- independent and management technologies able to efficiently schedule and map the workflow specification onto a large pool of resources. As many of the workflow languages, Oozie uses XML syntax to define the dependent set of actions into a DAG. There is another workflow tool called Azkaban [23], from LinkedIn, which has more or less similar functionality. Another useful system, in the same sprit of allowing the composition of rich analytics applications on top of Hadoop, is called Cascading [24]. It consists of a data processing API and query planner to define, share, and execute workflows on top of Hadoop. One of the core concepts in Cascading is the notion of pipes, which define a series of reusable data processing operations such as parsing, filtering, joining, etc. and a workflow would be a set of pipes. Cascading can be seen as a different way of doing what Pig does with a Java API, which is different from the workflow systems we have mentioned above, as they describe the workflow specification using a dedicated language (using XML for instance). In this sense, one can end up with an implementation combining both Oozie and Cascading for instance. There is another tool from Cloudera called Crunch [52], which is a Java library focusing on providing a set of primitive operations and user defined functions that can be combined to create complex multi- stage pipelines. Crunch then compiles these pipelines into a sequence of MapReduce jobs and manages their execution. This sounds a lot like Cascading, the main difference seems to be that Crunch would be more suitable for more complex data types (that would not naturally map to the tuple- based model). This is to avoid the implementation of complex user- defined functions to support this data types under Cascading or Pig. NoSQL data stores There are too many aspects to NoSQL technologies that we cannot even begin to do it justice in this small section. There will be a dedicated paper on these technologies and NoSQL data management on the cloud. Briefly, core aspects to NoSQL have initially been around unstructured data, a variety of data models, and the sacrifice of consistency over availability and scalability. The origin of the latter is Brewer s CAP theorem in It basically stated that consistency and availability cannot be maintained when a database is partitioned across a fallible large network. Or in general, for any given system, one can strongly support only two of these requirements: Consistency, Availability, and Partition tolerance. Many of the NoSQL technologies rose alongside big Internet companies. The list of existing NoSQL systems is now quite impressive [25]. As opposed to master/slave setup in relational databases, most of the NoSQL stores feature a peer- to- peer protocol to maintain their data. Not all of them though, as MongoDB [28] uses a master/slave scheme. Decentralisation has actually an advantage in removing the single point of failure, because of all the nodes playing identical roles. They are also easier to maintain and configure for the same reason. It also serves scalability, which is an essential architectural feature. A core requirement 7
8 is that the data store must be able to accept new data, nodes, and users, without major disruption of the system, reconfiguration, or performance degradation. Existing NoSQL data stores have different data models: Key- value stores, such as Redis [30], Column families stores, such as Cassandra [27], Document stores, such as CouchBase [26] or MongoDB [28], Graph- based stores, such as Neo4j [29]. The integration of these technologies with Hadoop & co. is also a hot topic and is of great interest to the community and the big data players. A company called DataStax [41] for instance has its business model around the integration of Cassandra and Hadoop. Packaged distributions - deployments The Hadoop marketplace is quite crowded nowadays! Hadoop- based solution packaging and distributions have flourished in the last couple of years. The most well known (used?) distribution would be from Cloudera [4] (the company that employs the original author of Hadoop, Doug Cutting). Others include MapR [31], Hortonworks [32], Karmasphere [3], and others. The Bigtop project, already mentioned earlier in this paper, would serve a similar purpose of packaging the ecosystem, but under the Apache umbrella. On the other hand, there are also different ready- to- use deployments on the cloud, including Amazon Elastic MapReduce (EMR) [33], Infochimps [34], Hadoop on Azure [35], among others probably. Google s AppEngine [46] also offers a MapReduce facility that could be use as part of an application; but it is not an analytics offering per se (as Google has it own big data offerings, including BigQuery). Now, most of these offerings can be programmed in the usual ways, using the native Java, streaming, Pig, Hive, etc. and offer SDKs, tools, and interfaces to develop, access, and manage resources and jobs, in addition to NoSQL stores, working and big data storage solutions, etc. There is also a possibility for a user to deploy its own Hadoop cluster on the cloud. Whirr [36] for instance provides a set of libraries for running cloud- neutral services using a common API. It has started initially as a set of bash scripts in the Hadoop project to allow the deployment on EC2, then has been ported to Python to include extra services and a wider range of providers. Its CLI provides a quite easy and convenient way to launch a cluster in few minutes. A quick word on the multi- cloud trend, which allows the usage of various cloud providers to deploy a given application or service. Many libraries (and also PaaS offerings) are actually supporting a range of provider operations, languages, frameworks, and services. These include libcloud [37] and jclouds [38] (which was used to create a Java version of the initial scripts in Whirr). On the PaaS side, Cloud Foundry would be an interesting project to mention here. Ø Bigtop Hadoop distribution Installing Hadoop and each of its projects independently can be very cumbersome. Installing the Bigtop Hadoop distribution provides a running instance of Hadoop with various ecosystem projects quickly and easily. In [39], the interested reader will find a walk through on how to install Hadoop from 8
9 Bigtop on a range of Linux distributions. This will avoid the pain of figuring out the dependency issues the hard way! Alternatively, a user can also download pre- packaged and configured virtual machines, from vendors and other sources, which can be used to explore Hadoop and the ecosystem straightaway (such as Cloudera s Hadoop demo VMs [4]). Ø User interfaces / BI Packaged solutions also provide additional user interfaces facilitating the interaction with Hadoop and other tools in the ecosystem. Cloudera s distributions for instance offer a web- based interface called Hue [40], which include a file browser, a job designer and browser, a Hive UI (called Beeswax), workflow scheduling, profiling, etc. Note that Hadoop comes with basic web interfaces that provide some information about what happening in the cluster in terms of job tracking, capacity, job statistics, etc. Hue is open source, but seems to only work with Cloudera s distributions. There are also quite few proprietary tools, we have mentioned DataStax for instance, which has a management tool called OpsCenter providing advanced functionality in managing their Cassandra/Hadoop stack. Another visual development interface is called Pentaho [42] (we have mentioned Kettle, the open source data integration tool from this company). Now, in addition to development, data access, integration, and so on, these tools offer also business intelligence and data mining capabilities. Another dedicated tool in that aspect is called Mahout [43]. It is a machine learning library implementing a range of algorithms in clustering, classification, filtering, etc. on top of Hadoop. Mahout has also built a vibrant community around a range of use cases, in frequent itemset generation, personalisation, recommendation, and so on. Yahoo!, Foursquare, twitter, among many other use Mahout implementations for various services. Another tool, dedicated to web mining, is called Bixo [47]. Bixo uses Cascading to run specialised web mining applications (pipes to use the Cascading terminology) on top of Hadoop. Bixo fetches the content from the web, parses it (using Tika parsers [48]), and analyses the data (tokenise, classify, etc.). Conclusion This article briefly presented the Hadoop ecosystem, which is still rapidly growing. There is, and will be, a lot of business and research opportunities around the whole ecosystem in terms of implementation, integration, support, management, user experience, and so on. On the business side for instance, one of the big challenges would be to help potential users sort out their options to suit their objectives, and end- up using the right stack/components for their needs. In terms of research, many areas are of prime importance including availability (the story of the namenode being a single point of failure has certainly been widely looked at, and many distributions and research projects provide solution to that), there are also performance issues, a need of concurrency- optimised and a wider range of input/output patterns, efficient data placement and movement, versioning, improved data integration, a more resource- aware scheduling, support to real- time data streaming, an so on. These issues, among others, will make up an important part of the future of Hadoop and its ecosystem. 9
10 Links [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] 10
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationInfomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationt] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationBIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationYou should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
More informationMySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBringing Big Data to People
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationGAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationIntroduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationBIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION
FACT SHEET BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION BIGDATA & HADOOP CLASS ROOM SESSION GreyCampus provides Classroom sessions for Big Data & Hadoop Developer Certification. This course will
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationIBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look
IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationAmerican International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
More informationHadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
More informationThe Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
More informationFrom Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten
From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in
More informationNative Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
More informationBig Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
More informationComprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
More informationBig Data Advanced Analytics for Game Monetization. Kimberly Chulis
Big Data Advanced Analytics for Game Monetization Kimberly Chulis CEO Core Analytics, LLC Core Analytics / Game Loyalty Bay area and Chicago based digital advanced analytics firm Big Data / NoSQL Advanced
More informationBIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationBig Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
More informationMySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering
MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation
More informationGoogle Bing Daytona Microsoft Research
Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large
More informationBig Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015
More informationMicrosoft SQL Server 2012 with Hadoop
Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
More informationApache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
More informationSQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationUsing MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationThe Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationBig Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationHadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
More informationHadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationTRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationBig Data and Hadoop for the Executive A Reference Guide
Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationLightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews
Lightweight Stack for Big Data Analytics Muhammad Asif Saleem Dissertation 2014 Erasmus Mundus MSc in Dependable Software Systems Department of Computer Science University of St Andrews A dissertation
More informationOpen Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More information