Deciphering Big Data Stacks: An Overview of Big Data Tools
|
|
|
- Oswald Briggs
- 10 years ago
- Views:
Transcription
1 Deciphering Big Data Stacks: An Overview of Big Data Tools Tomislav Lipic 1, Karolj Skala 1, Enis Afgan* 1,2 1 Centre for Informatics and Computing Rudjer Boskovic Institute, RBI Zagreb, Croatia {tlipic, skala}@irb.hr 2 Department of Biology Johns Hopkins University Baltimore, MD, USA [email protected] Abstract With its ability to ingest, process, and decipher an abundance of incoming data, the Big Data is considered by many a cornerstone of future research and development. However, the large number of available tools and the overlap between those are impeding their technological potential. In this paper, we present a systematic grouping of the available tools and present a network of dependencies among those with the aim of composing individual tools into functional software stacks required to perform Big Data analyses. Keywords Big Data; distributed systems; Cloud computing; data platform trends; interdependency I. INTRODUCTION With the multi-v model [1] used to categorize Big Data, the task of effectively addressing Big Data analysis needs to balance a combination of intricate issues, including data curation, management and privacy, as well as resource provisioning and management. Further complicating the situation is that each Big Data analysis project requires a unique yet composite environment driven by the project's goals. One increasingly complicated facet of the environment is the required software stack - deploying the appropriate stack is progressively becoming a technologically challenging undertaking. Simultaneous to the development of the Big Data software stack challenge, the availability of Cloud computing platforms has spread beyond specialized commercial offerings into widespread installations at universities, institutes, and the industry [2]. This adoption was fueled by the demonstrated increase in available infrastructure flexibility, functionality, and service scalability. That alone, however, was not sufficient to sustain the wide-adoption of the Cloud; in addition, a welldefined programmatic access (i.e., API) and clear term-ofservice (i.e., pricing/allocation) were a key that helped transform otherwise abstract technical terms (e.g., virtual machines, volumes, objects) into accessible functionality (e.g., scalable web applications), enabling the cloud-transformation. Conceptually, the Cloud provides many features suitable for deploying Big Data stacks and has been demonstrated as a viable platform for deploying certain Big Data solutions [3]. The Cloud allows information integration from Big Data sources by offering capacity to store and process the data, enabling actionable insights to be derived. However, the process of utilizing Big Data solutions is still complex and time consuming, mostly left in the hands of adept data scientists and engineers [4]. Namely, in order to take advantage of the benefits that Big Data solutions offer, applications need to be (re)designed and (re)developed using appropriate concepts and technologies, making them Big Data-aware. Combined with the required computing infrastructure, Big Data requires a significant commitment on behalf of an organization, a group, or an individual. To ease this transition, this paper sheds light on the tools used to deliver a software stack required to perform Big Data analysis. It does so by providing a systematic characterization of the current tools in form of a tool catalog, describing tools capabilities and offering a functional comparison among those. It then presents a network of deployment dependencies among the described tools. By providing a concise description of the components used to deliver Big Data analysis platforms, we believe the paper can contribute towards enabling a BigData-transformation, comparable to the cloudtransformation, and support adoption of the available technologies. In future work, we will leverage this information to enable automated composition and deployment of the available technologies into full-featured stacks to deliver functional Big Data analysis platforms in the Cloud. II. TECHNOLOGIES FOR BIG DATA CHALLENGES The field of Big Data is focusing on the challenges that the available data is imposing and aims to produce actionable results based on the data. Inspired by the diversity of inputs and the multiplicity of conceivable outcomes, the number of technological options covering the Big Data space has exploded. One way to navigate through this space is to follow a trajectory of the developed technologies as they address different Big Data challenges. Four such trajectories have emerged and can be identified as recognizable trends: (1) batch, (2) query, (3) low-latency query, and (4) continuous. These trends represent a grouping of technologies, embodied by tools with similar purpose. A. Batch Processing Batch can be seen as the 'traditional' Big Data - it revolves around the MapReduce paradigm [5] where an input file is split into a number of smaller input files (the map
2 step), an algorithm is executed in parallel batch mode on each of the input files across possibly many commodity computers, and the outputs from each step are then combined into a single output (the reduce step). The Apache Hadoop project implements this paradigm; this open source implementation has been adopted and modified to a various degree by a number of derivative distributions, most notably Cloudera CDH, Hortonworks Data Platform (HDP), MapR M series, and Amazon Elastic MapReduce (EMR). Utilizing the Hadoop framework requires an application to be developed using the MapReduce paradigm and is typically recommended when the input data is large (terabytes or larger) and diverse. Hadoop is accompanied with its own file system, Hadoop Distributed File System (HDFS) [6], which provides storage and data redundancy on top of commodity hardware. HDFS is not a POSIX compliant file system and any data being operated on via Hadoop is required to be imported into HDFS. Upon completion of the data analysis, the data is exported onto a POSIX file system to be accessed by other tools. B. Query Processing MapReduce provides key features for addressing the analysis and storage challenges of large volumes of data. However, it requires programmatic access to the data. By providing higher-level interfaces instead, access to data would be democratized, allowing users not versed in programming could access the data directly. As a first step in this direction, a query language was devised and implemented as part of Apache Pig [7], which translates high-level custom queries into MapReduce jobs. Additionally, Apache Hive [8] was developed that takes SQL-like queries and translates those into MapReduce jobs. Unlike adopting the raw MapReduce approach that requires significant adoption effort, these solutions allow the existing investment in SQL to be leveraged in the Big Data context. The usage of NoSQL databases is another approach for storing and querying Big Data. Instead of just storing the data directly in HDFS and retrieving it via MapReduce jobs, the data is stored in a database and retrieved via high-level queries [9]. These databases provide non-relational, distributed, and horizontally scalable storage solutions while hiding away the details of replicas and table sharding otherwise necessary when storing data at large scale. At the same time, they provide direct access to the data in form of simpler, querybased interfaces. NoSQL databases are roughly categorized into the following four classes [10], [11]: document stores, key-value stores, columnar, and graph databases. Tools such as CouchDB (document), Riak (key-value), HBase (column), and Neo4j (graph) provide this horizontal storage scalability and implement the query interface yielding the Big Data stores more accessible (see Han et al. [12] for a comprehensive survey and for an up-to-date list of NoSQL databases). C. Low-latency Query Processing For certain use cases (e.g., interactive visual analytics), it may be desirable to analyze data in near real-time (i.e., where lasts from seconds up to a minute). Batch is simply too slow for this type of data analysis and modified query engines have been devised that allow the computation time to be reduced up to 100 times [13]. The improved responsiveness is achieved by performing computation in memory [14] or by mapping SQL-like queries onto columnar data layout through multi-level execution trees without translating them into MapReduce jobs [15]. Although conceptually similar, the low-latency querying is not intended to replace batch and query within the Hadoop framework but to complement it by being used to analyze the outputs of a traditional Hadoop job or to prototype computations. Tools such as Cloudera Impala [16] and Apache Drill [17] implement these paradigms and are themselves based on Google's internal Dremel system [15]. Apache Drill also supports data sources other than HDFS, namely NoSQL databases while Impala can query data stored in HBase. D. Continuous Processing Complementing the real-time capabilities of the lowlatency queries is continuous data, or data streams. For applications such as online machine learning or real-time applications there is a need to process unbound streams of data and convert it into desired form. Projects such as Apache Storm [18], Apache S4 [19], and Apache Samza [20], or managed platform solutions such as AWS Kinesis [21], address this challenge by providing programmable interfaces for operating on countless tuples, or events, of incoming data. These applications allow one to define operations on tuples that transform incoming data as needed (e.g., transform raw sensor data into a precursor for emergency situations or stock market data into trading trends). Individual transformations can be composed into a topology, which becomes an execution workflow capable of addressing arbitrarily complex transformations. It is worth noting that unlike MapReduce jobs and queries that have a start and an end, data streams runs continuously. Underlying the data streams analysis layer is an effective data delivery system. Projects such as Apache Kafka [22] or Apache Flume [23] act as messengers that provide a robust and scalable mechanism for the data to be published and consumed by a data stream framework. E. Crossing the Trends In addition to the projects that function in the context of any single trend, there are efforts that sit at a cross-section of multiple trends. Apache Spark [24] is an example of such project; it supports low-latency queries as well as data streams. The project implements a distributed memory abstraction, called Resilient Distributed Datasets, that allows computations to be performed in memory on large clusters in a fault-tolerant manner [14]. Similarly, the Stratosphere project [25] builds on the work brought about by Spark by also implementing an in-memory alternative to the traditional MapReduce. Unlike Spark with its Resilient Distributed Datasets implementation, Stratosphere implements a programming model called Parallel Contracts as part of its execution engine Nephele [26]. This is realized by extending the MapReduce paradigm by adding additional data operators, for example join, union, and iterate. In combination with the map and reduce operators, Stratosphere allows complex data-flow graphs to be composed as more comprehensive data workflows.
3 Tools described above that perform data require and rely on resources provisioned and managed by cluster resource managers, such as Apache YARN [27] or Apache Mesos [28]. These resource managers are able to dynamically allocate necessary resources for a specific tool via resource isolation, which allows for Hadoop, MPI, and other environments to readily utilize the same infrastructure. This increases infrastructure flexibility because the same set of physical resources is easily repurposed based on need. TABLE I. Functional classification F. Domain-specific Applications and Libraries The described trends identify a layer of data infrastructure. Alone, this layer acts as a set of engines suitable for handling different types of stored data. On top of these engines are domain-specific libraries and applications that provide higher-level interfaces for interacting with the engines. This layered structure is visualized in Fig. 1. while Table 1 provides a reasonably comprehensive coverage of the currently available applications and libraries, grouped by their function and intended here as a user reference. FUNCTIONAL CLASSIFICATION OF EXISTING HIGHER-LEVEL BIG D ATA APPLICATIONS AND LIBRARIES Existing Big Data technologies Function / Description High-level programming abstractions Apache Pig, Apache DataFu, Cascading Lingual, Cascalog Provide higher-level interface, for example SQL, for composing Hadoop data analysis jobs. Shark (for Spark), Trident (for Storm), Meteor, Sopremo and PonIC (for Stratosphere) Facilitate easier and more efficient programming of complex data and analysis tasks in batch, iterative, and realtime manner. Big Data-aware machine-learning toolkits Apache Mahout (for Hadoop), MLlib (for Spark), Cascading Pattern, GraphLab framework, Yahoo SAMOA Allow machine learning algorithms to be more easily utilized in the context of the MapReduce paradigm, individually tailored for different types of input data and/or type. Graph systems Apache Giraph (for Hadoop), Bagel (for Spark), Stratosphere Spargel, GraphX (for Spark), Pegasus, Aurelius Faunus, GraphLab PowerGraph A range of tools covering generic graph based computations, complex networks, and interactive graph computations. Stinger, Neo4j, Aurelius Titan Storage facilities for graph-based tools and data structures. Apache Sqoop, Apache Chukwa, Apache Flume Act as orchestrator frameworks by facilitating bulk data transfers between Hadoop and structured data stores, log collection, or data streams into central data stores. Apache Falcon, Apache Oozie Handle data and management and workflow scheduling, including data discovery, process orchestration and lifecycle management. Apache Hue Web user interface providing browsers for HDFS files, MapReduce jobs, and a range of higher-level applications (e.g., HBase, Hive, Pig, Sqoop). Apache Ambari, Apache Helix, Apache Whirr, Cask Coopr Cluster managers used for provisioning, managing, and monitoring applications, services, and cloud resources. Berkeley Big Data benchmark, BigBench, BigDataBench,, Big Data Top 100, Apache Bigtop Used for statistical workload collection and generation or testing Hadoop-related projects. Data ingestion and scheduling systems Systems management solutions Benchmarking and testing applications
4 III. STACK DEPLOYMENT DEPENDENCY GRAPH To make them useful for complex analyses, the available technologies often need to be assembled into composite software stacks. The schematic provided in Fig. 1. depicts an abstract stack, mapping the technologies discussed thus far onto the appropriate stack layers. The Cloud offers an opportunistic platform for quickly and flexibly composing the available technologies into the required stack with exact tools. Doing so, however, requires detailed understanding of the tools' functions and deployment requirements. The previous section provides descriptions of the tools' function while this section provides insight regarding the tools' deployment requirements. Discovering tool deployment requirements is, unfortunately, not a clear-cut task. Fig. 2. demonstrates the complexity of this effort by depicting a tool deployment interdependency graph for Big Data tools. This graph has been constructed by manually examining the available Big Data tools and technologies and identifying explicit links between them. Constructing such a graph for an even more comprehensive set of tools in an automatic or semiautomatic manner would itself be a Big Data analysis problem. The interdependency graph represents an ordering mechanism among the tools when deploying them. For example, if one wants to use Pig Latin language, they would need Pig runtime environment, which would in turn require Hadoop, which would in turn require either YARN and consequently HDFS or HBase and HDFS. Along with the absolute requirements on dependent tools, some tools also have optional dependencies (indicated by faint arrows). Optional dependencies either enhance or modify functionality of the tool to make it suitable for more tasks. Pig can, for example, utilize Oozie as a workflow scheduling engine to enhance execution of workflows run via Pig. In addition to providing an effective overview of the current Big Data tools' dependencies, the network provides Cloud resources Processing engines Applications Data analysis Batch Machine learning and graph toolkits Benchmarks Query System managers (Re)Structured data Programming abstractions Lowlatency Cluster resource management and HDFS Schedulers Continuous Silos DBMS Systems Streams Fig. 1. A layered view of an abstract Big Data analysis stack. Shaded areas indicate functional overlap between the trends. insights into the correlation of the available tools. Inspecting the network, several locus points are visible (indicated by the size of a node). The size of the nodes is proportional to the number of incoming dependencies a node (i.e., tool) has and thus the largest nodes correspond to a set of core tools required most often for functional Big Data stacks. Most notably, this is HDFS and cluster resource managers (e.g., YARN) that act as common denominators in the stack. Colorbased locus points (e.g., Hadoop, Spark, Stratosphere) represent alternative technologies with comparable functionality. The choice of technology in this case is based on usage and is rooted in the type of input data being processed. The leaf nodes in the network represent unique or niche tools that fill a specific roll. Also visible from the graph is that the majority of tools favor YARN as a central resource manager. Building on top of YARN are either tools that provide user-facing functions (i.e., leaf nodes) or a class of tools that act as higher-level interfaces to the resources for other tools (e.g., Spark, Hadoop). Tools Fig. 1. The Big Data tools and technologies graph with the links indicating functional deployment interdependencies. Dark lines imply an absolute requirement while faint lines imply an optional requirement. Whether to fill the optional requirement is influenced by a desired functionality of the final stack so if the functionality is desired, the dependency needs to be filled. Colors of nodes indicate functional similarity of the tools and the size of the nodes indicate the degree of dependence other tools have on the given tool.
5 linking to those tools form groups that cover a range of domains (e.g., Spark-based tools covering a number of domains and are all linked to Spark). This is in contrast to thinking that those tools would focus only on a single problem domain. Hence we used color throughout the graph to indicate notion of similar problem domain for the tools. For example, Giraph and GraphX are alternative tools for solving similar problems. IV. CONCLUSIONS Instigated only a few years ago by a novel algorithm parallelization model, MapReduce, and an open source implementation of the same, Hadoop, the Big Data field was born. The field has quickly evolved into one of the major movements driving today s research and economy. Fueled by the many-v's of data, the Big Data field has exploded with a myriad of options ranging from technical solutions, commercial offerings, to conceptual best practices. Trying to adopt or leverage the field of Big Data feels like walking through a minefield of terms, unclear or even conflicting statements, and implicitly defined requirements. Simultaneously, the influx of data is mandating use of available technologies in a wide spectrum of analyses: from financial stock exchanges to monitoring consumer sentiment, to analyzing state of running systems. In response to this turbulent state of the Big Data landscape, in this paper, we summarized the current technological state in the space of Big Data. We have done this by compiling a catalog of the most recognized and notable Big Data tools and technologies and then grouped them based on their function. Next, based on this classification of the realworld state, we have devised a graph of their deployment interdependencies. This graph allows different layers of the Big Data stack to be defined, which in turn enables dependencies between individual tools to be identified. Having the ability to identify the layers and the dependencies between individual tools demystifies much of the Big Data landscape and enables functional systems composed of multiple tools to be more easily built. We believe that the focus of future work in the space of Big Data should be on democratizing access to the available tools by offering functional solutions rather than independent technologies. This will foster development of tuned workflows and pipelines, and thus tangible value, instead of mere capacity. As part of ongoing work (e.g., [4]), we will be leveraging the information available in the devised graph to enable custom and autonomous deployment of Big Data stacks on Cloud resources. ACKNOWLEDGMENT This work was supported in part by FP7-PEOPLE programme grant (AIS-DC). REFERENCES [1] M. D. Assunção, R. N. Calheiros, S. Bianchi, M. A. S. Netto, and R. Buyya, Big data computing and clouds: Trends and future directions, J. Parallel Distrib. Comput., Aug [2] S. Pandey and S. Nepal, Cloud Computing and Scientific Applications Big Data, Scalable Analytics, and Beyond, Futur. Gener. Comput. Syst., vol. 29, no. 7, pp , [3] D. Talia, Clouds for Scalable Big Data Analytics, Computer (Long. Beach. Calif)., vol. 46, no. 5, pp , May [4] L. Forer, T. Lipic, S. Schonherr, H. Weisensteiner, D. Davidovic, F. Kronenberg, and E. Afgan, Delivering bioinformatics MapReduce applications in the cloud, in th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014, pp [5] J. Dean and S. Ghemawat, MapReduce : Simplified Data Processing on Large Clusters, Commun. ACM, vol. 51, no. 1, pp. 1 13, [6] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010, [7] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig Latin: A Not-so-foreign Language for Data Processing, in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 2008, pp [8] A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive - a petabyte scale data warehouse using Hadoop, in 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010, pp [9] C. Fay, D. Jeffrey, G. Sanjay, C. H. Wilson, A. W. Deborah, B. Mike, C. Tushar, F. Andrew, and E. G. Robert, Bigtable: A Distributed Storage System for Structured Data, OSDI, [10] R. Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD Record, vol. 39, no. 4. p. 12, [11] S. Weber and C. Strauch, NoSQL Databases, Lect. Notes Stuttgart Media, pp. 1 8, [12] J. Han, E. Haihong, G. Le, and J. Du, Survey on NoSQL database, in Proceedings th International Conference on Pervasive Computing and Applications, ICPCA 2011, 2011, pp [13] S. Shenker, I. Stoica, M. Zaharia, and R. Xin, Shark: SQL and Rich Analytics at Scale, Proc ACM SIGMOD Int. Conf. Manag. Data, pp , [14] M. Zaharia, M. Chowdhury, T. Das, and A. Dave, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 12 Proc. 9th USENIX Conf. Networked Syst. Des. Implement., pp. 2 2, [15] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel : Interactive Analysis of Web-Scale Datasets, Proc. VLDB Endow. 3, vol. 1 2, pp , [16] Cloudera Impala. [Online]. Available: [17] M. Hausenblas and J. Nadeau, Apache Drill: Interactive Ad-Hoc Analysis at Scale, Jun [18] Storm: Distributed and fault-tolerant realtime computation. [Online]. Available: [19] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, S4: Distributed Stream Computing Platform, 2010 IEEE Int. Conf. Data Min. Work., pp , [20] Apache Samza. [Online]. Available: [21] Amazon Kinesis. [Online]. Available: [22] J. Kreps, N. Narkhede, and J. Rao, Kafka: A distributed messaging system for log, Proc. NetDB, [23] Apache Flume. [Online]. Available: [24] Spark. [Online]. Available: [25] Stratosphere Project. [Online]. Available: [26] A. Alexandrov, M. Heimel, V. Markl, D. Battré, F. Hueske, E. Nijkamp, S. Ewen, O. Kao, and D. Warneke, Massively parallel data analysis with PACTs on Nephele, Proc. VLDB Endow., vol. 3, no. 1 2, pp , Sep [27] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwali, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O Malley, S. Radia, B. Reed, and E. Baldeschwieler, Apache Hadoop YARN : Yet Another Resource Negotiator, in ACM Symposium on Cloud Computing, 2013, p. 16. [28] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, Mesos: a platform for fine-grained resource sharing in the data center, Proc. 8th USENIX Conf. Networked Syst. Des. Implement., p. 22, 2011.
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey
Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Beyond Batch Processing: Towards Real-Time and Streaming Big Data
Beyond Batch Processing: Towards Real-Time and Streaming Big Data Saeed Shahrivari, and Saeed Jalili Computer Engineering Department, Tarbiat Modares University (TMU), Tehran, Iran [email protected],
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Comprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015
Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 We Do Hadoop Fall 2014 Page 1 HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Hadoop in the Enterprise
Hadoop in the Enterprise Modern Architecture with Hadoop 2 Jeff Markham Technical Director, APAC Hortonworks Hadoop Wave ONE: Web-scale Batch Apps relative % customers 2006 to 2012 Web-Scale Batch Applications
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Why Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
#TalendSandbox for Big Data
Evalua&on von Apache Hadoop mit der #TalendSandbox for Big Data Julien Clarysse @whatdoesdatado @talend 2015 Talend Inc. 1 Connecting the Data-Driven Enterprise 2 Talend Overview Founded in 2006 BRAND
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741
APACHE HADOOP JERRIN JOSEPH CSU ID#2578741 CONTENTS Hadoop Hadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References ABSTRACT Hadoop is an efficient
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Big data for the Masses The Unique Challenge of Big Data Integration
Big data for the Masses The Unique Challenge of Big Data Integration White Paper Table of contents Executive Summary... 4 1. Big Data: a Big Term... 4 1.1. The Big Data... 4 1.2. The Big Technology...
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION
GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
White Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
HDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Bringing Big Data to People
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
YARN Apache Hadoop Next Generation Compute Platform
YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years
ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK
44 ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK Ashwitha Jain *, Dr. Venkatramana Bhat P ** * Student, Department of Computer Science & Engineering, Mangalore Institute of Technology & Engineering
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at [email protected].
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Data Security in Hadoop
Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize
WHITE PAPER. Four Key Pillars To A Big Data Management Solution
WHITE PAPER Four Key Pillars To A Big Data Management Solution EXECUTIVE SUMMARY... 4 1. Big Data: a Big Term... 4 EVOLVING BIG DATA USE CASES... 7 Recommendation Engines... 7 Marketing Campaign Analysis...
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Native Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Modernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.
by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce
Volume 5, Issue 9, September 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Study of
Data Management Course Syllabus
Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to
What s next for the Berkeley Data Analytics Stack?
What s next for the Berkeley Data Analytics Stack? Michael Franklin June 30th 2014 Spark Summit San Francisco UC BERKELEY AMPLab: Collaborative Big Data Research 60+ Students, Postdocs, Faculty and Staff
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
Self-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems
Procedia Computer Science Volume 51, 2015, Pages 2678 2682 ICCS 2015 International Conference On Computational Science MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems Hamza Zafar 1, Farrukh
Information Builders Mission & Value Proposition
Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns
Policy-based Pre-Processing in Hadoop
Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides
Big Data Realities Hadoop in the Enterprise Architecture
Big Data Realities Hadoop in the Enterprise Architecture Paul Phillips Director, EMEA, Hortonworks [email protected] +44 (0)777 444 3857 Hortonworks Inc. 2012 Page 1 Agenda The Growth of Enterprise
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
