Big Data Platforms: What s Next?
|
|
|
- Ashlee Goodwin
- 10 years ago
- Views:
Transcription
1 Big Data Platforms: What s Next? Three computer scientists from UC Irvine address the question What s next for big data? by summarizing the current state of the big data platform space and then describing ASTERIX, their next-generation big data management system. By Vinayak R. Borkar, Michael J. Carey, and Chen Li DOI: / A wealth of digital information is being generated daily, information has great potential value for many purposes if captured and aggregated effectively. In the past, data warehouses were largely an enterprise phenomenon, with large companies being unique in recording their day-to-day operations in databases, and warehousing and analyzing historical data to improve their businesses. Today, organizations and researchers in a wide variety of areas are seeing tremendous potential value and insight to be gained by warehousing the emerging wealth of digital information, popularly referred to as big data, and making it available for analysis, querying, and other purposes [1]. Online businesses of all shapes and sizes are tracking customers purchases, product searches, website interactions, and other information to increase the effectiveness of their marketing and customer service efforts. Governments and businesses are tracking the content of blogs and tweets to perform sentiment analysis. Public health organizations are monitoring news articles, tweets, and Web search trends to track the progress of epidemics. Social scientists are studying tweets and social networks to understand how information of various kinds spreads and/or how it can be more effectively utilized for the public good. It is no surprise that support for data-intensive computing, search, and information storage a.k.a. big data analytics and management is being touted as a critical challenge in today s computing landscape. In this article we take a look at the current big data landscape, including its database origins, its more recent distributed systems rebirth, and current industrial practices and related trends in the research world. We then ask the question What s next? and provide a very brief tour of what one particular project, namely what our ASTERIX project, is doing in terms of exploring potential answers to this question (and why). A BIT OF BIG DATA HISTORY It is fair to say that the IT world has been facing big data challenges for over four decades it s just that the definition of big has been changing. In the 1970s, big meant megabytes; over time, big grew to gigabytes and then to terabytes. Today, the IT notion of big has reached the petabyte range for conventional, high-end data warehouses, and exabytes are presumably waiting in the wings. In the world of relational database systems, the need to scale databases to data volumes beyond the storage and/or processing capabilities of a single large computer system gave birth to a class of parallel database management systems known as shared-nothing parallel database systems [2]. As the name suggests, these systems run on networked clusters of computers, each with their own processors, memories, and disks. Data is spread over the cluster based on a partitioning strategy usually hash partitioning, but sometimes range partitioning or random partitioning and queries are processed by employing parallel, hash-based divide-and-conquer techniques. A first generation of systems appeared in the 1980s, with pioneering prototypes from the University of Wisconsin and the University of Tokyo and the first commercial offering coming 44 XRDS
2 IJ Ċ ĩ from Teradata Corporation. The past decade has seen the emergence of a second major wave of these systems, with a number of startups delivering new parallel database systems that were then swallowed up through acquisitions by the industry s major hardware and software vendors. Because high-level, declarative language (SQL) front relational databases, users of parallel database systems have been shielded from the complexities of parallel programming. As a result, until quite recently, these systems have arguably been the most successful utilization of parallel computing. During the latter 1990s, while the database world was admiring its finished work on parallel databases and major database software vendors were busy commercializing the results, the distributed systems world began facing its own set of big data challenges. The rapid growth of the World Wide X R D S Ł Ť Ţ ţ Ť Ł ĩ ţ ū Ł ĩ ţ Web and the resulting need to index and query its mushrooming content created big data challenges for search companies such as Inktomi, Yahoo, and Google. The processing needs in the search world were quite different, however, and SQL was not the answer, though shared-nothing clusters once again emerged as the hardware platform of choice. Google responded to these challenges by developing the Google File System (GFS), offering a familiar byte-stream-based file view of data that is randomly partitioned over hundreds or even thousands of nodes in a cluster [3]. GFS was then coupled with a programming model, MapReduce, to enable programmers to process big data by writing two user-defined functions, map and reduce [4]. The MapReduce framework applied these functions in parallel to individual data instances (Map) in GFS files and to sorted groups of instances that share a common key (Reduce) similar to the partitioned parallelism used in shared-nothing parallel database systems. Yahoo and other big Web companies such as Facebook created an Apache opensource version of Google s big data stack, yielding the now highly popular Hadoop platform with its associated HDFS storage layer. Similar to the big data back-end storage and analysis dichotomy, the historical record for big data also has a front-end (i.e., user-facing) story worth noting. As enterprises in the 1980s and 1990s began automating more and more of their day-to-day operations using databases, the database world had to scale up its online transaction processing (OLTP) systems as well as its data warehouses. Companies such as Tandem Computers responded with fault-tolerant, cluster-based SQL systems. Similarly, 45
3 feature but later in the distributed systems world, large Web companies were driven by very expansive user bases (up to tens or even hundreds of millions of Web users) to find solutions to achieve very fast simple lookups and updates to large, keyed data sets such as collections of user profiles. Monolithic SQL databases built for OLTP were rejected as being too expensive, too complex, and/or not fast enough, and today s NoSQL movement was born [5]. Again, companies such as Google and Amazon developed their own answers (BigTable and Dynamo, respectively) to meet this set of needs, and again, the Apache open-source community created corresponding clones (HBase and Cassandra, two of today s most popular and scalable key-value stores). Figure 1. ASTERIX Architecture Data loads and feeds from external sources (JSON, XML, ) Asterix Client Interface Metadata AQL Compiler Manager Hyracks Dataflow Layer Dataset/Feed Storage LSM Tree Manager Local Disks AQL queries and results High Speed Network Asterix Client Interface Metadata AQL Compiler Manager Hyracks Dataflow Layer Dataset/Feed Storage LSM Tree Manager Local Disks TODAY S BIG DATA PLATFORM(S) Hadoop and HDFS have grown to become the dominant platform for Big Data analytics at large Web companies as well as less traditional corners of traditional enterprises (e.g., for clickstream and log analyses). At the same time, data analysts have grown tired of the low-level MapReduce programming model, now choosing instead from among a handful of high-level declarative languages and frameworks that allow data analyses to be expressed much more easily and written and debugged much more quickly. These languages include Hive from Facebook (a variant of SQL) and Pig from Yahoo (a functional variant of the relational algebra, roughly). Tasks expressed in these languages are compiled down into a series of MapReduce jobs for execution on Hadoop clusters. Looking at workloads on real clusters, it has been reported that well over 60 percent of Yahoo s Hadoop jobs and more than 90 percent of Facebook s jobs now come from these higher-level languages rather than hand-written MapReduce jobs. MapReduce is essentially being relegated to the role of a big data runtime for higher-level, declarative data languages (which are not so very different than SQL). Given this fact, it is interesting to analyze the pros and cons of MapReduce in this role as compared to more traditional parallel SQL runtime systems [6]. Important pros of Hadoop compared with parallel SQL systems include: Data publishing to external sources and applications Asterix Client Interface Metadata AQL Compiler Manager Hyracks Dataflow Layer Dataset/Feed Storage LSM Tree Manager Local Disks ASTERIX Cluster 1. Open source availability versus expensive software licenses. 2. Multiple non-monolithic layers and components versus having only a top-level query API through which to access the data. 3. Support for access to file-based external data versus having to first design, load, and then index tables before being able to proceed. 4. Support for automatic and incremental forward recovery of jobs with failed tasks versus rolling long jobs back to their beginning to start all over again on failure. 5. Automatic data placement and rebalancing as data grows and machines come and go versus manual, DBA-driven data placement. 6. Support for replication and machine fail-over without operator intervention versus pager-carrying DBAs having to guide data recovery activities. Some of the cons are: 1. Similar to early observations on why database systems needs were not met by traditional OSs and their file systems [7], layering a record-based abstraction on top of a very large bytesequential file abstraction leads to an impedance mismatch. 2. There is no imaginable reason, other than because it is already there, to layer a high-level data language on top of a two-unary-operator runtime like MapReduce, as it can be quite unnatural (e.g., for joins) and can lead to suboptimal performance. 3. With random data block partitioning, the only available parallel query processing strategy is to sprayand-pray every query to all blocks of the relevant data files. 4. A flexible, semi-structured [8], schema-less data model (based on keys and values) means that important information about the data being operated on is known only to the programs operating on it (so program maintenance troubles await). 5. Coupling front- and back-end big data platforms to cover the full big data lifecycle requires significant use of bubble gum, baling wire, and handwritten ETL-like scripts. 6. While Hadoop definitely scales, its computational model is quite heavy (e.g., always sorting the data flowing between Map and Reduce, 46 XRDS
4 Figure 2. Example AQL schemas, queries, and results. always persisting temporary data to HDFS between jobs in a multi-job query plan, etc. [9]). WHAT S NEXT? Given the largely accidental nature of the current open-source Hadoop stack, and a need to store and manage as well as simply analyze data, we set out three years ago to design and implement a highly scalable platform for next-generation information storage, search, and analytics. We call the result a Big Data Management System (or BDMS). By combining and extending ideas drawn from semi-structured data management, parallel databases, and first-generation data-intensive computing platforms (notably Hadoop/ HDFS), ASTERIX aims to be able to access, ingest, store, index, query, analyze, and publish very large quantities of semi-structured data. The design of the ASTERIX BDMS is well-suited to handling use cases that range all the way from rigid, relation-like data collections whose structures are well understood and largely invariant to flexible and more complex data, where little is planned ahead of time and the data instances are highly variant and self-describing. Figure 1 provides an overview of a shared-nothing ASTERIX cluster and how its various software components map to cluster nodes. The bot- XRDS 47
5 feature CACM_TACCESS_one-third_page_vertical:Layout 1 6/9/09 1:04 PM Page 1 ACM Transactions on Accessible Computing This quarterly publication is a quarterly journal that publishes refereed articles addressing issues of computing as it impacts the lives of people with disabilities. The journal will be of particular interest to SIGACCESS members and delegrates to its affiliated conference (i.e., ASSETS), as well as other international accessibility conferences. ASTERIX aims to be able to access, ingest, store, index, query, analyze, and publish very large quantities of semi-structured data. Figure 3. ASTERIX software stack. AsterixQL Asterix Data Mgmt. System HiveQL Hivesterix Algebricks Algebra Layer Piglet Other HLL Compilers tom-most layer of ASTERIX provides storage capabilities for ASTERIXmanaged datasets based on LSM-tree indexing (chosen in order to support high data-ingestion rates). Further up the stack is our data-parallel runtime, Hyracks [9], which sits at roughly the same level as Hadoop in implementations of Hive and Pig but supports a much more flexible computational model. The topmost layer of ASTERIX is a full parallel BDMS, complete with its own flexible data model (ADM) and query language (AQL) for describing, querying, and analyzing data. AQL is comparable to languages such as Pig or Hive, however ASTERIX supports native storage and indexing of data as well as having the ability to operate on externally resident data (e.g., data in HDFS). The ASTERIX data model (ADM) borrowed data concepts from JSON and added more primitive types as well as type constructors from semistructured and object databases. Figure 2(a) illustrates ADM by showing how it might be used to define a record type for modeling Twitter messages. The record type shown is an open type, meaning that its instances should conform to its specification but will be allowed to contain arbitrary additional fields that can vary from one instance to another. The example also shows how ADM includes features such as optional fields with known types (e.g., sender-location ), nested collections of primitive values ( referred-topics ), and nested records ( user ). More information about ADM can be found in a recent paper that provides an overview of the AS- TERIX project [10]. Figure 2(d) shows an example of how a set of TweetMessageType instances would look. Data storage in ASTERIX is based on the concept of a dataset, a declared collection of instances of a given type. ASTERIX supports both system-managed datasets such as the TweetMessages dataset declared at the bottom of Figure 2(a) which are stored and managed by ASTERIX as partitioned, LSM-based B+ trees with optional secondary indexes, and external datasets, whose data can reside in existing HDFS files or collections of files in the cluster nodes local file systems. The ASTERIX query language is called AQL, a declarative query language designed by borrowing the essence of the XQuery language, most notably its FLWOR expression constructs and composability, and then simplifying and adapting it to the Hadoop M/R Job Hadoop M/R Compatibility Pregel Job Pregelix IMRU Job IMRU Hyracks Job Hyracks Data-parallel Platform 48 XRDS
6 types and data modeling constructs of ADM. Figure 2(b) illustrates AQL by example. This AQL query runs over the TweetMessages dataset to compute, for those tweets mentioning verizon, the number of tweets that refer to each topic appearing in those tweets. Figure 2(c) shows the results of this example query when run against the sample data of Figure 2(d). One of the primary application areas envisioned for ASTERIX is warehouse-based Web data integration [11]. As such, ASTERIX comes out of the box with a set of interesting capabilities that we feel are critical for such use cases. One is built-in support for a notion of data feeds to continuously ingest, pre-process, and persist data from external data sources such as Twitter. Another is support for fuzzy selection and fuzzy (a.k.a. set-similarity) joins, as Web data and searches are frequently ridden with typos and/or involve sets (e.g., of interests) that should be similar but not identical. Figure 2(e) illustrates a fuzzy join query in AQL. Yet another built-in capability is basic support for spatial data (e.g., locations of mobile users) and for queries whose predicates include spatial criteria. Figure 3 shows the nature of the open-source ASTERIX software stack, which supports the ASTERIX system but also aims to address other big data requirements. To process queries such as the example from Figure 2(b), ASTERIX compiles each AQL query into an Algebricks algebraic program. This program is then optimized via algebraic rewrite rules that reorder the Algebricks operators as well as introduce partitioned parallelism for scalable execution, after which code generation translates the resulting physical query plan into a corresponding Hyracks job that uses Hyracks to compute the desired query result. The left-hand side of Figure 3 shows this layering. As also indicated in the figure, the Algebricks algebra layer is data-model-neutral and is therefore also able to support other high-level data languages (such as a Hive port that we have built). The ASTERIX open-source stack also offers a compatibility layer for users with Hadoop jobs who wish to run them using our software as well as of- fering several other experimental big data programming packages (including Pregelix, a Pregel-like layer that runs on Hyracks, and IMRU, an iterative map/reduce/update layer that targets large-scale machine learning applications [12]). ASTERIX GOING FORWARD Currently the ADM/AQL layer of AS- TERIX can run parallel queries including lookups, large scans, parallel joins (both regular and fuzzy), and parallel aggregates. Data is stored natively in partitioned B+ trees and can be indexed via secondary indexes such as B+ trees, R-trees, or inverted indexes. The system s external data access and data feed features are also operational. We plan to offer a first open-source release of ASTERIX in late 2012, and we are seeking a few early partners who would like to run ASTERIX on their favorite big data problems. Our ongoing work includes preparing the code base for an initial public release, completing our initial transaction story, adding additional indexing support for fuzzy queries, and providing a key-value API for applications that prefer a NoSQL style API over a more general querybased API. More information about the project and its current code base can be found on our project website ( It is worth pointing out ASTERIX is a counter-cultural project in several ways. First, rather than tweaking Hadoop or other existing packages, we set out to explore the big data platform space from the ground up. We are learning a great deal from doing so, as it is surprising just how many interesting engineering and research problems are still lurking in places related to things that have already been done before. Second, rather than building a highly-specialized system to later be glued into a patchwork of such systems, we are exploring the feasibility of a one size fits a bunch system that addresses a broader set of needs (e.g., by offering data storage and indexing as well as support for external data analysis, short- and medium-sized query support as well as large batch jobs, and a key-value API as well as a query-based one). Acknowledgments a grant from the UC Discovery program, a matching References [1] Lohr, S. The age of big data.. February 12, [2] DeWitt, D. and Gray, J. Parallel database systems: The future of high performance database systems. Communications of the ACM Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems data processing on large clusters. In Proceedings of Design and Implementation SIGMOD Record management : Ogres, onions, or parfaits. In Proceedings of the 15th International Conference database management. Communications of the ACM [8] Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: From Relations to Semistructured Data and XML. Proceedings of the 2011 IEEE International Distributed and Parallel Databases [11] Alsubaiee, S., Behm, A., Grover, R., Vernica, R., Proceedings of the ACM SIGMOD International Conference on Management of Data Biographies School of Information and Computer Sciences at UC Irvine. He has an M.S. in computer science and engineering from IIT Bombay. Michael J. Carey is a professor in the Bren School of Information and Computer Sciences at UC Irvine. He has a Chen Li is an associate professor in the Bren School of Information and Computer Sciences at UC Irvine. He has a Ph.D. in computer science from Stanford University. XRDS 49
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson
ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson ASTERIX What is it? It s a next generation Parallel Database System to addressing today s
Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department
Pla7orms for Big Data Management and Analysis Michael J. Carey Informa(on Systems Group UCI CS Department Outline Big Data Pla6orm Space The Big Data Era Brief History of Data Pla6orms Dominant Pla6orms
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine #AsterixDB 0 Rough Topical Plan Background and motivation (quick!) Big Data storage
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Big Data Analysis and HADOOP
Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: [email protected] Mobile: 8220380693 Abstract: - Digital universe with
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Discovering Business Insights in Big Data Using SQL-MapReduce
Discovering Business Insights in Big Data Using SQL-MapReduce A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy July 2013 Sponsored by Copyright 2013
Big Data for the Rest of Us Technical White Paper
Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
White Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
Inside Big Data Management : Ogres, Onions, or Parfaits?
Inside Big Data Management : Ogres, Onions, or Parfaits? Vinayak Borkar Computer Science Department UC Irvine, USA Michael J. Carey Computer Science Department UC Irvine, USA Chen Li Computer Science Department
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre
NoSQL systems: introduction and data models Riccardo Torlone Università Roma Tre Why NoSQL? In the last thirty years relational databases have been the default choice for serious data storage. An architect
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
In-Memory Analytics for Big Data
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
There s no way around it: learning about Big Data means
In This Chapter Chapter 1 Introducing Big Data Beginning with Big Data Meeting MapReduce Saying hello to Hadoop Making connections between Big Data, MapReduce, and Hadoop There s no way around it: learning
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
BIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Bringing the Power of SAS to Hadoop
Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, High-Performance Data Storage and Processing to Get Better Answers, Faster WHITE PAPER SAS White Paper
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
Big Data, Big Traffic. And the WAN
Big Data, Big Traffic And the WAN Internet Research Group January, 2012 About The Internet Research Group www.irg-intl.com The Internet Research Group (IRG) provides market research and market strategy
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
