Business Intelligence and Column-Oriented Databases
|
|
|
- Rodger Black
- 10 years ago
- Views:
Transcription
1 Page 12 of 344 Business Intelligence and Column-Oriented Databases Kornelije Rabuzin Faculty of Organization and Informatics University of Zagreb Pavlinska 2, Nikola Modrušan NTH Mobile, Međimurska 28, Abstract. In recent years, NoSQL databases are becoming more and more popular. We distinguish several different types of such databases and columnoriented databases are very important in this context, for sure. The purpose of this paper is to see how column-oriented databases can be used for data warehousing purposes and what the benefits of such an approach are. HBase as a data management system is used to store the data warehouse in a column-oriented format. Furthermore, we discuss how star schema can be modelled in HBase. Moreover, we test the performances that such a solution can provide and we compare them to relational database management system Microsoft SQL Server. Keywords. Business Intelligence, Data Warehouse, Column-Oriented Database, Big Data, NoSQL 1 Introduction In the past few years NoSQL databases are becoming more and more popular. Some features that are crucial for the relational data model do not seem to be appropriate for today's applications. Applications that are built today require much better response time, they do not need a strict ACID (Atomicity, Consistency, Isolation, Durability) [18], versioning is actually a desirable feature, the volumes of data are becoming very large etc. Since the relational data model does not cope with these issues very well, NoSQL databases are here to fill in the gap. When we talk about NoSQL databases, we have to have in mind that four different types can be distinguished; key value, document-oriented, columnoriented and graph databases. Document-oriented databases store data in documents, but documents are not in a form that we are used to. Document does not mean a.doc (or similar) file, but usually JSON is used as a data format because it can be easily read and interpreted [15]. In JSON, objects are crucial. An object is in fact a nested string/value structure (Fig. 1). A value can be another object, a list of values etc. One of the most popular document-oriented database systems is MongoDB. Figure 1. JSON object [15] Graph databases, on the other hand, rely on some segment of the graph theory. They are good to represent nodes (entities) and relationships among them. This is especially suitable to analyze social networks and some other scenarios. Key value databases are important as well for a certain key you store (assign) a certain value. Document-oriented databases can be treated as key value as long as you know the document id. Here we skip the details as it would take too much time to discuss different systems [21]. Column-oriented databases such as HBase are exactly that, column oriented. This means that column values are stored together (instead of rows). Since queries that are used to extract data usually retrieve only some columns from a table (queries such as SELECT * FROM table should be avoided), columnoriented solutions should be much faster. This paper explores how column-oriented databases (HBase in our case) can be used for data warehousing purposes and what are the benefits of such an approach. The paper is organized as follows. First section explores HBase as a column-oriented database representative. The second section presents several major findings in the field of data warehousing for NoSQL databases. The third section discusses an example scenario that is then implemented as well as some experimental results. Finally, the conclusion is presented. 2 HBase storage system Data warehouses have been used for a long time. Their main purpose is to integrate data from heterogeneous sources in order to build a unified and
2 Page 13 of 344 cleaned structure called a data warehouse. Many good sources can be found on the subject of facts and dimensions (for example, [16], [13], [4], [19] or [2]) and we will not go into detail here. Collection of data can be stored by means of different storage models. Standard data warehouses are mostly based on the relational data model. When we talk about data warehouses, we have to have in mind that there is an open community that focuses on free data warehouse technologies with performances that are even better than the technologies in commercial use [3]. So which technology should be used and when? Namely, NoSQL database systems have certain advantages in terms of preprocessing or exploration of raw unstructured data, flexible programming languages running in parallel, analysis of provisional data etc. [14]. Companies such as Yahoo, Google and Facebook are collecting enormous amounts of data and relational (traditional) data storage models are just not sufficient in such scenarios. In order to avoid issues, these companies started to develop their own technologies to cope with large amounts of data, for example Hadoop platform ([8], [22]). Hadoop has two types of data storage: Hive and HBase. The Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.. [23] Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al [8]. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. [24]. As we said earlier, HBase is a column-oriented NoSQL representative. Data is stored in tables that have rows and columns, but columns are grouped in column families (one column family has more columns). It is important to notice that one cell can have more values over time and versioning is supported. One can look at the following picture (Fig. 3) to see column families, columns, rows and values: Since column values are stored together, it is easy to add new columns. For the same key, there are more rows with different values over time. In order to uniquely identify the row, we need to specify the version as well. HBase has some other advantages, as well as some drawbacks. Data does not have to be joined as we are used to do in SQL because data are denormalized. However, we do not have a system catalogue that one could use to find out interesting data about tables and other objects (we have to take care of it by ourselves). 3 Previous research Converting data warehouse star schema to Hadoop Hive tables using Sqoop in order to test advantages and disadvantages was described in [7], and is also presented in Figure 2. below. Sqoop represents special developed tools that can migrate data to Hadoop. It allows easy import and export of data from structured data stores: relation database, enterprise data warehouse and NoSQL Data store [7]. SAS is business analytics software. SAS data collection was used as a primary data source, from which the authors extracted data as a flat file without duplicates and stored data in star schema model. Both types are stored in the Hive data management system by the earlier mentioned Sqoop tool. Authors made a query performance and data compression test, which resulted with a difference in the query performance and data compress ratio. Our idea was to use a similar approach, but instead of Hive we used the HBase data management system. Furthermore, we tried to implement a star schema in HBase table model. Figure 2. The Data load process [12] Figure 3. HBase table [9] The type of database which an organization require depends on its' needs [5]. Here we reached the discussion of advantages and limitations of columnand row-oriented databases. Row-oriented database is more efficient than column-oriented one when we have to select many columns of a single row, or when
3 Page 14 of 344 a row can be selected with a single disk seek. Also, row-oriented databases are efficient when we have to insert a row within all column data at the same time [5], [11], [17], [6]. Otherwise, column-oriented databases are more efficient with the aggregation of many rows, but with a smaller subset of column data. Here we have a big plus on the analytics performance regarding the same type of value stored at one place, but under a limited number of situations CPU performance is not as good [5], [11], [17], [6]. Authors in [1] succeed to simulate a column store in a row store system via vertical partitioning and indexonly plans, but simulation didn t yield a good performance because of high tuple reconstruction costs. 4 Implementation 4.1 Relational model of SMS messaging In this section we want to present a small data warehouse scenario that is used for SMS message analysis. Nowadays, there are many services, such as parking, voting, content download, that are using SMS messages. Third world countries use old mobile technologies (SMS messages),, so there is still a need to build such data warehouses in order to serve the management with fresh and accurate data. Management wants to find out how many messages each service has per hour, which operator is the most represented, what is the amount of billed messages, what are the top five services by billed messages etc. The star schema is given below. We have one fact and six dimension tables (Fig 4.). This data warehouse was implemented in MS SQL Server. Thus, each SMS message is stored in the central table facttransaction. A fact table is connected with dimension tables and contains a foreign key towards each of the dimensions. In the Country table we can find data about countries (name, currency, Vat ); the User table contains data about the user of the service (mobile number). Data about the mobile operator is included in the Operator table (network code, name, ) and Service table includes details like Name, portfolio name etc. Finally DateH dimension gives us information about the date and time a message was received and submitted. Figure 4. Data warehouse model 4.2 HBase implementation of SMS messaging There are a few different ways to store data and make aggregations in HBase [10]: 1. Making HBase tables as external Hive tables - data can be accessed via HiveQL language here, but that is an unwanted additional step; 2. Writing MapReduce jobs and working with HBase data in the HDFS; 3. Using applications such as Crux, Phoenix, etc. in order to create HBase tables together with metadata needed for query processing. In this paper, we applied the third approach, because we used HBase as a standalone installation, without a Hadoop file system. Moreover, we wanted to make analysis on HBase tables (not on Hive), so copying tables from one format to another was not applicable. Results that we got have been compared with SQL Server 2012 standalone installation. Crux is a depreciated application. Since Apache took Phoenix in its project/application pool, we used Phoenix as the main application for querying HBase tables (Fig. 5). As we noted in the text above, HBase stored data without metadata, so there were some problems with aggregation and grouping. Phoenix
4 Page 15 of 344 successfully avoids such problems and adds certain metadata. We wanted to test both technologies with big data sets. When we inserted approx rows in HBase, there was already a big difference in the performance. Finally, we made a test with one million data rows. The data warehouse model was described above. We extracted and imported data to HBase tables by the means of a Phoenix Python script (psql.py) and we compared the performances for both solutions. Furthermore, we observed several scenarios of data model imported in HBase: 1. Denormalization of the entire data warehouse model in one flat table facts could be joined with dimensions and imported to a single HBase table. 2. Second possible approach is to import dimensions and then the fact table, in that specific order. Dimension tables can have only one column family and this one would look different for the fact table. 3. The third approach would be to import each entity (fact, dimension) in a separate HBase table and then join the tables on the HBase querying level. The second and the third option seem to be pretty complex for column-oriented databases. For example, joining tables in HBase and using relationships (foreign keys) for querying purposes, presents a problem with HBase querying framework. So we choose option one flat data warehouse file. Table 1. HBase vs. MS SQL Server HBase Queries Select id, TransactionSour ceid, IsDelivered, TransportCost from phoenix_tbl Select, ue) from phoenix_tbl group by select, ue) from phoenix_ tbl where isdelivered = 1 and like 'Im%' group by 5 Conclusion SQL Server Queries Select id, TransactionSourc eid, IsDelivered, TransportCost from dbo.fact Select name, ue) from dbo.fact f inner join dbo.dimcustome r dc on f.customerkey = dc.customerkey group by name Select name, ue) from dbo.fact f inner join dbo.dimcustome r dc on f.customerkey= dc.customerkey where isdelivered = 1 and name like 'Im%' group by name HBase exec. Time 11000ms, ms 1400ms, 24994ms 60.6ms, 1397ms SQL Server exec. time 116ms, ms 8ms, 243ms 2ms 36ms Figure 5. Phoenix query Experimental results are given in Table 1. We created three different queries (some of them use GROUP BY clause) and we measured the execution time on both systems. One can see that MS SQL Server has much better performances than HBase: In this paper we explored how NoSQL columnoriented representative called HBase can be used to implement the data warehouse. Although HBase is a column-oriented system, we expected that performances would be better, but that was not the case. SQL Server 2012 confirmed to be a robust solution with very good performances. Although similar attempts are made by other authors, they didn t use HBase system for data warehouse purposes; instead, they use Hive. In the implementation phase, many problems occurred. Some tools that we tried didn't work and some were even not available (any more). So we can say that technologies that are not mature do show some peculiarities and that things are not always straightforward. We also discussed how star schema can be modelled in HBase and the best approach turned out to be the flat data model, i.e, denormalization of the entire data warehouse. In the future, we would like to implement all tree approaches mentioned in the text above, in order to compare them. Moreover, using the column families in HBase could be an interesting subject for study as well.
5 Page 16 of References [1] Abadi, D. J; Madden, S; Hachem, N. Column-Stores vs. Row-Stores: How Different Are They Really?, ACM, [2] Adamson, C. Mastering Data Warehouse Aggregates, Solutions for Star Schema Performance, USA: Wiley Publishing, [3] Awadallah, A; Graham, D. Hadoop and the Data Warehouse: When to Use Which, Cloudera, [4] Ballard, C; Herreman, D; Schau, D; Schau, R; Kim, E; Valencic, A. Data Modeling Techniques for Data Warehousing, IBM redbook, IBM Corporation, International Technical Support Organization, [5] Bhagat, V; Gopal, A. Comparative Study of Row and Column Oriented Database, Fifth International Conference on Emerging Trends in Engineering and Technology, [6] Bhatia, A; Patil, S. Column Oriented DBMS an Approach, International Journal of Computer Science & Communication Networks, [7] Bradley, C; Hollinshead, R; Kraus, S; Lefler, J; Taheri, R. Data Modeling Considerations in Hadoop and Hive, SAS Institute Inc., [8] Chang, F; Dean, J; Ghemawat, S; Hsieh, W. C. D; Wallach, A; Burrows, M; Chandra, T; Fikes, A; Gruber, R. E. Bigtable A Distributed Storage System for Structured Data, Google Research Publication, [9] Chaudhuri, S; Dayal, U. An Overview of Data Warehousing and OLAP Technology, ACM Sigmod Record, [10] Gruzman, D. Group by In HBase, downloaded: April 21 th [11] Harizopoulos, S; Liang, V; Abadi, D. J; Madden, S. Performance Tradeoffs in Read- Optimized Databases, VLDB, [13] Inmon, H. W. Building the Data Warehouse Third Edition, New York: John Wiley & Sons, Inc., [14] Intel Corporation. Extract, Transform, and Load Big Data with Apache Hadoop, Intel Corporation, [15] JSON, Introducing JSON, downloaded: April 20 th [16] Kimball, R; Caserta, J. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, Indianapolis: Wiley Publishing, Inc., [17] Matei, G. Column-Oriented Databases, an Alternative for Analytical Environment, Database Systems Journal vol. I, [18] Medjahed, B; Ouzzani, M; Elmagarmid, A. K. Generalization of ACID Properties, Encyclopedia of Database Systems, [19] Ponniah, P. Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals, New York: John Wiley & Sons, Inc., [20] Prabhakar, A. Apache Sqoop: A Data Transfer Tool from Hadoop, Cloudera Inc., [21] Redmond, E; Wilson, J. R. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement, Pragmatic Programmers, USA, [22] Russom, P. Integrating hadoop business intelligence datawarehousing, TDWI, [23] The Apache Software Foundation, Apache Hive TM, downloaded: April 25 th [24] The Apache Software Foundation, Welcome to Apache HBase, downloaded: April 25 th [12] Haque, A; Perkins, L. Distributed RDF Triple Store Using HBase and Hive, University of Texas at Austin, 2012.
Deductive Data Warehouses and Aggregate (Derived) Tables
Deductive Data Warehouses and Aggregate (Derived) Tables Kornelije Rabuzin, Mirko Malekovic, Mirko Cubrilo Faculty of Organization and Informatics University of Zagreb Varazdin, Croatia {kornelije.rabuzin,
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
BUSINESS INTELLIGENCE AND NOSQL DATABASES
INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2012) Vol. 1 (1) 25 37 BUSINESS INTELLIGENCE AND NOSQL DATABASES JERZY DUDA Department of Applied Computer Science, Faculty of Management,
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759
Big Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.
Suresh Gyan Vihar University Journal of Engineering & Technology (An International Bi Annual Journal) Vol. 1, Issue 2, 2015,pp.12-16 ISSN: 2395 0196 Review of Query Processing Techniques of Cloud Databases
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre
NoSQL systems: introduction and data models Riccardo Torlone Università Roma Tre Why NoSQL? In the last thirty years relational databases have been the default choice for serious data storage. An architect
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems
A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Testing 3Vs (Volume, Variety and Velocity) of Big Data
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Using the column oriented NoSQL model for implementing big data warehouses
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 469 Using the column oriented NoSQL model for implementing big data warehouses Khaled. Dehdouh 1, Fadila. Bentayeb 1, Omar. Boussaid 1, and Nadia
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
REAL-TIME BIG DATA ANALYTICS
www.leanxcale.com [email protected] REAL-TIME BIG DATA ANALYTICS Blending Transactional and Analytical Processing Delivers Real-Time Big Data Analytics 2 ULTRA-SCALABLE FULL ACID FULL SQL DATABASE LeanXcale
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Bringing Big Data into the Enterprise
Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project
European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of
An Introduction to Data Warehousing An organization manages information in two dominant forms: operational systems of record and data warehouses. Operational systems are designed to support online transaction
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Slave. Master. Research Scholar, Bharathiar University
Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper online at: www.ijarcsse.com Study on Basically, and Eventually
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!
Benchmark for OLAP on NoSQL Technologies
Benchmark for OLAP on NoSQL Technologies Comparing NoSQL Multidimensional Data Warehousing Solutions Max Chevalier 1, Mohammed El Malki 1,2, Arlind Kopliku 1, Olivier Teste 1, Ronan Tournier 1 1 - University
Big Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
A Study on Big Data Integration with Data Warehouse
A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive
Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive E. Laxmi Lydia 1,Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering, Vignan's Institute
Data Modeling Considerations in Hadoop and Hive
Technical Paper Data Modeling Considerations in and Hive Clark Bradley, Ralph Hollinshead, Scott Kraus, Jason Lefler, Roshan Taheri October 2013 Table of Contents Introduction... 2 Understanding HDFS
Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料
Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料 美 國 13 歲 學 生 用 Big Data 找 出 霸 淩 熱 點 Puri 架 設 網 站 Bullyvention, 藉 由 分 析 Twitter 上 找 出 提 到 跟 霸 凌 相 關 的 詞, 搭 配 地 理 位 置
Cleveland State University
Cleveland State University CIS 695 Big Data Processing and Data Analytics (3-0-3) 2016 Section 51 Class Nbr. 5493. Tues, Thur TBA Prerequisites: CIS 505 and CIS 530. CIS 612, CIS 660 Preferred. Instructor:
Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:
Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
White Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
Data Warehouse design
Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout
CIO Guide How to Use Hadoop with Your SAP Software Landscape
SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER
BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER TABLE OF CONTENTS INTRODUCTION WHAT IS BIG DATA?...
Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley
Tiber Solutions Understanding the Current & Future Landscape of BI and Data Storage Jim Hadley Tiber Solutions Founded in 2005 to provide Business Intelligence / Data Warehousing / Big Data thought leadership
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected].
Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece [email protected] Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,
Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1
Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1 Mark Rittman, Director, Rittman Mead Consulting for Collaborate 09, Florida, USA,
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Benchmarking and Analysis of NoSQL Technologies
Benchmarking and Analysis of NoSQL Technologies Suman Kashyap 1, Shruti Zamwar 2, Tanvi Bhavsar 3, Snigdha Singh 4 1,2,3,4 Cummins College of Engineering for Women, Karvenagar, Pune 411052 Abstract The
Cleveland State University
Cleveland State University CIS 612 Modern Database Programming & Big Data Processing (3-0-3) Fall 2014 Section 50 Class Nbr. 2670. Tues, Thur 4:00 5:15 PM Prerequisites: CIS 505 and CIS 530. CIS 611 Preferred.
BUILDING OLAP TOOLS OVER LARGE DATABASES
BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Data Warehouse Snowflake Design and Performance Considerations in Business Analytics
Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker
Chapter 6. Foundations of Business Intelligence: Databases and Information Management
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976
In-memory databases and innovations in Business Intelligence
Database Systems Journal vol. VI, no. 1/2015 59 In-memory databases and innovations in Business Intelligence Ruxandra BĂBEANU, Marian CIOBANU University of Economic Studies, Bucharest, Romania [email protected],
Navigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
www.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28
Data Warehousing - Essential Element To Support Decision- Making Process In Industries Ashima Bhasin 1, Mr Manoj Kumar 2 1 Computer Science Engineering Department, 2 Associate Professor, CSE Abstract SGT
Apache Hadoop: The Big Data Refinery
Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data
From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten
From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources
