# This is an author-deposited version published in : Eprints ID : 13232

Save this PDF as:

Size: px
Start display at page:

Download "This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 13232"

## Transcription

1 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : Eprints ID : To link to this article : DOI: /j.jss URL : To cite this version : Song, Jie and Guo, Chaopeng and Wang, Zhi and Zhang, Yichan and Yu, Ge and Pierson, Jean-Marc HaoLap : a Hadoop based OLAP system for big data. (2015) Journal of Systems and Software, vol pp ISSN Any correspondance concerning this service should be sent to the repository administrator:

6 Table 2 Definition of related symbols. Symbol Description Fig. 2. Example of dimension coding and hierarchy of dimension values Dimension traverse algorithm In HaoLap, the hierarchy of dimension values can be treated as a special rooted tree called hierarchy tree. In hierarchy tree, the ALL is the root node, denoted as level 0. The value of each dimension level canbetreatedasnodeofeachhierarchylevel,andsiblingnodeshave same amount of descendants. OLAP involves the traversing of dimension, such as roll up or drill down. Coding mechanism is helpful for maintaining the relationship and we can implement the roll up on dimension tree by code calculation. If code(v i+1 ) is given, code(v i ) could be calculated in condition of v i+1 roll up to v i. The drill down from v i to v i+1 is treated as the roll up from v i+2 tov i+1. Let v 1, v 2,...,v i, v i+1,..., v m be a top-down path of dimension tree, order(v i ) be the number of previous sibling nodes of v i (nodes with same parent node and in left-to-right order). Taking Fig. 2 for example, in path , 2 2, 2 3, code(2 3 ) = 32, order(2 3 ) = 1, code(2 2 ) = 1, order(2 2 ) = 1, code( ) = 0, and order ( ) = 0. The relationship between code(v i ) and order(v i ) can be illustrated as Eq. (1). code(v i ) = ( ((0 +order(v 1 )) l 2 +order(v 2 )) l 3 +order(v 3 ) ) l m +order(v i ) (1) Givencode(v i ),order(v 1 )toorder(v i )couldbecalculatedbyeq.(2). temp i = code(v i ) order(v i tempi ) = temp i % l i temp i 1 = l i order(v i 1 tempi 1 ) = temp i 1 % l i 1 temp i 2 = l i 1 order(v 1 ) = temp 1 % l 1 BasedonEqs.(1)and(2),code(v i 1 )tocode(v 1 )couldbecalculated if code(v i ) is given, so that each node in the order path could be determined. For example, in Fig. 2, at day level code(2 3 ) = 32, the path is , 2 2, 2 3, according to Eq. (2), code(2 3 ) = 32, l 3 = 31, l 2 = 12 and l 1 = 50, then order(2 2 ) = 1 and order( ) = 0, thus code(2 2 ) = 1 and code( ) = 0 are calculated according to Eq. (1) Cube partition algorithm The purposes of dividing cube into chunks are filtering input data of MapReduce, which can improve the query performance, and a chunk could be the storage unit of a cube (discussed in Section 4.4). In this section, we introduce the chunk partition algorithm. The performance of OLAP is related to chunk n. When chunk n is smaller, the parallelism could be better, and scanned cells for a certain query range could be much less, but the scheduling cost would be larger. Therefore, chunk n is critical and should be carefully determined. Inspired by Sarawagi and Stonebraker (1994), we (2) T n λ i N a v t 0 t 1 q a ij ξ j N a can be calculatedby aggregatinga ij and ξ j The average execution time of OLAP The number of dimensions The chunk sizeatd i The average numberof chunks whichare selectedby queries The number of map tasks processed per second The file addressingtime The scheduling timeof the task The number of query conditions The numberof dimensionvaluesof d i whicharematched query condition j The occurrence probabilityof conditionj propose an approach to determine the chunk size based on all query conditions and each condition s probability. However, the query conditions analyzed are random sample because it is impossible to use all query conditions up. In addition, chunk n is not only determined by the query conditions but also the features of MapReduce, such as file addressing time, data processing time and so on. Table 2 illustrates definitionsofsymbols.amongthesymbols,onlyλ i isavariable,while Tand N a arecalculationresults. ( q n ) aij N a = ξ j (3) j=1 i=1 λ i When we consider some effects of MapReduce, we can get the average time of OLAP. ( q n ) ( n ) aij T = ξ j t 0 +t 1 + i=1λ i (4) v j=1 i=1 λ i The chunk size can be calculated by Eq. (4), if the value of λ i that minimize T is achieved, and meanwhile the chunk capacity can be calculated too. The details of partition algorithm are abbreviated in this paper. As we can see from the definitions, the sum of chunk capacity of total chunks may be larger than the cube capacity, because of the operation, thus some null cells can be found in a few boundary chunks Data storage algorithms The storage cost of MOLAP could be huge because OLAP is fasted by the multidimensional array in the memory therefore such storage approach is impossible in face of big data. The multidimensional array of HaoLap is calculated not stored. Therefore, the storage of a HaoLap dimension only contains the size of each level because the codes of dimension values in same level are continuous and the sibling nodes have same number of descendants. Let d be a dimension with m dimension levels, denoted as {l i i [1,m]}, the storage of d is a set, denotedas{<l i, l i > i [1,m]}.Inpractice,l i issimplifiedasthelevel identity, and XML files are used to persist the L(d) permanently in Metadata Server. Logically, both cells of the cube and chunks of the cube can be associated as values of the multidimensional array. Physically, a chunk is the storage unit of a cube, stored as a MapFile (Lai and ZhongZhi, 2010) in HDFS. Both chunk files and cells are linearized for persistence, and reverse-linearized for queries, which are similar to the linearization and reverse-linearization of multidimensional array. Let a multidimensional array consist of n dimensions and the size of array are denoted as A 1, A 2,..., A n. If the coordinate of a value in the array is denoted as (X 1, X 2,..., X n ), the linearization

7 and reverse-linearization equations are illustrated as Eqs.(5) and(6), respectively specifically. index(x) = ( ((X n A n +X n 1 ) A n 1 + +X 3 ) A 3 +X 2 ) A 2 +X 1 (5) temp 1 = index temp1 X 1 = temp 1 %A 1 temp 2 = A 1 temp2 X 2 = temp 2 %A 2 temp 3 = X n = temp n %A n A 2 (6) Fig. 3. Example of chunks selection algorithm. As far as a cube is concerned, let x be a cell referred by d 1, d 2,..., d n. Let v i, to which x refers, be dimension value of d i, code(v i ) bex i,thenthecoordinateofxis(x 1,x 2,...,x n ).Wecancalculatedthe linearizedindexofxbyeq.(5)inwhich(x 1,X 2,...,X n )is(x 1,x 2,...,x n ) and A 1,A 2,...,A n is d 1, d 2,..., d n.thereverse-linearizationis thesameaseq.(6).thecellislinearizedandstoredasarecordinthe file, which includes linearized coordinate of the cell and measures in the cell. As far as a chunk is concerned, let y be a chunk, which is a partition of the cube, y = λ 1,λ 2,..., λ n and the coordinate of y is (y 1, y 2,..., y n ), in which y i = x i /λ i. The name of chunk file is the linearized index of chunk. The linearized index of y can be calculated by Eq. (5) in which (X 1, X 2,..., X n ) is (y 1, y 2,..., y n ) and A 1, A 2,..., A n is λ 1,λ 2,..., λ n, and reverse-linearization is the same as Eq. (6). In the implementation, a chunk is a MapFile of Hadoop HDFS and the coordinate of the chunk after linearization is the name of the chunk, which is convenient to addressing. In a chunk file, records are storedintheformofkey-values,inwhichthevaluesarerespondedto the measures and the keys are the index of the corresponding cells, linearized according to Eq. (5). In order to avoid the sparity problem of measures, for the measure whose value is NULL and its key, they are not stored in the chunk. The index is integer, but the size of cube would be very large in the big data environment. Taking Java as an example, the value of index can overflow the range of long integer (2 64 ), so we use string to represent the integer. In addition, the cost of storing key may be much larger than that of values in key-values data model because index is a big number, we optimize the key-values storage, in which only the minimum index and the offsets are stored in the chunkfile Chunk selection algorithm OLAP operations mainly includes roll up, drill down, slice, dice and pivot. Actually, slice and dice can be treated as the range query operation, while roll up and drill down can be treated as the combination of query and aggregation, and pivot can be treated as create the different view of the cube. Thus, the OLAP operations could be abstracted as aggregated query, which aggregates data of a certain range. In data warehouses, the measures are referred and queried by dimensions, therefore the query conditions are defined at each dimension. Range indicates a query range of cells in Target. Obviously, a Range is a multidimensional tuple. Let {d 1, d 2... d n } be the dimensions of Target, and an ordered pair α i, β i (α i < β i ) be given query range on dimension d i, therefore Range = { α i, β i i [1,n]}. For convenience, Range is defined as an ordered pair: {α i },{β i } = {α 1,α 2,,α n },{β 1,β 2,,β n } The OLAP query includes three steps, such as chunks selection, data filtering, and data aggregation. Many researches have discussed how to implement the query and aggregation in MapReduce (Jing-hua et al., 2012; Wu et al., 2013). In this section, chunks selection is introduced. HaoLap provides chunks selection algorithm to reduce queried space by selecting chunks that matches the query range. Let a Range be {α 1,α 2,..., α n }, {β 1,β 2,..., β n ), coordinator of the chunk be (c 1,c 2...c n ), and the chunk size be <λ 1,λ 2,...,λ n >. If i [1,n]satisfyEq.(7),wesaythatthechunkmatchestheRange. αi βi c i (7) λ i λ i Chunks that match the Range are processed by MapReduce jobs. Taking Fig. 3 for example, assuming that the Range is {x 1, y 1, z 1 }, {x 2,y 2,z 2 },thepointbis(x 1,y 1,z 1 )andthepointhis(x 2,y 2,z 2 ),and then the sub-cube ABCD EFGH contains the required cells. Through Eq.(7),wegetthesmallestcubeA B C D E F G H thatiscomposedby integral chunks and contains ABCD EFGH. These chunks are input files ofmapreducejob.chunkselectionavoidsmostofthecellsthatdonot match the query conditions. Namely, the cells that are in the centric ofa B C D E F G H matchthequeryconditionandavoidfurtherdata filtering.however,attheboundaryofa B C D E F G H,therearestill some cells not matching the query conditions and that need to be filtered in MapReduce job. 5. Implementation In this section, the data loading and OLAP implementations based on MapReduce framework are introduced. A MapReduce job consists four parts, including InputFormatter, Mapper, Reducer, and OutputFormatter. The MapReduce based OLAP process is shown as Fig. 4. Before the MapReduce job is performed, the query quadruple is submitted by client and verified in order to avoid the deterministic failures. Then HaoLap detects the chunk-files-list through the chunks selection algorithm mentioned in Section 4.5. In InputFormatter, the chunk files in the chunk-file-list are read and the cells in the chunks are scanned. Meanwhile, coordinates of cell are reverse-linearized and checked by the query conditions. If a coordinate match the query conditions, the linearized coordinate of the cell and the measure are pass to Mapper. Otherwise the cell is abandoned. Both input and output of Mapper and Reducer functions are keyvalue pairs. In which, Key M In, Value M In and Key M Out, Value M Out are input and output format of Map function, respectively; Key R In, Value R In and Key R Out, Value R Out are input and output format of Reduce function, respectively. The Key M Out should be implicitly convertedtokey R In,andValue R In isthecollectionof Value M Out. In HaoLap, the Key M In is the linearized coordinate of a cell and Value M In is the value of measures read by InputFormatter. The

16 Yichuan Zhang received the Ph.D. degree in computer sciencefromnortheasternuniversityofchinain2012.heisa lecture of software college in Northeastern University. His main research interests include data intensive computing and iterative computing. Jean-Marc Pierson received his Ph.D. in computer science from Ecole Normale SupErieure de Lyon. He has been a professor of the Paul Sabatier University, Toulouse 3 since His research interesting includes distributed systems and high performance computing, distributed data management, and energy aware distributed computing. Ge Yu received his Ph.D. degree in computer science from Kyushu University of Japan in He has been a professor at Northeastern University of China since His research interesting includes database theory and technology, distributed and parallel systems.

### DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

### OLAP (Online Analytical Processing) G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

OLAP (Online Analytical Processing) G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT OVERVIEW INTRODUCTION OLAP CUBE HISTORY OF OLAP OLAP OPERATIONS DATAWAREHOUSE DATAWAREHOUSE ARCHITECHTURE DIFFERENCE

### New Approach of Computing Data Cubes in Data Warehousing

International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 14 (2014), pp. 1411-1417 International Research Publications House http://www. irphouse.com New Approach of

### A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

### Week 3 lecture slides

Week 3 lecture slides Topics Data Warehouses Online Analytical Processing Introduction to Data Cubes Textbook reference: Chapter 3 Data Warehouses A data warehouse is a collection of data specifically

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

### Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

### Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

### Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

### Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

### Lectures for the course: Data Warehousing and Data Mining (406035)

Lectures for the course: Data Warehousing and Data Mining (406035) Week 1 Lecture 1 Discussions on the need for data warehousing How DW is different from OLTP databases Week 2 Lecture 2 Evaluation norms

### This chapter reviews On-line Analytical Processing (OLAP) in Section 2.1 and data cubes in Section 2.2.

OLAP and Data Cubes This chapter reviews On-line Analytical Processing (OLAP) in Section 2.1 and data cubes in Section 2.2. 2.1 OLAP Coined by Codd et. a1 [18] in 1993, OLAP stands for On-Line Analytical

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

### An Approach to Implement Map Reduce with NoSQL Databases

www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

### ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

### DATA WAREHOUSING - OLAP

http://www.tutorialspoint.com/dwh/dwh_olap.htm DATA WAREHOUSING - OLAP Copyright tutorialspoint.com Online Analytical Processing Server OLAP is based on the multidimensional data model. It allows managers,

### Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP Ji-Hyun Kim, Hwan-Seung Yong Department of Computer Science and Engineering Ewha Womans University 11-1 Daehyun-dong, Seodaemun-gu, Seoul,

### Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

### INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

### Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

### Cloud Computing at Google. Architecture

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

### Data Warehousing. Paper 133-25

Paper 133-25 The Power of Hybrid OLAP in a Multidimensional World Ann Weinberger, SAS Institute Inc., Cary, NC Matthias Ender, SAS Institute Inc., Cary, NC ABSTRACT Version 8 of the SAS System brings powerful

### Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

### Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

### Mobile Storage and Search Engine of Information Oriented to Food Cloud

Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

### NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

### A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

RESEARCH ARTICLE A Comparative Study on Operational base, Warehouse Hadoop File System T.Jalaja 1, M.Shailaja 2 1,2 (Department of Computer Science, Osmania University/Vasavi College of Engineering, Hyderabad,

### A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

### INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Cloudera Certified Developer for Apache Hadoop

Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

### Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com

Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing

### Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

### Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

### HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

### Big Systems, Big Data

Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

### Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

### Testing Big data is one of the biggest

Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

### http://www.paper.edu.cn

5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

### Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

### KINGS COLLEGE OF ENGINEERING

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING ACADEMIC YEAR 2011-2012 / ODD SEMESTER SUBJECT CODE\NAME: CS1011-DATA WAREHOUSE AND DATA MINING YEAR / SEM: IV / VII UNIT I BASICS

### Oracle s Big Data solutions. Roger Wullschleger.

s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### Big Data on Microsoft Platform

Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

### Data processing goes big

Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

### BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### OLAP Systems and Multidimensional Queries I

OLAP Systems and Multidimensional Queries I Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master

### Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

### Data Warehouse Technologies

Basic Functions of Databases Data Warehouse Technologies 1. Data Manipulation and Management. Reading Data with the Purpose of Displaying and Reporting. Data Analysis Data Warehouse Technology Data Warehouse

### Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

### When to consider OLAP?

When to consider OLAP? Author: Prakash Kewalramani Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 03/10/08 Email: erg@evaltech.com Abstract: Do you need an OLAP

### III Big Data Technologies

III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

### HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### In-Memory Analytics for Big Data

In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

### Design of Electric Energy Acquisition System on Hadoop

, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

### Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

### European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### I/O Considerations in Big Data Analytics

Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

### Keywords: Big Data, HDFS, Map Reduce, Hadoop

Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

### OLAP in the Data Warehouse

Introduction OLAP in the Data Warehouse Before the advent of data warehouse, tools were available for business data analysis. One of these is structured query language or SQL for query processing. However,

### A Study of Data Management Technology for Handling Big Data

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

### Monitoring Genebanks using Datamarts based in an Open Source Tool

Monitoring Genebanks using Datamarts based in an Open Source Tool April 10 th, 2008 Edwin Rojas Research Informatics Unit (RIU) International Potato Center (CIP) GPG2 Workshop 2008 Datamarts Motivation

### MapReduce With Columnar Storage

SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

### DATA WAREHOUSE AND OLAP TECHNOLOGIES. Outline. Data Warehouse Data Warehouse OLAP. A data warehouse is a:

DATA WAREHOUSE AND OLAP TECHNOLOGIES Keep order, and the order shall save thee. Latin maxim Outline 2 Data Warehouse Definition Architecture OLAP Multidimensional data model OLAP cube computing Data Warehouse

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

### BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

### Using the column oriented NoSQL model for implementing big data warehouses

Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 469 Using the column oriented NoSQL model for implementing big data warehouses Khaled. Dehdouh 1, Fadila. Bentayeb 1, Omar. Boussaid 1, and Nadia

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

### 2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000 Introduction This course provides students with the knowledge and skills necessary to design, implement, and deploy OLAP

### low-level storage structures e.g. partitions underpinning the warehouse logical table structures

DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures