Power Grid Time Series Data Analysis with Pig on a Hadoop Cluster compared to Multi Core Systems

Size: px
Start display at page:

Download "Power Grid Time Series Data Analysis with Pig on a Hadoop Cluster compared to Multi Core Systems"

Transcription

1 Power Grid Time Series Data Analysis with Pig on a Hadoop Cluster compared to Multi Core Systems Felix Bach *, Hueseyin K. Çakmak *, Heiko Maass, Uwe Kuehnapfel Institute for Applied Computer Sciences Karlsruhe Institute of Technology Karlsruhe, Germany {bach, cakmak, maass, kuehnapfel}@kit.edu Abstract In order to understand the dependencies in the power system we try to derive state information by combining high-rate voltage time series captures at different locations together with data analysis at different scales. This may enable large-scale simulation and modeling of the grid. Data captured by our recently introduced Electrical Data Recorders (EDR) and power grid simulation data are stored in the large scale data facility (LSDF) at Karlsruhe Institute of Technology (KIT) and growing rapidly in size. In this article we compare classic sequential multithreaded time series data processing to a distributed processing using Pig on a Hadoop cluster. Further we present our ideas for a better organization for our raw- and metadata that is indexable, searchable and suitable for big data. Power system, time series, data analysis, big data, LSDF, Hadoop, Pig, multicore I. INTRODUCTION Knowledge of the power grid system state and its dynamics and dependencies is one important prerequisite for a reliable control and understanding of the power system. The worldwide integration of renewable energy sources in the power grid demands for a new dynamic grid architecture, the so called Smart Grid [1] that is proposed to replace centralized grids. For its realization, it is necessary to have appropriate methods for detailed modeling and simulation at large scale based on the analysis of real measurements. Available consumption data are not sufficient: The coarse time-scale of measurements by so called smart meters, providing cumulated power value time-series with typical frequencies of only one sample per 1 to 15 minutes, may be used for short-term local load forecasts on a statistical level but are not sufficient for global fine grain analysis or for use in physical simulations that could increase the knowledge of dynamics and dependencies in the grid. Thus we designed a system for capturing, archiving and analyzing voltage data using our previously presented high-rate voltage measurement device with sub-second resolution called EDR (Electrical Data Recorder) and presented first feature extraction and analysis results [2][3]. However, we found that our current approach lacks data scalability and usability. As a consequence, we evaluated new technology for the organization and analysis of our data. We implemented first * Both authors contributed equally to this paper. experimental processing of large amounts of archived data utilizing a Hadoop cluster together with Pig scripts and user-defined Pig functions (UDF) and compared its performance to classical sequential multi-threaded processing on a local multi core machine to find out which method is appropriate for which data scale and if our data allows efficient processing in Hadoop. The paper is organized as follows: In chapter II we give an overview over related work and a problem description. In chapter III we describe our data and the cluster setup. In chapter IV the results of our performance analysis of distributed and local sequential processing are shown. We also discuss data visualization aspects for time series data from the LSDF. In chapter V we present new concepts for improvements. II. RELATED WORK AND PROBLEM IDENTIFICATION A. Related work Different groups have worked on capturing, archiving and analysis of voltage data. However, to our knowledge no such work has previously been done for long-term data analysis with synchronized sub second sampling, at least in Europe. High-rate network analyzers capture more detailed data than smart meters but are currently only used on short-term failure analysis and not synchronized between different locations. Phasor measurement units (PMUs) are high speed sensors with the option for synchronous acquisition to enable monitoring of the power grid quality. However, PMUs are rarely used in Germany and data from the real network are expensive [1]. The Tennessee Valley Authority (TVA) collects PMU data on behalf of the North American Electric Reliability Corporation (NERC) to help ensure the reliability of the bulk power system in North America [4]. High voltage electric system busses and transmission lines are sampled at a substation several thousand times a second, which is then reported for collection and aggregation. The entirety of the stream is passed to archiving servers. A real-time data stream is forwarded to a server program hosted by TVA which passes the data in a standard phasor data protocol (IEEE C ) to client visualization tools. Agents may move PMU archive files into a Hadoop cluster or

2 directly request these data at their remote location for analyses. By the end of 2010 around 40TB of PMU data were collected, in the next 5 years 500TB of PMU data are expected additionally. The Open Source Phasor Data Concentrator (openpdc), administered by the Grid Protection Alliance (GPA) [5] is a project providing a free software system designed to process streaming time series data in real-time. According to the homepage, measured data is gathered with GPS-time from many hundreds of input sources, time-sorted and provided to user defined actions. It is scalable and designed to consume all standard PMU input protocols. However, performance statistics are logged every 10 seconds only. Another free phasor data concentrator based on the IEEEC synchrophasor standard is ipdc [6]. It is more academic than openpdc and allows development and testing of algorithms and applications related to PMU data and may be used and modified without any restriction. The project Lumberyard [7] provides scalable indexing and low latency fuzzy pattern searching in time series data. It uses HBase (a column based Hadoop Storage) and isax [8] to achieve both scale and index/search respectively. Lumberyard is based on isax and uses the jmotif Java library that provides symbolic aggregate approximation (SAX) [9] and isax for time series data enabling outlier detection and the search for often occurring patterns. B. Problem description Our EDR measurements produce massive amounts of data every day and additional data will arise from power grid simulations. When measuring at the high rate of 25 khz, each EDR produces ~16 GiB per day. This adds up to a total of 5.7 TiB per year per device, exceeding typical hard disk storage sizes. As soon as we add many devices or run simulations with virtual EDRs, storing the data on disk drives connected to a single PC is obviously no longer possible and processing is not efficient. In order to transform our data into useful and searchable scientific information we need new ways to manage, archive, explore and analyze the resulting data because classical storage and mining methods were not found to be appropriate for huge amounts of data: As a first step to understand the meaning of the data, we wanted to compute some statistic measures for the time series. For small timeranges in measurement history, we implemented some basic sequential processing of the data. But the larger the size and number of data files that needed to be processed, the less appropriate classic sequential methods showed (as to be presented in chapter IV). To overcome these limits in time ranges and data sizes both for storage and processing, we searched for ways to bring computations and data residence together. 1) Technical Requirements First, we need a data storage that would not run out of space like traditional storage on hard disk and that would hold our valuable data fail-safe, e.g. by using redundant storage to enable maintenance and accessibility. Second, we need ways to access and analyze the data efficiently. It should be possible to flexibly analyze the archived data anytime afterward for arbitrary time-ranges and to quickly search and filter the data using customizable queries. For this, the storage should provide one namespace, so data distribution is abstracted for users. Also a graphical frontend for interactive navigation supported by different visualizations of the data was needed to provide some data overview. 2) Goals of this study We wanted to find out the scale at which distributed processing outperforms local parallel processing. Since the cluster our data resides on already has a Hadoop installation available, we tried utilizing its Map-Reduce framework to follow the approach of bringing the processing to the data rather than transferring data to processing units. A convenient way to do this was to write user defined functions (UDFs) for Apache Pig [10] (see chapter III). By running the same experimental statistical analyses over differently sized test-data sets using this distributed setup on the one hand and more classical parallel processing on the other, we then tried to evaluate and compare both approaches. A good storage and processing approach is the precondition to start developing and implementing more sophisticated methods for analysis, classification and clustering of big datasets. This is due to the totally different design patterns and requirements for the archival data layout involved in both processing approaches. III. INFRASTRUCTURE AND DATA LAYOUT A. Hadoop cluster setup We utilize the Large Scale Data Facility (LSDF), which is currently maintained at KIT, Steinbuch Centre for Computing (SCC) [11]. It is designed to fulfill the requirements of different data intensive scientific experiments or applications. At the moment the Hadoop cluster available as part of the LSDF consists of 58 nodes with 464 physical cores in total, each node having 2 sockets with Intel Xeon CPU E5520 (2.27GHz, 4 cores), 36 GB of RAM, 2 TB of disk, 1 GE network connection, OS Scientific Linux 5.5 and Linux kernel The two additional name nodes are identical to the data nodes except of the increased 96 GB of RAM and a 10 GE network connection. Clouderas distribution of Hadoop in version CDH3u4 is currently installed, providing different interfaces. We used the Apache Pig API. Pig consists of a high level data flow description language called Pig Latin and a server running on top of Hadoop that internally uses Map-Reduce. B. Data Sources and Formats We have two data sources: Collected measurement data produced by our EDRs and simulation data gathered at virtual smart meters in a power grid simulation model. 1) Power Grid Time Series Data Captured by EDRs The investigation site is the Campus North (CN) of KIT, which is an enclosed research center located near Karlsruhe. One 110 kv incoming transmission line from the local energy supplier as well as one 2 MW block heating station, also used for electrical power generation, provide its

3 electrical energy. 534 electrical active power consumption smart meters are installed at the KIT Campus North and data are available over years from a central database [3]. Since February 2012 we additionally conduct continuous voltage time series measurements in the island-like electric network using our EDRs, where three phase voltages are acquired simultaneously at a typical rate of 10 khz (up to 25 khz). A GPS pulse-per-second (PPS) signal is captured at the same time for synchronization purposes [2]. Our EDR raw data is first stored locally on the disk drive of the PC that is recording the time series as xml-files. These are then aggregated at a file server, which writes the data to the Hadoop file system (HDFS) using Hadoop s Java API. Each xml-file contains the measured voltage-data for one minute in 60 one-second-blocks for each of the 3 phases stored as Base64 encoded 16 bit raw binary numbers. Some additional metadata is stored together with the raw data in the xml-files for each one second block describing the acquisition process and supplemental information. This data layout was designed when the first measurements were made in order to keep highest accuracy at lowest data transfer amount. Even if this layout does not seem to fit the idea of distributed processing on a Hadoop cluster, we need a way to process it after having collected daily measurements over nearly one year now in this format. We already have aggregated a large amount of time series data ( files in 172 folders, MiB) with various recording sample rates. The amount of data is increasing by time and multiplies by the number of installed EDRs. 2) Power grid simulation data To understand the effects in a complex power grid we also put effort in its modeling. We developed the new software framework easimov (Electric Grid Analysis, Simulation, Modeling and Visualization) as a collection of multiple modules. The modeling module enables to interactively create a geo-information based representation of the power grid structure and components. We have access to the data of 534 smart meters at the KIT-CN which we utilize to simulate the currents based on the load and voltage profiles. We use GridLAB-D [12] as the simulation software. Each virtual smart meter in the simulation model can record voltage and current for each phase per second; higher sampling rates will be supported in GridLAB-D v3. The output of the recorders is saved to individual csv-files. The file sizes depend on the number of the smart meters (n sm ), the number of captured parameters (n p ) e.g. current, voltage, the duration of the simulation (t sim ) and the sampling rate for the output (dt). The total data size also depends on the number of the simulation runs (n sim ). The expected data size per model can be calculated as: datasize sim = bytes n sm n p n sim t sim dt The constant value for the bytes was determined for n sm =n p = n sim =1, t sim =1h and dt=1sec. For a long-term simulation e.g. for 1 year datasize sim will rise with the coefficient 8.765, for a high sample rate simulation according to the EDR resolution datasize sim will rise with the coefficient up to The KIT-CN model will produce 1,5 TB for one long-term simulation (n sm =534, n p =n sim =1, t sim =1y, dt=1sec). C. Current data organisation The raw data files are transferred continuously to the LSDF's HDFS storage, where each day is represented by a folder containing all the data files, each holding the information of one minute. The file names allow the identification by date, time and the identifier of the measuring GPS device. IV. PERFORMANCE ANALYSIS OF DATA PROCESSING In this study we analyze the voltage time series data captured by one EDR-device at the KIT-CN for the month June We are interested in efficient analysis and the extraction of statistical information. As a first step to understand the meaning of the data and to evaluate different processing approaches, we wanted to compute some statistic measures for the time series data. In order to test the capabilities of the new technology (Hadoop, Pig) and the limits of PC computing for our data, we implemented the analysis algorithms on a multicore environment and on the Hadoop cluster. Further we measured the runtime for each step of the data visualization with direct LSDF access. A. Comparison of the Processing Time: Multicore vs. Pig For comparing the performance of the statistical analysis for big data we developed a Java based multi-threaded application which was tested with two multicore PCs (PC-1: Intel i CPU, 4 cores, 3.3 GHz, 8GiB RAM, Win7 64Bit, PC-2: 2xIntel Xeon CPU, 8 cores, 3.0GHz, 16GiB RAM, Win7 64 Bit). The multi-threaded application operates on measurement data which are stored locally. For testing purposes we concentrated on simple statistical calculations like the minimum, maximum, average and the standard deviations of minima and maxima for all three phases of the EDR-data. Since we are primarily interested in comparing the processing speed, the selected methods may be insufficient for serious electrical data analysis. The same functionality for data decoding and the calculations was also implemented as a user-defined function (UDF) for use in Pig scripts. UDFs are compiled to Map- Reduce Code running distributed on a Hadoop cluster. To be able to write and automatically start Pig Latin scripts from a development PC, the scripts were embedded in Java code. For this purpose, the Pig Java API provides a PigServer class, and methods for registering and running scripts. Using this API a function was implemented to run a simple Pig script that reads the data xml tags of all xml-files in a given HDFS folder and applies the processing of a UDF written in Java to it. The function includes all statistical analysis implementations in Java as well as the needed Base64 decoding. The compiled Java-class is packed into a jar file which has to be registered at the Pig server in order to use it in Pig scripts. Processing results are written to HDFS. We analyzed voltage measurement data for time periods ranging from six hours (~855 MiB, 360 files) up to three weeks, which is about 69,7GiB in data files. The files had a Hadoop replication factor of 2 and a HDFS block size of 64 MB was used. In Fig. 1 a comparison of the measured time needed for the calculation of the statistical parameters

4 on the multicore PCs (multi-threaded Java code, 8 threads) and with the Hadoop cluster is shown. From 6,2GiB onwards Hadoop cluster processing is superior to multi-threaded data processing. Due to the advanced configuration of PC-2, processing times are shorter but both curves show similar characteristics. We can observe that Pig processing always needs a certain time for initialization of the Map-Reduce step (creating and distributing jar-files); even for small data sizes the initialization seems to be the retarding factor. Figure 2: Runtime evaluation of data processing for LSDF data Figure 1: Runtime comparison Pig/Hadoop cluster vs. multicore PCs In a further step we examined the variation of the processing time for several runs within the Hadoop cluster. We calculated the average and median values for five run time measurements for the same data configuration. During the execution we encountered two Hadoop socket warnings DataStreamer Exception for the combinations, which nearly doubled the execution time. B. Runtime Analysis for local Data Visualization We developed a visualization module for the easimov software, which scans the EDR raw data on the LSDF and sorts them due to the id of the EDR devices and the captured time intervals. We were interested in the analysis of the needed time for direct data retrieval from the LSDF and for the visualization with our current data organization. For testing purposes we selected a 15 minutes interval of the voltage time series for July 21st, 2012 and measured the runtime for the data access, the data retrieval, the data processing and the data visualization (Fig. 2). The 15 minute example described above needs about 105 seconds from the interactive selection of the time range to the visualization. The data transfer from the LSDF is not the bottleneck for this short time range, rather the local data processing of the large data with Java. We verified this by measuring the necessary time for each processing step. The most time consuming step is data decoding and the data formatting for the rendering module (93.3%). This is followed by the parsing of the xml files (4.43%) with the SAX parser. A drawback of this processing pipeline is the repeated necessary access to the LSDF and the calculation of the statistics from scratch. Local data processing after a data transfer from the LSDF will not make any sense for the expected size of our measured and simulated data. V. PROPOSED IMPROVEMENTS Our experiences have reconfirmed that processing must be carried out near the data. This implies the data analysis and the preparation of data visualization. A further improvement will be the creation of metadata, which will structure our data and provide additional semantic information. The local runtime analysis also made clear, that the Base64 encoding of our raw data is computationally too intensive to scale with the data growth. We will have to accept increased file sizes and store values in a text based format. A. New Metadata structure for indexing and queries One problem with the present data organization in the LSDF is its flatness (one folder per day) and the lack of metadata at higher data abstraction layers. Computing the total size of our data folder took more than five hours showing the access problems for unpartitioned data. Currently, metadata is only kept for one second measurement intervals. It is desirable to also have information for longer time intervals, e.g. metadata for a specific day, month or year could describe the data characteristic for this time interval in a compressed manner and relate to events and properties in extern databases. In order to access the data domain in a top-down manner, a hierarchical metadata representation is proposed. We will explain this based on the introduced statistical functions. The procedure will give us the possibility to have information on the measured data at the top level (year), which contains value ranges for the minima, maxima, averages, etc. Traversing the tree to significant data sections will be possible. In order to build up the hierarchical metadata structure we analyze the content of data blocks bottom-up, then create metadata descriptions and merge in the next higher level (time interval). Figure 3 shows the principle: we have data blocks representing meta- and sample data for each second. Based on these data we calculate minima, maxima, average and standard deviations of minima and maxima. Based on 60 seconds data blocks we create a minute data block containing the ranges for all characteristics. We save the minima and maxima of these in the minutes description file. Also we save the file names pointing to the measurement data files. For the next hierarchy level, we group the minute metadata files into hour metadata files using the same principle, that

5 is, we outline the value ranges for the characteristics obtained from the minute metadata files. This is repeated for each EDR until we reach the top of the hierarchy, the year metadata file. Currently this is defined as our top level, which is stored separately for every EDR device. Figure 3: The proposed hierarchical metadata structuring and indexing The collected information on a year basis is available separately and also processible for each EDR device. This indexing combined with the data analysis of specific characteristics can be done as a regular job for incoming new data. Additionally we will be able to submit queries for arbitrary time ranges, which could be computed efficiently from the metadata instead of the computation from raw data. B. Further Improvements We will work on automated methods that will perform the analysis, indexing and structuring directly on the Hadoop cluster for our collected voltage time series measurements. The data analysis methods will be extended for advanced state of the art analysis and mining. We plan to convert our measurement data to the SAX representation and make it indexable using the isax approach. Further we will concentrate on the compact visual presentation of and interaction with statistical analysis data for the proposed new data structure. Promising starting points are the intelligent icons approach [13], VizTree [14] and the methods in [15] that could provide a semantic overview of our datasets and their similarity. We also plan to develop methods for semantic searches, search by data (similarity search), dataset recommenders and classification as well as clustering using the metadata extracted by our analyses. Further optimizations of data transfer between EDRs and HDFS storage, like the aggregation of the xml-blocks to fit the HDFS block-size of 64MB, are planned for speeding up data processing. The security of the system will be another aspect that needs to be dealt with seriously. We will have to provide secure archiving and querying. This may include separation and anonymization of stored personal data e.g. by using pseudonyms and secure but fast encryption methods. VI. CONCLUSION We presented our approach for high rate voltage power network measurements using EDRs and data archiving and analysis problems and methods. We compared multi-core with distributed data processing for simple statistics analyses. For this purpose we developed the statistical analysis functionality in Java for both the Hadoop cluster using the Pig-API and the local environment. We could show that for data sets in our format exceeding ~6.2 GiB the usage of Pig in our Hadoop cluster is superior to classical multicore processing regarding speed of computations and scalability. We also introduced a new time-oriented schema for raw time series and metadata organization enabling fast and easy information retrieval. REFERENCES [1] J. Depablos, V. Centeno, A. G. Phadke, und M. Ingram, Comparative testing of synchronized phasor measurement units, in IEEE Power Engineering Society General Meeting, 2004, 2004, S Vol.1. [2] H. Maass, H. K. Çakmak, W. Suess, A. Quinte, W. Jakob, K. U. Stucky, U. Kuehnapfel, Introducing the Electrical Data Recorder as a New Capturing Device for Power Grid Analysis, in AMPS 2012 PROCEEDINGS, [3] H. Maass, H. K. Çakmak, W. Suess, A. Quinte, W. Jakob, E. Müller, K. Boehm, K. U. Stucky, U. Kuehnapfel, Introducing a New Voltage Time Series Approach for Electrical Power Grid Analysis, presented at the 2nd IEEE ENERGYCON Conference & Exhibition, 2012 / ICT for Energy Symposium, 2012, S [4] The Smart Grid: Hadoop at the Tennessee Valley Authority (TVA) - cloudera, Josh Patterson, Cloudera. [Online]. Available: [Accessed: 15-Aug-2012]. [5] Grid Protection Alliance (OpenPDC, OpenHist.,...). [Online]. Available: [Accessed: 13-Sep-2012]. [6] ipdc - Free Phasor Data Concentrator. [Online]. Available: [Accessed: 15-Aug-2012]. [7] Lumberyard: Time Series Indexing at Scale" OSCON O Reilly Conferences, July 25-29, 2011, Portland, OR. [8] J. Shieh und E. Keogh, isax: disk-aware mining and indexing of massive time series datasets, Data Mining and Knowledge Discovery, Bd. 19, Nr. 1, S , [9] E. Keogh, J. Lin, und A. Fu, Hot sax: Efficiently finding the most unusual time series subsequence, in Data Mining, Fifth IEEE International Conference on, 2005, S. 8 pp. [10] A. Gates, Programming Pig", ISBN-13: , O'Reilly Media, [11] A. O. Garcia, S. Bourov, A. Hammad, V. Hartmann, T. Jejkal, J. C. Otte, S. Pfeiffer, T. Schenker, C. Schmidt, P. Neuberger, und others, Data-intensive analysis for scientific experiments at the Large Scale Data Facility, in Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on, 2011, S [12] D. P. Chassin, K. Schneider, und C. Gerkensmeyer, GridLAB-D: An open-source power systems modeling and simulation environment, in Transmission and Distribution Conference and Exposition, T #x00026;d. IEEE/PES, 2008, S [13] L. Wei, E. Keogh, X. Xi, und S. Lonardi, Integrating Lite-Weight but Ubiquitous Data Mining into GUI Operating Systems, Journal of Universal Computer Science, Bd. 11, Nr. 11, S , [14] VizTree. [Online]. [Accessed:07-Nov-2012]. Available: [15] W. Aigner, S. Miksch, W. Müller, H. Schumann, und C. Tominski, Visualizing time-oriented data A systematic view, Computers & Graphics, Bd. 31, Nr. 3, S , Juni 2007.

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data : High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]

More information

Search and Real-Time Analytics on Big Data

Search and Real-Time Analytics on Big Data Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

A Service for Data-Intensive Computations on Virtual Clusters

A Service for Data-Intensive Computations on Virtual Clusters A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Research of Smart Distribution Network Big Data Model

Research of Smart Distribution Network Big Data Model Research of Smart Distribution Network Big Data Model Guangyi LIU Yang YU Feng GAO Wendong ZHU China Electric Power Stanford Smart Grid Research Institute Smart Grid Research Institute Research Institute

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially

More information

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast International Conference on Civil, Transportation and Environment (ICCTE 2016) Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast Xiaodong Zhang1, a, Baotian Dong1, b, Weijia Zhang2,

More information

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 Overview IFS-8000 v2.0 is a flexible, scalable and modular IT system to support the processes of aggregation of information from intercepts to intelligence

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Data Mining with Hadoop at TACC

Data Mining with Hadoop at TACC Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical

More information

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL

A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL A CLOUD-BASED FRAMEWORK FOR ONLINE MANAGEMENT OF MASSIVE BIMS USING HADOOP AND WEBGL *Hung-Ming Chen, Chuan-Chien Hou, and Tsung-Hsi Lin Department of Construction Engineering National Taiwan University

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

BIRT Document Transform

BIRT Document Transform BIRT Document Transform BIRT Document Transform is the industry leader in enterprise-class, high-volume document transformation. It transforms and repurposes high-volume documents and print streams such

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Diagram 1: Islands of storage across a digital broadcast workflow

Diagram 1: Islands of storage across a digital broadcast workflow XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification SIPAC Signals and Data Identification, Processing, Analysis, and Classification Framework for Mass Data Processing with Modules for Data Storage, Production and Configuration SIPAC key features SIPAC is

More information

Semester Thesis Traffic Monitoring in Sensor Networks

Semester Thesis Traffic Monitoring in Sensor Networks Semester Thesis Traffic Monitoring in Sensor Networks Raphael Schmid Departments of Computer Science and Information Technology and Electrical Engineering, ETH Zurich Summer Term 2006 Supervisors: Nicolas

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

OpenProdoc. Benchmarking the ECM OpenProdoc v 0.8. Managing more than 200.000 documents/hour in a SOHO installation. February 2013

OpenProdoc. Benchmarking the ECM OpenProdoc v 0.8. Managing more than 200.000 documents/hour in a SOHO installation. February 2013 OpenProdoc Benchmarking the ECM OpenProdoc v 0.8. Managing more than 200.000 documents/hour in a SOHO installation. February 2013 1 Index Introduction Objectives Description of OpenProdoc Test Criteria

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

How To Use Hadoop For Gis

How To Use Hadoop For Gis 2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

Software Entwicklungen für das LSDF Datenmanagement

Software Entwicklungen für das LSDF Datenmanagement Software Entwicklungen für das LSDF Datenmanagement Rainer Stotzka, V. Hartmann, T. Jejkal,, P. Neuberger, S. Ochsenreither, F. Rindone, T. Schmidt, H. Pasic J. van Wezel, A. Garcia, R. Kupsch, S. Bourov,

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy [email protected] Data Availability or Data Deluge? Some decades

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

The WAMS Power Data Processing based on Hadoop

The WAMS Power Data Processing based on Hadoop Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore The WAMS Power Data Processing based on Hadoop Zhaoyang Qu 1, Shilin

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Synchronized real time data: a new foundation for the Electric Power Grid.

Synchronized real time data: a new foundation for the Electric Power Grid. Synchronized real time data: a new foundation for the Electric Power Grid. Pat Kennedy and Chuck Wells Conjecture: Synchronized GPS based data time stamping, high data sampling rates, phasor measurements

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn [email protected] Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information