Science of Big Data: Background and Requirements

Size: px
Start display at page:

Download "Science of Big Data: Background and Requirements"

Transcription

1 Science of Big Data: Background and Requirements RADEK BENDA Department of Statistics and Quantitative Methods Tomas Bata University in Zlín nam. T. G. Masaryka 5555, Zlin CZECH REPUBLIC Abstract: - The Big data concept has progressively become the next evolutionary phase in batch processing, storing, manipulation and relations visualization in vast number of records. Computational power required for these operations is usually abundant in organizations dealing with Big Data, lacking experts with experience in statistics, data mining, machine learning and databases currently pose a challenge, however. As investments from medium- and large-sized companies are expected to grow substantially within the area in future, demand for data scientists will most certainly follow a similar trend. The article aims to present fundamental prerequisites for a data scientists as well as to provide overview of massive databases management tools along with different approaches towards Big Data compared to classical databases architectures. Key-Words: - Big, data, database, knowledge, discovery, mining, processing 1 Introduction Batch data processing has steadily become more popular due to decreasing prices of hardware components coupled with increasing computational power, freely-available tools for analyses and results visualization, electronisation and digitisation allowing for instantaneous remote data collection (cash registers, smart sensors, mobile devices) over the Internet regardless of geographic location, and subsequent storing in predefined database structures. Computation power follows thesis presented by Gordon E. Moore who stated the complexity, represented by transistor number, of a CPU (Central Processing Unit) would rise thanks to decreasing production costs and at least short-term it is reasonable to expect its annual doubling and therefore increased power. Proved to be an astute observation, it was modified for various other hardware components, namely HDD (Hard-Disk Drive) capacity over time. Hard disk prices essentially follows the same pattern with simultaneous capacity increments. Widespread availability of hardware, low storage access times, parallelization and suitable data structures are main prerequisites for Big Data processing in real-time or near real-time. At the same time, organisations demand expertise combining advanced statistical analysis, data manipulation and understanding in order to visualise and present the results to higher-level management who expect Big Data to provide a competitive advantage. The term data scientist appeared some time ago to denote expert capable of discerning patterns, applicable to the organizational processes. The work position s unequivocal delimitation, however, raises questions as to what skills and proficiencies should a data scientist possess. As data scientist s formal definition is not known to authors, it is only natural to attempt to propose the profile ourselves. The article is structured as follows: in the second part, Big Data will be introduced together with differences from classical database models and tools used for management and analyses. In the third part, minimum prerequisites for a data scientist will be delineated. Discussion and Conclusion summarizes the text and briefly hypothesises about future trends in the area. The importance of Big Data was demonstrated by an analysis which stipulated that in 2011, the amount of newly created and replicated data reached 1.8 ZB (1.8 trillion GB) in absolute terms [7]. Harnessing the knowledge within should be made a priority for every organization wishing to remain competitive. 2 Problem Formulation The problem the article addresses is devising a concise profile for data scientist, including competencies, prerequisites and skill required. Despite some efforts in the area by organizations seeking qualified professionals, no proposal has been industry-accepted as of yet. Considering Big ISBN:

2 Data analysis to be a challenging area of expertise, fusing disciplines such as informatics, statistical analysis, visualization, and programming, lack of formal delimitation may become all the more pronounced in future. The authors believe the problem to be both feasible and beneficial for practice. 3 Big Data Academia did not reflect on the nascent of Big Data to up around 2011 as evidenced by limited number of articles on the topic. One of the reasons may be the fact many researchers and even business representatives considered it a scaled version of classical databases with same principles applicable. Multinational concerns and data-intensive research centres, however, took a different approach towards Big Data processing. The predominant database model is currently a relational database and database schema visualizing data organization. Freely available RDBMS (Relational Database Management System) software (MySQL, PostgreSQL) are frequently used. For those preferring commercial alternatives a plethora of options is available, too. Most of them offer standardized functions ensuring compatibility across platforms but many specificities exist in particular products, the main being incompatible data types for value storage. These differences may cause unexpected behaviour when ported to a different platform, errors or outright incorrect functionality due to an ANSI (American National Standards Institute) standard released after first RDMBS tools were commercially launched. Vendors decided to implement it but in order to secure compatibility for existing clients, the non-standard features were kept. A ubiquitous industry-accepted standard was titled SQL (Structured Query Language), a relational database management programming language. Introduced in 1986, it has proved immensely popular for specifying queries which allow to obtain a subset of records complying with parameterised statements written by the user. Apart from queries, SQL supports data manipulation and design routines. The latest revision was jointly published by ANSI and ISO (International Organization for Standardization) and labelled ISO/IEC 9075:2011. Unfortunately, as with the ANSI standard SQL was incorporated differently by various vendors and cross-platform compatibility is thus not guaranteed. The abovementioned inconsistencies led to proposal of NoSQL model in late 1990s. Free from necessity to standardize, it did not follow the SQL rigorously and later abandoned it and relational database model altogether. NoSQL accentuates modularity and extensive system customization according to specific application needs in contrast with standardization efforts vendors have to undergo. The concept benefitted Big Data processing, relying on case-by-case tweaking and design rather than commoditisation. In 2011, an UnQL (Unstructured Query Language) project was initiated, aiming at semi-structured and document databases manipulation. Unlike SQL, NoSQL does not support ACID (atomicity, consistency, isolation, durability) which specifies the following properties each transaction (i.e., interaction with the database) have to fulfil: Atomicity: every transaction must be executed in whole, if any part fails then the database content is not modified in any way, Consistency: every write to a database have to pass all the requirements (such as correct data type) and thus be valid, Isolation: a transaction must not collide with any other, usually by locking columns in which alterations are to be made, Durability: after the transaction has been executed, all changes to a database have to remain permanent until the execution of another transaction modifying the content again [12]. Big Data management systems sometimes guarantee only a subset or none of the properties described above. Nevertheless, they usually contain dedicated consistency and topicality mechanisms for the users. The need for Big Data management is accentuated by Web 2.0 where multimedia content is created and shared primarily by users using the Internet, social networking which generates order of magnitude higher amount of data, online shopping proliferation, broadening the Internet user base as well as digitisation of public administration s and medical agendas. Among Big Data s challenges there belong a suitable file system for their storage. It may be defined as: At the highest level a file system is a way to organize, store, retrieve, and manage information on a permanent storage medium such as disk [10]. Size is generally not a limiting factor or a constraint requiring additional resources in standard databases, so operating system-supported file systems are used. For larger databases, data warehouse, an inherent feature of corporate information system is employed instead. A data warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support ISBN:

3 of management's decision making process [13]. They marked the advent of Big Data as continuous data aggregation provided organizations with enormous record counts not exploitable into practical knowledge without deep expertise. Many organizations previously invested in information technology infrastructure in hopes of gaining a competitive advantage but without professionals the data remain usable only for static and analyses limited by vendors products. Due to file system s properties, lags when executing queries may be noticeable when working with large datasets on a standard file system unsuitable for Big Data. Therefore, dedicated solutions were proposed alleviating the problem while introducing additional features. 3.1 GFS, BigTable Some companies introduced dedicated file systems natively capable of handling Big Data due to demands on storage capacities and acceptable access times. One of them is GFS (Google File System) where designers stated following properties: the system is built from many inexpensive components that often fail; stores a modest number of large files (millions, in range of 100 MB several GB); multiple clients append to same files; the workloads primarily consists of large streaming and small random reads; large sequential writes appending to data; high bandwidth is more important than low latency [8]. Obviously, the majority of these properties may be applied to databases, even the sustained bandwidth constraint as large volumes of data are usually transferred. Google utilizes GFS as infrastructure for an in-house developed DBMS titled BigTable based on a three-layered architecture as depicted on Figure 1. Every tablet, elementary data storage unit, consist of a single server which may be dynamically allocated or deallocated on-the-fly with the file system providing sufficient redundancy in order to facilitate access to database for all users [6]. Fig. 1: BigTable s tablet structure. (Chang et al, 2006). Metadata ( data about data ) form integral part of every data collection, aggregating information about the content associated with them. 3.2 Apache Hadoop As GFS and BigTable has not been released commercially and are not freely available, further research in Big Data based on findings from Google led to inception of the Hadoop project, available on a non-commercial basis. Built on a three-layered model (as seen in Figure 1), it has wide user base, primarily in distributed data processing applications. Hadoop s designers stipulated very similar properties when compared to GFS: hardware failure; streaming data access; large data sets; simple coherency model; moving computations is cheaper than moving data ; portability across heterogeneous hardware and software platforms [2]. Support for hundreds of thousands of concurrently accessed files, each typically GB (1024 MB) or TB (1024 GB) in size, is imperative. Hadoop allows real-time processing, analysing and storing the data in conjunction with NoSQL frontend. It uses its own file system, HDFS (Hadoop Distributed File System) enabling distribution of Big Data computational tasks to dedicated nodes in server configuration [1]. Several high-profile companies run Hadoop, such as Microsoft, Apple, Facebook, Amazon.com, Hewlett-Packard, IBM or LinkedIn. Other Big Data-enabled file systems include Amazon Dynamo whose main feature is focus on writing into database in any moment, an approach opposite to BigTable and Hadoop both of which prefer throughput. 3.3 MapReduce A third component for efficient Big Data processing is a framework supporting heavily parallelized computations. Concurrency model during data mining is not only a recommendation for Big Data but a necessity. Otherwise, bottlenecks would not be parameters connected to Big Data s storage (input/output operations, file system properties) but insufficient processing power. A universal approach, originally proposed by engineers in Google, was titled MapReduce and based on divide and conquer principle, the only currently known paradigm of working with large data sets. By dividing the originally time-infeasible problem into independent parts, each of which may be processed separately by threads, cores, processors or servers, the results are aggregated when all the workers finish and subsequently presented to the user. ISBN:

4 The MapReduce algorithm comprise two separate primitives: Map: partitioning the instance into independent parts, distribution to worker nodes which may either process the batch themselves or delegate the process further on, Reduce: requesting, aggregating, and presenting results to the user as a return value [5]. MapReduce is universally applicable but Big Data in particular benefit from high processing power platforms such as cloud which stores the data outside the organization, thereby decreasing IT TCO (Information Technology Total Cost of Ownership). They are accessible regardless of geographical location or time zone via the Internet. Commercial and scientific data mining applications may benefit from the fact MapReduce modules were ported to numerous high-level programming languages (see Chapter 4.2), greatly simplifying manipulation with large data sets. Compared to classical relational models, MapReduce does not support indexing and database schemas. Results from a recent comparative study showed that for some operations classical DBMS tools should be preferred performance-wise instead of MapReduce and Hadoop [14]. The authors suggested DBMS are heavily optimized thanks to years of practice but also noted simple set-up of both MapReduce and Hadoop in relation to DBMS. Hadoop implements its own version of MapReduce. 4 Requirements for a Data Scientist The term data science is not at all new. Data Science Journal focusing on original research in the field of Big Data has been released since Data scientist does not have, as far as authors knows, a formal, widely accepted or preferred delimitation, not even a requirements list for prospective employees. In this section, a collection of prerequisites will be provided necessary for thorough and relevant Big Data analyses. However, proficiency in some areas is not sufficient. Only by combining the knowledge it is possible to achieve synergic effects. An important data scientist s feature should also be experience and practical problem-solving skills. 4.1 Database Management Systems for Big Data management and processing as introduced in Chapter 2 are distinct from standard RDBMS paradigm but are nevertheless inspired in variety of procedures. Transforming different kinds of data into a single repository (typically a data warehouse) relies on correct execution of the ETL (Extract, Transform, Load) phase. Despite tools automating some steps, human intervention forms a pivotal element. Aggregation from heterogeneous sources presupposes knowledge of file system structures, HTML (HyperText Markup Language), XML (extensible Markup Language), CSV (Comma- Separated Values), unification and exporting into common structures. The transformation itself is a time-intensive process depending on the extent of ETL, parallelization, storage access times and infrastructure locality (outsourcing into the cloud, moving to local storage capacities). On non-parallel systems, hours and even days is a fair estimation of total time required for batch Big Data processing. A single dedicated station is therefore not a viable alternative to a server configuration. Sound basis when working with database systems, SQL syntax and modification through DML (Data Manipulation Language) and DDL (Data Definition Language) are a must. Collaborative activities across different positions, especially when dealing with sensitive data may further utilize DCL (Data Control Language). 4.2 Programming Languages A suitable programming language is recommended for streamlining the ETL process. Higher-level languages (Perl, Python) allow data to be treated as objects for which many operations are permissible, reducing the time factor involved significantly. In comparison with lower-level languages (MASM, FASM) the syntax is self-explanatory and intuitive due to them utilizing commands based on natural human language. The higher-level programming languages main disadvantage is the abstraction penalty : lowerlevel languages provide facilities which provably executes programs using fewer resources [4]. However, increases in processing power rendered the shortcoming largely irrelevant. Other higher-level programming languages include C and its derivatives (C++, C#, Objective C), Java, PHP, and Delphi. For the purposes of statistical processing, dedicated environments (R, Julia) are freely available. A data scientist should be comfortable working without GUI (Graphical User Interface) in CLI (Command Line Interface) mode, have thorough understanding of Unix-based operations systems as well as be comfortable interacting with various API (Application Programming Interface) modules. ISBN:

5 4.3 Statistical Analysis A fundamental requirement is also proficiency in statistical analysis, internalizing of hypotheses testing, DOE (Design of Experiments) and applying correct tests to different data. Data mining, the KDD s (Knowledge Data Discovery) analytical part, fuses statistics with AI (Artificial Intelligence), machine learning and data visualization techniques. As Big Data started to appear as soon as 1990s, they are by no means a recently discovered trend and statistics, dealing with phenomena observable in large populations is a logical complement. Data mining comprises 6 distinct phases: anomaly (outlier) detection, association rules learning, cluster analysis, statistical classification, regression analysis and summarization [10]. Without statistical analysis, programming tools are incapable of interpreting the results in context of the selected problem despite advances in machine learning and AI. If the postulate that data science is data mining s next evolutionary step, prerequisites recommended for the latter practice may be usable in the former as well. Such an outline was devised as a result of collaboration among students, teachers and experts [3]. It emphasises merging theoretical background with practical training and apart from the phases listed earlier, consists of ETL techniques, data warehouse analyses, OLAP (Online Analytical Processing) and data cubes creation, harvesting knowledge from textual and web sources along with time series analysis. 4.4 Visualizing Tools Concise and unequivocal results presentation is essential for the receiver. While technically correct analysis is a prime requirement for methodically adequate output, the way it is communicated is also a part of KDD. In last few years, it has become an emerging field as visualisation and graphical aggregation tools were introduced, a trend attributed among other factors to social network analyses. Data visualisation is understood as the science of visual representation of data, defined as information which has been abstracted in some schematic form, including attributes and variables for the units of information [11]. Big Data presentation in mass media (TV, newspapers, Internet) often utilizes the information graphics (infographics) technique. Academia, though, shows only marginal interest in the area as per unique article counts in spite of it currently being a tool of choice for Big Data visualization. Visualisation tools such as Gephi, ggplot2 and VisIt are distributed on a non-commercial basis. 5 Discussion and Conclusion A shift towards Big Data is progressive but scarcely any arguments can be made supporting the claim it is about to change in the future. Big Data already exist in business organizations, schools, medical centres, statistical bureaus, are aggregated on the Internet, social platforms, in online payment processing and shopping systems, and science (space exploration, statistics, particle physics, economics). It is, however, imperative to harvest relevant and applicable knowledge which may boost customer relationship and SCM (Supply Chain Management) activities, predict demand for products and determine favoured product combinations or services, simplify and direct instore orientation towards frequently purchased items, enable medical personnel to predict diagnosis by querying symptoms from the database, provide relevant, to-the-point online recommendations based on the contents of a shopping cart etc. Data scientist s goal is to uncover these patterns inside vast amount of records. Lacking formal delimitation may be explained by the fact every organization necessitates individual approach which hampers unified profiling of basic requirements. Authors are convinced a widely acceptable prerequisites list may help specify the abstract term, ensure adequate level of competencies in students and serve as a framework for professionals wishing to continually develop their skills. Universities may modify curricula by incorporating courses on statistical analysis and informatics. Data science is perceived as a highly technical discipline and its inclusion at economic faculties is unlikely. Paradoxically, economic data (time series, banking statistics, purchasing records, marketing events) form a highly suitable and desirable area in which data scientists may operate. Broad specialization should therefore be reconsidered in favour of focus on perspective, developing areas where labour demand surplus is to be expected in a foreseeable future. Two milestones are expected to take place: massive increase of Big Data from mobile sources (mobile analytics) and transfer to distributed processing models, specifically cloud computing. The former will require data scientists able not only to draw generalized but geographically targeted outputs. For instance, devising an optimum product mix shipped to various continents or even countries based on historical as well as real-time data. The latter allows to perform Big Data analyses without purchasing hardware and software ISBN:

6 infrastructure, both supplied by a pay-per-use scheme. Cloud is a platform supporting flexible Big Data processing due to lower TCO and managed resource consumption. Purchasing a single instance is a financially-efficient alternative for organizations dealing with large data sets in discrete intervals as investments into information technology may not be justifiable in such case. Future of Big Data processing is thus open which in conjunction with extensive applications possibilities helps both to foster educating future practitioners and closer cooperation between practice and academia. References: [1] Bakshi, Kapil. (2012) Considerations for Big Data: Architecture and Approach, 2012 IEEE Aerospace Conference, March 3 10, 2012, Big Sky, Montana. [2] Borthakur, Dhruba. (2008). HDFS Architecture Design, [online] The Apache Software Foundation. Available at: rent/hdfs_design.pdf [Accessed on ] [3] Chakrabarti, Soumen, Ester, Martin, Fayyad, Usama, Gehrke, Johannes, Han, Jiawei. (2006). Data Mining Curriculum: A Proposal (Version 1.0), [online]. ACM Special Interest Group on Knowledge Discovery and Data Mining. Available at: [Accessed on ] [4] Chatzigeorgiou, Alexander, and Stephanides, George. Evaluating Performance and Power of Object-Oriented Vs. Procedural Programming in Embedded Processors, 7th International Conference on Reliable Software Technologies (Ada- Europe 2002), June 17 21, 2002, Vienna, Austria. [5] Dean, Jeffrey, and Ghemawat, Sanjay. (2004). MapReduce: Simplified Data Processing on Large Clusters, 6th USENIX Symposium on Operating Systems Design & Implementation (OSDI 04), December 6 8, 2004, San Fransisco, California. [6] Chang, Fay, Dean, Jeffrey, Ghemawat, Sanjay, Hsieh, Wilson C., Wallach, Deborah A et al. (2006). Bigtable: A Distributed Storage System for Structured Data, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 06), November 6 8, 2006, Seattle, Washington. [7] Gantz, John, and Reinsel, David. (2011). Extracting Value from Chaos, [online]. Framingham: IDC. Available at: [Accessed on ] [8] Ghemawat, Sanjay, Gobioff, Howard, and Leung, Shun-Tak. (2003) The Google File System, 19th ACM Symposium on Operating System Principles (SOSP 2003), October 22 23, 2003, Lake George, New York. [9] Giampaolo, Dominic. (1999). Practical File System Design with the Be File System. San Francisco: Morgan Kaufmann Publishers. [10] Fayyad, Usama, Piatetsky-Shapiro, Gregory, and Smyth, Padhraic. (1996). From Data Mining to Knowledge Discovery in Databases, AI Magazine, Vol. 17, No. 3, pp [11] Friendly, Michael. (2009). Milestones in the history of thematic cartography, statistical graphics, and data visualization, [online]. Ontario: York University. Available at: SCS/Gallery/milestone/milestone.pdf [Accessed ] [12] Haerder, Theo, and Reuter, Andreas. (1983) Principles of Transaction-Oriented Database Recovery, ACM Computing Surveys, Vol. 14, No. 4, pp [13] Inmon, William H. (2005) Building the Data Warehouse. New Jersey: Wiley. [14] Pavlo, Andrew, Paulson, Erik, Rasin, Alexander, Adabi, Daniel J., DeWitt, David J. et al. (2009). A Comparison of Approaches to Large-Scale Data Analysis, 2009 ACM SIGMOD/PODS Conference, June 29 July 2, 2009, Providence, Rhode Island. ISBN:

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2 RESEARCH ARTICLE A Comparative Study on Operational base, Warehouse Hadoop File System T.Jalaja 1, M.Shailaja 2 1,2 (Department of Computer Science, Osmania University/Vasavi College of Engineering, Hyderabad,

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

Figure 1 Cloud Computing. 1.What is Cloud: Clouds are of specific commercial interest not just on the acquiring tendency to outsource IT

Figure 1 Cloud Computing. 1.What is Cloud: Clouds are of specific commercial interest not just on the acquiring tendency to outsource IT An Overview Of Future Impact Of Cloud Computing Shiva Chaudhry COMPUTER SCIENCE DEPARTMENT IFTM UNIVERSITY MORADABAD Abstraction: The concept of cloud computing has broadcast quickly by the information

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Massive Data Storage

Massive Data Storage Massive Data Storage Storage on the "Cloud" and the Google File System paper by: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung presentation by: Joshua Michalczak COP 4810 - Topics in Computer Science

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hosting Transaction Based Applications on Cloud

Hosting Transaction Based Applications on Cloud Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

BIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT

BIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT BIG DATA: A CASE STUDY ON DATA FROM THE BRAZILIAN MINISTRY OF PLANNING, BUDGETING AND MANAGEMENT Ruben C. Huacarpuma, Daniel da C. Rodrigues, Antonio M. Rubio Serrano, João Paulo C. Lustosa da Costa, Rafael

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Oracle s Big Data solutions. Roger Wullschleger.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated

More information

USING BIG DATA FOR INTELLIGENT BUSINESSES

USING BIG DATA FOR INTELLIGENT BUSINESSES HENRI COANDA AIR FORCE ACADEMY ROMANIA INTERNATIONAL CONFERENCE of SCIENTIFIC PAPER AFASES 2015 Brasov, 28-30 May 2015 GENERAL M.R. STEFANIK ARMED FORCES ACADEMY SLOVAK REPUBLIC USING BIG DATA FOR INTELLIGENT

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data

More information

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP Pythian White Paper TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP ABSTRACT As companies increasingly rely on big data to steer decisions, they also find themselves looking for ways to simplify

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Cloud Database Emergence

Cloud Database Emergence Abstract RDBMS technology is favorable in software based organizations for more than three decades. The corporate organizations had been transformed over the years with respect to adoption of information

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive E. Laxmi Lydia 1,Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering, Vignan's Institute

More information

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Database Management. Chapter Objectives

Database Management. Chapter Objectives 3 Database Management Chapter Objectives When actually using a database, administrative processes maintaining data integrity and security, recovery from failures, etc. are required. A database management

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D. Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology

More information

Turning Big Data into Big Insights

Turning Big Data into Big Insights mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago 2 3 SQL Overview Structured Query Language

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Content Problems of managing data resources in a traditional file environment Capabilities and value of a database management

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Study and Analysis of Data Mining Concepts

Study and Analysis of Data Mining Concepts Study and Analysis of Data Mining Concepts M.Parvathi Head/Department of Computer Applications Senthamarai college of Arts and Science,Madurai,TamilNadu,India/ Dr. S.Thabasu Kannan Principal Pannai College

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining System, Functionalities and Applications: A Radical Review Data Mining System, Functionalities and Applications: A Radical Review Dr. Poonam Chaudhary System Programmer, Kurukshetra University, Kurukshetra Abstract: Data Mining is the process of locating potentially

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Multi-Site Software Development It s Not Just Replication Anymore

Multi-Site Software Development It s Not Just Replication Anymore Multi-Site Software Development It s Not Just Replication Anymore An MKS White Paper By David J. Martin Vice President Product Management Multi-site Software Development It s Just Not Replication Anymore

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract

Jagir Singh, Greeshma, P Singh University of Northern Virginia. Abstract 224 Business Intelligence Journal July DATA WAREHOUSING Ofori Boateng, PhD Professor, University of Northern Virginia BMGT531 1900- SU 2011 Business Intelligence Project Jagir Singh, Greeshma, P Singh

More information