1 Placing Big Data in Official Statistics: A Big Challenge? Monica Scannapieco 1, Antonino Virgillito 2, Diego Zardetto 3 1 Istat - Italian National Institute of Statistics, 2 Istat - Italian National Institute of Statistics, 3 Istat - Italian National Institute of Statistics, Abstract Big Data refers to data sets that are impossible to store and process using common software tools, regardless of the computing power or the physical storage at hand. The Big Data phenomenon is rapidly spreading in several domains including healthcare, communication, social sciences and life sciences, just to cite some relevant examples. However, the opportunities and challenges of Big Data in Official Statistics are matter of an open debate that involves both statisticians and IT specialists of National Statistical Institutes. In this paper, we first analyse the concept of Big Data under the Official Statistics perspective by identifying its potentialities and risks. Then, we provide some examples that show how NSIs can deal with such a new paradigm by adopting Big Data technologies, on one side, and rethinking methods to enable sound statistical analyses on Big Data, on the other side. Keywords: Web data sources, large scale data processing, data and information quality. 1. Introduction All the estimates about the quantity of data produced every day in the world reveal impressive figures. Streams of data flow unceasingly from miscellaneous sources, such as sensors, GPS trackers, users posting messages on social networks, Internet documents, log files, scientific experiments and more. The problem of extracting useful insight from this often unstructured mass of information poses major technical and organizational challenges. The term Big Data is used for identifying structured or unstructured data sets that are impossible to store and process using common software tools (e.g. relational databases), regardless of the computing power or the physical storage at hand. The size of data, typically spanning dimensions of tera- and peta-bytes orders of magnitude, is not the only aspect that make data Big. Indeed, the problem of feasibility in treating data increases when data sets grow continuously over time while a timely processing is necessary for producing business value. In contexts such as search engines and social networks, where the amount of content to be stored and indexed (Internet documents, user data, user searches) has become overwhelming, it is mandatory to exploit techniques such as parallel computation and distribution on large clusters of computers.
2 Nevertheless, the availability of large quantities of data opens up new possibilities for creating new types of analysis and extracting new information. Examples are appearing in many domains, from healthcare to communication to social sciences, where the analysis of massive data sets allows to unveil hidden information in form of, e.g., clusters and patterns. A popular application of this paradigm is the extraction and the analysis of messages posted to social networks for identifying trends in opinions ( opinion mining ). For example, the days before the 2012 U.S. presidential elections, several news web sites provided their own real-time polls extracted from the analysis of Twitter posts. In general, the Big Data upsurge provides us with two innovation directions: (i) the disposal of new high-performance techniques and tools for storage and processing and (ii) the opportunity for new types of analysis and products. As far as Official Statistics is concerned, the question arises about what could be the impact of this innovation on production processes. The High-Level Group for the Modernisation of Statistical Production and Services (formerly known as HLG-BAS) recently recognized that the abundance of data available nowadays should trigger a profound rethink of products offered by NSIs, and highlighted the risk for NSIs to be superseded by non-official statistics providers that are more ready to exploit the opportunities offered by Big Data (HLG-BAS 2011). However, it is not obvious to derive from such a general statement actual directions in research and production. In this paper we analyze the general issue of exploiting Big Data in the context of Official Statistics from the perspective of the support IT can offer in processing large amounts of data. We first identify the challenges and the opportunities that can stem from the introduction of Big Data in statistical processes. Then, we provide some practical examples on how some established statistical techniques and practices could be revisited through the introduction of parallel computation tools and huge data sets. 2. Challenges Using Big Data in Official Statistics poses several challenges that are, on one side, related to the inherent complexity of this data management paradigm, and, on the other side, related to the official nature of the products provided by NSIs. We envision four major challenges for Big Data usage in Official Statistics, namely: (i) Dimension, (ii) Quality, (iii) Time dependence, (iv) Accessibility. Among these, Quality and Accessibility seem to be even more critical, as we will argue later on in the paper. The Dimension challenge is directly related to the Big characteristic, where Big should not only be intended as the absolute size of the considered datasets (i.e., huge number of records and/or columns), but also the fact that these can continuously grow and change from time to time, such as in a continuous stream, and then reach a non-predictable (and eventually non-treatable) size. Dimension has an impact on: (i) storage of data and (ii) processing of data, both considered in the following.
3 2.1. Dimension Challenge: Storing Big Data Storage of Big Data requires advanced data architectures that can be supported by recent technologies involving hardware as well as software. Probably the most popular technology related to storage of Big Data is the distributed file system abstraction, the most popular incarnation of which being the Hadoop project. Hadoop is an open-source software project, developed by the Apache Software Foundation, targeted at supporting the execution of data-oriented application on clusters of generic hardware. The core component of the Hadoop architecture is a distributed file system implementation, namely HDFS. HDFS realizes a file system abstraction on top of a cluster of machines. Its main feature is the ability to scale to a virtually unlimited storage capacity by simply adding new machines to the cluster at any time. The framework transparently handles data distribution among machines and fault-tolerance (through replication of files on multiple machines). A distributed file system is restricted to handle and store files rather than structured or semi-structured data, then it is not suited to be directly used in statistical applications for querying, processing and analysis. The distributed file systems is a low-level abstraction that is mainly intended to represent the basic building block of cluster management, to be exploited by other services. For example, in the Hadoop architecture, HDFS is the foundation of the other main component of the architecture, namely MapReduce. We will examine MapReduce later, when discussing the issue of processing Big Data. Other kinds of distributed storage systems exist, specifically targeted at handling huge amounts of data, and based on loosely-structured data models that allow for storage and querying through user-oriented high-level languages. Such systems are commonly identified with the umbrella term NoSQL databases ( not only SQL ), that embraces products based on different technologies and data models 1. Two of such systems, namely HBase and Hive, are part of the Hadoop ecosystem and represent high-level interfaces to HDFS. Other popular NoSQL databases are Google BigTable, Cassandra, MongoDB. In very general terms, the main difference between a NoSQL data model and the relational model is that in the former the data schema is not fixed a priori but can be flexibly adapted to the actual form of the data. For example, two records in the same dataset can have a different number of columns or a same column can contain data of different type. Such a flexibility is the key for reaching high scalability: physical data representation is simplified, few constraints on data exist, and data can be easily partitioned and stored on different machines. The obvious drawback lies in the lack of strong consistency guarantees (e.g., ACID properties and foreign key constraints) and the difficulty of performing joins between different datasets. NoSQL databases are powerful tools for storing and querying large datasets and can be effectively exploited for statistical purposes. The focus is on the treatment of non-transactional data to be used for aggregation and mining, or even simply as a source for other data processing steps. Other examples of technologies that can aid the storage of Big Data include: database machines (like Oracle Exadata) and vertical DBMSs (like MS PowerPivot support 1 A survey on the different types of data models used in NoSQL databases is out of the scope of this paper.
4 engine). Despite the availability on the IT market of such solutions, their usage by NSIs is at an early stage. As far as Istat is concerned, there are however some notable examples: (i) Istat is starting a migration towards the database machine Oracle Exadata of its whole data asset; (ii) current Population Census activities are supported by Microsoft Business Intelligence, including PowerPivot technology Dimension Challenge: Processing Big Data Processing of Big Data is a hard task both in terms of feasibility and efficiency. The feasibility of solving a specific processing problem not only depends on the absolute size of the input datasets but is also related to the inherent complexity of the algorithm. If an algorithm is highly demanding in terms of memory and processor usage, data quickly can become big even if the input size is treatable for storage. We expand this concept through the examples presented in Section 3. The other challenge in processing of Big Data is related to efficiency, in the sense that even if a processing task can be handled with the computing resources at hand, the time required for completing the task should be reasonable. The basic idea behind the processing of Big Data consists in distributing the processing load among different machines in a cluster, so that the complexity of the computation can be shared. The number of the machines can be adapted in order to fit to the complexity of the task, the input size and the expected processing time. Besides managing distributed storage, a technological platform such as Hadoop also provide the abstraction of distribution of processing in a cluster. However, parallel processing not only involves a technological problem, but mainly requires the algorithm to be written in such a way that the final result can be obtained with each machine in the cluster working on a partition of the input, independently of the others. The MapReduce paradigm (Dean and Ghemawat 2004) is a programming model specifically designed for the purpose of writing programs whose execution can be parallelized. A MapReduce program is composed by two functions, map, specifying a criteria to partition the input into categories and reduce where the actual processing is performed on all the input records that belong to a same category. The distributed computation framework (e.g. Hadoop) takes care of splitting the input and assigning it to a different node in the cluster that takes in turn the role of mapper and reducer. Hence, the general processing task is split into separate sub-task and the result of the computation of each node is independent from all the others. The physical distribution of data through the cluster is transparently handled by the platform and the programmer must only write the processing algorithm according to the MapReduce rules. MapReduce is a general paradigm not tied to a specific programming language. While Hadoop requires MapReduce programs to be written in Java, MapReduce interfaces exists for all the common programming and data-oriented languages, including R.
5 The possible limit of this approach is that not all processing tasks are inherently parallelizable, that is some tasks cannot be easily split into set of independent sub-tasks. In statistical/mathematical computing there are several examples (think, e.g., of matrix inversion). In Section 3, we present some examples of statistically relevant problems and discuss how they can be solved through a parallel approach. 2.3 Quality, Time and Accessibility Challenges Big Data are noisy, that means that they do need to undergo a not trivial filtering and cleaning process before they can be used. Hence, one significant issue regards balancing the complexity of the cleaning process with the information value of the obtained results. Quality of Big Data can be more specifically characterized by: Quality dimensions, e.g. accuracy, timeliness, consistency, completeness. Each of these dimensions could be interpreted in a fundamentally revised way, in dependence of the specific contexts where Big Data are collected. For instance, there are dedicated works to the quality of sensor data, that take into account sensor precision and sensor failures (see e.g., (Klein and Lehner 2009)). Metadata. Most of Big Data, especially Web available ones, are loosely structured, i.e. there is limited metadata information that permit their interpretation. This is among the hardest challenges: data with no metadata are often useless. However, there are several projects that are pushing towards a concrete realization of the Web of Things, where objects have their identity and semantics based on various kind of metadata information associated to them: Linked Open Data (http://linkeddata.org) and WikiData (http://www.wikidata.org) are notable examples. Another aspect is related to process metadata. For instance, provenance is a significant process metadata describing an information describing origin, derivation, history, custody, or context of an object (Cheney et al. 2008). The relationship between time and Big Data has two main aspects. The first one is data volatility, i.e. a temporal variability of the information the data are meant to represent: there are data that are highly volatile (e.g. stock options), other which exhibit some degree of volatility (e.g. product prices), and some which are not volatile at all (e.g. birth dates). The second aspect is more generally related to the time features of the data generating mechanism. For instance, some Web data (e.g. data coming from social networks) spring up and get updated in an almost unpredictable fashion, so that their time dimension is not available in a direct way, but does need to be re-constructed, if wishing to use those data in any meaningful analysis. The major role that NSIs could have in respect to Big Data is as data consumers, rather than as data producers, hence they must have access to them. This is not an easy task first of all because several Big Data producers (e.g. Google, Facebook) gain their competitive advantage by maintaining their data locked. That is, Big Data are not (necessarily) Open Data. Hence accessing data sets for research purposes can be tricky and might involve issues related to the actual utility of the retrieved data. As an example, only a limited
6 subset of (public) tweets is made available by Twitter for querying through its APIs, and the selection criteria of such subset is not documented. Even when the competitive advantage is not an issue, privacy could be a limiting factor, due to legal regulations or to business policies (think, e.g., of hospital files or health data produced by clinical laboratories). 3. Big Data in Statistical Production Processes: Some Examples NSIs carry out their statistical production processes by exploiting a wide variety of different methods. Be the origin of such methods in the statistical/mathematical realm or in the computer science domain, such methods have evolved in NSIs by hinging upon specific features of the data they were meant to handle. Therefore, current NSIs methods bear a footprint that can be easily traced back to the nature of the data collected and consumed in the Official Statistics context (in the following: traditional data). In particular, the quality of traditional data is ensured by a well-defined set of metadata. The time dimension of traditional data is completely under the control of each NSI, as it is planned and managed since the preliminary design phases of surveys. As producers of traditional data, NSIs do not have any accessibility problems, and also when using data coming from administrative archives, they can rely on well-established exchange protocols. Though obvious, traditional data (even Census ones) are not Big Data, as they were defined in the Introduction. Hence, while Big Data can appear as a good opportunity to catch, we must admit that traditional NSIs methods seem not ready to deal with the specificity of such kind of data. Therefore, we envision the possibility for NSIs to deal with Big Data in two ways: adopting Big Data technologies, on one side, and rethinking methods to enable sound statistical analyses on Big Data, on the other side. In the following, we will use Record Linkage (RL) and data cleaning as a working examples to clarify this concept. RL refers to the set of statistical and IT methods that NSIs traditionally exploit in order to integrate data coming from different sources (but referring to the same real-world target population) when common record identifiers are lacking or unreliable. The need of such methods arises, for instance, whenever an NSI tries to use administrative archives as auxiliary sources in order to improve the quality of inferences based on survey data. Let s first focus on a particular RL phase, for showing how Big Data technology can be used for improving efficiency in traditional data processing. RL between two data sets, each of size equals to n records, requires n 2 record comparisons. When n is of the order of 10 4 or higher, the space of all the comparison pairs must be reduced in order to permit its practical exploration. To the scope, it is possible to adopt techniques that reduce the search space, i.e. the set of record pairs that will undergo the phase that decides their Matching or Non-Matching status. Among them, Sorted neighborhood is one of the most popular approaches: it sorts all entities using an appropriate blocking key and only compares entities within a predefined distance window. In (Kolb et al. 2012) it is shown how to use MapReduce for the parallel execution of Sorted neighborhood, demonstrating a highly efficient record linkage implementation with a Big Data technology. In such an implementation, the Map
7 function determines the blocking key for ordering records. The map output is then distributed across to multiple Reduce tasks, each implementing the sliding window strategy. Insisting on the RL scenario, let s show why some traditional methods cannot directly be used with Big Data. Let us suppose that (i) some Big Data sources exist that could provide further valuable information on phenomena which are currently addressed by a traditional survey, and that (ii) such sources are actually accessible to the NSI in charge of the survey. As it is well known, careful sampling designs and strictly probabilistic sample selection techniques are real pillars of traditional surveys carried out by NSIs, acting as fundamental ex-ante guarantees of the quality of the obtained inferences. The survey weights tied to respondent units are imprints of such pillars and play a special role in estimation. Each sampled unit enters the survey with its own sampling weight and cannot be substituted ex-post with any other unit (be it a similar one) without the risk of severely impairing estimation. Therefore, any attempt to enrich a traditional survey with further information coming from a Big Data source would require to identify the tiny component of that source which is actually linked to the real-world units sampled in the survey. This would allow to transfer new information from the Big Data source to the right sample units, without hindering the probabilistic sampling design of the survey, i.e. leaving the survey weights unaffected. In other words, in order to take advantage from the Big Data opportunity without jeopardizing the quality of inferences, the NSI should be able to link the survey data with the Big Data. Unfortunately, such result seems definitely out of the reach of traditional RL methods adopted in Official Statistics. Indeed, the Big scale of the Big Data sources would claim for fully automated RL methods, that could be deployed in high performance computing environments and then run robotically (i.e. without any human bottleneck). This clearly rules out most of the tools from the traditional RL kit: supervised learning techniques, clerically prepared training sets, user-defined classification thresholds, clerical review, interactive evaluation of obtained results. Despite the latter conclusion looks unavoidable, the challenge of devising alternative RL methods suitable to set upon the Big Data scale is admittedly far from being met, at least in the Official Statistics field. Indeed, NSIs currently perceive traditional RL methods almost as a mandatory choice, on the grounds that by leveraging domain knowledge at the price of some amount of human work those methods proved to be able to ensure high-quality RL results. On the contrary, existing fully automated RL systems like approximate join algorithms from the database research community, deliberately biased toward the goal of high efficiency tend to be regarded with skepticism in NSIs, as they are commonly believed to perform too much poorly with respect to accuracy. As a more encouraging example, let us switch to the problem of handling matching constraints in RL. Such constraints arise whenever one knows in advance that either or both the data sets to be linked do not contain duplicates. This is precisely the case of the RL scenario in Official Statistics, since thanks to well established checking and cleaning procedures routinely applied by NSIs ordinary survey data can be assumed to be duplicate-free. As a consequence, each record coming from the Big Data source can match at most a single survey record. This property translates into a set of constraints that
8 have to be fulfilled when solving the RL problem. The latter is generally modelled as a constrained Maximum Likelihood problem, which NSIs traditionally address by using Simplex-based algorithms (since both its objective function and constraints are linear). Now, while the Simplex is an exact optimization technique, a major concern with it is tied to memory usage. If n denotes the average size of the data sets to be matched (i.e., in our scenario, the Big Data scale: n ~ ½N Big ), the number of unknowns and inequalities for the problem grow like n 2 and n respectively, yielding a Simplex solver space complexity of O(n 3 ) = O(N Big 3 ) at least. The net result is that the Simplex approach cannot be directly applied to the Big Data scale, no matter how much computing power an NSI can rely on. Fortunately, there exist alternative techniques that can ensure the computational efficiency required to attack the Big Data scale, provided NSIs are ready to accept approximate (rather than exact) solutions to their optimization problems. For instance, (Zardetto et al. 2010) showed how to tackle the RL problem under matching constraints by means of a purposefully designed Evolutionary Algorithm (EA). On the one hand, the EA overcomes the memory problem (the biggest data structure it stores grows only linearly with n), on the other hand, it is able to find high-quality (though not necessarily optimal) RL results. Moreover, the EA thanks to the inherent parallelism of the EA metaheuristic is a natural candidate for further scalability via the MapReduce paradigm, as testified by a growing body of research (see e.g. (Verma et al. 2009) and references therein). Data cleaning tasks are carried out by consolidated methods in NSIs. For instance, consistency checks are performed by defining sets of edits, localizing edit violations and operating repairs to erroneous records by traditional methods like Fellegi-Holt. If Big Data becomes a new data collection basin, one could think to have access to a huge number of sources that contain redundant data, like, e.g. Web sources. This could revolutionize the way in which consistency checks are performed. Indeed, instead of thinking to complex edit-based models to be verified on a single dataset, one could think to cross-match distinct datasets in order to verify data consistency. One possibility is to find out rules by identifying anomalous (i.e. exceptionally rare) combinations of values in the huge amount of data provided by Web sources. Such rules could be used to complement or even possibly to replace the edit rule sets that are traditionally based on prior knowledge on phenomena under investigation. A second possibility could be related to a two-step process. In the first step, data sources could be linked on the basis of record linkage procedures, while, in the second one, some fusion techniques could be applied in order to resolve inconsistencies (Bleiholder and Naumann 2008). In synthesis, the above described examples illustrate that both Big Data technologies and Big Data can have a place in the Official Statistics context: Big Data technologies can improve the efficiency of statistical production processes and Big Data can add value to statistical information products, though at the price of a consistent rethinking of traditional statistical methods.
9 4. Concluding Remarks Current methods in the survey analysis realm largely evolved from common ancestors (e.g., design based survey sampling theory, regression theory, generalized linear models), which were originally devised to be used on small amounts of high quality data. Such methods are, indeed, extremely sensitive to outliers and erroneous data, a circumstance that explains the tremendous effort put by NSIs in checking and cleaning activities. Moreover, these traditional methods typically exhibit high computational complexity (power-behavior is the rule, linear algorithms are lucky exceptions, sub-linear ones a real chimera), a feature that hinders their scalability on huge amounts of data. As a consequence, most of the statistical methods adopted by NSIs seem to be positioned poles apart, when examined in the Big Data perspective. NSIs will have to undertake some radical paradigm shift in statistical methodology, in order to let Big Data gain ground in Official Statistics. Despite it is far from obvious how to translate such awareness into actual proposals, we deem new candidate statistical methods should be: (i) more robust (i.e. more tolerant towards both erroneous data and departures from model assumptions), perhaps at the price of some accuracy loss; (ii) less demanding in terms of a clear and complete understanding of obtained results in the light of an explicit statistical model (think, e.g., of Artificial Neural Networks, Support Vector Machines); (iii) based on approximate (rather than exact) optimization techniques, which are able to cope with noisy objective functions (as implied by low quality input data) while, at the same time, ensuring the mandatory scalability requirement inherent in Big Data processing (thanks to their implicit parallelism: think of stochastic metaheuristics like, e.g., evolutionary algorithms, ant colonies, swarm particles). In order to exploit the (potentially enormous) treasure shining through the Big Data mountain, NSIs will have to climb that mountain. Moreover, during such climbing NSIs will have to avoid possible crevasses (think of the reputation risk inherent to sentiment analysis/social mining techniques: stakeholders tend to recognize NSIs authority to the extent that they perceive results of statistical production as objective ). While a highly non-trivial technological leap will be necessary, this will definitely turn out to be insufficient. Indeed, putting Big Data in an Official Statistics environment, also requires a radical paradigm shift in statistical methodology. References Bleiholder, J., Naumann, F. (2008). Data fusion. ACM Computing Surveys, 41 (1). Cheney, J., Buneman, P., Ludäscher, B. (2008). Report on the Principles of Provenance Workshop. SIGMOD Record, 37 (1).
10 Dean, J., Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Operating Systems Design and Implementation (OSDI). HLG-BAS (2011). Strategic vision of the High-level group for strategic developments in business architecture in statistics. Conference of European Statisticians, 59 th Plenary Session. URL: Klein, A., Lehner, W. (2009). Representing Data Quality in Sensor Data Streaming Environments. Journal of Data and Information Quality (JDIQ), 1 (2). Kolb, L., Thor, A., Rahm, E. (2012). Multi-pass sorted neighborhood blocking with MapReduce. Computer Science Research and Development, 27(1). Verma, A., Llora, X., Goldberg, D.E., Campbell, R.H. (2009). Scaling Genetic Algorithms Using MapReduce. Intelligent Systems Design and Applications (ISDA). Zardetto, D., Scannapieco, M., Catarci, T. (2010). Effective Automated Object Matching. International Conference on Data Engineering (ICDE).
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Programme of the ESTP training course on BIG DATA EFFECTIVE PROCESSING AND ANALYSIS OF VERY LARGE AND UNSTRUCTURED DATA FOR OFFICIAL STATISTICS Rome, 5 9 May 2014 Istat Piazza Indipendenza 4, Room Vanoni
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen email@example.com Abstract. The Hadoop Framework offers an approach to large-scale
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: firstname.lastname@example.org
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University email@example.com 14.9-2015 1/36 Google MapReduce A scalable batch processing
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Knowledgent White Paper Series Big Data and Healthcare Payers WHITE PAPER Summary With the implementation of the Affordable Care Act, the transition to a more member-centric relationship model, and other
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
With Saurabh Singh firstname.lastname@example.org The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Trustworthiness of Big Data International Journal of Computer Applications (0975 8887) Akhil Mittal Technical Test Lead Infosys Limited ABSTRACT Big data refers to large datasets that are challenging to
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, email@example.com Assistant Professor, Information
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms
Report on the Dagstuhl Seminar Data Quality on the Web Michael Gertz M. Tamer Özsu Gunter Saake Kai-Uwe Sattler U of California at Davis, U.S.A. U of Waterloo, Canada U of Magdeburg, Germany TU Ilmenau,
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Transitioning
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
A New vision for network architecture David Clark M.I.T. Laboratory for Computer Science September, 2002 V3.0 Abstract This is a proposal for a long-term program in network research, consistent with the
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Sharing Solutions for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco Fortini 1, Miguel Guigò 2, Francisco Hernandez 2,
Big Data Big Data Defined Introducing DataStack 3.0 Inside: Executive Summary... 1 Introduction... 2 Emergence of DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics...
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:firstname.lastname@example.org
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Medical Big Data Workshop 12:30-5pm Star Conference Room #MedBigData15 Welcome! Today s Goals: Introduce you to the Big Data @ CSAIL Introduce you to the popular MIMIC II Dataset Overview of Database Technologies
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,