GRANULARITIES AND INCONSISTENCIES IN BIG DATA ANALYSIS



Similar documents
Inconsistencies in Big Data

Data Refinery with Big Data Aspects

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

How To Understand The Benefits Of Big Data

BIG DATA: CHALLENGES AND OPPORTUNITIES IN LOGISTICS SYSTEMS

Big Data: Study in Structured and Unstructured Data

The Scientific Data Mining Process

The emergence of big data technology and analytics

Data Mining and Database Systems: Where is the Intersection?

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Anuradha Bhatia, Faculty, Computer Technology Department, Mumbai, India

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Introduction. A. Bellaachia Page: 1

Statistics for BIG data

From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems

COMP9321 Web Application Engineering

How Big Data Transforms Data Protection and Storage

Information Management course

Big Data and Analytics: Challenges and Opportunities

Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Investigative Research on Big Data: An Analysis

Healthcare Measurement Analysis Using Data mining Techniques

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

Dynamic Data in terms of Data Mining Streams

ISSN: International Journal of Innovative Research in Technology & Science(IJIRTS)

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Chapter ML:XI. XI. Cluster Analysis

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Developing the SMEs Innovative Capacity Using a Big Data Approach

Information Visualization WS 2013/14 11 Visual Analytics

Formal Methods for Preserving Privacy for Big Data Extraction Software

Adobe Insight, powered by Omniture

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Big Data Introduction, Importance and Current Perspective of Challenges

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

An Overview of Knowledge Discovery Database and Data mining Techniques

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

CHAPTER 1 INTRODUCTION

A Divided Regression Analysis for Big Data

Big Data a threat or a chance?

Real-Time Solutions to Big Data Problems

Training for Big Data

Big Data in Transportation Engineering

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Introduction to Data Mining

ICT Perspectives on Big Data: Well Sorted Materials

Big Data Executive Survey

Research of Postal Data mining system based on big data

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

Big Data Analytics- Innovations at the Edge

Are You Ready for Big Data?

Big Data: Rethinking Text Visualization

Associate Prof. Dr. Victor Onomza Waziri

Data Isn't Everything

Hexaware E-book on Predictive Analytics

Research of Smart Distribution Network Big Data Model

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Enable Location-based Services with a Tracking Framework

Big Data Mining: Challenges and Opportunities to Forecast Future Scenario

ANALYTICS BUILT FOR INTERNET OF THINGS

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Government Technology Trends to Watch in 2014: Big Data

A Hurwitz white paper. Inventing the Future. Judith Hurwitz President and CEO. Sponsored by Hitachi

Exploiting Data at Rest and Data in Motion with a Big Data Platform

What happens when Big Data and Master Data come together?

Extend your analytic capabilities with SAP Predictive Analysis

Database Marketing, Business Intelligence and Knowledge Discovery

Turning Big Data into a Big Opportunity

Industry 4.0 and Big Data

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Big Data / FDAAWARE. Rafi Maslaton President, cresults the maker of Smart-QC/QA/QD & FDAAWARE 30-SEP-2015

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

WELCOME TO THE WORLD OF BIG DATA. NEW WORLD PROBLEMS, NEW WORLD SOLUTIONS

From Data to Foresight:

BIG DATA IN SUPPLY CHAIN MANAGEMENT: AN EXPLORATORY STUDY

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Component visualization methods for large legacy software in C/C++

RESEARCH ON THE FRAMEWORK OF SPATIO-TEMPORAL DATA WAREHOUSE

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

Big Data Analytics: Collecting, Analyzing and Decision Making

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Big Data Text Mining and Visualization. Anton Heijs

Data, Data Everywhere

Big Data R&D Initiative

Methodology Framework for Analysis and Design of Business Intelligence Systems

Big Data, Integration and Governance: Ask the Experts

A Survey on Data Warehouse Architecture

SPATIAL DATA CLASSIFICATION AND DATA MINING

BIG DATA STRATEGY. Rama Kattunga Chair at American institute of Big Data Professionals. Building Big Data Strategy For Your Organization

Transcription:

International Journal of Software Engineering and Knowledge Engineering World Scientific Publishing Company GRANULARITIES AND INCONSISTENCIES IN BIG DATA ANALYSIS DU ZHANG Department of Computer Science, California Sate University Sacramento, 95819-6021, USA zhangd@ecs.csus.edu http://gaia.ecs.csus.edu/~zhangd Received (Day Month Year) Revised (Day Month Year) Accepted (Day Month Year) Big data and big data analysis are a multi-dimensional scientific and technological pursuit that has profound impact on the society as a whole. Though big data has become such a catchy buzzword, to make any significant stride in this pursuit, we must have a clear picture of what big data is and what big data analysis entails. In this paper, after a brief account on the landscape of big data and big data analysis, we focus attention on two issues: granularities of knowledge content in big data, and utility of inconsistencies in big data analysis. Keywords: Big data; big data analysis; granularities of knowledge content; inconsistencies. 1. Introduction Big data and big data analysis are a multi-dimensional scientific and technological pursuit that has profound impact on the society as a whole. As big data becomes such a catchy buzzword that has spurred great interests and curiosities from a broad scope of audiences, to make any significant stride in this pursuit, we must have a clear picture of what big data is and what big data analysis entails. Figure 1 highlights various, though not an exhaustive list of, dimensions about big data and bid data analysis. After some general comments, we will focus our attention on two issues in this scientific pursuit: granularities of knowledge content in big data, and inconsistencies in big data analysis. The objectives of big data analysis are largely driven by big data stakeholders or customers objectives. This can range from creating values in healthcare, accelerating the pace of scientific discoveries for life and physical sciences, improving the productivity in manufacturing, developing a competitive edge for business, retail, or service industries, to innovating in education, media, transportation, or government. How to better utilize data assets, in addition to physical assets and human capital, to create value has become a fertile ground for enterprises to gain competitive advantages. As big data analysis becomes the next frontier for advancement of knowledge, innovation, and enhanced decision-making process, the significance of its impact on the society as a whole can never be underestimated. 1

2 Author s Names Domains that benefit from the big data push include: life and physical sciences, medicine, education, healthcare, location-based services, manufacturing, retail, communication and media, government, transportation, banking, insurance, financial services, utilities, environment, and energy industry [3, 9, 12]. Figure 1. Dimensions in Big Data and Big Data Analysis. Big data as a technical term generates many different interpretations and definitions. A meta-definition based on the size dimension is given in [8]: big data should be defined at any point in time as data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time. The volume-variety-velocity definition [7] attempts to capture not only the size dimension, but also the types and speed (at which data are generated) dimensions of the datasets we encounter today. The survey results in [11] indicated a list of alternative definitions for big data. What has been glossed over in the literature is the following: what exactly does a dataset contain, primitive data elements, or meta-data in terms of information, knowledge, or meta-knowledge, or any combination of them? The terms of data and information have been used interchangeably in the literature, but there are distinct definitions for data, information, knowledge, metaknowledge, and expertise, respectively. On the other hand, big data analysis is defined to be a pipeline of acquisition and recording; extraction, cleaning and annotation; integration, aggregation and representation; analysis and modeling; and interpretation [1]. There are other alternative definitions on what big data analysis entails [9, 12].

Instructions for Typing Manuscripts (Paper s Title) 3 There are many sources of big data, from transactions, scientific experiments, genomic, logs, events, emails, social media, sensors, RFID scans, texts, geospatial, audio, medical records, surveillance, images, to videos [3, 11]. These sources of big data contain elements or instances that can be semi-structured (e.g., tabular, relational, categorical, or meta-data), or unstructured (e.g., text, messages). Elements in a dataset have many properties. First, data elements may have the same or different probabilistic distributions. Second, as observed in [8], what makes big data big is repeated observations over time and/or space. Hence, most large datasets have inherent temporal or spatial dimensions, or both [8]. Recognizing this inherent temporal/spatial property is very important because this is where performance problems stem from when we try to conduct big data analysis using the prevailing database model (current RDBS model does not honor the order of rows in tables [8]). Another property is that most large datasets exhibit predictable characteristics in the following sense: the largest cardinalities of most datasets specifically, the number of distinct entities about which observations are made are small compared with the total number of observations [8]. This is a very important heuristic in big data analysis. For scientific datasets, they are typically multi-dimensional, have embedded physical models, possess meta-data about experiments and their provenance, and have low update rates with most updates append-only [2]. Technologies that bring big data analysis tasks to bear include: machine learning, cloud computing, crowd sourcing [12], data mining, time series analysis, stream processing, and visualization [4, 9]. Many challenges remain in big data analysis. In addition to volume, variety and velocity that create challenges in storage, curation, search, retrieval, and visualization issues, veracity generates data uncertainty handling complications [11]. Meeting challenges brought on by these four-vs relies critically on recognizing regularities, patterns and correlations in data (with the assistance of domain knowledge about inherent temporal and spatial properties of data), decomposing analytic tasks and carrying them out in parallel. There are a whole host of inconsistent or conflicting circumstances during big data analysis [1, 5]. How to properly handle various types of inconsistencies during data pre-processing and analysis is another challenge. Additional challenges include privacy, security, provenance, and modeling [1, 9]. Several potential pitfalls exist in the process of advancing knowledge or creating value out of data. While data are plentiful in today s digital society, we need to be mindful that data alone are not enough to advance knowledge or create value. Every learner must embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it [6]. The second pitfall is the curse of dimensionality. When utilizing machine learning algorithms to generalize beyond the input data, generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter covers only a fraction of about 10-18 of the input space [6]. A related

4 Author s Names issue is feature engineering [6], the large dataset in its raw format is not in a form that is amenable to learning, but you can construct features from it that are. In the next two sections, we will briefly examine the following two issues: granularities of knowledge content in big data, and inconsistencies in big data analysis. 2. Granularities of Knowledge Content in Big Data In the hierarchy of knowledge, there are layers of knowledge content. Noise can be described as items that carry no content of knowledge. Data denotes values drawn from some domain of discourse. Information defines the meanings of data values as understood by those who use them. Knowledge represents specialized information about some domain that allows one to make decision. Meta-knowledge is knowledge about knowledge. Expertise is specialized operative knowledge that is inherently task-specific and relatively inflexible. Figure 2 depicts the knowledge hierarchy where knowledge content in a higher layer is more structured, has richer representation and semantics, and small connotations. Induction goes from data to knowledge (bottom-up arrow) and deduction applies knowledge to individual entities (top-down arrow). Knowledge content of large granularity has small connotations and knowledge content of small granularity has large connotations. Big data has been used as a categorical phrase for large datasets. What has been glossed over by the term is what exactly such a large dataset contains: primitive data elements, pieces of information, pieces of knowledge, pieces of meta-knowledge, or any combination of them? We need to be precise and should not regard data, information, and knowledge as interchangeable terms denoting the same entities (see examples of differences in Table 1). Figure 2. Granularity of Knowledge Content in Big Data. Bringing concepts in granularities of knowledge content explicitly into the big data analysis is conducive to various tasks at different stages in the analysis process. For instance, depending on the circumstance of an input set (e.g., containing data elements only, or data elements plus domain knowledge), a learning algorithm that works best

Instructions for Typing Manuscripts (Paper s Title) 5 under the circumstance can be selected. Terminology-wise, in addition to big data, big information, big knowledge, or big meta-knowledge can be more pertinently utilized to describe accurately circumstances where an input set contains large volume of information, knowledge, or meta-knowledge, respectively. Table 1. Examples of knowledge content granularities in big data. Location-based services Social networks Healthcare Retail Knowledge Restaurant ratings Social network Diagnoses Purchase patterns structures Information Restaurants People who tweet and people who follow other people Data Latitude-longitude coordinates Patients Groups of customers Tweets X-ray images Transactions 3. Inconsistencies in Big Data Analysis Inconsistencies are commonplace in human behaviors and decision-making processes for which big data are acquired, fused, and represented. Once captured in big data, inconsistent or conflicting phenomena can occur at various granularities of knowledge content, from data, information, knowledge, meta-knowledge, to expertise, and can adversely affect the quality of the outcomes of big data analysis process [1, 5]. Inconsistencies can also manifest themselves in reasoning methods, heuristics, or problem-solving approaches of various analysis tasks, creating challenges for big data analysis. Let X and Y be a set of data instances and a set of labels for data instances, respectively. Given a dataset S and two data elements d i S and d j S, d i = (x, y) and d i = (xʹ, yʹ ), where x, xʹ X, and y, yʹ Y. d i and d j are data instances with inconsistent labels when the following holds: (x = xʹ ) (y yʹ ) (y yʹ ) (yʹ y). The presence of d i and d j in S is referred to as data inconsistency. When subjecting a machine learning algorithm to a dataset S that contains data inconsistency, the model thus learned will have a reduced predictive accuracy. We need to recognize types of inconsistencies for different types of big data. For instance, for location-based or timeseries data, temporal or spatial inconsistencies will dominate, whereas for unstructured text data, inconsistencies pertaining to antonym, negation, mismatched value, structural or lexical contrasts or world knowledge will occupy a commanding position [13]. In addition, it is necessary to differentiate categories of inconsistent phenomena at different levels of data, information, knowledge, meta-knowledge. Inconsistencies at data level involve various types of values for features of data instances (symbolic, numeric, categorical, waveform, etc.) and different types of labels; Inconsistencies at information level manifest in terms of functional dependencies or associations; At knowledge level,

6 Author s Names inconsistencies display in declarative or procedural beliefs; Meta-knowledge inconsistencies are demonstrated through control strategies or learning decisions [13]. There are different big data analytic tasks or objectives, such as prediction, classification, regression, association analysis, clustering, and outlier analysis. Which type of inconsistencies has what impact on which analytic objective is yet another issue to be investigated. The goal is to utilize inconsistencies as valuable heuristics in guiding the development of inconsistency-specific tools to help assist tasks in big data analysis. One example is inconsistency-induced learning, or i 2 Learning in [14, 15], that allows inconsistencies to be utilized as stimuli to initiate learning episodes that lead to the resolution of data or knowledge inconsistencies, or refined/augmented knowledge, which in turn improves the performance of a system. 4. Concluding Remarks Overemphasizing the big in big data may create some unintended consequences. In the zeal to go after big data, people may forget what is at stake here is the adequacy and relevance of the data with regard to the objective of the analysis, and overlook not so big data or small data that could be just what it takes to create value or discover knowledge. The big-data-small-segmentation scenario and the real-time microsegmentation technique to target promotions and advertising in [9] substantiate this point perfectly. As is indicated in [5, 10], the real-time performance requirement is increasingly exerting pressure on the underlying methods and techniques for big data analysis. Before long, we will need to bring the real-time requirement into machine learning to devise real-time machine learning algorithms for this challenge. Acknowledgments The author appreciates the support and guidance from Dr. Jerry Gao, editor of Viewpoints section, and comments by anonymous reviewers that help improve the paper. References [1] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, and U. Dayal, Challenges and opportunities with big data, Cyber Center Technical Report 2011-1, Purdue University, January 1, 2011. [2] A. Ailamaki, V. Kantere, and D. Dash, Managing scientific data, Communications of the ACM, Vol.53, No.6, (2010) 68-78. [3] Big Data, http://en.wikipedia.org/wiki/big_data [4] S. Bryson, D. Kenwright, M. Cox, D. Ellsworth, and R. Haimes, Visually exploring gigabyte data sets in real time, Communications of the ACM, Vol.42, No.8, (1999) 83-90. [5] S. Chaudhuri, U. Dayal, and V. Narasayya, An overview of business intelligence technology, Communications of the ACM, Vol.54, No.8, (2011) 88-98. [6] P. Domingos, A few useful things to know about machine learning, Communications of the ACM, Vol.55, No.10, (2012) 78-87. [7] Gartner Group press release, Pattern-based strategy: getting value from big data, July 2011.

Instructions for Typing Manuscripts (Paper s Title) 7 [8] A. Jacobs, The pathologies of big data, Communications of the ACM, Vol.52, No.8, (2009) 36-44. [9] J. Manyika, M.Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, Big Data: the next frontier for innovation, competition, and productivity, McKinsey Global Institute, June 2011. [10] G. Mone, Beyond Hadoop, Communications of the ACM, Vol.56, No.1, (2013) 22-24. [11] M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano, Analytics: the realworld use of big data: how innovative enterprises extract value from uncertain data, Executive Report, IBM Institute for Business Value and Said Business School at the University of Oxford, 2012. [12] The White House Big Data Research and Development Initiative, http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2. pdf [13] D. Zhang and E. Gregoire, The landscape of inconsistency: a perspective, International Journal of Semantic Computing, Vol. 5, No.3, (2011) 235-256. [14] D. Zhang, i 2 Learning: perpetual learning through bias shifting, in Proc. of the 24 th International Conference on Software Engineering and Knowledge Engineering, July 2012, pp. 249-255. [15] D. Zhang and M. Lu, Learning through Overcoming Inheritance Inconsistencies, in Proc. of the 13 th IEEE International Conference on Information Reuse and Integration, August 2012, pp.201-206.