EDA and ML A Perfect Pair for Large-Scale Data Analysis
|
|
|
- Charlene Short
- 10 years ago
- Views:
Transcription
1 EDA and ML A Perfect Pair for Large-Scale Data Analysis Ryan Hafen, Terence Critchlow Pacific Northwest National Laboratory Richland, WA USA [email protected], [email protected] In this position paper, we discuss how Exploratory Data Analysis (EDA) and Machine Learning (ML) can work together in large-scale data analysis environments. In particular, we describe how applying EDA techniques and ML methods in a complementary fashion can be used to address some of the challenges faced when applying ML techniques to large, real world data sets, and discuss tools that help do the job. This iterative approach is demonstrated with a simple example of how extracting events from a historical sensor data set was enabled by iteratively identifying and filtering various types of erroneous data. Keywords-exploratory data analyisis, machine learning, large-data analysis I. INTRODUCTION In this short position paper, we discuss how Exploratory Data Analysis (EDA) and Machine Learning (ML) can work together to enable effective large-scale data analysis. One of the significant challenges applying ML to large data sets is creating the training data, evaluating the results, and determining how to refine the data to improve the performance of the algorithm. While solutions to these problems have been established on relatively small data sets, overcoming them in the realm of big-data remains a challenge. Recent advances in statistical computing environments have allowed EDA techniques to transition from small data sets into significantly larger ones. This transition allows EDA to be used conjunction with ML to address these challenges. Section II provides a working description of EDA and describes a typical EDA analysis environment. In Section III, we describe recent advances that have made EDA feasible for large data analysis scenarios. Section IV describes how EDA and ML can be effectively combined to provide an enhanced analytical environment. Finally, in section V, an example of a large-scale analysis using this combined approach is presented. II. WHAT IS EDA? EDA is the interactive, hypothesis-driven exploration of data. In this approach, analysis of the data is driven by a set of questions and/or suspicions, but as the analysis proceeds, results are examined not only to answer the original question, but to potentially gain additional, possibly unexpected, insights about the data after exploring we have some answers and more questions [1]. It is important to recognize that the exploration is directed, based on the initial questions being asked. We explore data with expectations. We revise our expectations based on what we see in the data. And we iterate this process. [2] Because of the importance of the user s direction, EDA is greatly informed by domain expertise and domain-driven models of how data should behave. It often plays an important role in either confirming a priori expectations or suggesting new hypotheses based on the data. As John Tukey, the father of EDA points out, Exploratory data analysis is an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques. [3]. However, in practice, the availability of good tools and techniques is critical to the useful application of EDA to a specific problem. In particular, both numerical and visual methods are typically applied to the data to identify trends and patterns as well as deviations from the expected patterns. This requires tools that support iterative interaction with the data, allows users to easily describe complex analytics, and provides the ability to easily display analysis results. A. Requirements for an EDA Environment The quote from Tukey just cited highlights two important aspects of EDA: flexibility and the reliance on display. Any successful EDA environment must support both components. EDA environments must be flexible. The EDA approach to analysis is inherently iterative. The models or algorithms that capture the salient features of the data are inherently complex, and initial descriptions are never complete. Typically, analysis begins with either simple summaries or displays of the data. These provide insight into the data, leading to new ideas. Algorithms to investigate the validity of these new hypotheses are developed and evaluated. These algorithms lead to new insights, and the process is repeated. Inherent in a successful environment is the ability to rapidly implement the method or methods in order for the new idea to be examined. Thus, a successful EDA environment must provide a high-level, interactive language for data analysis that has direct support for complex numerical, statistical, and machine learning methods. Visualization is critical in an EDA environment. EDA requires a human to drive the data exploration process. It is well documented that humans have an extraordinary ability to identify patterns in complex data when presented with the information through an appropriate visualization. Thus, the most effective way for a human be of value in this loop is to enable easy visualization of the analysis results. To accomplish this, a successful EDA environment must
2 provide a flexible interface for creating a wide variety of visualizations. III. TOOLS FOR EDA WITH LARGE DATA EDA principles have been successfully used for many years [4], and several popular environments, including R [5] and MatLab [6], have been developed. These environments allow users to easily and interactively explore their data using scripting languages developed exclusively to express complex mathematical and statistical algorithms. R is particularly popular because it is an open source project, allowing the broad research community to contribute thousands of specialized analysis packages. The supporting ecosystem makes it easy to identify, import, and use algorithms that perform every imaginable analysis task. Until recently, R and similar EDA environments have been limited to relatively small data sets because they require the ability to analyze the entire data set in memory. To facilitate the iterative process of EDA with large data, it is important that methods can be applied at scale so that results can be received in a reasonable amount of time. Fortunately, recent research has advanced the current state of the art, making EDA on multi-tb data sets feasible. In this section, we introduce the R and Hadoop Integrated Programming Environment (RHIPE) [7] that provides this environment, and discuss some research efforts aimed at scalable analytic and visualization methods based on this platform. A. RHIPE RHIPE (pronounced hree-pay') is a merger of the R statistical programming environment and the Hadoop distributed processing system. RHIPE allows an analyst to carry out analysis of complex big data wholly from within R, using Hadoop to handle the distributed computing. RHIPE provides a dynamic language for analysis of big data. RHIPE is tightly integrated with Hadoop, providing support for many Hadoop features, such as combiners, custom partitioning, distributed cache, traditional input/output formats such as map and sequence files, text files, and HBase, as well as support for adding custom jars for other input formats. Data is serialized using protocol buffers, making it easy to share data with other applications. The compute paradigm for Hadoop is MapReduce. MapReduce provides a versatile high-level parallelization to solve many data-intensive problems through use of userspecified Map and Reduce functions, including many statistical and machine learning algorithms [8]. In this approach, an algorithm is split into a map component which performs an initial data processing step and outputs a set of key-value pairs and a reduce component which performs a second processing step on all values with the same key. This two-step approach is relatively general, and has been widely adopted by the data analysis community. However, many algorithms that fall into the MapReduce paradigm are not practically feasible, such as algorithms that require several iterations or require large amounts of state information to be saved between iterations [8]. Further, many algorithms do not fit into the MapReduce paradigm at all, for example certain graph analysis cannot be easily translated to self-contained map and reduce steps, but rather require communication across tasks. Divide and Recombine (D&R) is a new statistical approach to the analysis of large complex data [9] that seeks to overcome some of the limitations of MapReduce. This approach is similar to MapReduce in that data is divided into subsets, analytic methods are applied to each subset in an embarrassingly parallel manner, and the outputs of each method are recombined to form a result for the entire data. The key difference is in how the data subsets are generated. Specifically, subsets are created such that the results of applying a method independently to each subset can be recombined (e.g. through averaging model coefficients) to achieve a very good approximation to a global model fit. Similar ideas are being researched [10]. The challenge is to develop subsetting algorithms that will produce good results while being sufficiently simpler than the analysis algorithm being approximated. Assuming that the data can be effectively subset, D&R allows a data analyst to apply almost any existing statistical or visualization method to large complex data. Current research in D&R is to develop best division and recombination procedures for various analytic methods. B. Visualization Databases As previously noted, EDA requires the ability to visualize the analytical results. Additionally, for comprehensive analysis of the data, it is important to be able to visualize the data in detail. One approach to this is to subset the data and examine a few subsets in detail. But this may not be sufficient to identify patterns which would span subsets or occur infrequently. In these cases, it is useful to apply a visualization method to every subset, creating a collection of small multiples [11] which spans the entire data set. Viewing a collection of small multiples is effective in revealing repetitions or changes, while allowing an analyst to see the data at finer levels of detail. This visualization approach has been an effective tool for many years and is built in to many successful visualization software packages, such as the lattice package for the R statistical computing environment [12] and the underlying trellis framework [13]. A collection of small multiples is referred to as a single display. Since many displays are created during analysis, an effective way to manage them is required. This is the role of a visualization database (VDB) [14]. This capability is particularly important when working with large data sets since as the size or complexity of the data increases, the number of small multiples increases beyond the capacity of the human mind to simultaneously capture the details and larger scale trends or patterns. If these displays are stored and available for additional analysis, it is possible to have the computer determine which displays might be interesting [15]. This is accomplished through calculating diagnostic quantities, or cognostics [16], for each panel to provide a rank for its potential usefulness. Cognostic quantities can be simple statistical quantities (range, standard deviation, model coefficients) or can be tailored to the visualization at hand to differentiate among
3 large numbers of panels with respect to context-relevant attributes. IV. USING EDA AND ML TOGETHER There has been significant research in the area of distributed algorithms for machine learning, leading to an impressive collection of tools for large-scale data analysis. Unfortunately, difficulties can arise when using these tools in practice. In particular, with numerical methods alone, it can be difficult to: identify and understand incorrect or anomalous records within the data set that may be negatively affecting the ML algorithm determine or understand the effectiveness of the model that that ML algorithm has learned, and how to change the model to provide better results interpret the scientific meaning or impact of the algorithm results As outlined below, EDA can be used to address these concerns, forming a complementary cycle where EDA and ML are performed iteratively. A. Data Preparation and Identification of Incorrect Data EDA naturally complements ML in the preprocessing and cleaning stage. Environments suited to EDA, as outlined previously, can be very helpful with tasks like getting the data in the proper format, getting a feel for the data, cleaning the data, performing feature selection, and determining appropriate directions to take. Often these tasks are pushed to the background, with all importance placed on the learning stage. However, in the experience of practitioners, these tasks constitute 80% of the effort that determines 80% of the value of the ultimate results [17]. These tasks are typically thought of as a one-time up-front cost, but in practical applications, they are ongoing efforts, with ML algorithm results bringing new issues to light at each iteration. In section V, we illustrate this point with an example. B. Understanding the Effectiveness of the ML Algorihtm Often the overall classification accuracy or prediction error metrics from a ML algorithm do not tell the whole story. When focusing only on these metrics, we run the risk of choosing the best performer from a class of poor performers. Injecting EDA techniques into the process can help provide insight into what regions of the design space are performing well or poorly, and spark ideas for improvements to ML algorithms. A little bit of strategy from the human (EDA) can go a long way in helping the immense tactical power of the machine (ML). The benefit of EDA, however, is based on the ability of the analyst to guide the process. This benefit is minimized when the analyst is unable to guide the process, for example because they are unfamiliar with the domain or unclear what they are looking for. In this case, an unsupervised approach is likely to generate a more useful model. This type of approach is also likely to be more efficient when the features that comprise the model are known in advance, since the analyst would need to develop a similar model from scratch. As EDA is a rather general approach, it is difficult to prescribe a concrete set of steps that illustrate its use in this context. However, in our experience we have noted the use of exploratory techniques, primarily driven by visualization, in helping to validate assumptions of the ML method being applied, identify transformations that yield a far more parsimonious model, identify interactions missed by the ML method, and leverage domain expertise. The importance of the use of EDA in the ML process is well-summarized by Tukey: How do we avoid analysis that the data before us indicate should be avoided? By exploring the data-before, during, and after analysis for hints, ideas, and, sometimes, a few conclusions. [3] C. Interpreting Results When the context of an analysis relates to scientific discovery, the goal is not necessarily the prediction or classification with the most accuracy, but a sound understanding of the phenomena upon which the data is based. In this case, we are most interested in building interpretable models, checking if data supports hypothesized models, and looking for new phenomena. The role of ML in interpretable scientific discovery has been a topic of debate [18], but we see ML methods as completely necessary here, with EDA helping with the interpretation of results. V. USE CASE A detailed example of how EDA an ML can be iteratively used to provide novel analytical results is outside the scope of this paper. However, instead a simple use case demonstrates how combining EDA based data cleaning with event extraction can be used in a complementary fashion on a realistic data set: specifically, given power grid sensor data, identify and extract records detailing a specific type of event, an islanding event, out of the data. The data is a time series of sensor readings, from a network of 38 sensors on a single network, each reporting a set of values thirty times per second. Each record can be thought of as a tuple (t, s, f, v) where t is the timestamp, s is the sensor, f is a flag value indicating the sensor s state, and v is a vector of values reported by that sensor at the specified time. When operating normally, the network connects all sensors. Given physical constraints, each sensor should report closely related values in this case. Islanding occurs when network connections are broken, and the network is divided into two smaller, disconnected networks. In this case, the sensors may report significantly different values if they are on different partitions. A model was easily developed that could identify islands within a simple test data set. Unfortunately, when the model was applied to the entire data set an unusually high number of islanding events were detected. EDA was used to determine why the model was not performing as expected: the sensor data stream had extremely high occurrences of erroneous data, leading the model to generate incorrect results.
4 Starting with an initial collection of statistics, EDA was then used to identify three distinct types of errors within the data set: previously unknown flags (f) which indicated sensor errors, sensors producing a single value, and sensors generating white noise. The initial collection of erroneous records was identified when error conditions, previously unknown to local experts, were identified based on an analysis that identified a 100% correlation between f and v. The second set of erroneous records was identified when v was unchanged for a length of time determined by a statistical analysis of the data, flagging repeated sequences of a length exceeding the extremities of the tail of a fitted geometric distribution. The final set of records was defined by those records where there is no significant autocorrelation between values on a given sensor across a small timescale. The interested reader is directed to [19] for additional insight into this analysis.. Determining the specific statistical models that identified the erroneous data required iteratively computing and analyzing statistical measures against the underlying data set. This would typically begin with an analysis computed against the entire data set using RHIPE, which would take between 10 and 30 minutes. From there, we would identify subsets of the data that appeared to be of interest for example, records with a specific flag value. These data subsets were then extracted from the entire data set and analyzed individually to determine if there was a pattern of interest (e.g. a specific flag always had a certain value). Once a potential pattern was identified, and defined as a model, the model was validated by running it against the entire data set and analyzing all instances identified by the model. Once the model was validated, it could be used to either filter the data stream in real time to improve downstream analysis tasks or filter the archived data as appropriate for off-line analysis. After each filter was developed, the event detection model was re-run. The results from each run were then analyzed using EDA to determine why there were more events than expected. This iteration continued until the results produced by the event detection model on the filtered data was validated. In our application, EDA was particularly useful because it was impossible to describe in advance the patterns that needed to be identified. For example, it would be extremely unlikely to know in advance that sensors will occasionally generate data that looks like white noise. Without this pattern being known in advance, it is unlikely that training sets for supervised learning algorithms will include sufficient data for an algorithm to determine the pattern. Similarly, if the pattern that represents the event is unexpected, it may be outside the scope of what traditional machine learning algorithms may be able to identify. Many traditional algorithms would not generate correlation measures both along a time series and across different time series, which would have missed identifying some of our erroneous records. Finally, in high-dimension space expert insights can be critical in reducing the search space to a manageable set of alternatives. For example, knowing that our network should not remain constant for an extended period of time, despite constant values happening frequently, led to the identification of the appropriate geometric distribution analysis that identifies when a sensor is stuck. VI. CONCLUSIONS Traditional machine learning algorithms are effective at identifying, classifying, and categorizing large-scale data sets using complex statistical models. Unfortunately, training and validating these models requires access to data subsets that accurately represent the larger data set. Ensuring the model has sufficient information to learn everything it needs to know is a challenge unto itself. Exploratory data analysis techniques can be used to address these concerns. Historically, such interactive techniques have been excluded from this domain because they were limited to small data sets. This has changed with new, scalable EDA frameworks such as RHIPE, which combine the flexibility of R and the scalability of Hadoop. Based on our experience iteratively validating event detection models, we believe that the application of EDA techniques in combination with ML provides the opportunity to overcome the limitations inherent in both approach and provides better analytical results. REFERENCES [1] L. Wilkinson. The Impact of Tukey s Exploratory Data Analysis, Chicago chapter of the American Statistical Association Spring Conference, May 5, [2] L. Wilkinson, A. Anand, and R. Grossman. High-dimensional visual analytics: Interactive exploration guided by pairwise views of point distributions. Visualization and Computer Graphics, IEEE Transactions on, 12(6): , [3] J. W. Tukey. We need both exploratory and confirmatory. American Statistician, pages 23 25, [4] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading, Massachusetts, U.S.A., [5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN [6] MATLAB version Natick, Massachusetts: The MathWorks Inc., [7] RHIPE: [8] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, [9] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. Cleveland. Large complex data: divide and recombine (D&R) with rhipe. Stat, 1(1):53 67, [10] Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. Bootstrapping big data. Big Learn, [11] E. Tufte, N. Goeler, and R. Benson. Envisioning information, volume 21. Graphics Press Cheshire, CT, [12] D. Sarkar. Lattice: multivariate data visualization with R. Springer Verlag, [13] R. A. Becker, W. S. Cleveland, and M.-J. Shyu. The visual design and control of trellis display. Journal of Computational and Graphical Statistics, 5(2): , 1996.
5 [14] S. Guha, P. Kidwell, R. P. Hafen, and W. S. Cleveland. Visualization databases for the analysis of large complex datasets. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), [15] J. Friedman and W. Stuetzle. John W. Tukey s work on interactive graphics. Annals of Statistics, pages , [16] W. S. Cleveland. The Collected Works of John W. Tukey: Graphics , volume 5. Chapman & Hall/CRC, [17] T. Dasu and T. Johnson. Exploratory data mining and data cleaning. Vol Wiley-Interscience, [18] L. Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science 16.3 (2001): [19] R. Hafen, T. Gibson, K. Kleese van Dam, and T. Critchlow. Data Mining Applications with R, chapter Power Grid Data Analysis with R and Hadoop. Elsevier, In print.
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data
100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Data Isn't Everything
June 17, 2015 Innovate Forward Data Isn't Everything The Challenges of Big Data, Advanced Analytics, and Advance Computation Devices for Transportation Agencies. Using Data to Support Mission, Administration,
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
High Productivity Data Processing Analytics Methods with Applications
High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Transforming the Telecoms Business using Big Data and Analytics
Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights
DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other
Radoop: Analyzing Big Data with RapidMiner and Hadoop
Radoop: Analyzing Big Data with RapidMiner and Hadoop Zoltán Prekopcsák, Gábor Makrai, Tamás Henk, Csaba Gáspár-Papanek Budapest University of Technology and Economics, Hungary Abstract Working with large
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY
SECURITY METRICS: MEASUREMENTS TO SUPPORT THE CONTINUED DEVELOPMENT OF INFORMATION SECURITY TECHNOLOGY Shirley Radack, Editor Computer Security Division Information Technology Laboratory National Institute
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Automated Learning and Data Visualization
1 Automated Learning and Data Visualization William S. Cleveland Department of Statistics Department of Computer Science Purdue University Methods of Statistics, Machine Learning, and Data Mining 2 Mathematical
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
The Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Perspectives on Data Mining
Perspectives on Data Mining Niall Adams Department of Mathematics, Imperial College London [email protected] April 2009 Objectives Give an introductory overview of data mining (DM) (or Knowledge Discovery
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Navigating Big Data business analytics
mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what
BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics
BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are
Information Visualization WS 2013/14 11 Visual Analytics
1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept
Statistics 215b 11/20/03 D.R. Brillinger Data mining A field in search of a definition a vague concept D. Hand, H. Mannila and P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge. Some definitions/descriptions
Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, COR@L Laboratory
Big Data Analytics The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, COR@L Laboratory * Source: http://www.economistinsights.com/technology-innovation/analysis/hype-and-hope/methodology
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
Specific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data
Trelliscope: A System for Detailed Visualization in the Deep Analysis of Large Complex Data Ryan Hafen Karin Rodland Luke Gosink Kerstin Kleese-Van Dam Jason McDermott William S. Cleveland Purdue University
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
DATA PREPARATION FOR DATA MINING
Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
3TU.BSR: Big Software on the Run
Summary Millions of lines of code - written in different languages by different people at different times, and operating on a variety of platforms - drive the systems performing key processes in our society.
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Why do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS Stacey Franklin Jones, D.Sc. ProTech Global Solutions Annapolis, MD Abstract The use of Social Media as a resource to characterize
A Mind Map Based Framework for Automated Software Log File Analysis
2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore A Mind Map Based Framework for Automated Software Log File Analysis Dileepa Jayathilake
Integrating a Big Data Platform into Government:
Integrating a Big Data Platform into Government: Drive Better Decisions for Policy and Program Outcomes John Haddad, Senior Director Product Marketing, Informatica Digital Government Institute s Government
Tom Khabaza. Hard Hats for Data Miners: Myths and Pitfalls of Data Mining
Tom Khabaza Hard Hats for Data Miners: Myths and Pitfalls of Data Mining Hard Hats for Data Miners: Myths and Pitfalls of Data Mining By Tom Khabaza The intrepid data miner runs many risks, including being
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Introduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
Pattern-Aided Regression Modelling and Prediction Model Analysis
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing
Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition
