Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014
Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions Business Strategies Data Storage Data Diversity Information Filtering Real-Time Data Analysis Techniques III. Conclusion 2
Introduction & Motivation PART I 3
Big Data Analytics in a Cloud. 4
What is Big Data Analytics? Buzz Word for a combination of: o o Big Data Advanced Analytics Not just one Data Type and not just one technique But we will see this in a minute!!! 5
Big Data The three V s (I) Most definition focus on the data size o NOT SUFFIECIENT!! Big Data can be defined using the three V s : o o Volume Velocity o Variety The measurements for each V are absolutely divers 6
Volume: Big Data The three V s (II) o Gigabytes, Terabytes or Petabytes o Number of Files or Records Velocity: o Real-time (as Stream) o Batches Variety: o Structure of data (un-, semi- or structured) o Web data o Real-time data 7
Advanced Analytics (I) Advanced Analytics, as Big Data Analytics is a Buzz word! It stands for a collection of different analysis techniques o All techniques are suited to deal with unknown data sets A.k.a. Discovery Analytics 8
Advanced Analytics (II) Some Techniques: o Predictive Analytics o Data Mining o Statistical Analysis o Natural Language Processing o Data base capabilities MapReduce In-database analytics In-memory databases 9
Importance of Big Data Analytics (I) Big Data Analytics is seen as one of the most profound trends in Business Intelligence according to TDWI Today more and more data is collected by enterprises o See Big Data To gain new insights this data has to be analysed o Not possible with standard analytic platforms 10
Importance of Big Data Analytics (II) The 5 main benefits are: 1. Better targeted social influencer marketing (61%) 2. More numerous and accurate business insights (45%) 3. Segmentation of customer base (41%) 4. Recognition of sales and market opportunities (38%) 5. Automated decisions for real-time processes (37%) 11
Importance of Big Data Analytics (III) The 5 main barriers are: 1. Inadequate staffing or skills for big data analytics (46%) 2. Cost, overall (42%) 3. Lack of business sponsorship (38%) 4. Difficulty of architecting big data analytics system (33%) 5. Current database software lacks in-database analytics (32%) 12
Techniques & Solutions PART II 13
Business Strategies Problems Strategy or architecture for dealing with Big Data Analytics is needed Problems: o Different programming abstractions (compared to desktop environment) o Every choice has direct dollar costs, regardless of the field: Computation Upload / Download Data storage 14
Business Strategies Cloud Computing Every choice directly effects the computation time! Supports many Virtual Machines Correlation of paying more and increasing the computation power o Doubling memory or speed does not linearly scale to halve the time! There are many vendor-based solutions for data upload into the cloud databases 15
Data Storage The HDFS Goals Belongs to the so-called No-SQL Databases Goals of the HDFS: o Fault detection & fast automatic recovery o Streaming data access o Handling large data sets o Simple coherency model o moving computation is cheaper than moving data o portability 16
Data Storage The HDFS Architecture 17
Data Diversity Filtering Information (I) Data mining describes: o Application of methods and algorithms o Supporting or enabling the extraction of empirical links of data objects in data sets Goals of data mining: o Find new correlations, patterns and trends inside large amounts of data 18
Data Diversity Filtering Information (II) Most of the data arriving is unlabeled => classification not possible A clustering is: o A group of same or similar elements gathered or occurring closely together Task: o Organize a collection of n objects into a partitioning or a hierarchy of partitions o Label the data 19
Data Diversity Filtering Information (III) Problems: o Measure similarity o The unknown number of clusters needed o Cluster validity o Outliers 20
Data Diversity Real-Time Data (I) CEP: Complex Event Processing Events are complex in sense of the relations between arriving data parts CEP systems will non only consider arriving events separated from each other o Timestamp + Content + optional constraints Goal is to identify interesting situations by processing event notifications (not generic data) 21
Data Diversity Real-Time Data (II) CEP is an extension to the traditional publishsubscribe interaction concept: o Observer: RSS feed (example) o Consumer: other systems Examples for CEP Engine: o Next CEP (rules based pattern detection) o PB-CEP (plan based pattern detection) 22
Data Diversity Analysis Techniques (I) Analytical computations are moved into the database system in-database analytics: o Model scoring o Predictive analytics o And others Calculations are executed in a single, centralized location o Data access right where it is stored o No data extraction o Memory capabilities o Load balancing o Parallel processing 23
Data Diversity Analysis Techniques (II) Using historical data to predict the future (long or short term) o Data mining techniques (clustering, regression, classification) o Statistical analysis techniques Build a predictive model o Exploit patterns in historical data to identify risks and opportunities Combination with CEP makes sense: o CEP can ensure the calculation of the predictors (main problem!) o Short term realization of complex events 24
Conclusion PART III 25
Summary What we ve seen! Big Data is not all about size Big Data Analytics is important due to the positive influence on many enterprise departments. But it is expensive! One needs the right computation platform, storage system and analysis techniques depending on the data one is working with o Cloud Computing o HDFS o CEP / In-database Analytics 26
FINAL WORDS All presented techniques are just examples o Numerous more systems, software products available in this field Persons from many different fields have to work together to enable the analysis of big data. o Business analysts o Database specialists o System engineers o 27
Thank You for Your Attention! ANY QUESTIONS? 28