Big Data and Data Science: Behind the Buzz Words

Similar documents

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

ANALYTICS CENTER LEARNING PROGRAM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

How To Handle Big Data With A Data Scientist

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Sunnie Chung. Cleveland State University

Big Data Technologies Compared June 2014

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implement Hadoop jobs to extract business value from large and varied data sets

Open Source Technologies on Microsoft Azure

Big Data Explained. An introduction to Big Data Science.

I/O Considerations in Big Data Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Bringing Big Data to People

Applications for Big Data Analytics

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Understanding NoSQL on Microsoft Azure

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Advanced In-Database Analytics

Peninsula Strategy. Creating Strategy and Implementing Change

BIG DATA What it is and how to use?

BIG DATA TOOLS. Top 10 open source technologies for Big Data

MapReduce with Apache Hadoop Analysing Big Data

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

How To Scale Out Of A Nosql Database

The Internet of Things and Big Data: Intro

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Big Data on Microsoft Platform

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hadoop Ecosystem B Y R A H I M A.

Bringing the Power of SAS to Hadoop. White Paper

Challenges for Data Driven Systems

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Big Data and Industrial Internet

NoSQL Data Base Basics

Big Data Introduction

Questionnaire about the skills necessary for people. working with Big Data in the Statistical Organisations

TRAINING PROGRAM ON BIGDATA/HADOOP

Sentimental Analysis using Hadoop Phase 2: Week 2

Understanding NoSQL Technologies on Windows Azure

Big Data Too Big To Ignore

Building Your Big Data Team

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Big Data for Processing Data and Performing Workload

BIRT in the World of Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Big data for the Masses The Unique Challenge of Big Data Integration

HDP Enabling the Modern Data Architecture

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data Analytics Infrastructure

BIG DATA TRENDS AND TECHNOLOGIES

Open source large scale distributed data management with Google s MapReduce and Bigtable

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

Big Data solutions to support Intelligent Systems and Applications

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data. Lyle Ungar, University of Pennsylvania

So What s the Big Deal?

Data processing goes big

The Inside Scoop on Hadoop

Cloud Scale Distributed Data Storage. Jürmo Mehine

Hurtownie Danych i Business Intelligence: Big Data

ANALYTICS MODERNIZATION TRENDS, APPROACHES, AND USE CASES. Copyright 2013, SAS Institute Inc. All rights reserved.

Talend Open Studio for Big Data. Release Notes 5.2.1

Case Study : 3 different hadoop cluster deployments

Bringing Big Data into the Enterprise

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Beginner s Guide to. BigDataAnalytics

Transforming the Telecoms Business using Big Data and Analytics

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

The Future of Data Management

Ubuntu and Hadoop: the perfect match

Architectures for Big Data Analytics A database perspective

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data Market Size and Vendor Revenues

Modern Data Architecture for Predictive Analytics

Open source Google-style large scale data analysis with Hadoop

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Integrating a Big Data Platform into Government:

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data Warehouse design

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Transcription:

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014

Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing big data 2

Big data: from hype to value Show me the money. - Jerry Maguire (1996) 3

The real issue Data that you can t process and use quickly enough with the technology you have Possible reasons for this Volume Velocity Variety (diverse/unstructured formats) Not a new problem, but new data sources are increasing the amount of challenging data 4

Sources of challenging data Transactions Web log files Mobile Voice, images, text, video from web and other sources Sensors Genomic 5

New data management solutions Need to handle larger volumes, unstructured formats, and/or real-time processing have driven new technologies Can lower costs, increase processing speeds for data that can t be handled well with relational databases and/or single servers 6

Opportunities from big data Cost reduction Improve models/decisions with new data more data faster cycle times New products and services 7

What about insurance? Product design Marketing Underwriting Pricing Sales management Claims IT 8

Develop a strategy What does your business need? What data do you have that is underutilized? What data are you missing that would be valuable? 9

Deconstructing data science Mr. Maguire: I just want to say one word to you, just one word. Ben: Yes, sir. Mr. Maguire: Are you listening? Ben: Yes, I am. Mr. Maguire: Plastics. - The Graduate (1967) 10

Some definitions of data scientist A data analyst in California A statistician under 35 A developer of data products A practitioner of data jujitsu 11

Something new, or re-branding? C. F. Jeff Wu (1998): Data collection Modeling and analysis Problem solving and decision making William S. Cleveland (2001): Multidisciplinary investigation Models and methods Computing with data Tool evaluation 12

Some more recent attempts The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it Combine the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist s approach of choosing the problem first and then finding data to shed light on it Extract information from large datasets and then present something of use to non-data experts 13

What seems different Using large datasets Hands-on, heavy data prep of unstructured data Coding with general purpose languages (Python, C++, Java) Starting with the data, not a question? Emphasis on storytelling/visualization 14

Family Tree Statistics Machine Learning Data Prep (RDB) Data Mining Analytics Data Prep (NoSQL) Data Science Data Visualization 15

Managing big data You re gonna need a bigger boat. - Jaws (1975) 16

Managing big data Distribute data storage, data processing across multiple computers Can use cheaper, commodity hardware because data is duplicated on multiple machines can be recovered when one fails Faster run times - use the parallel computing power of the machines where the data is stored, and avoid I/O of extracting data first 17

Let s talk about the elephant in the room, Hadoop Software framework for storing and processing structured and unstructured data Distributes (and replicates) your data across multiple commodity machines (a cluster ) File system (HDFS) keeps track of where the data is Programming framework (MapReduce) to process the data 18

Many Hadoop vendors Apache Cloudera Hortonworks IBM MapR (although technically a different file system) Microsoft Pivotal 19

What is MapReduce? Source: http://kickstarthadoop.blogspot.com 20

Other Hadoop tools Hive SQL-like query language Pig Latin scripting language for creating MapReduce programs HBase column-oriented database within Hadoop Mahout Java machine learning library Sqoop moves data between Hadoop and relational databases 21

Not Only Hadoop Family Category Examples Pros Cons Relational Not Only SQL Massively Parallel Processing (MPP) Key-Value Column Document Graph Teradata, Netezza, Greenplum, Vertica, Oracle Exadata Redis, Riak, Voldemort Hbase, Hypertable, Cassandra CouchDB, MongoDB Neo4j, InfiniteGraph Fast and familiar Simple, fast I/O Good for unstructured data Good for unstructured data Certain types of problems Expensive Poor for unstructured data Poor for complex data Poor for interconnected data Poor for interconnected data Not really scalable 22

Analyzing big data I feel the need the need for speed! - Top Gun (1986) 23

First, it isn t always as big as it seems Use big data tools to summarize it down, then apply the usual analysis software Do you really need every observation? Then sample it down 24

Intermediate steps Use software/algorithms that process outside of memory (bigglm, Revolution R) Get more memory a new machine, a big memory instance on a cloud 25

If you go for it... Need analysis software that has been written to work in parallel Product SAS on Hadoop Revolution R Enterprise IBM SPSS Analytic Server Mahout MapReduce Algorithms supported for distributed processing C&RT, Time series, GLM, Logistic regression, Random Forest, Clustering Regression, Logistic regression, GLM, Clustering, Decision Trees, Random Forest Linear regression, Neural Net, C&RT, CHAID Collaborative filtering, Naïve Bayes, Random Forest, Clustering, Principal Components Write your own MapReduce directly or with an interface like RHadoop 26

THANK YOU peggy.brinkmann@milliman.com 27