BIG DATA What it is and how to use?



Similar documents
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop IST 734 SS CHUNG

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Ecosystem B Y R A H I M A.

BIG DATA TRENDS AND TECHNOLOGIES

CSE-E5430 Scalable Cloud Computing Lecture 2

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Scale Out Of A Nosql Database

Big Data With Hadoop

Advanced Big Data Analytics with R and Hadoop

Apache Hadoop: The Big Data Refinery

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop. Sunday, November 25, 12

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hadoop Big Data for Processing Data and Performing Workload

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 7. Using Hadoop Cluster and MapReduce

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Big Data Big Data/Data Analytics & Software Development

Big Data and Data Science: Behind the Buzz Words

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Big Data Explained. An introduction to Big Data Science.

Data Warehouse design

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Constructing a Data Lake: Hadoop and Oracle Database United!

A Brief Outline on Bigdata Hadoop

Using distributed technologies to analyze Big Data

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Data processing goes big

Chase Wu New Jersey Ins0tute of Technology

Large-Scale Test Mining

HDP Hadoop From concept to deployment.

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Architectures for Big Data Analytics A database perspective

<Insert Picture Here> Big Data

Transforming the Telecoms Business using Big Data and Analytics

Modernizing Your Data Warehouse for Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Journée Thématique Big Data 13/03/2015

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Testing Big data is one of the biggest

Microsoft Big Data. Solution Brief

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Data Mining in the Swamp

Big Data on Microsoft Platform

Challenges for Data Driven Systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Internals of Hadoop Application Framework and Distributed File System

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Big Data and Apache Hadoop s MapReduce

How To Handle Big Data With A Data Scientist

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Trafodion Operational SQL-on-Hadoop

Big Data Weather Analytics Using Hadoop

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data: Tools and Technologies in Big Data

Hadoop and Map-Reduce. Swati Gore

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Open source Google-style large scale data analysis with Hadoop

Are You Ready for Big Data?

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Apache Hadoop: Past, Present, and Future

Dominik Wagenknecht Accenture

Are You Ready for Big Data?

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Case Study : 3 different hadoop cluster deployments

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Cloud Computing at Google. Architecture

ANALYTICS CENTER LEARNING PROGRAM

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Intro to Map/Reduce a.k.a. Hadoop

Advanced In-Database Analytics

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Workshop on Hadoop with Big Data

Cost-Effective Business Intelligence with Red Hat and Open Source

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Manifest for Big Data Pig, Hive & Jaql

Real Time Big Data Processing

Transcription:

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1

21.11.14 BIG DATA concept versions 1. Unstructured vs structured - Big Data focuses on unstructured data 2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB) 3. The 3V-s of Big Data - Volume, Velocity, Variety Volume MB to GB to TB DATA BIG DATA Calls, scripts Variety Purchases Weather Social media Logs Velocity GB bytes of data transported every hour It all started with 2

Google Whitepapers In 2003, 2004, 2005 Google released three academic papers describing Google s technology for massive data processing: 1. Google File System (GFS) - Google storing all web content 2. Map-Reduce Google calculating PageRank and web search index 3. BigTable Google storing Crawling data Analytics, Earth and Personalized Search in columnar database Hadoop historical background In 2004/5 Doug Cutting developed Nutch open source web search engine struggling with huge data processing issues. Doug implemented Google File System analog and named it HADOOP From 2006 Hadoop is an Apache Foundation project 3

Hadoop file system (HDFS) HDFS: is a file system that can store very large data sets scales out across a cluster of hosts is optimized for throughput instead of latency achieves high availability through replication instead of redundancy faults of nodes are expected to be norm than exception HDFS Architecture HDFS Client Metadata Name node Blocks Management Data Read and Write Data node Data node Data node http://static.googleusercontent.com/media/research.google.com/et//archive/gfs-sosp2003.pdf 4

21.11.14 MAP-REDUCE concept TASK Huge Job MAP REDUCE Job 1 Worker 1 does job 1 Job 2 Worker 2 does job 2 Combine job 1 and job 2 result Job 3 Worker 3 does job 3 Combine job 3 and job 4 result Job 4 Worker 4 does job 4 Process combined results RESULT Huge job result MAP REDUCE is a framework for processing huge works 1. Split the huge job between workers 2. Combine workers results into single result How it works? Step 1 MAP DATA: 5 baskets of apples, oranges, pears Task: Find the number of apples, oranges and pears that I have Server 1 Server 2 Server 3 Server 4 Server 5 Initial data Server 1 Server 2 Server 3 Server 4 Server 5 In each basket we count apples, oranges, pears 5

How it works? Step Shuffle Server 1 Server 2 Server 3 Server 4 Server 5 Shuffle Server 1 Server 2 Server 3 Server 4 Server 5 How it works? Step 2 Reduce Server 1 Server 2 Server 3 Server 4 Server 5 Server 1 Server 2 Server 3 Server 4 Server 5 X 50 X 42 X 31 Reduce X 50 X 42 X 31 Final result 6

21.11.14 Hadoop + MAP-REDUCE Hadoop filesystem with MAP-REDUCE is a distributed grid with storage and processing power Hadoop Storage Processing power Hadoop has been adopted! Google Whitepaper 2003 2004 2005 2006 2008 2009 2010 2007 Google file system reimplem entation 7

Hadoop ecosystem Non-Relational DBMS Fine-grainer data handling Hive Data warehouse that provides SQL interface, data strucutre is projected ad hoc onto underliying unstructured dat HBase Column oriented, schema less, distributed database modeled after Google s Big Table. Random real time read/ write Scripting Pig Platform for manipulating and analyzing large data sets, Scripting language for analysis Machine Learning Mahout Machine learning libraries for recommendations, clustering, classification and item sets HDFS Distributes and replicates data across machines Hadoop Core Platform MapReduce Distributes and Monitors tasks, restarts failed tasks Big Data technical stack Business analytics tools Data Mining /Modeling tools Document databases Data integration Business analytics Business Intelligence Forecasting Data Mining / Modeling Data mining Data modeling Columnar databases Key value stores Data Sources Batch data integration SQL Batch/Map- Reduce Real-time Script Machinelearning Search Metadata management (HCatalog) On-line Database In-Memory Output.... Streaming data flow Hadoop cluster of hosts Cluster management / monitoring (Ambari) HDFS.. 8

Relational Data vs BIG DATA Relational data management DATA BIG DATA management Apply data schema Store data Store in Relational database Apply analytics Apply data schema Apply analytics Schema on READ Structure first Structure later How to find the value in data? Machine Learning Supervised learning We have previous knowledge about the sample cases that are basis for learning Classification Regression Decision Trees Unsupervised learning We do not have any previous knowledge about the sample cases that are basis for learning Clustering Hidden Markov Chains Dimensionality reduction 9

How does it work Linear Regression? Price Example: Linear Regression TASK: find the price for 46m2 apartment Price y = ax + b In order to find a price of a 46m2-size-apartment we find the linear relation of samples. 1. We assume linear relation Price = a * Size + b 56K 46m2 Apartment Size size 2. We calculate each sample distance from the line 3. We search for the blue line equation with minimal total distance from samples 4. Knowing the line function we calculate the price for 46m2 apartment Example: Customer churn Customer historical data Churn? Gender Customer age Card type Brand Sales total In eur Purchase frequency Purchase No Churn Decision TREE algorithm Male 37 type1 brand1 62 1 123 no Female 49 type2 brand1 15 125 6 no Female 38 type3 brand3 116 31 5 no Male 64 type4 brand1 12 4 8 no Female 30 type5 brand6 47 21 43 no Female 30 type4 brand1 25 82 16 no Female 47 type2 brand7 31 97 3 yes Male 30 type3 brand2 35 162 6 yes Female 51 type1 brand3 24 88 73 no Female 30 type3 brand2 31 32 22 no Male 42 type4 brand3 57 279 3 yes Female 30 type1 brand1 25 175 11 no Female 30 type3 brand2 54 5 40 no Male 30 type2 brand7 44 467 3 yes Customer Churn prediction rules. purchace.freq.sdev <= 165: :...purchase.no > 7: no purchase.no <= 7: :...purchace.freq.sdev > 86: :...purchase.no > 4: : :...purchace.freq.sdev <= 126: : : :...purchase.no > 5: no : : : purchase.no <= 5: : : : :...brand in {brand1,brand2,brand4}: no : : : brand = brand3: yes : : purchace.freq.sdev > 126: : : :...purchase.no <= 6: yes : : purchase.no > 6: : : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes............... Female 30 type3 brand1 46 150 3 no Actionable insights for enterprise 10

Example 3: Predict loan payment default? Example: Bank loan decision TASK: Find the probability of default for applicant Historical loan application data 16 factors (parameters) Target No Default = 0 Default = 1 In order to predict the probability of default we use Multivariate logistic regression 1. Logistic function 1 f (x) = 1+ e x 3000 samples Input parameters T 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0.. 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 2. We create model based on historical data predicting the default 3. Testing the model we split the dataset randomly into training 80% and test set 20% Error matrix Actual Predicted 0 1 0 True positive False Negative 1 False positive True Negative Big Data. 1. Technology invented by Google, further developed by all big internet companies 2. Linear scalability, open-source 3. Decreased costs low cost HardWare, no licenses 4. Increased capabilities schema on read, massive analytics 5. Machine Learning to discover value in the data 11

4 steps approach for Big Data problems STEP 1 Knowledge creation Seminars, workshops Real-life examples STEP 2 IDEAs discovery Find potentially valuable data Apply short validation, test STEP 3 Plan and prototype Minimalistic Prototypes Setup and business value validation STEP 4 Implementation Implement fast, low risk Integrate with existing processes Where to start? " Look the tutorials in the internet " Read some books about BIG DATA and Machine Learning " Participate in on-line coursers (Coursera.org or similar) " Experiment with tools sandboxes, sample setups " Participate on online competitions (like Kaggle.com) 12

If you are interested? Nortal has interesting Big Data and Machine Learning tasks to solve! Lauri Ilison, PhD email: lauri.ilison@nortal.com 13