BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1
21.11.14 BIG DATA concept versions 1. Unstructured vs structured - Big Data focuses on unstructured data 2. Big Data could be a volume issue - petabyte-scale data (1 Mio GB) 3. The 3V-s of Big Data - Volume, Velocity, Variety Volume MB to GB to TB DATA BIG DATA Calls, scripts Variety Purchases Weather Social media Logs Velocity GB bytes of data transported every hour It all started with 2
Google Whitepapers In 2003, 2004, 2005 Google released three academic papers describing Google s technology for massive data processing: 1. Google File System (GFS) - Google storing all web content 2. Map-Reduce Google calculating PageRank and web search index 3. BigTable Google storing Crawling data Analytics, Earth and Personalized Search in columnar database Hadoop historical background In 2004/5 Doug Cutting developed Nutch open source web search engine struggling with huge data processing issues. Doug implemented Google File System analog and named it HADOOP From 2006 Hadoop is an Apache Foundation project 3
Hadoop file system (HDFS) HDFS: is a file system that can store very large data sets scales out across a cluster of hosts is optimized for throughput instead of latency achieves high availability through replication instead of redundancy faults of nodes are expected to be norm than exception HDFS Architecture HDFS Client Metadata Name node Blocks Management Data Read and Write Data node Data node Data node http://static.googleusercontent.com/media/research.google.com/et//archive/gfs-sosp2003.pdf 4
21.11.14 MAP-REDUCE concept TASK Huge Job MAP REDUCE Job 1 Worker 1 does job 1 Job 2 Worker 2 does job 2 Combine job 1 and job 2 result Job 3 Worker 3 does job 3 Combine job 3 and job 4 result Job 4 Worker 4 does job 4 Process combined results RESULT Huge job result MAP REDUCE is a framework for processing huge works 1. Split the huge job between workers 2. Combine workers results into single result How it works? Step 1 MAP DATA: 5 baskets of apples, oranges, pears Task: Find the number of apples, oranges and pears that I have Server 1 Server 2 Server 3 Server 4 Server 5 Initial data Server 1 Server 2 Server 3 Server 4 Server 5 In each basket we count apples, oranges, pears 5
How it works? Step Shuffle Server 1 Server 2 Server 3 Server 4 Server 5 Shuffle Server 1 Server 2 Server 3 Server 4 Server 5 How it works? Step 2 Reduce Server 1 Server 2 Server 3 Server 4 Server 5 Server 1 Server 2 Server 3 Server 4 Server 5 X 50 X 42 X 31 Reduce X 50 X 42 X 31 Final result 6
21.11.14 Hadoop + MAP-REDUCE Hadoop filesystem with MAP-REDUCE is a distributed grid with storage and processing power Hadoop Storage Processing power Hadoop has been adopted! Google Whitepaper 2003 2004 2005 2006 2008 2009 2010 2007 Google file system reimplem entation 7
Hadoop ecosystem Non-Relational DBMS Fine-grainer data handling Hive Data warehouse that provides SQL interface, data strucutre is projected ad hoc onto underliying unstructured dat HBase Column oriented, schema less, distributed database modeled after Google s Big Table. Random real time read/ write Scripting Pig Platform for manipulating and analyzing large data sets, Scripting language for analysis Machine Learning Mahout Machine learning libraries for recommendations, clustering, classification and item sets HDFS Distributes and replicates data across machines Hadoop Core Platform MapReduce Distributes and Monitors tasks, restarts failed tasks Big Data technical stack Business analytics tools Data Mining /Modeling tools Document databases Data integration Business analytics Business Intelligence Forecasting Data Mining / Modeling Data mining Data modeling Columnar databases Key value stores Data Sources Batch data integration SQL Batch/Map- Reduce Real-time Script Machinelearning Search Metadata management (HCatalog) On-line Database In-Memory Output.... Streaming data flow Hadoop cluster of hosts Cluster management / monitoring (Ambari) HDFS.. 8
Relational Data vs BIG DATA Relational data management DATA BIG DATA management Apply data schema Store data Store in Relational database Apply analytics Apply data schema Apply analytics Schema on READ Structure first Structure later How to find the value in data? Machine Learning Supervised learning We have previous knowledge about the sample cases that are basis for learning Classification Regression Decision Trees Unsupervised learning We do not have any previous knowledge about the sample cases that are basis for learning Clustering Hidden Markov Chains Dimensionality reduction 9
How does it work Linear Regression? Price Example: Linear Regression TASK: find the price for 46m2 apartment Price y = ax + b In order to find a price of a 46m2-size-apartment we find the linear relation of samples. 1. We assume linear relation Price = a * Size + b 56K 46m2 Apartment Size size 2. We calculate each sample distance from the line 3. We search for the blue line equation with minimal total distance from samples 4. Knowing the line function we calculate the price for 46m2 apartment Example: Customer churn Customer historical data Churn? Gender Customer age Card type Brand Sales total In eur Purchase frequency Purchase No Churn Decision TREE algorithm Male 37 type1 brand1 62 1 123 no Female 49 type2 brand1 15 125 6 no Female 38 type3 brand3 116 31 5 no Male 64 type4 brand1 12 4 8 no Female 30 type5 brand6 47 21 43 no Female 30 type4 brand1 25 82 16 no Female 47 type2 brand7 31 97 3 yes Male 30 type3 brand2 35 162 6 yes Female 51 type1 brand3 24 88 73 no Female 30 type3 brand2 31 32 22 no Male 42 type4 brand3 57 279 3 yes Female 30 type1 brand1 25 175 11 no Female 30 type3 brand2 54 5 40 no Male 30 type2 brand7 44 467 3 yes Customer Churn prediction rules. purchace.freq.sdev <= 165: :...purchase.no > 7: no purchase.no <= 7: :...purchace.freq.sdev > 86: :...purchase.no > 4: : :...purchace.freq.sdev <= 126: : : :...purchase.no > 5: no : : : purchase.no <= 5: : : : :...brand in {brand1,brand2,brand4}: no : : : brand = brand3: yes : : purchace.freq.sdev > 126: : : :...purchase.no <= 6: yes : : purchase.no > 6: : : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes............... Female 30 type3 brand1 46 150 3 no Actionable insights for enterprise 10
Example 3: Predict loan payment default? Example: Bank loan decision TASK: Find the probability of default for applicant Historical loan application data 16 factors (parameters) Target No Default = 0 Default = 1 In order to predict the probability of default we use Multivariate logistic regression 1. Logistic function 1 f (x) = 1+ e x 3000 samples Input parameters T 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0.. 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 2. We create model based on historical data predicting the default 3. Testing the model we split the dataset randomly into training 80% and test set 20% Error matrix Actual Predicted 0 1 0 True positive False Negative 1 False positive True Negative Big Data. 1. Technology invented by Google, further developed by all big internet companies 2. Linear scalability, open-source 3. Decreased costs low cost HardWare, no licenses 4. Increased capabilities schema on read, massive analytics 5. Machine Learning to discover value in the data 11
4 steps approach for Big Data problems STEP 1 Knowledge creation Seminars, workshops Real-life examples STEP 2 IDEAs discovery Find potentially valuable data Apply short validation, test STEP 3 Plan and prototype Minimalistic Prototypes Setup and business value validation STEP 4 Implementation Implement fast, low risk Integrate with existing processes Where to start? " Look the tutorials in the internet " Read some books about BIG DATA and Machine Learning " Participate in on-line coursers (Coursera.org or similar) " Experiment with tools sandboxes, sample setups " Participate on online competitions (like Kaggle.com) 12
If you are interested? Nortal has interesting Big Data and Machine Learning tasks to solve! Lauri Ilison, PhD email: lauri.ilison@nortal.com 13