Getting to Know Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: putchong@ku.th
Information Tsunami Rapid expansion of Smartphone Usage, social computing, mobile application, gaming Rapid increases in Network Bandwidth and coverage Wifi, 4G Rapid move toward Internet of Things (IOT) Sensor everywhere, multimedia information
During the first day of a baby s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. Photo Credit: Catherine Balet Strangers in the light (Steidl) 2012 / from The Human Face of Big Data
By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. Photo Credit: Douglas Kirkland 2012 / from The Human Face of Big Data
Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. Gartner Inc.
Property of Big Data Velocity Volume Variety BIG Data
Volume Big data must be huge Beyond the capability of a single computer server to process it Possible to store the data but difficult to process it
Velocity Big data accumulate at a very fast speed Stock market data Internet access log Social media data Twitter, facebook, IG We need to Extract meaning as fast and as much as we can before throwing away the data
Data come with variety Traditional data base Documents Web page Social media data Image Video/Audio Location Variety
Diya Soubra, The 3Vs that define Big Data, 2012 http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data
BIG DATA BENEFIT AND USE CASE
Why? Know thy self, know thy enemy. A thousand battles, a thousand victories. http://www.intel.com/content/dam/www/public/us/en/d ocuments/product-briefs/big-data-cloud-technologiesbrief.pdf ) The real value of big data is in the insights it produces when analyzed discovered patterns, derived meaning, indicators for decisions, and ultimately the ability to respond to the world with greater intelligence. Improve product and service Increase customer satisfaction/behavior Improve operation efficiency Understand emerging market trends
Google Flu pattern emerges when all the flurelated search queries are added together. We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world. http://www.google.org/flutrends/abo ut/how.html
Social Media Analytics Social media analytics is the practice of gathering data from blogs and social media websites and analyzing that data to make business decisions. The most common use of social media analytics is to mine customer sentiment in order to support marketing and customer service activities. What is social media analytics? - Definition from WhatIs.com
Cupid in you Network Study matchmaker surveyed approximately 1500 English speakers around the world who had listed a relationship on their profile at least one year ago but no more than two years asking them how they met their partner and who introduced them (if anyone). analyzed network properties of couples and their matchmakers using de-identified, aggregated data. Matchmaker characteristics Matchmakers have far more friends than the people they're setting up. Matchmakers' networks have a different structure their networks are less dense: their friends are less likely to know each other Matchmakers were more likely to be close friends, rather than acquaintances. https://research.facebook.com/blog/448802398605370/cupid-in-your-network/
Consideration for Applying Big Data http://fredericgonzalo.com/en/2013/07/07/big-data-in-tourism-hospitality-4-key-components/
NoSQL (Not Only SQL) A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. being non-relational, distributed, opensource and horizontally scalable. Used to handle a huge amount of data The original intention has been modern web-scale databases. Reference: http://nosql-database.org/
MongoDB is a general purpose, open-source database. MongoDB features: Document data model with dynamic schemas Full, flexible index support and rich queries Auto-Sharding for horizontal scalability Built-in replication for high availability Text search Advanced security
Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; Hadoop YARN a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications;and Hadoop MapReduce a programming model for large scale data processing. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.
Magic behind Hadoop and HDFS Problem is divided into two phases Map applying some action to data in <key, Value> Pair and get some intermediate results Reduce summarize intermediate result <key,value> and return back to main program Ricky Ho, How Hadoop Map/Reduce works, http://architects.dzone.com/articles/how-hadoop-mapreduce-works
Example: Word count Counting word in an input text file. How many word love in a novel? ^_^ In map phase the sentence would be split as words and form the initial key value pair <word, 1> tring tring the phone rings becomes <tring,1>,<tring,1>, <the,1>, <phone,1>, <rings,1> In the reduce phase the keys are grouped together and the values for similar keys are added. There are only one pair of similar keys tring the values for these keys would be added so the out put key value pairs would be <tring,2>, <the,1>, <phone,1>, <rings,1> Reduce forms an aggregation phase for keys This would give the number of occurrence of each word in the input. http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduceexample.html
Data Product Data Product provides actionalble information without exposing decision maker to the underlying data or analytics Movie Recommendations Weather Forecast Stock Market Prediction Operation improvement Health Diagnosis Targeted Advertising
Source: The Filed Guide to Data Science, Booz, Allen, Hamilton
Bottom up approach What is the data that we have? How can we collect and store it? What is the infrastructure and tool to process this big data? What analytics method can be apply? What is the insight we can gain from this data and analysis?
Top down What is the business challenge that can create value and impact to the organization? What is the data that we need? What is the tools and analytics approach that should be used? What is the infrastructure needed?
Some Trends
In-memory Database An in-memory database is a database management system that primarily relies on main memory for computer data storage. faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk. Source: http://en.wikipedia.org/wiki/in-memory_database
Spark at Yahoo Personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them. wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news personalization was written in 15,000 lines of C++.) With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business. Second use case shows off Hive on Spark (Shark s) interactive capability. use existing BI tools to view and query their advertising analytic data collected in Hadoop. http://www.datanami.com/2014/03/06/apache_spark_3_realworld_use_cases/
BigData Infrastructure Goes to Cloud Data is already on the cloud Virtual organization Cloud based SaaS Service Big Data As a Service on the Cloud Private Cloud Public Cloud IBM Bluemix, Amazon AWS (EMR) and many App App Services Services Big Data
Big Data Analytics a set of advanced technologies designed to work with large volumes of heterogeneous data. explore the data and to discover interrelationships and patterns using sophisticated quantitative methods such as machine learning neural networks robotics algorithm computational mathematics artificial intelligence
Deep Learning Deep learning is a subcategory of machine learning with the use of neural networks to improve things like speech recognition, computer vision, and natural language processing. Unsupervised learning for abstract concept
Applying Deep Learning In 2011, Stanford computer science professor Andrew Ng founded Google s Google Brain project, which created a neural network trained with deep learning algorithms, which famously proved capable ofrecognizing high level concepts, such as cats, after watching just YouTube videos--and without ever having been told what a cat is. Facebook using deep learning expertise to help create solutions that will better identify faces and objects in the 350 million photos and videos uploaded to Facebook each day. Voice recognition like Google Now and Apple s Siri is now using deep learning. According to Google researchers, the voice error rate in the new version of Android--after adding insights from deep learning--stands at 25% lower than previous versions of the software. http://www.wired.com/2014/08/deep-learning-yann-lecun/ Source: http://www.fastcolabs.com/3026423/why-google-is-investing-in-deep-learning
IBM Watson and Cognitive Technology Watson is a cognitive technology that processes information more like a human than a computer by understanding natural language, generating hypotheses based on evidence, and learning as it goes. And learn it does. Watson gets smarter in three ways: being taught by its users learning from prior interactions being presented with new information. This means organizations can more fully understand and use the data that surrounds them, and use that data to make better decisions.
Applying Watson in Healthcare WellPoint, Inc. is an Indianapolis-based health benefits company. approximately 37 million health plan members processes more than 550 million claims per year. Using IBM Watson to improve the quality and efficiency of healthcare decisions. WellPoint trained Watson with 25,000 historical cases. Now Watson uses hypothesis generation and evidence-based learning to generate confidencescored recommendations that help nurses make decisions about utilization management. Natural language processing leverages unstructured data, such as text-based Treatment requests. Benefit Helps UM nurses make faster UM decisions about treatment requests Could accelerate healthcare preapprovals, which can be critical when treatments are time-sensitive Includes unstructured data in the streamlined decision process
Challenges Developing Big Data Application is not simple New algorithm, new software development tools Proper policy about data security and ownership Lack of Data Scientists Different from Software Developer