Getting to Know Big Data



Similar documents
COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

How To Handle Big Data With A Data Scientist

Large scale processing using Hadoop. Ján Vaňo

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Hadoop implementation of MapReduce computational model. Ján Vaňo

Are You Ready for Big Data?

BIG DATA IN BUSINESS ENVIRONMENT

Interactive data analytics drive insights

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

How To Scale Out Of A Nosql Database

INTRODUCTION TO CASSANDRA

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop. Sunday, November 25, 12

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Challenges for Data Driven Systems

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Are You Ready for Big Data?

BIG DATA What it is and how to use?

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

How Companies are! Using Spark

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Ubuntu and Hadoop: the perfect match

Hadoop & its Usage at Facebook

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data and Data Science: Behind the Buzz Words

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Data Mining in the Swamp

BIG DATA TECHNOLOGY. Hadoop Ecosystem

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

So What s the Big Deal?

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Hadoop & its Usage at Facebook

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data With Hadoop

Hadoop Big Data for Processing Data and Performing Workload

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Transforming the Telecoms Business using Big Data and Analytics

MapReduce with Apache Hadoop Analysing Big Data

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Data Refinery with Big Data Aspects

A Brief Outline on Bigdata Hadoop

Tap into Hadoop and Other No SQL Sources

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Chapter 7. Using Hadoop Cluster and MapReduce

The Future of Data Management

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

How To Make Sense Of Data With Altilia

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Journal of Environmental Science, Computer Science and Engineering & Technology

BIG DATA USING HADOOP

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data Explained. An introduction to Big Data Science.

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data and Apache Hadoop s MapReduce

NoSQL for SQL Professionals William McKnight

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

The 4 Pillars of Technosoft s Big Data Practice

The 3 questions to ask yourself about BIG DATA

Big Data on Microsoft Platform

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Big Systems, Big Data

Moving From Hadoop to Spark

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Apache Hadoop FileSystem and its Usage in Facebook

Native Connectivity to Big Data Sources in MSTR 10

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

White Paper: Hadoop for Intelligence Analysis

Transcription:

Getting to Know Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University Email: putchong@ku.th

Information Tsunami Rapid expansion of Smartphone Usage, social computing, mobile application, gaming Rapid increases in Network Bandwidth and coverage Wifi, 4G Rapid move toward Internet of Things (IOT) Sensor everywhere, multimedia information

During the first day of a baby s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. Photo Credit: Catherine Balet Strangers in the light (Steidl) 2012 / from The Human Face of Big Data

By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. Photo Credit: Douglas Kirkland 2012 / from The Human Face of Big Data

Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. Gartner Inc.

Property of Big Data Velocity Volume Variety BIG Data

Volume Big data must be huge Beyond the capability of a single computer server to process it Possible to store the data but difficult to process it

Velocity Big data accumulate at a very fast speed Stock market data Internet access log Social media data Twitter, facebook, IG We need to Extract meaning as fast and as much as we can before throwing away the data

Data come with variety Traditional data base Documents Web page Social media data Image Video/Audio Location Variety

Diya Soubra, The 3Vs that define Big Data, 2012 http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

BIG DATA BENEFIT AND USE CASE

Why? Know thy self, know thy enemy. A thousand battles, a thousand victories. http://www.intel.com/content/dam/www/public/us/en/d ocuments/product-briefs/big-data-cloud-technologiesbrief.pdf ) The real value of big data is in the insights it produces when analyzed discovered patterns, derived meaning, indicators for decisions, and ultimately the ability to respond to the world with greater intelligence. Improve product and service Increase customer satisfaction/behavior Improve operation efficiency Understand emerging market trends

Google Flu pattern emerges when all the flurelated search queries are added together. We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world. http://www.google.org/flutrends/abo ut/how.html

Social Media Analytics Social media analytics is the practice of gathering data from blogs and social media websites and analyzing that data to make business decisions. The most common use of social media analytics is to mine customer sentiment in order to support marketing and customer service activities. What is social media analytics? - Definition from WhatIs.com

Cupid in you Network Study matchmaker surveyed approximately 1500 English speakers around the world who had listed a relationship on their profile at least one year ago but no more than two years asking them how they met their partner and who introduced them (if anyone). analyzed network properties of couples and their matchmakers using de-identified, aggregated data. Matchmaker characteristics Matchmakers have far more friends than the people they're setting up. Matchmakers' networks have a different structure their networks are less dense: their friends are less likely to know each other Matchmakers were more likely to be close friends, rather than acquaintances. https://research.facebook.com/blog/448802398605370/cupid-in-your-network/

Consideration for Applying Big Data http://fredericgonzalo.com/en/2013/07/07/big-data-in-tourism-hospitality-4-key-components/

NoSQL (Not Only SQL) A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. being non-relational, distributed, opensource and horizontally scalable. Used to handle a huge amount of data The original intention has been modern web-scale databases. Reference: http://nosql-database.org/

MongoDB is a general purpose, open-source database. MongoDB features: Document data model with dynamic schemas Full, flexible index support and rich queries Auto-Sharding for horizontal scalability Built-in replication for high availability Text search Advanced security

Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. The base Apache Hadoop framework is composed of the following modules: Hadoop Common contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster; Hadoop YARN a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications;and Hadoop MapReduce a programming model for large scale data processing. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant.

Magic behind Hadoop and HDFS Problem is divided into two phases Map applying some action to data in <key, Value> Pair and get some intermediate results Reduce summarize intermediate result <key,value> and return back to main program Ricky Ho, How Hadoop Map/Reduce works, http://architects.dzone.com/articles/how-hadoop-mapreduce-works

Example: Word count Counting word in an input text file. How many word love in a novel? ^_^ In map phase the sentence would be split as words and form the initial key value pair <word, 1> tring tring the phone rings becomes <tring,1>,<tring,1>, <the,1>, <phone,1>, <rings,1> In the reduce phase the keys are grouped together and the values for similar keys are added. There are only one pair of similar keys tring the values for these keys would be added so the out put key value pairs would be <tring,2>, <the,1>, <phone,1>, <rings,1> Reduce forms an aggregation phase for keys This would give the number of occurrence of each word in the input. http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduceexample.html

Data Product Data Product provides actionalble information without exposing decision maker to the underlying data or analytics Movie Recommendations Weather Forecast Stock Market Prediction Operation improvement Health Diagnosis Targeted Advertising

Source: The Filed Guide to Data Science, Booz, Allen, Hamilton

Bottom up approach What is the data that we have? How can we collect and store it? What is the infrastructure and tool to process this big data? What analytics method can be apply? What is the insight we can gain from this data and analysis?

Top down What is the business challenge that can create value and impact to the organization? What is the data that we need? What is the tools and analytics approach that should be used? What is the infrastructure needed?

Some Trends

In-memory Database An in-memory database is a database management system that primarily relies on main memory for computer data storage. faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk. Source: http://en.wikipedia.org/wiki/in-memory_database

Spark at Yahoo Personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them. wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news personalization was written in 15,000 lines of C++.) With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business. Second use case shows off Hive on Spark (Shark s) interactive capability. use existing BI tools to view and query their advertising analytic data collected in Hadoop. http://www.datanami.com/2014/03/06/apache_spark_3_realworld_use_cases/

BigData Infrastructure Goes to Cloud Data is already on the cloud Virtual organization Cloud based SaaS Service Big Data As a Service on the Cloud Private Cloud Public Cloud IBM Bluemix, Amazon AWS (EMR) and many App App Services Services Big Data

Big Data Analytics a set of advanced technologies designed to work with large volumes of heterogeneous data. explore the data and to discover interrelationships and patterns using sophisticated quantitative methods such as machine learning neural networks robotics algorithm computational mathematics artificial intelligence

Deep Learning Deep learning is a subcategory of machine learning with the use of neural networks to improve things like speech recognition, computer vision, and natural language processing. Unsupervised learning for abstract concept

Applying Deep Learning In 2011, Stanford computer science professor Andrew Ng founded Google s Google Brain project, which created a neural network trained with deep learning algorithms, which famously proved capable ofrecognizing high level concepts, such as cats, after watching just YouTube videos--and without ever having been told what a cat is. Facebook using deep learning expertise to help create solutions that will better identify faces and objects in the 350 million photos and videos uploaded to Facebook each day. Voice recognition like Google Now and Apple s Siri is now using deep learning. According to Google researchers, the voice error rate in the new version of Android--after adding insights from deep learning--stands at 25% lower than previous versions of the software. http://www.wired.com/2014/08/deep-learning-yann-lecun/ Source: http://www.fastcolabs.com/3026423/why-google-is-investing-in-deep-learning

IBM Watson and Cognitive Technology Watson is a cognitive technology that processes information more like a human than a computer by understanding natural language, generating hypotheses based on evidence, and learning as it goes. And learn it does. Watson gets smarter in three ways: being taught by its users learning from prior interactions being presented with new information. This means organizations can more fully understand and use the data that surrounds them, and use that data to make better decisions.

Applying Watson in Healthcare WellPoint, Inc. is an Indianapolis-based health benefits company. approximately 37 million health plan members processes more than 550 million claims per year. Using IBM Watson to improve the quality and efficiency of healthcare decisions. WellPoint trained Watson with 25,000 historical cases. Now Watson uses hypothesis generation and evidence-based learning to generate confidencescored recommendations that help nurses make decisions about utilization management. Natural language processing leverages unstructured data, such as text-based Treatment requests. Benefit Helps UM nurses make faster UM decisions about treatment requests Could accelerate healthcare preapprovals, which can be critical when treatments are time-sensitive Includes unstructured data in the streamlined decision process

Challenges Developing Big Data Application is not simple New algorithm, new software development tools Proper policy about data security and ownership Lack of Data Scientists Different from Software Developer