Data Processing in the Era of Big Data



Similar documents
Big Data a threat or a chance?

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIG DATA TRENDS AND TECHNOLOGIES

Large-Scale Data Processing

Are You Ready for Big Data?

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Are You Ready for Big Data?

BIG DATA CHALLENGES AND PERSPECTIVES

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Transforming the Telecoms Business using Big Data and Analytics

How To Handle Big Data With A Data Scientist

Challenges for Data Driven Systems

Data Refinery with Big Data Aspects

BIG DATA What it is and how to use?

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Hadoop. Sunday, November 25, 12

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Apache Hama Design Document v0.6

A Survey on Big Data Concepts and Tools

Big Data Explained. An introduction to Big Data Science.

So What s the Big Deal?

Big Data Analytics. Lucas Rego Drumond

The 4 Pillars of Technosoft s Big Data Practice

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data on Microsoft Platform

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

COMP9321 Web Application Engineering

Data Warehouse design

Big Data and Analytics: Challenges and Opportunities

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Big Data Technologies Compared June 2014

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

NoSQL for SQL Professionals William McKnight

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data. Fast Forward. Putting data to productive use

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Sunnie Chung. Cleveland State University

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Systems, Big Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

A Brief Outline on Bigdata Hadoop

Big Data Analytics for Space Exploration, Entrepreneurship and Policy Opportunities. Tiffani Crawford, PhD

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

In-Memory Analytics for Big Data

How To Scale Out Of A Nosql Database

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data With Hadoop

Big Data / FDAAWARE. Rafi Maslaton President, cresults the maker of Smart-QC/QA/QD & FDAAWARE 30-SEP-2015

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop IST 734 SS CHUNG

NextGen Infrastructure for Big DATA Analytics.

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Big Data: Tools and Technologies in Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 12

Big Data and Analytics (Fall 2015)

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

InfiniteGraph: The Distributed Graph Database

Foundations of Business Intelligence: Databases and Information Management

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Next-Generation Cloud Analytics with Amazon Redshift

Advanced In-Database Analytics

Real Time Big Data Processing

GigaSpaces Real-Time Analytics for Big Data

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big data and its transformational effects

Machine Learning over Big Data

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

BIG DATA-AS-A-SERVICE

Apache Hadoop. Alexandru Costan

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

International Journal of Innovative Research in Computer and Communication Engineering

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Impact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.

Open source large scale distributed data management with Google s MapReduce and Bigtable

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data and Data Science: Behind the Buzz Words

Doing Multidisciplinary Research in Data Science

Transcription:

Department of Computer Science and Information Engineering National Taiwan University October 3, 2014

Big Data a New Jargon Importance

Importance Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Wiki 1. Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization Garner 2. 1 http://en.wikipedia.org/wiki/big_data 2 http://blogs.gartner.com/doug-laney/files/2012/01/ ad949-3d-data-management-controlling-data-volume-velocity-and-variety. pdf

Importance We can derive more information from a single large data set than many data sets of the same total volume.

The Size Matters Importance The current limit on dataset is about an exabyte, as Wiki claimed. Megabyte 10 6 Gigabyte 10 9 Terabyte 10 12 Petabyte 10 15 Exabyte 10 18, or 1, 000, 000, 000, 000, 000, 000.

The Size Matters Importance How to store an exabyte of data? You need one million 1Terabyte disks. Price 2000NT 1, 000, 000 = 2, 000, 000, 000NT Weight 0.6Kg 1, 000, 000 = 600, 000Kg Power 2W 1000000 = 2, 000, 000W Height 3cm 1000000 = 30km, this is about 100 times the height of Taipei 101 tower.

Who Needs Big Data? Importance Meteorology Genomics Connectomics Complex physics simulations Biological and Environmental Research Internet search Finance Business informatics

Importance Where Does Big Data Come from? Ubiquitous information-sensing mobile devices Remote sensing Software logs Cameras Microphones Radio-frequency Identification readers Wireless sensor networks

Importance Importance Big Data delivers the cost-effective prospect to improve decision-making in critical development areas such as health care, employment, economic productivity, crime and security, and natural disaster and resource management. To spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.

Importance Importance Big data has the potential to result in a new kind of digital divide: a divide in data-based intelligence to inform decision-making. Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, and HP have spent more than 15 billion on software firms only specializing in data management and analytics. In 2010, this industry on its own was worth more than 100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

Big Data Potential Index Importance 3 3 http://www.mckinsey.com/insights/business_technology/big_ data_the_next_frontier_for_innovation

An Example Importance Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between on-line behavior and real-world economic indicators 4. 4 http://en.wikipedia.org/wiki/big_data

Importance The study examined Google queries logs in 45 different countries in 2010 and calculated a future orientation index, which is the ratio of the volume of searches for the coming year to the volume of searches for the previous year. They compared the future orientation index to the per capita GDP of each country and found a strong tendency for countries in which Google users inquire more about the future to exhibit a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.

Challenge Importance Capture Storage Search Sharing Analysis Visualization

Big Data Volume Velocity Variety Veracity Technology? Technologies Volume amount of data Velocity speed of data in and out Variety range of data types and sources Veracity the correctness of data

Volume Volume Velocity Variety Veracity Technology? Technologies Enterprises are awash with ever-growing data of all types, easily amassing terabytes even petabytes of information 5. The per-capita data volume roughly doubled every 40 months since 1980, and now we create 2.5 exabyte (2.5 10 18 ) of data every day. There is 40% projected growth in global data generated per year, but only 5% growth in IT spending. 5 http://www-01.ibm.com/software/data/bigdata/

Volume Volume Velocity Variety Veracity Technology? Technologies The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. Walmart has more than 2.5 petabytes of customer data the equivalent of 167 times the information contained in all the books in the US Library of Congress. The Utah Data Center constructed by the United States National Security Agency will hold yottabytes (10 24 ) information collected by the NSA over the Internet.

Volume Volume Velocity Variety Veracity Technology? Technologies Facebook has 50 billion photos from its user base, and Facebook users share 30 billion pieces of contents every month. Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide. The Utah Data Center constructed by the United States National Security Agency will hold yottabytes (10 24 ) information collected by the NSA over the Internet. Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.

Velocity Volume Velocity Variety Veracity Technology? Technologies The Large Hadron Collider has 150 million sensors delivering data 40 million times per second, and generates 500 exabytes per day before replication. The Sloan Digital Sky Survey (SDSS) collects more astronomical data in its first few weeks than all data collected in the history of astronomy, generating about 200 GB per night. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. Walmart handles more than 1 million customer transactions every hour.

Variety Volume Velocity Variety Veracity Technology? Technologies Big data is any type of data structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more 6. New insights are found when analyzing these data types together. 6 http://www-01.ibm.com/software/data/bigdata/

Veracity Volume Velocity Variety Veracity Technology? Technologies One in three business leaders don t trust the information they use to make decisions 7. How can you act upon information if you don t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows. 7 http://www-01.ibm.com/software/data/bigdata/

The Outlook Volume Velocity Variety Veracity Technology? Technologies Q: What is our chance doing Exabyte computing? A: Not very good, so far.

Our Enemy Volume Velocity Variety Veracity Technology? Technologies 8 8 http://us.123rf.com/400wm/400/400/soify/soify1210/ soify121000002/15649042-monster-cartoon.jpg

Our Weapon Volume Velocity Variety Veracity Technology? Technologies 9 9 http://cdn.smosh.com/sites/default/files/bloguploads/ cute-weapon-tiny2-b.jpg

Big Data Technology Volume Velocity Variety Veracity Technology? Technologies Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.

Technologies Volume Velocity Variety Veracity Technology? Technologies A 2011 McKinsey report suggests suitable technologies for Big data 10. AB testing Association rule learning Classification Cluster analysis Crowdsourcing Data fusion and integration Ensemble learning Genetic algorithms Machine learning 10 http://www.mckinsey.com/insights/business_technology/big_ data_the_next_frontier_for_innovation

Technologies Volume Velocity Variety Veracity Technology? Technologies Natural language processing Neural networks Pattern recognition Anomaly detection Predictive modeling Regression, sentiment analysis Signal processing Supervised and unsupervised learning Simulation Time series analysis and visualization

Really?? Volume Velocity Variety Veracity Technology? Technologies This is questionable. None of these can even solve the problem of storing the data.

The Road Volume Velocity Variety Veracity Technology? Technologies 11

Data Science Volume Velocity Variety Veracity Technology? Technologies 1 Fundamentals 2 Statistics 3 Programming 4 Machine Learning 5 Text Mining/Natural Language Processing 6 Visualization 7 BigData 8 Data Ingestion 9 Data Munging 10 Toolbox

Technologies Volume Velocity Variety Veracity Technology? Technologies Additional technologies being applied to big data include. Massively parallel-processing (MPP) databases Search-based applications Data-mining grids Distributed file systems Distributed databases Cloud based infrastructure

Lack of Technologies Volume Velocity Variety Veracity Technology? Technologies Gartner suggest the following to deal with the volume issue 12. Limiting data collected to that which will be leveraged by the current or imminent business processes. Limiting certain analytic structures to a percentage of statistically valid sample data. Profiling data sources to identify and subsequently eliminate redundancy. Monitoring data usage to determine cold spots. Outsourcing. (You can never beat that) 12 http://blogs.gartner.com/doug-laney/files/2012/01/ ad949-3d-data-management-controlling-data-volume-velocity-and-variety. pdf

Volume Velocity Variety Veracity Technology? Technologies What do we need to deal with Big Data? Cloud Infrastructure Extremely large scale database Data mining, machine learning Domain knowledge

Cloud Infrastructure Volume Velocity Variety Veracity Technology? Technologies Only cloud computing can provide processing capability for big data. Just a simple question where are you going to place the one million hard disks for exabyte scale database?

Data Center Volume Velocity Variety Veracity Technology? Technologies A data center, also called a server farm, is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls (e.g., air conditioning, fire suppression) and security devices.

Data Center Volume Velocity Variety Veracity Technology? Technologies Continuity Companies rely on their information systems to run their operations. Security A data center has to offer a secure environment which minimizes the chances of a security breach. Integrity Redundancy of both fiber optic cables and power, which includes emergency backup power generation, to ensure the Integrity of data.

Infrastructure as a Service Volume Velocity Variety Veracity Technology? Technologies Infrastructure as a Service (IaaS) is the delivery of computer infrastructure (typically a platform virtualization environment) as a service. Originally called Hardware as a Service (HaaS)

Amazon Elastic Compute Cloud Volume Velocity Variety Veracity Technology? Technologies Amazon Elastic Compute Cloud (also known as EC2 ) allows customers to rent computers on which to run their own computer applications. EC2 allows scalable deployment of applications by providing a web services interface through which a customer can create virtual machines, i.e. server instances, on which the customer can load any software of their choice.

Amazon Elastic Compute Cloud Volume Velocity Variety Veracity Technology? Technologies Elastic Completely Controlled Flexible Designed for use with other Amazon Web Services Reliable Secure Inexpensive

NoSQL Volume Velocity Variety Veracity Technology? Technologies A NoSQL database provides a simple, lightweight mechanism for storage and retrieval of data that provides higher scalability and availability than traditional relational databases 13. 13 http://en.wikipedia.org/wiki/nosql

Relational Database? Volume Velocity Variety Veracity Technology? Technologies If you want vast, on-demand scalability, you need a non-relational database 14. Is that so? What are the differences between relational and non-relational databases? Is this a sign that relational databases have had their day and will decline over time? 14 http: //readwrite.com/2009/02/12/is-the-relational-database-doomed

Relational Database Volume Velocity Variety Veracity Technology? Technologies Has been around over 30 years. Well studied, well optimized. No major changes. All of those revolutions fizzled out, and none even made a dent in the dominance of relational databases.

Relational Database Volume Velocity Variety Veracity Technology? Technologies For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance 15. 15 http: //readwrite.com/2009/02/12/is-the-relational-database-doomed

Scalability Volume Velocity Variety Veracity Technology? Technologies Scalability is the key issue. To achieve scalability you need scalable infrastructure.

Database Scalability Volume Velocity Variety Veracity Technology? Technologies Web 2.0 applications, social networking, and on-line multi-player gaming have become more and more popular. These applications typically deal with ever-increasing large amounts of data. Deploying these applications on traditional relational database management systems typically suffers limited scalability.

NoSQL Databases Volume Velocity Variety Veracity Technology? Technologies There are also various NoSQL databases used to manage large amounts of data. BigTable from Google HBase Cassandra from Facebook Dynamo from Amazon

From SQL to NoSQL Volume Velocity Variety Veracity Technology? Technologies NoSQL databases are not a direct replacement for traditional relational database management systems. Many applications require multi-row transaction support. Data management tools and many existing applications typically interface with databases using SQL.

Our Works Volume Velocity Variety Veracity Technology? Technologies a SQL interface for NoSQL 16. a multi-row transaction system on Hbase 17. a cloud-based BSP model graph computation engine 18. 16 Meng-Ju Hsieh, Chao-Rui Chang, Li-Yung Ho, Jan-Jan Wu, : : A Scalable Database Management System for Cloud Computing. ICPP 2011: 315-324. 17 Chao-Rui Chang, Meng-Ju Hsieh, Jan-Jan Wu, Po-Yen Wu, : : A Highly Scalable Cloud Database for Multi-user Query Processing. IEEE CLOUD 2012: 943-944. 18 Li-Yung Ho, Tsung-Han Li, Jan-Jan Wu, : : An efficient and scalable graph data processing system. BigData Conference 2013: 193-198.

A MapReduce-based interface for SQL applications. Provide high-performance OLAP processing with SQL syntax. A joint work with Institute of Information Science, Academia Sinica. http://otl.sinica.edu.tw/index.php?t=9&group_id= 25&article_id=1208

Technology Transfer

A highly scalable database for OLTP applications. Built on top of HBase. supports many desirable features that OLTP applications require.

Features high scalability SQL interface multi-row transaction support secondary index support

Contributions Provide a SQL interface on HBase. Support multi-row transactions on HBase. Design a distributed secondary indexing scheme for HBase.

HBase a NoSQL database with high scalability. designed to host very large tables. good at random read/write access. built on the Hadoop framework.

Architecture of HBase HBase Client HBase Client HBase Client HBase Client Region Server Region Region Coprocessor Coprocessor Region Server Region Server Store Table Store Table DFS Client DFS Client DFS Client DFS Client DFS Client Data Node Data Node Data Node Data Node Data Node

System Architecture of Client Layer Table Manager Transaction Transaction Transaction HBase Region Server Region Server Server Layer Region Coprocessor Local Transaction Manager Region Coprocessor Local Transaction Manager Storage Layer Hadoop - HDFS

Transaction Throughput Scale Factor = 100

Transaction Throughput Scale Factor = 200

Summary is a highly scalable database for OLTP applications. provides a SQL interface for applications. supports multi-row transactions on HBase. uses a distributed B-tree scheme to improve performance. Experiment results indicate that scale well on large data sets.

Large Scale Graph Computation MapReduce has been proven to be efficient for a specific class of large scale data processing does not perform well on graph data processing Google propose Pregel, utilizing Bulk Synchronous Parallel (BSP) model for large scale graph processing

BSP Model on Graph Computation Vertex centric, iterative computation model User implements compute function which targets at a single vertex Resemble to map and reduce function in Map-reduce model Computation consists of a sequence of iterations, called supersteps The execution of compute function are synchronized between supersteps

BSP Model Execution Flow At first, all vertices are set to active state In each iteration, a compute function is invoked on each active vertex to 1 read messages sent to it in previous iteration 2 modify its vertex value according to messages 3 send messages to other neighboring vertices (activate other vertices) 4 optionally vote to halt computation (become inactive) If all vertices are in inactive state, then end computation

Giraph An iterative graph processing system Originated as the open-source counterpart to Pregel from Google Used at Facebook to analyze social graph

Hama A BSP computing framework on top of HDFS Designed for massive scientific computations such as matrix, graph and network algorithms

An efficient and scalable graph data processing system Highly optimized for processing large scale graphs Cooperates with HBase to achieve scalable data manipulation

System Architecture Master Partition Manager Query Manager Worker Data Graph Data Loader Query Processor Worker Data Graph Data Loader Query Processor NoSQL database (HBase) Coordination System (Zookeeper) Figure Pangfeng : TheLiu architecture Data Processing of in the Era of Big Data

Optimization Pull Messaging Applied to algorithms which requires all neighboring data in order to do computation Lazy Vertex Loading Applied to sub-graph query Vertex Weighted Partitioning

Dataset Four real social networks as the input data Social Networks Nodes (millions) Edges (millions) Orkut 3.07 117.26 Flicker 1.86 15.97 LiveJournal 5.28 49.4 YouTube 1.16 3.01

Overall Performance 180 160 Hama Giraph 140 120 Seconds 100 80 60 40 20 0 Maxvalue N-steps Pagerank Bipartite SSSP Inference Label Applications

Overall Performance 180 160 Hama Giraph 140 120 Seconds 100 80 60 40 20 Maxvalue N-steps Pagerank Bipartite SSSP Inference Label Applications

Overall Performance 400 350 Hama Giraph 300 Seconds 250 200 150 100 50 Maxvalue N-steps Pagerank Bipartite SSSP Inference Label Applications

Overall Performance 1000 900 Hama Giraph 800 700 600 Seconds 500 400 300 200 100 0 Maxvalue N-steps Pagerank Bipartite SSSP Inference Label Applications

Scalability is the key issue. Cloud infrastructure is essential. Data is big only when it reaches a scale we cannot process it with traditional IT infrastructure. NoSQL will be crucial because of its stability. We still have a long way to process exabyte data set.