Outline. What is Big data and where they come from? How we deal with Big data?

Similar documents

Big Data Explained. An introduction to Big Data Science.

How To Handle Big Data With A Data Scientist

BIG DATA TRENDS AND TECHNOLOGIES

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

BIG DATA What it is and how to use?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Oracle Big Data SQL Technical Update

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Architectures for Big Data Analytics A database perspective

Transforming the Telecoms Business using Big Data and Analytics

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Chapter 1. Contrasting traditional and visual analytics approaches

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data and Data Science: Behind the Buzz Words

So What s the Big Deal?

Sunnie Chung. Cleveland State University

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Week 13: Data Warehousing. Warehousing

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data on Microsoft Platform

What happens when Big Data and Master Data come together?

Open source Google-style large scale data analysis with Hadoop

Customized Report- Big Data

CitusDB Architecture for Real-Time Big Data

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Analytics: Challenges and Opportunities

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

COMP9321 Web Application Engineering

Parallel Data Warehouse

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

The State of Real-Time Big Data Analytics & the Internet of Things (IoT) January 2015 Survey Report

BIG DATA TECHNOLOGY. Hadoop Ecosystem

INTRODUCTION TO CASSANDRA

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Big Data. Fast Forward. Putting data to productive use

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data Technologies Compared June 2014

Advanced In-Database Analytics

Apigee Insights Increase marketing effectiveness and customer satisfaction with API-driven adaptive apps

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data and Market Surveillance. April 28, 2014

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

ANALYTICS BUILT FOR INTERNET OF THINGS

How To Scale Out Of A Nosql Database

Why Big Data in the Cloud?

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

TUT NoSQL Seminar (Oracle) Big Data

Using Data Mining and Machine Learning in Retail

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

In-Database Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

SEIZE THE DATA SEIZE THE DATA. 2015

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Database Marketing, Business Intelligence and Knowledge Discovery

BIG DATA. Value 8/14/2014 WHAT IS BIG DATA? THE 5 V'S OF BIG DATA WHAT IS BIG DATA?

Banking On A Customer-Centric Approach To Data

Big Data and Industrial Internet

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Beginner s Guide to. BigDataAnalytics

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Data Centric Computing Revisited

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

From Internet Data Centers to Data Centers in the Cloud

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Big Data Introduction

Real Time Big Data Processing

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

NoSQL and Hadoop Technologies On Oracle Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Putting Apache Kafka to Use!

Understanding the Value of In-Memory in the IT Landscape

Introduction. A. Bellaachia Page: 1

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data. Lyle Ungar, University of Pennsylvania

Transcription:

What is Big Data

Outline What is Big data and where they come from? How we deal with Big data?

Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something, you generate some transaction record for your purchase, when you go online, when you message your friend over your phone, that all generates tons of data. In the past, most of the data just get throw out. But for the recent years, people start realizing we can find many interesting things from these data. For example, the store can use these data to find out your purchasing behavior and to sell more things to you. Biologist can use these data to find out how one disease propagate over different places. In a environment like IoT when everything is connected to the Internet, we will generate even more data.

How much data? When bill gates invented Windows, he used to say 640K is enough for the memory of a computer. But today, we count these big data as PB and TB. A TB is 1000GB and a PB=1000TB. Just give you some example, Google process 20PB per day. And Facebook and Ebay generate from 10-50TB per day. Assuming we use 4G to send these data (100MB/sec). It will take more than one day to transmit these data generated by Facebook users.

Some scientific projects generate even more data than these online service. And HLC (15PB), for high-energy physics, generates more than 15PB per year. Earthscope generates 67TB per day. Without a supercomputer, it will be impossible to analyze these data. Maximilien Brice, CERN

The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data.

Type of Data These data can be generated in different forms: structured, semi-structured, graphic and text. It can be real-time and non-realtime.

What to do with these data? What can we do with the data? You can use them to generate some statistics of the past. For example, you Can use amazon s data to find out what s the curent most popular book people have bought. Or given a question, you can use these data to find the answer, For example, FBI can use your facial image to find out everything about you. Or you can discover something new from the data, and These are what many scientists do every day. For example, biologist Use the biological data to figure out how to make people live longer.

Warehouse Architecture For the first type of usage, so-called, data warehousing. We Normally collect the data from various places and then integrate them And put them together in a central server. So that people can Access the central server to do the analysis they want. During the integration and analysis, we can also generate some intermediate data, or so-called metadata.

Aggregates For example, you can query how many pieces of products has been sold on day 1 using A simple SQL. Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodid storeid date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 81

What is Data Mining? Data mining is generally different from the first two usage of the data. In this, we are trying to discover something unexpected or unknown from The data. Unlike the previous example, we knew what we store and what We are gonna get from the database.

Data Mining Tasks There are many different techniques can be used for data mining, and here We are just briefly describing some common ones.

Classification: Definition Classification is one of the common things we do in data mining. The idea here is to use the data to train model based on some features of the data. For example, we want to divide people into two kind of classes one with healthy living style, who eat well, sleep well and exercise regularly, so that we can first collect a dataset of these people and use that to train a model based on their sleeping time, dieting behavior, exercise hours. So in the future, when we have a new person s data, we can then use this model to tell if this person living healthily or not. When building a model, we usually need to test the accuracy of the model, and the general practice to use half of the data as training data to build the model, and the use the remaining data to validate the model.

Decision Trees Decision tree is one of the mechanisms for classification. For example, from this data, we can find if a person is from SF or he is driving a van, he is more likely to buy a new car. sale custid car age city newcar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no training set 15

Clustering Clustering is a way to divide data into different groups. For example, if you have people s age, education and income data. You can see people have higher education and older generally have a higher income. income education age 16

Association Rule Mining Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. The most commonly used example to analyze the sales record. 17

Other Types of Mining What the above said is assuming you have a structured database, so you have structured column and row for the analysis. There are also data which are un-structured data like text and graph mining. Text mining is mostly often used in mining information from web pages. For example, finding which web pages are more related. Graph mining Is one special structured data in which data entity are stored in a graph format in which different parameter/features in the data form a graph relationship. That is, nodes are the features and links are the relationship between features.

Data Streams Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. We normally only look at a subset of these data one at a time using so called window 'technique and there are different ways of defining the window. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches. 19

Challenges in Handling Big Data The issue with big data is because it s big so that you need Big storage and big processing power to handle them. And you also need fast algorithm/architecture to process them. 20

Big Data Landscape There are many technologies have been proposed to handle big data. In this course, we will focus on Hadoop, But will also briefly mention some of the other technology. 21

Big Data Technology (#1) The current big data technologies generally focus on 3 aspects: 1. How to reduce the running time of computing big data 2. How to make the analysis tool of big data become more and more effective 3. How to get more and more insight out of these data and use them for business Another trend we can foresee in the future is that the data will become bigger and bigger from terabytes to even zettabytes! it should be understood that there at least three significant aspects of Big Data that make it unique, beyond just "an order of magnitude more data beyond what you have now : First, we need to Recognize that traditional methods for moving, processing, and querying data were not sufficient, the Big Data industry has created an entirely new set of techniques -- and adapting some of those that existed -- so that organizations can actually process the full universe of information that they possess in enough time to actually get inside the windows of key business processes and critical decision trees. Thus, Fast Data techniques provides the ability to 'see' all (or at least enough) of what you know in a short enough time to actually do something with what you've learned. 22

Big Data Technology (#2) Second. There are qualitative differences between traditional business databases and Big Data. While Fast Data is about new techniques to process and transform raw information considerably faster than ever before, we need Big Analytics to turn information into knowledge using a combination of existing and new approaches. As you can see from the slides, some of the classic players in analytics are in use here including MATLAB, SAS, and R. But some of the most interesting aspects of Big Data can be found in relatively new entrants such as Apache Hive and Mahout, the latter which brings to bear automated machine learning to find hidden trends and otherwise unthought of or unconsidered ideas. In fact, an entire industry is growing up insmart information management systems that will "not rely on users dreaming up smart questions to ask computers; rather, they will automatically determine if new observations reveal something of sufficient interest to warrant some reaction, e.g., sending an automatic notification to a user or a system about an opportunity or risk." 23

Big Data Technology (#3) Finally, The powerful yet unfocused tools of Big Analytics are not sufficient to reap the rewards of Big Data. That requires taking the sum of the information at hand, applying analytic processes to it, and finally generating new knowledge and insights using a specific, situated method. Insight must be in the domain of the business to be useful, and this part of Big Data is where the technology is connected to ground truth in a feedback loop. That is, the tools of Big Analytics are just tools by themselves. It's not until they are directed at deriving a particular type of result that they are actually useful in a business context. Insights must also be connected to specific objectives (examples depicted in the moving parts visual above) in order to have high levels of impact.