1 Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology Development Agency (NSTDA)
2 Presentation outline Big Data Technology Overview, definition, motivation and properties BI vs. Data science Applications Big Data Tools Hadoop NoSQL MongoDB
3 What is Big Data?
5 Big Data: Motivation Source:
6 Structured vs. Unstructured Data Source:
7 Structured vs. Unstructured Data Source:
8 Big Data vs. Traditional Data Source:
9 Big Data vs. Traditional Data Source:
10 Big Data vs. Traditional Data Source:
11 Big Data vs. Traditional Data Source:
12 7 Key Drivers for the Big Data Market Source:
15 3Vs of Big Data
16 3Vs of Big Data
18 4Vs of Big Data
19 4Vs of Big Data
20 4Vs of Big Data
21 4Vs of Big Data
22 4Vs of Big Data
23 Where does Big Data come from? Source:
24 Gartner s Hype Cycle for Big Data chart
25 Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/
26 Big Data Business Model Maturity Chart Source: https://infocus.emc.com/william_schmarzo/big-data-business-model-maturity-chart/
27 Big Data Business Model Maturity Chart Source: Bill Schmarzo, Big Data: Understanding How Data Powers Big Business
28 Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics,
29 Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics,
30 Difference between BI and Data Science Source: Bill Schmarzo, CTO, EMC Consulting, Enterprise Information Management & Analytics,
31 Data Scientist Source: Source:
37 What is Hadoop? Source: The content from this section is summarized from
38 Data is growing fast Three reasons why we are generating data faster than ever: (1) Processes are increasingly automated; (2) Systems are increasingly interconnected; (3) People are increasingly living online.
39 Data and system evolution
40 Real-time data analytics The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions. Industries - Travel, Retail, Financial Services, Digital Media, Search etc. that are consumer oriented are all facing similar real-time information dynamics.
41 What is Hadoop? Hadoop is a scalable fault-tolerant distributed system for data storage and processing (Apache license). Core Hadoop has two main systems: (1) Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage. (2) MapReduce: distributed faulttolerant resource management and scheduling coupled with a scalable data programming abstraction.
42 Hadoop: MapReduce
43 Hadoop's benefts Flexibility Store any data (structured or not), Run any analysis. Scalability Start at 1TB/3-nodes grow to PB/1000s of nodes. Economics Cost per TB at a fraction of traditional options.
44 Traditional DBMS Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications. Use relational databases when dealing with (1) Interactive OLAP Analytics; (2) Multistep ACID Transactions (3) 100% SQL Compliance. It is becoming increasingly more diffcult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business
45 Hadoop's approach Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data. Hadoop and related software are designed for 3V s: (1) Volume Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity Data ingest speed aided by append-only and schema-on-read design; (3) Variety Multiple tools to structure, process, and access data.
46 Scenarios for Using Hadoop When a user types a query, it isn t practical to exhaustively scan millions of items. Instead it makes sense to create an index and use it to rank items and fnd the best matches. Hadoop provides a distributed indexing capability. Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing and fault tolerant. It can deliver data and can run large-scale, high-performance processing batch jobs in spite of system changes or failures.
47 Scenarios for Using Hadoop (cont'd) 1) Hadoop as an ETL and Filtering Platform One of the biggest challenges with high volume data sources is extracting valuable signal from lot of noise. Hadoop platforms can read in the raw data, apply appropriate flters and logic, and output a structured summary or refned data set. This output (e.g., hourly index refreshes) can be further analyzed or serve as an input to a more traditional analytic environment like SAS. Typically a small % of a raw data feed is required for any business problem.
48 Scenarios for Using Hadoop (cont'd) 2) Hadoop as an exploration engine Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refned output is in a Hadoop cluster, new data can be added to the existing pile without having to reindex all over again. In other words, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.
49 Scenarios for Using Hadoop (cont'd) 3) Hadoop as an Archive Historical data is usually archived by tape or disk to secondary storage or sent offsite. When this data is needed for analysis, it s painful and costly to retrieve it and load it back up. With cheap storage in a distributed cluster, lot s of data can be kept active for continuous analysis. Hadoop is effcient it allows better utilization of hardware by allowing the generation of different index types in one cluster.
50 The Hadoop Stack Cloudera s Distribution for Hadoop (CDH)
54 Hadoop's Case Study:
56 Inverted index Source: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/ searchkit_basics/searchkit_basics.html
57 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
58 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
59 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
60 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
61 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
62 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
63 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
64 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
65 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
66 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
67 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
68 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
69 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
70 Hadoop: MapReduce
71 MapReduce: Google Distributed Indexing
72 Source: This slide is from Stanford course: CS 276 / LING 286: Information Retrieval and Web Search,
73 Handling Big Data with MongoDB
74 What is NoSQL?
75 RDBMS VS. NoSQL Database transaction properties: Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc. BASE Basically Available Soft-state services with Eventual-consistency.
76 Why is NoSQL becoming popular?
77 RDMBS VS NoSQL: an example
78 RDMBS VS NoSQL: an example
86 Google Trends
89 MongoDB: Key idea
92 MongoDB: Auto-sharding
93 RDBMS vs. MongoDB
98 Example 1. Create a Java Project 2. Get Mongo Java Driver
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Big + Fast + Safe + Simple = Lowest Technical Risk The Synergy of Greenplum and Isilon Architecture in HP Environments Steffen Thuemmel (Isilon) Andreas Scherbaum (Greenplum) 1 Our problem 2 What is Big
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Big Data: Are You Ready? Kevin Lancaster Director, Engineered Systems Oracle Europe, Middle East & Africa 1 A Data Explosion... Traditional Data Sources Billing engines Custom developed New, Non-Traditional
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. firstname.lastname@example.org, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Apache Hadoop in the Enterprise Dr. Amr Awadallah, CTO/Founder @awadallah, email@example.com Cloudera The Leader in Big Data Management Powered by Apache Hadoop The Leading Open Source Distribution of Apache
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at firstname.lastname@example.org.
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?
An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP ESG Data Systems Architecture Big Data & Analytics as a Service Components Unstructured Data / Sparse Data of Value
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Making Sense of Big Data in Insurance Amir Halfon, CTO, Financial Services, MarkLogic Corporation BIG DATA?.. SLIDE: 2 The Evolution of Data Management For your application data! Application- and hardware-specific
Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/
White Paper Protecting Big Data Data Protection Solutions for the Business Data Lake Abstract Big Data use cases are maturing and customers are using Big Data to improve top and bottom line revenues. With
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Proact whitepaper on Big Data Summary Big Data is not a definite term. Even if it sounds like just another buzz word, it manifests some interesting opportunities for organisations with the skill, resources
The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.
Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Big Data Big Data/Data Analytics & Software Development Danairat T. email@example.com, 081-559-1446 1 Agenda Big Data Overview Business Cases and Benefits Hadoop Technology Architecture Big Data Development
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
One Size Doesn t Fit All Choosing which big data, NoSQL or database technology to use March 14, 2012 Mark R. Madsen http://thirdnature.net The problem of big is three problems of volume Computations! Amount
www.leanxcale.com firstname.lastname@example.org REAL-TIME BIG DATA ANALYTICS Blending Transactional and Analytical Processing Delivers Real-Time Big Data Analytics 2 ULTRA-SCALABLE FULL ACID FULL SQL DATABASE LeanXcale
Simplifying Big Data Analytics: Unifying Batch and Stream Processing John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Streaming Analy.cs S S S Scale- up Database Data And Compute Grid
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
High Performance IT Insights Building the Foundation for Big Data Page 2 For years, companies have been contending with a rapidly rising tide of data that needs to be captured, stored and used by the business.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
SAP Solutions CIO Guide How to Use with Your SAP Software Landscape February 2013 Table of Contents 3 Executive Summary 4 Introduction and Scope 6 Big Data: A Definition A Conventional Disk-Based RDBMs
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise Solutions Group The following is intended to outline our
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
Database Systems Journal vol. IV, no. 3/2013 31 Big Data Challenges Alexandru Adrian TOLE Romanian American University, Bucharest, Romania email@example.com The amount of data that is traveling across
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, firstname.lastname@example.org Assistant Professor, Information
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Deploying an Operational Data Store Designed for Big Data A fast, secure, and scalable data staging environment with no data volume or variety constraints Sponsored by: Version: 102 Table of Contents Introduction
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: email@example.com Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: firstname.lastname@example.org
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Your consent to our cookies if you continue to use this website.