What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from by Ruoming Jin
|
|
- Jack Greene
- 8 years ago
- Views:
Transcription
1 data every day 5/5/2016 Introduction to Big Data, mostly from by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. 2 Big Data: 3V s 12+ TBs of tweet data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in M by billion people on the Web by end Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year To extract knowledge all these types of data need to linked together 6 1
2 Velocity (Speed) Real-time/Fast Data Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 7 8 Real-Time Analytics/Decision Requirement Harnessing Big Data Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 10 The Model Has Changed THE EVOLUTION OF BUSINESS INTELLIGENCE The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data Speed BI Reporting OLAP & Data warehouse Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Scale Scale Big Data: Real Time & Single View Graph Databases Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra Speed 1990 s 2000 s 2010 s 11 2
3 Big Data Analytics Big data is more real-time in nature than traditional DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not wellsuited for big data apps Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps 13 Big Data Technology Cloud Computing IT resources provided as a service Compute, storage, databases, queues Clouds leverage economies of scale of commodity hardware Cheap storage, high bandwidth networks & multicore processors Geographically distributed data centers Offerings from Microsoft, Amazon, Google, 15 Topic 2: Hadoop/ Programming & Data Processing Architecture of Hadoop, HDFS, and Yarn Programming on Hadoop Basic Data Processing: Sort and Join Information Retrieval using Hadoop Data Mining using Hadoop (Kmeans+Histograms) Machine Learning on Hadoop (EM) Hive/Pig Spark vs. on HDFS HBase and Cassandra References References: Hadoop: The Definitive Guide, Tom White, O Reilly Hadoop In Action, Chuck Lam, Manning Doing Data Science, Rachel Schutt and Cathy O Neil, O Reilly Data-Intensive Text Processing with, Jimmy Lin and Chris Dyer ( Good tutorial presentation & examples at: The definitive original paper:
4 Cloud Resources Hadoop on your local machine Hadoop in a virtual machine on your local machine (Pseudo-Distributed on Ubuntu) Hadoop in the clouds with Amazon EC2 Introduction to /Hadoop From Ruoming Jin s Slides, themselves adapted from Jimmy Lin s slides (at UMD) Limitations of Existing Data Analytics Architecture Key Ideas 2 BI Reports + Interactive Apps RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn t Scale Storage Only Grid (original raw data) Mostly Append Collection Instrumentation 2011 Cloudera, Inc. All Rights Reserved. Can t Explore Original High Fidelity Raw Data Archiving = Premature Data Death Scale out, not up Limits of SMP and large shared-memory machines Move processing to the data Cluster may have limited bandwidth Process data sequentially, avoid random access Seeks are expensive, disk throughput is reasonable Seamless scalability From the mythical man-month to the tradable machine-hour Slides from Dr. Amr Awadallah s Hadoop talk at Stanford, CTO & VPE from Cloudera The datacenter is the computer! Apache Hadoop Scalable fault-tolerant distributed system for Big Data: Data Storage Data Processing A virtual Big Data machine Borrowed concepts/ideas from Google; Open source under the Apache license Core Hadoop has two main systems: Hadoop/: distributed big data processing infrastructure (abstract/paradigm, fault-tolerant, schedule, execution) HDFS (Hadoop Distributed File System): fault-tolerant, highbandwidth, high availability distributed storage More recently (since 2014): Apache Spark on Hadoop/HDFS and... Apache Spark is now the most active open source project in big data with more than 600 contributors within the past year. Image from 4
5 Example: word counts : Big Data Processing Abstraction Millions of documents in Word counts out: brown, 2 fox, 2 the, 3 In practice, before and related technologies: The first 10 computers are easy; The first 100 computers are hard; The first 1000 computers are impossible; But now with, engineers at Google often use computers! What s wrong with 1000 computers? Some will crash while you re working If probability of crash =.001 Then probability of all up = (1-.001) 1000 = 0.37 expects crashes, tracks partial work, keeps going Typical Large-Data Problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Key idea: provide a functional abstraction for these two operations (Dean and Ghemawat, OSDI 2004) Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] or simpler All values with the same key (k ) are sent to the same reducer, in k order for each reducer Here [] means a sequence The execution framework handles everything else Spark: has map, reduce as operations, plus others. Hello World : Word Count (String docid, String text): for each word w in text: Emit(w, 1); (String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum); This can be done in or Spark 5
6 Runtime Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data Handles synchronization Gathers, sorts, and shuffles intermediate data Handles errors and faults Detects worker failures and restarts Everything happens on top of a distributed FS (later) This description also valid for Spark, but it uses memory more, so can run faster in many cases. Programmers specify two functions: map (k, v) [(k, v )] reduce (k, [v ]) [(k, v )] All values with the same key are reduced together The execution framework handles everything else Not quite usually, programmers also specify: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod n Divides up key space for parallel reduce operations and eventual delivery of results to certain partitions combine (k, [v ]) [(k, v )] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic k 1 v 1 k 2 v 2 k 3 v 3 k 4 v 4 k 5 v 5 k 6 v 6 map map map combine combine combine map b a 1 2 c c 3 6 a c 5 2 b c 7 8 combine b a 1 2 c 9 a c 5 2 b c 7 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c reduce reduce reduce the quick brown fox the fox ate the mouse how now brown cow Word Count Execution Input Shuffle & Sort Output brown: 1,1 fox: 1,1 how:1 now:1 the:1,1,1 ate: 1 cow: 1 mouse: 1 quick: 1 brown, 2 fox, 2 the, 3 r 1 s 1 r 2 s 2 r 3 s 3 Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. Hadoop History Dec 2004 Google GFS paper published July 2005 Nutch uses Apr 2007 Yahoo! on 1000-node cluster Jan 2008 An Apache Top Level Project Jul 2008 A 4000 node test cluster Sept 2008 Hive becomes a Hadoop subproject Feb 2009 The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. June 2009 On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production. In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB. Feb 2014: Apache Spark starts as Top-Level Project. Originally developed (2011+) at the University of California, Berkeley's AMPLab, the Spark codebase was at this point donated to Apache (open source). 6
7 Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo! Who uses Hadoop? Example Word Count () public static class Tokenizerper extends per<object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizeritr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word,one); Example Word Count () public static class IntSumr extends r<text,intwritable,text,intwritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); Example Word Count (Driver) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setperclass(tokenizerper.class); job.setcombinerclass(intsumr.class); job.setrclass(intsumr.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); the quick brown fox the fox ate the mouse how now brown cow Word Count Execution Input Shuffle & Sort Output brown: 1,1 fox: 1,1 how:1 now:1 the:1,1,1 ate: 1 cow: 1 mouse: 1 quick: 1 brown, 2 fox, 2 the, 3 An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by same map For associative ops. like sum, count, max Decreases size of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values)) 7
8 Word Count with Combiner Input & Combine Shuffle & Sort Output User Program (1) submit the quick brown fox the fox ate the mouse the, 2 brown, 2 fox, 2 the, 3 split 0 split 1 split 2 split 3 split 4 (3) read worker worker Master (2) schedule map (2) schedule reduce (5) remote read worker (4) local write worker (6) write output file 0 output file 1 how now brown cow Input files worker phase Intermediate files (on local disk) phase Output files Adapted from (Dean and Ghemawat, OSDI 2004) Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local Why? Not enough RAM to hold all the data in memory Disk access is slow, but disk throughput is reasonable (i.e. sequential reading of disk for stream processing) A distributed file system is the answer GFS (Google File System) for Google s HDFS (Hadoop Distributed File System) for Hadoop Another example of Clickstream-like data: for each ad viewing, user info and whether they clicked on the ad: {userid, ip, zip, adnum, clicked Want unique users who saw, clicked, by zip First Try First try key as zip: can emit {90210, {0,1 if saw and failed to click, {90210, {1,1 if saw and clicked receives, say: {90210, [{1,1, {0,1, {0,1] This shows three visits, one click, but we don t know if these visits were by different users, so we don t know the number of unique users Second try We need to preserve user identity longer Use {zip, userid as key Value: again {0,1 or {1,1 if saw and clicked emits {{90210,user123, {0,1, etc. r gets info on one user, one zip: {{90210,user123, [{0,1, {1,1] r can process list, emit {90210,user123, {1,2 But not done yet 8
9 Second pass r (pass 1) emits {90210,user123, {1,2 Second reads this, emits its contribution to zip s stats (one user saw and clicked): {90210, {1, 1 Second receives 2210 user reports for this zip: {90210, {{1,1, {0,1, {0,1, And counts up unique users and their clicks: emits {90210, {56, 2210 for 2210 unique users in zip viewed ads, 56 of them clicked. Compare to SQL Table T of {userid, ip, zip, adnum, clicked Using a trick, we can do this in one select: select zip, count (distinct userid), count (distinct clicked*userid) from T group by zip Assumes clicked = 0 or 1 in T row Note that DB2, Oracle, and mysql can do count (distinct expr), though entry SQL92 only requires count(distinct column) Compare to SQL Table T of {userid, ip, zip, adnum, clicked Closer to processing select zip, userid, count (clicked) cc from T group by zip, userid Put results into table T1 (zip, userid, cc) select zip, count(*), sum(sign(cc)) from T1 group by zip Scalar function sign(x) = -1, 0, +1 is available on Oracle, DB2, mysql, but not in Entry SQL92 Do it in SQL92? CASE is the conditional value capability in SQL92, but is not required for Entry SQL92 (it is supported by all respectable DBs) Sign(x) as case: case when x < 0 then -1 when x > 0 then 1 else 0 End Something better? We see that using means telling the system in detail how to solve the problem SQL just states the problem, lets the QP figure out how to do it Next time: Hive, the SQL-like query language built on top of Spark also has a SQL-like language 9
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationWhat is cloud computing?
Introduction to Clouds and MapReduce Jimmy Lin University of Maryland What is cloud computing? With mods by Alan Sussman This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationApplications for Big Data Analytics
Smarter Healthcare Applications for Big Data Analytics Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail:
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationMrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationEnterprise Data Storage and Analysis on Tim Barr
Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBig Data Analytics* Outline. Issues. Big Data
Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example
More informationHadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationUsing Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5
Using Big Data to Explore New Opportunities Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5 Introduction to Big Data The Myth About Big Data Source: Big Data: New
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationEXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationBig Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationBig Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
More informationBig Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationFrom Wikipedia, the free encyclopedia
Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationBig Data Workshop. dattamsha.com
Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the
More informationHadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?
Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time? Kai Wähner kwaehner@tibco.com @KaiWaehner www.kai-waehner.de Disclaimer! These opinions are my own and do not necessarily
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationCIS492 Special Topics: Cloud Computing د. منذر الطزاونة
CIS492 Special Topics: Cloud Computing د. منذر الطزاونة Big Data Definition No single standard definition Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,
More informationApache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
More informationReal Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationAGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationRelational Processing on MapReduce
Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
More informationOutline. What is Big data and where they come from? How we deal with Big data?
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationTransforming the Telecoms Business using Big Data and Analytics
Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe
More informationTutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationTap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
More information<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
More informationBig Data Big Data/Data Analytics & Software Development
Big Data Big Data/Data Analytics & Software Development Danairat T. danairat@gmail.com, 081-559-1446 1 Agenda Big Data Overview Business Cases and Benefits Hadoop Technology Architecture Big Data Development
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationHadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years sarah@cloudera.com 2 What is Hadoop? An open-source
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationBig Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationSQLSaturday #399 Sacramento 25 July, 2015. Big Data Analytics with Excel
SQLSaturday #399 Sacramento 25 July, 2015 Big Data Analytics with Excel Presenter Introduction Peter Myers Independent BI Expert Bitwise Solutions BBus, SQL Server MCSE, SQL Server MVP since 2007 Experienced
More informationReal Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
More informationConnecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
More informationBiomap Jobs and the Big Picture
Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:
More informationHow To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
More informationBig Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationBig Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
More informationImpact of Big Data in Oil & Gas Industry. Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India.
Impact of Big Data in Oil & Gas Industry Pranaya Sangvai Reliance Industries Limited 04 Feb 15, DEJ, Mumbai, India. New Age Information 2.92 billions Internet Users in 2014 Twitter processes 7 terabytes
More informationBig Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationJournal of Environmental Science, Computer Science and Engineering & Technology
JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences
More informationHDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationBig Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
More information