Massive Data Analysis: Course Overview

Size: px
Start display at page:

Download "Massive Data Analysis: Course Overview"

Transcription

1 Massive Data Analysis: Course Overview Juliana Freire! Content obtained from many sources, including: Agrawal et al., VLDB 2010 tutorial; Shim, VLDB 2012 tutorial; Jeff Ullman s lecture notes, G. Weikum

2 Instructors:! o Juliana Freire! Course Staff and Information o Jerome Simeon! Reach us at More info on In our wiki you will find:! Tentative schedule! News and announcements! Reading list! Assignments! Check it often!!! 2!

3 What we will cover Infrastructure: Architecture, computing models (e.g., MapReduce), storage solutions (e.g., Big Table, MongoDB), query/processing languages! Algorithms and analysis: statistics, data mining techniques! Tentative schedule in:! Readings from:! o Scientific papers! o Textbooks (they are free to download!)! Mining of Massive Data Sets (version 1.1), by Anand Rajaraman, Jure Leskovec and Jeff Ullman. Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer.

4 Pre- Requisites A course in database systems, covering application programming in SQL and other database-related languages such as XQuery! A course on algorithms and data structures! Good programming skills!

5 What you will do Programming assignments (50%) done individually! o You will need a computer! o We will provide you access to Amazon AWS (more details later)! Quizzes (15%): you will use Gradiance! o Register at o Use token 00B06796! Final exam (35%)!

6 Motivation

7 Big Data: What is the Big deal?

8 Big Data: What is the Big deal? Many success stories! o Google: many billions of pages indexed, products, structured data! Google grew from processing 100 TB of data a day with o Facebook: MapReduce 1.1 billion in 2004 users [45] using to processing the site 20 each PB a day month! with MapReduce in 2008 [46]. In April 2009, a blog post1 was wri^en about ebay s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Lin and Dyer, 2010 o Twitter: 517 million accounts, 250 million tweets/day This is changing society!!

9 The McKinsey Report on Big Data Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital. We estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart's data warehouse in 1999) per company with more than 1,000 employees! The use of big data will become a key basis of competition and growth for individual firms.! There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.!

10 Big Data: New Opportunities Enable scientific breakthroughs! Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider, Sloan Sky Survey, genomes, climate data,! Social data, e.g., Facebook, Twitter! 3,180,000 and 3,410,000 results in Google Scholar!!

11 Big Data: New Opportunities Smart Cities: 50% of the world population lives in cities! o Census, crime, emergency visits, taxis, public transportation, real estate, noise, energy,! Cities are making their data available!! https://nycopendata.socrata.com/! Make cities more efficient and sustainable, and improve the lives of their citizens!

12 NYC Inspections New York City gets 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them.! Flowers group integrated information from 19 different agencies that provided indication of issues in buildings!! Result: hit rate for inspections went from 13% to 70%!

13 Big Data: New Opportunities NYU CUSP aims to use New York City as its laboratory and classroom to help cities around the world become more productive, livable, equitable, and resilient. CUSP observes, analyzes, and models cities to optimize outcomes, prototype new solutions, formalize new tools and processes, and develop new expertise/experts.!!!!http://cusp.nyu.edu/!

14 Big Data: New Opportunities Data is currency: companies are profiting from knowledge extracted from Big Data! o Better understand customers, targeted advertising,!

15 Big Data: New Opportunities h^p://blogs.wsj.com/venturecapital/tag/big- data/

16 What is Massive/Big Data? The three V s of big data: Volume, Variety, and Velocity! Too big: petabyte-scale collections or lots of (not necessarily big) data sets! Too hard: does not fit neatly in an existing tool! o Data sets that need to be cleaned, processed and integrated! o E.g., Twitter, news, customer transactions! Too! fast: needs to be processed quickly!!

17 Big Data: What is the Big deal? Big data is not new: financial transactions, call detail records, astronomy,! What is new:! - Many more data enthusiasts! - More data are widely available, e.g., Web, data.gov, scientific data, social and urban data! - Computing is cheap and easy to access! o Server with 64 cores, 512GB RAM ~$11k! o Cluster with 1000 cores ~$150k! o Pay as you go: Amazon EC2!

18 Big Data: More than Volume Volume = Length Width Depth Big Data Length: Collect & Compare Big Data Width: Discover & Integrate Big Data Depth: Analyze & Understand Slide by Gerhard Weikum

19 Big Urban Data: NYC Taxis

20 Collect, Clean, and Compare Beijing NYC Compare different cities

21 Collect, Clean, and Compare 7-8am 8-9am 9-10am 10-11am Compare effects over time Bigger picture of city life!

22 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Was there a traffic problem? An important event? Discover information in news, blogs, etc.

23 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Discover information in news, blogs, etc.

24 Analyze and Understand The Sandy Effect

25 Analyze and Understand Studying traffic pa^erns to and from the airports

26 Taxis in NYC: Rides per Hour

27 Big Data: What is hard? Scalability for computations? NOT!! o Lots of work on distributed systems, parallel databases,! o Elasticity: Add more nodes!! But there are no one-size-fits-all solution: often, you have to build your own! Rapidly-evolving technology! Many different tools! Different computation model: need new algorithms!

28 Big Data: What is hard? Scalability for people: Data exploration is hard! regardless of whether data are big or small! algorithms provenance data integration machine learning visual encodings statistics interaction modes data curation math data management data knowledge

29 (Big) Data Analysis Pipeline

30 Big Data: Challenges

31 Big Data: Challenges Apple: Fruit or company?

32 Big Data: Opportunities and Challenges

33 Big Data: Challenges Taxi data: >.5 billion trips! Can t load on Excel and! even commercial databases! are too slow! Our solution to support! interactive queries! o New spatio-temporal index! o New index that leverages GPU (work in progress)!!

34 Big Data: New Technologies Infrastructure:! New computing paradigms: Cloud, Hadoop Map/ Reduce! New storage solutions: NoSQL, column stores, Big Table! New languages: JAQL, Pig Latin! We will study these and how they relate to previous technologies! Analysis and Mining:! New infrastructure demands new approaches to explore data! We will study algorithms to process and analyze data in Big-Data environments!

35 Infrastructure

36 What is Cloud Computing? Old idea: Software as a Service (SaaS)! o Delivering applications over the Internet!! Recently: [Hardware, Infrastructure, Platform] as a service!! Utility Computing: pay-as-you-go computing! o Illusion of infinite resources! o No up-front cost! o Fine-grained billing (e.g., hourly)! Agrawal et al., VLDB 2010 Tutorial!

37 Cloud Computing: Why Now? Experience with very large data centers! o Unprecedented economies of scale! o Transfer of risk! Technology factors! o Pervasive broadband Internet! o Maturity in virtualization technology! Business factors! o Minimal capital expenditure! o Pay-as-you-go billing model! Agrawal et al., VLDB 2010 Tutorial!

38 Warehouse Scale Computing Google s data center in Oregon 16 Million Nodes per building Agrawal et al., VLDB 2010 Tutorial!

39 Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Agrawal et al., VLDB 2010 Tutorial! Time Static data center Time Unused resources Data center in the cloud Slide Credits: Berkeley RAD Lab

40 Economics of Cloud Users Risk of over- provisioning: underutilization Resources Capacity Demand Unused resources Time Static data center Agrawal et al., VLDB 2010 Tutorial! Slide Credits: Berkeley RAD Lab

41 Economics of Cloud Users Heavy penalty for under- provisioning Resources Time (days) Agrawal et al., VLDB 2010 Tutorial! Capacity Demand Resources Resources Time (days) Lost revenue Time (days) Lost users Slide Credits: Berkeley RAD Lab Capacity Demand Capacity Demand

42 Just hype? Cloud Computing? What are you! talking about? Cloud Computing is! nothing but a computer attached to! a network.! - - Larry Ellison, Excerpts from an interview Agrawal et al., VLDB 2010 Tutorial!

43 Cloud Computing: Hype or Reality Unlike the earlier attempts:! o Distributed Computing! o Distributed Databases! o Grid Computing! Cloud Computing is REAL:! o Organic growth: Google, Yahoo, Microsoft, and Amazon! o Poised to be an integral aspect of National Infrastructure in US and elsewhere! Agrawal et al., VLDB 2010 Tutorial!

44 Cloud Computing Modalities Can we outsource our IT software and hardware infrastructure? Hosted Applications and services! Pay-as-you-go model! Scalability, fault-tolerance, elasticity, and self-manageability! We have terabytes of click-stream data what can we do with it? Very large data repositories! Complex analysis! Distributed and parallel data processing! Agrawal et al., VLDB 2010 Tutorial!

45 Why Data Analysis? What is the most effective distribution channel? Who are our lowest/highest margin customers? Who are my customers and what products are they buying? What product prom- -otions have the biggest impact on revenue? VLDB 2010 Tutorial! What impact will new products/services! have on revenue and margins? Which customers are most likely to go to the competition?

46 Why Data Analysis? What would the! impacts be of a! Fare change? Where are our lowest/highest margin passengers? What is the distribution! of trip lengths? What is the quickest! route from midtown! To downtown at 4pm on! Monday? What impact will the introduction of! additional medallions! have? Where should drivers! go to get! passengers?

47 Decision Support Used to manage and control business! Data is historical or point-in-time! Optimized for inquiry rather than update! Use of the system is loosely defined and can be ad-hoc! Used by managers and end-users to understand the business and make judgements! Agrawal et al., VLDB 2010 Tutorial!

48 Decision Support Data-analysis in the enterprise context emerged:! o As a tool to build decision support systems! o Data-centric decision making instead of using intuition! o New term: Business Intelligence! Traditional approach:! o Decision makers wait for reports from disparate OLTP systems! o Put it all together in a spreadsheet! o Manual process! Agrawal et al., VLDB 2010 Tutorial!

49 Data Analytics in the Web Context Data capture at the user interaction level:! o in contrast to the client transaction level in the Enterprise context! As a consequence, the amount of data increases significantly! Need to analyze such data to understand user behaviors! Agrawal et al., VLDB 2010 Tutorial!

50 Data Analytics outside Big Corporations Even data capture at client transaction level leads to a lot of data!!! Need to analyze such data to understand behavior! Cannot afford expensive warehouse solutions!

51 Data Analytics in the Cloud Scalability to large data volumes:! o Scan 100 TB on 1 50 MB/sec = 23 days! o Scan 100 TB on 1000-node cluster = 33 minutes! è Divide-And-Conquer (i.e., data partitioning)!! Cost-efficiency:! o Commodity nodes (cheap, but unreliable)!! o Commodity network! o Automatic fault-tolerance (fewer admins)! o Easy to use (fewer programmers)! Agrawal et al., VLDB 2010 Tutorial!

52 Platforms for Large- scale Data Analysis Parallel DBMS technologies! o Proposed in the late eighties! o Matured over the last two decades! o Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises! Map Reduce! o pioneered by Google! o popularized by Yahoo! (Hadoop)! Agrawal et al., VLDB 2010 Tutorial!

53 Parallel DBMS technologies Popularly used for more than two decades! o Research Projects: Gamma, Grace,! o Commercial: Multi-billion dollar industry but access to only a privileged few! Relational Data Model! Indexing! Familiar SQL interface! Advanced query optimization! Well understood and studied! Very reliable!! Agrawal et al., VLDB 2010 Tutorial!

54 MapReduce Overview:! o Data-parallel programming model! o An associated parallel and distributed implementation for commodity clusters! Pioneered by Google! o Processes 20 PB of data per day (circa 2008)! Popularized by open-source Hadoop project! o Used by Yahoo!, Facebook, Amazon, and the list is growing! [Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010] Agrawal et al., VLDB 2010 Tutorial!

55 Hadoop! Open source of MapReduce framework of Apache Project! Hadoop Distributed File System (HDFS)! o Store big files across machines! o Store each file as a sequence of blocks! o Blocks of a file are replicated for fault tolerance! Distribute processing of large data across thousands of commodity machines! Key components! o MapReduce - distributes applications! o Hadoop Distributed File System (HDFS) - distributes data! A single Namenode (master) and multiple Datanodes (slaves)! o Namenode: manages the file system and access to files by clients! o Datanode: manages the storages attached to the nodes running on! Kyuseok Shim (VLDB 2012 TUTORIAL)!

56 MapReduce Programming Model! Borrows from functional programming! Users should implement two primary methods:!! o Map: (key1, val1) [(key2, val2)]! o Reduce: (key2, [val2]) [(key3, val3)]! Kyuseok Shim (VLDB 2012 TUTORIAL)!

57 Word Counting with MapReduce! M 1! Documents! Key! Value! Key! Value! Doc1 Doc2 Financial, IMF, Eco nomics, Crisis Financial, IMF, Cris is Documents! Doc3 Economics, Harry Doc4 Financial, Harry, P o^er, Film Doc5 Crisis, Harry, Po^er! Map! Map! Financial! `! `! 1 IMF! 1! Economics! 1! `! Crisis! 1! Financial! 1! `! IMF! 1! Crisis! 1! Economics! `! 1 Harry! 1! Financial! 1! Harry! 1! `! Po^er! 1! Film! 1! Crisis! 1! Harry! 1! `! Po^er! 1! M 2! Kyuseok Shim (VLDB 2012 TUTORIAL)!

58 Word Counting with MapReduce! Doc1 Doc2 Documents! Financial, IMF, Eco nomics, Crisis Financial, IMF, Cris is Documents! Doc3 Economics, Harry Doc4 Financial, Harry, P o^er, Film Doc5 Crisis, Harry, Po^er! Map! Map! Key! Key! Value! Key! Value list! Value! Financial! Financial! 1 Crisis! 1, 1, 1 1 Financial! IMF! 1! Crisis! 1, 1! 1! Financial! Economics! 1! Crisis! 1, 1! 1! IMF! Crisis! 1! Harry! 1, 1, 1! 1! IMF! Harry! 1! Harry! 1, 1, 1! 1! Economics! Film! 1! Harry! 1! 1! Economics! Po^er! 1! Film! 1, 1! 1! Po^er! 1! Po^er! 1! Reduce! Reduce! Key! Financial! IMF! Economics! Crisis! Harry! Film! Po^er! `! `! Value! Before reduce functions are called, for each distinct key, the list of its values is generated! Kyuseok Shim (VLDB 2012 TUTORIAL)!

59 MapReduce Advantages Automatic Parallelization:! o Depending on the size of RAW INPUT DATA è instantiate multiple MAP tasks! o Similarly, depending upon the number of intermediate <key, value> partitions è instantiate multiple REDUCE tasks! Run-time:! o Data partitioning! o Task scheduling! o Handling machine failures! o Managing inter-machine communication! Completely transparent to the programmer/analyst/user! Agrawal et al., VLDB 2010 Tutorial!

60 MapReduce Experience Runs on large commodity clusters:! o 1000s to 10,000s of machines! Processes many terabytes of data! Easy to use since run-time complexity hidden from the users! 1000s of MR jobs/day at Google (circa 2004)! 100s of MR programs implemented (circa 2004)! Agrawal et al., VLDB 2010 Tutorial!

61 The Need Special-purpose programs to process large amounts of data: crawled documents, Web Query Logs, etc.! At Google and others (Yahoo!, Facebook):! o Inverted index! o Graph structure of the WEB documents! o Summaries of #pages/host, set of frequent queries, etc.! o Ad Optimization! o Spam filtering! Agrawal et al., VLDB 2010 Tutorial!

62 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance! Principal philosophies:! o Make it scale, so you can throw hardware at problems! o Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance)! Hive and Pig further simplify programming! MapReduce is not suitable for all problems, but when it works, it may save you a lot of time! Agrawal et al., VLDB 2010 Tutorial!

63 Map Reduce vs Parallel DBMS! Parallel DBMS! MapReduce! Schema Support! ü! Not out of the box! Indexing! ü! Not out of the box! Programming Model! Optimizations (Compres sion, Query! Optimization)! Declarative! (SQL)! ü! Imperative! (C/C++, Java, )! Extensions through! Pig and Hive! Not out of the box! Flexibility! Not out of the box! ü! Fault Tolerance! Agrawal et al., VLDB 2010 Tutorial! Coarse grained! techniques! [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] ü!

64 MapReduce: A step backwards? Don t need 1000 nodes to process petabytes:! o Parallel DBs do it in fewer than 100 nodes! No support for schema:! o Sharing across multiple MR programs is difficult! No indexing:! o Wasteful access to unnecessary data! Non-declarative programming model:! o Requires highly-skilled programmers! No support for JOINs:! o Requires multiple MR phases for the analysis! We will study this in more detail! Agrawal et al., VLDB 2010 Tutorial!

65 Analysis and Mining

66 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company!

67 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company! Visualization: Pictures help us to think! o Substitute perception for cognition! o External memory: free up limited cognitive/memory resources for higher-level problems! Mining: Discovery of useful, possibly unexpected, patterns in data!

68 (Big) Data Analysis Pipeline

69 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005

70 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps, easy to get lost! Need to manage the data exploration process:! o Guide users! o Need provenance for reproducibility [Freire et al., CISE 2008]! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005

71 Analyzing and Mining Big Data: Issues Besides scalability for algorithms and computations! A big data-mining risk is that you will discover patterns that are meaningless.! Statisticians call it Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find junk.! Jeff Ullman s lecture notes!

72 Examples of Bonferroni s Principle 1. A big objection to Total Information Awareness (TIA) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy.! 2. The Rhine Paradox: a great example of how not to conduct scientific research.! Jeff Ullman s lecture notes!

73 Stanford Professor Proves Tracking Terrorists Is Impossible! Reporter from the LA Times picked an example in Professor Ullman s class! Despite attempts by Professor Ullman, the reporter was unable to grasp the point that the story was made up to illustrate Bonferroni s Principle, and was not real! Modified from Jeff Ullman s lecture notes!

74 The TIA Example Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.! We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.! Jeff Ullman s lecture notes!

75 TIA Example: Details 10 9 people being tracked.! 1000 days.! Each person stays in a hotel 1% of the time (10 days out of 1000).! Hotels hold 100 people (so 10 5 hotels to hold 1% of the people being tracked).! If everyone behaves randomly (i.e., no evil-doers) will the data mining detect anything suspicious?! Jeff Ullman s lecture notes!

76 TIA Example: Calculations (1) p at some hotel q at some hotel Same hotel Probability that given persons p and q will be at the same hotel on given day d :! o 1/100 1/100 1/10 5 = 10-9! Probability that p and q will be at the same hotel on given days d 1 and d 2 :! o = 10-18! Pairs of days: C(1000,2) = 1000!/(1000-2)!*2!! o ~5 10 5! Jeff Ullman s lecture notes!

77 TIA Example: Calculations (2) Probability that p and q will be at the same hotel on some two days:! o = ! Pairs of people: C(10 9 2) =~! o ! Expected number of suspicious pairs of people:! o = 250,000.! Jeff Ullman s lecture notes!

78 Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.! Analysts have to sift through 250,000 candidates to find the 10 real cases.! o Not gonna happen.! o But how can we improve the scheme?! Jeff Ullman s lecture notes!

79 Moral When looking for a property (e.g., two people stayed at the same hotel twice ), make sure that the property does not allow so many possibilities that random data will surely produce facts of interest.! Jeff Ullman s lecture notes!

80 Rhine Paradox (1) Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception.! He devised (something like) an experiment where subjects were asked to guess 10 hidden cards red or blue.! He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!! Jeff Ullman s lecture notes! 80!

81 Rhine Paradox (2) He told these people they had ESP and called them in for another test of the same type.! Alas, he discovered that almost all of them had lost their ESP.! What did he conclude?! You shouldn t tell people they have ESP; it causes them to lose it.! Jeff Ullman s lecture notes! 81!

82 Moral Understanding Bonferroni s Principle will help you look a little less stupid than a parapsychologist.! Jeff Ullman s lecture notes! 82!

83 Next Class Introduction to Map-Reduce and high-level data processing languages!

Large-scale Data Processing on the Cloud

Large-scale Data Processing on the Cloud Large-scale Data Processing on the Cloud MTAT.08.036 Lecture 1: Data analytics in the cloud Satish Srirama satish.srirama@ut.ee Course Purpose Introduce cloud computing concepts Introduce data analytics

More information

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة CIS492 Special Topics: Cloud Computing د. منذر الطزاونة Big Data Definition No single standard definition Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs 1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Principles for Working with Big Data"

Principles for Working with Big Data Principles for Working with Big Data" Juliana Freire Visualization and Data Analysis (ViDA) Lab Computer Science & Engineering Center for Urban Science & Progress (CUSP) Center for Data Science New York

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data. 0 Introduction B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Cloud Computing using MapReduce, Hadoop, Spark

Cloud Computing using MapReduce, Hadoop, Spark Cloud Computing using MapReduce, Hadoop, Spark Benjamin Hindman benh@cs.berkeley.edu Why this talk? At some point, you ll have enough data to run your parallel algorithms on multiple computers SPMD (e.g.,

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,

More information

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Big Data a threat or a chance?

Big Data a threat or a chance? Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

INTRODUCTION TO CASSANDRA

INTRODUCTION TO CASSANDRA INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open

More information

Outline. What is Big data and where they come from? How we deal with Big data?

Outline. What is Big data and where they come from? How we deal with Big data? What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Big Data Big Deal? Salford Systems www.salford-systems.com

Big Data Big Deal? Salford Systems www.salford-systems.com Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without

More information

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform... Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

In-Memory Analytics for Big Data

In-Memory Analytics for Big Data In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

CLOUD COMPUTING USING HADOOP TECHNOLOGY

CLOUD COMPUTING USING HADOOP TECHNOLOGY CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:narendren.jbk@gmail.com

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

The Next Wave of Data Management. Is Big Data The New Normal?

The Next Wave of Data Management. Is Big Data The New Normal? The Next Wave of Data Management Is Big Data The New Normal? Table of Contents Introduction 3 Separating Reality and Hype 3 Why Are Firms Making IT Investments In Big Data? 4 Trends In Data Management

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc. Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team) Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we

More information

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010 IT CAPACITY Provisioning IT Capacity Under-supply

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

White Paper: Datameer s User-Focused Big Data Solutions

White Paper: Datameer s User-Focused Big Data Solutions CTOlabs.com White Paper: Datameer s User-Focused Big Data Solutions May 2012 A White Paper providing context and guidance you can use Inside: Overview of the Big Data Framework Datameer s Approach Consideration

More information

Source: Tutorial: Introduction to Big Data Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia

Source: Tutorial: Introduction to Big Data Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia Source: Tutorial: Introduction to Big Data Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia http://ailab.ijs.si/~blazf/bigdatatutorial-grobelnikfortunamladenic- ISWC2013.pdf

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Generating the Business Value of Big Data:

Generating the Business Value of Big Data: Leveraging People, Processes, and Technology Generating the Business Value of Big Data: Analyzing Data to Make Better Decisions Authors: Rajesh Ramasubramanian, MBA, PMP, Program Manager, Catapult Technology

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information