1 Massive Data Analysis: Course Overview Juliana Freire! Content obtained from many sources, including: Agrawal et al., VLDB 2010 tutorial; Shim, VLDB 2012 tutorial; Jeff Ullman s lecture notes, G. Weikum
2 Instructors:! o Juliana Freire! Course Staff and Information o Jerome Simeon! Reach us at More info on In our wiki you will find:! Tentative schedule! News and announcements! Reading list! Assignments! Check it often!!! 2!
3 What we will cover Infrastructure: Architecture, computing models (e.g., MapReduce), storage solutions (e.g., Big Table, MongoDB), query/processing languages! Algorithms and analysis: statistics, data mining techniques! Tentative schedule in:! Readings from:! o Scientific papers! o Textbooks (they are free to download!)! Mining of Massive Data Sets (version 1.1), by Anand Rajaraman, Jure Leskovec and Jeff Ullman. Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer.
4 Pre- Requisites A course in database systems, covering application programming in SQL and other database-related languages such as XQuery! A course on algorithms and data structures! Good programming skills!
5 What you will do Programming assignments (50%) done individually! o You will need a computer! o We will provide you access to Amazon AWS (more details later)! Quizzes (15%): you will use Gradiance! o Register at o Use token 00B06796! Final exam (35%)!
7 Big Data: What is the Big deal?
8 Big Data: What is the Big deal? Many success stories! o Google: many billions of pages indexed, products, structured data! Google grew from processing 100 TB of data a day with o Facebook: MapReduce 1.1 billion in 2004 users  using to processing the site 20 each PB a day month! with MapReduce in 2008 . In April 2009, a blog post1 was wri^en about ebay s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Lin and Dyer, 2010 o Twitter: 517 million accounts, 250 million tweets/day This is changing society!!
9 The McKinsey Report on Big Data Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital. We estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart's data warehouse in 1999) per company with more than 1,000 employees! The use of big data will become a key basis of competition and growth for individual firms.! There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.!
10 Big Data: New Opportunities Enable scientific breakthroughs! Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider, Sloan Sky Survey, genomes, climate data,! Social data, e.g., Facebook, Twitter! 3,180,000 and 3,410,000 results in Google Scholar!!
11 Big Data: New Opportunities Smart Cities: 50% of the world population lives in cities! o Census, crime, emergency visits, taxis, public transportation, real estate, noise, energy,! Cities are making their data available!! https://nycopendata.socrata.com/! Make cities more efficient and sustainable, and improve the lives of their citizens!
12 NYC Inspections New York City gets 25,000 illegal-conversion complaints a year, but it has only 200 inspectors to handle them.! Flowers group integrated information from 19 different agencies that provided indication of issues in buildings!! Result: hit rate for inspections went from 13% to 70%!
13 Big Data: New Opportunities NYU CUSP aims to use New York City as its laboratory and classroom to help cities around the world become more productive, livable, equitable, and resilient. CUSP observes, analyzes, and models cities to optimize outcomes, prototype new solutions, formalize new tools and processes, and develop new expertise/experts.!!!!http://cusp.nyu.edu/!
14 Big Data: New Opportunities Data is currency: companies are profiting from knowledge extracted from Big Data! o Better understand customers, targeted advertising,!
15 Big Data: New Opportunities h^p://blogs.wsj.com/venturecapital/tag/big- data/
16 What is Massive/Big Data? The three V s of big data: Volume, Variety, and Velocity! Too big: petabyte-scale collections or lots of (not necessarily big) data sets! Too hard: does not fit neatly in an existing tool! o Data sets that need to be cleaned, processed and integrated! o E.g., Twitter, news, customer transactions! Too! fast: needs to be processed quickly!!
17 Big Data: What is the Big deal? Big data is not new: financial transactions, call detail records, astronomy,! What is new:! - Many more data enthusiasts! - More data are widely available, e.g., Web, data.gov, scientific data, social and urban data! - Computing is cheap and easy to access! o Server with 64 cores, 512GB RAM ~$11k! o Cluster with 1000 cores ~$150k! o Pay as you go: Amazon EC2!
18 Big Data: More than Volume Volume = Length Width Depth Big Data Length: Collect & Compare Big Data Width: Discover & Integrate Big Data Depth: Analyze & Understand Slide by Gerhard Weikum
19 Big Urban Data: NYC Taxis
20 Collect, Clean, and Compare Beijing NYC Compare diﬀerent cities
21 Collect, Clean, and Compare 7-8am 8-9am 9-10am 10-11am Compare effects over time Bigger picture of city life!
22 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Was there a traffic problem? An important event? Discover information in news, blogs, etc.
23 Discover and Integrate Compare with other data sources, e.g., NYC Citi bikes Discover information in news, blogs, etc.
24 Analyze and Understand The Sandy Effect
25 Analyze and Understand Studying traffic pa^erns to and from the airports
26 Taxis in NYC: Rides per Hour
27 Big Data: What is hard? Scalability for computations? NOT!! o Lots of work on distributed systems, parallel databases,! o Elasticity: Add more nodes!! But there are no one-size-fits-all solution: often, you have to build your own! Rapidly-evolving technology! Many different tools! Different computation model: need new algorithms!
28 Big Data: What is hard? Scalability for people: Data exploration is hard! regardless of whether data are big or small! algorithms provenance data integration machine learning visual encodings statistics interaction modes data curation math data management data knowledge
29 (Big) Data Analysis Pipeline
30 Big Data: Challenges
31 Big Data: Challenges Apple: Fruit or company?
32 Big Data: Opportunities and Challenges
33 Big Data: Challenges Taxi data: >.5 billion trips! Can t load on Excel and! even commercial databases! are too slow! Our solution to support! interactive queries! o New spatio-temporal index! o New index that leverages GPU (work in progress)!!
34 Big Data: New Technologies Infrastructure:! New computing paradigms: Cloud, Hadoop Map/ Reduce! New storage solutions: NoSQL, column stores, Big Table! New languages: JAQL, Pig Latin! We will study these and how they relate to previous technologies! Analysis and Mining:! New infrastructure demands new approaches to explore data! We will study algorithms to process and analyze data in Big-Data environments!
36 What is Cloud Computing? Old idea: Software as a Service (SaaS)! o Delivering applications over the Internet!! Recently: [Hardware, Infrastructure, Platform] as a service!! Utility Computing: pay-as-you-go computing! o Illusion of infinite resources! o No up-front cost! o Fine-grained billing (e.g., hourly)! Agrawal et al., VLDB 2010 Tutorial!
37 Cloud Computing: Why Now? Experience with very large data centers! o Unprecedented economies of scale! o Transfer of risk! Technology factors! o Pervasive broadband Internet! o Maturity in virtualization technology! Business factors! o Minimal capital expenditure! o Pay-as-you-go billing model! Agrawal et al., VLDB 2010 Tutorial!
38 Warehouse Scale Computing Google s data center in Oregon 16 Million Nodes per building Agrawal et al., VLDB 2010 Tutorial!
39 Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Agrawal et al., VLDB 2010 Tutorial! Time Static data center Time Unused resources Data center in the cloud Slide Credits: Berkeley RAD Lab
40 Economics of Cloud Users Risk of over- provisioning: underutilization Resources Capacity Demand Unused resources Time Static data center Agrawal et al., VLDB 2010 Tutorial! Slide Credits: Berkeley RAD Lab
41 Economics of Cloud Users Heavy penalty for under- provisioning Resources Time (days) Agrawal et al., VLDB 2010 Tutorial! Capacity Demand Resources Resources Time (days) Lost revenue Time (days) Lost users Slide Credits: Berkeley RAD Lab Capacity Demand Capacity Demand
42 Just hype? Cloud Computing? What are you! talking about? Cloud Computing is! nothing but a computer attached to! a network.! - - Larry Ellison, Excerpts from an interview Agrawal et al., VLDB 2010 Tutorial!
43 Cloud Computing: Hype or Reality Unlike the earlier attempts:! o Distributed Computing! o Distributed Databases! o Grid Computing! Cloud Computing is REAL:! o Organic growth: Google, Yahoo, Microsoft, and Amazon! o Poised to be an integral aspect of National Infrastructure in US and elsewhere! Agrawal et al., VLDB 2010 Tutorial!
44 Cloud Computing Modalities Can we outsource our IT software and hardware infrastructure? Hosted Applications and services! Pay-as-you-go model! Scalability, fault-tolerance, elasticity, and self-manageability! We have terabytes of click-stream data what can we do with it? Very large data repositories! Complex analysis! Distributed and parallel data processing! Agrawal et al., VLDB 2010 Tutorial!
45 Why Data Analysis? What is the most effective distribution channel? Who are our lowest/highest margin customers? Who are my customers and what products are they buying? What product prom- -otions have the biggest impact on revenue? VLDB 2010 Tutorial! What impact will new products/services! have on revenue and margins? Which customers are most likely to go to the competition?
46 Why Data Analysis? What would the! impacts be of a! Fare change? Where are our lowest/highest margin passengers? What is the distribution! of trip lengths? What is the quickest! route from midtown! To downtown at 4pm on! Monday? What impact will the introduction of! additional medallions! have? Where should drivers! go to get! passengers?
47 Decision Support Used to manage and control business! Data is historical or point-in-time! Optimized for inquiry rather than update! Use of the system is loosely defined and can be ad-hoc! Used by managers and end-users to understand the business and make judgements! Agrawal et al., VLDB 2010 Tutorial!
48 Decision Support Data-analysis in the enterprise context emerged:! o As a tool to build decision support systems! o Data-centric decision making instead of using intuition! o New term: Business Intelligence! Traditional approach:! o Decision makers wait for reports from disparate OLTP systems! o Put it all together in a spreadsheet! o Manual process! Agrawal et al., VLDB 2010 Tutorial!
49 Data Analytics in the Web Context Data capture at the user interaction level:! o in contrast to the client transaction level in the Enterprise context! As a consequence, the amount of data increases significantly! Need to analyze such data to understand user behaviors! Agrawal et al., VLDB 2010 Tutorial!
50 Data Analytics outside Big Corporations Even data capture at client transaction level leads to a lot of data!!! Need to analyze such data to understand behavior! Cannot afford expensive warehouse solutions!
51 Data Analytics in the Cloud Scalability to large data volumes:! o Scan 100 TB on 1 50 MB/sec = 23 days! o Scan 100 TB on 1000-node cluster = 33 minutes! è Divide-And-Conquer (i.e., data partitioning)!! Cost-efficiency:! o Commodity nodes (cheap, but unreliable)!! o Commodity network! o Automatic fault-tolerance (fewer admins)! o Easy to use (fewer programmers)! Agrawal et al., VLDB 2010 Tutorial!
52 Platforms for Large- scale Data Analysis Parallel DBMS technologies! o Proposed in the late eighties! o Matured over the last two decades! o Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises! Map Reduce! o pioneered by Google! o popularized by Yahoo! (Hadoop)! Agrawal et al., VLDB 2010 Tutorial!
53 Parallel DBMS technologies Popularly used for more than two decades! o Research Projects: Gamma, Grace,! o Commercial: Multi-billion dollar industry but access to only a privileged few! Relational Data Model! Indexing! Familiar SQL interface! Advanced query optimization! Well understood and studied! Very reliable!! Agrawal et al., VLDB 2010 Tutorial!
54 MapReduce Overview:! o Data-parallel programming model! o An associated parallel and distributed implementation for commodity clusters! Pioneered by Google! o Processes 20 PB of data per day (circa 2008)! Popularized by open-source Hadoop project! o Used by Yahoo!, Facebook, Amazon, and the list is growing! [Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010] Agrawal et al., VLDB 2010 Tutorial!
55 Hadoop! Open source of MapReduce framework of Apache Project! Hadoop Distributed File System (HDFS)! o Store big files across machines! o Store each file as a sequence of blocks! o Blocks of a file are replicated for fault tolerance! Distribute processing of large data across thousands of commodity machines! Key components! o MapReduce - distributes applications! o Hadoop Distributed File System (HDFS) - distributes data! A single Namenode (master) and multiple Datanodes (slaves)! o Namenode: manages the file system and access to files by clients! o Datanode: manages the storages attached to the nodes running on! Kyuseok Shim (VLDB 2012 TUTORIAL)!
56 MapReduce Programming Model! Borrows from functional programming! Users should implement two primary methods:!! o Map: (key1, val1) [(key2, val2)]! o Reduce: (key2, [val2]) [(key3, val3)]! Kyuseok Shim (VLDB 2012 TUTORIAL)!
58 Word Counting with MapReduce! Doc1 Doc2 Documents! Financial, IMF, Eco nomics, Crisis Financial, IMF, Cris is Documents! Doc3 Economics, Harry Doc4 Financial, Harry, P o^er, Film Doc5 Crisis, Harry, Po^er! Map! Map! Key! Key! Value! Key! Value list! Value! Financial! Financial! 1 Crisis! 1, 1, 1 1 Financial! IMF! 1! Crisis! 1, 1! 1! Financial! Economics! 1! Crisis! 1, 1! 1! IMF! Crisis! 1! Harry! 1, 1, 1! 1! IMF! Harry! 1! Harry! 1, 1, 1! 1! Economics! Film! 1! Harry! 1! 1! Economics! Po^er! 1! Film! 1, 1! 1! Po^er! 1! Po^er! 1! Reduce! Reduce! Key! Financial! IMF! Economics! Crisis! Harry! Film! Po^er! `! `! Value! Before reduce functions are called, for each distinct key, the list of its values is generated! Kyuseok Shim (VLDB 2012 TUTORIAL)!
59 MapReduce Advantages Automatic Parallelization:! o Depending on the size of RAW INPUT DATA è instantiate multiple MAP tasks! o Similarly, depending upon the number of intermediate <key, value> partitions è instantiate multiple REDUCE tasks! Run-time:! o Data partitioning! o Task scheduling! o Handling machine failures! o Managing inter-machine communication! Completely transparent to the programmer/analyst/user! Agrawal et al., VLDB 2010 Tutorial!
60 MapReduce Experience Runs on large commodity clusters:! o 1000s to 10,000s of machines! Processes many terabytes of data! Easy to use since run-time complexity hidden from the users! 1000s of MR jobs/day at Google (circa 2004)! 100s of MR programs implemented (circa 2004)! Agrawal et al., VLDB 2010 Tutorial!
61 The Need Special-purpose programs to process large amounts of data: crawled documents, Web Query Logs, etc.! At Google and others (Yahoo!, Facebook):! o Inverted index! o Graph structure of the WEB documents! o Summaries of #pages/host, set of frequent queries, etc.! o Ad Optimization! o Spam filtering! Agrawal et al., VLDB 2010 Tutorial!
62 Takeaway MapReduce s data-parallel programming model hides complexity of distribution and fault tolerance! Principal philosophies:! o Make it scale, so you can throw hardware at problems! o Make it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance)! Hive and Pig further simplify programming! MapReduce is not suitable for all problems, but when it works, it may save you a lot of time! Agrawal et al., VLDB 2010 Tutorial!
63 Map Reduce vs Parallel DBMS! Parallel DBMS! MapReduce! Schema Support! ü! Not out of the box! Indexing! ü! Not out of the box! Programming Model! Optimizations (Compres sion, Query! Optimization)! Declarative! (SQL)! ü! Imperative! (C/C++, Java, )! Extensions through! Pig and Hive! Not out of the box! Flexibility! Not out of the box! ü! Fault Tolerance! Agrawal et al., VLDB 2010 Tutorial! Coarse grained! techniques! [Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, ] ü!
64 MapReduce: A step backwards? Don t need 1000 nodes to process petabytes:! o Parallel DBs do it in fewer than 100 nodes! No support for schema:! o Sharing across multiple MR programs is difficult! No indexing:! o Wasteful access to unnecessary data! Non-declarative programming model:! o Requires highly-skilled programmers! No support for JOINs:! o Requires multiple MR phases for the analysis! We will study this in more detail! Agrawal et al., VLDB 2010 Tutorial!
65 Analysis and Mining
66 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company!
67 Data Analysis and Mining Many challenges, even when data is not big! Data cleaning and curation:! o Detection and correction of errors in data E.g., age = 150.! o Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company! Visualization: Pictures help us to think! o Substitute perception for cognition! o External memory: free up limited cognitive/memory resources for higher-level problems! Mining: Discovery of useful, possibly unexpected, patterns in data!
68 (Big) Data Analysis Pipeline
69 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
70 Data Analysis and Mining In exploratory tasks, change is the norm!! o Data analysis and mining are iterative processes! o Many trial-and-error steps, easy to get lost! Need to manage the data exploration process:! o Guide users! o Need provenance for reproducibility [Freire et al., CISE 2008]! Data Process Data Product Perception & Cognition Knowledge Specification Exploration Data Manipulation User Figure modified from J. van Wijk, IEEE Vis 2005
71 Analyzing and Mining Big Data: Issues Besides scalability for algorithms and computations! A big data-mining risk is that you will discover patterns that are meaningless.! Statisticians call it Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find junk.! Jeff Ullman s lecture notes!
72 Examples of Bonferroni s Principle 1. A big objection to Total Information Awareness (TIA) was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy.! 2. The Rhine Paradox: a great example of how not to conduct scientific research.! Jeff Ullman s lecture notes!
73 Stanford Professor Proves Tracking Terrorists Is Impossible! Reporter from the LA Times picked an example in Professor Ullman s class! Despite attempts by Professor Ullman, the reporter was unable to grasp the point that the story was made up to illustrate Bonferroni s Principle, and was not real! Modified from Jeff Ullman s lecture notes!
74 The TIA Example Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.! We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.! Jeff Ullman s lecture notes!
75 TIA Example: Details 10 9 people being tracked.! 1000 days.! Each person stays in a hotel 1% of the time (10 days out of 1000).! Hotels hold 100 people (so 10 5 hotels to hold 1% of the people being tracked).! If everyone behaves randomly (i.e., no evil-doers) will the data mining detect anything suspicious?! Jeff Ullman s lecture notes!
76 TIA Example: Calculations (1) p at some hotel q at some hotel Same hotel Probability that given persons p and q will be at the same hotel on given day d :! o 1/100 1/100 1/10 5 = 10-9! Probability that p and q will be at the same hotel on given days d 1 and d 2 :! o = 10-18! Pairs of days: C(1000,2) = 1000!/(1000-2)!*2!! o ~5 10 5! Jeff Ullman s lecture notes!
77 TIA Example: Calculations (2) Probability that p and q will be at the same hotel on some two days:! o = ! Pairs of people: C(10 9 2) =~! o ! Expected number of suspicious pairs of people:! o = 250,000.! Jeff Ullman s lecture notes!
78 Conclusion Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.! Analysts have to sift through 250,000 candidates to find the 10 real cases.! o Not gonna happen.! o But how can we improve the scheme?! Jeff Ullman s lecture notes!
79 Moral When looking for a property (e.g., two people stayed at the same hotel twice ), make sure that the property does not allow so many possibilities that random data will surely produce facts of interest.! Jeff Ullman s lecture notes!
80 Rhine Paradox (1) Joseph Rhine was a parapsychologist in the 1950 s who hypothesized that some people had Extra-Sensory Perception.! He devised (something like) an experiment where subjects were asked to guess 10 hidden cards red or blue.! He discovered that almost 1 in 1000 had ESP they were able to get all 10 right!! Jeff Ullman s lecture notes! 80!
81 Rhine Paradox (2) He told these people they had ESP and called them in for another test of the same type.! Alas, he discovered that almost all of them had lost their ESP.! What did he conclude?! You shouldn t tell people they have ESP; it causes them to lose it.! Jeff Ullman s lecture notes! 81!
82 Moral Understanding Bonferroni s Principle will help you look a little less stupid than a parapsychologist.! Jeff Ullman s lecture notes! 82!
83 Next Class Introduction to Map-Reduce and high-level data processing languages!
Large-scale Data Processing on the Cloud MTAT.08.036 Lecture 1: Data analytics in the cloud Satish Srirama firstname.lastname@example.org Course Purpose Introduce cloud computing concepts Introduce data analytics
CIS492 Special Topics: Cloud Computing د. منذر الطزاونة Big Data Definition No single standard definition Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms,
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: email@example.com Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University firstname.lastname@example.org 14.9-2015 1/36 Google MapReduce A scalable batch processing
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: email@example.com Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
With Saurabh Singh firstname.lastname@example.org The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside email@example.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Principles for Working with Big Data" Juliana Freire Visualization and Data Analysis (ViDA) Lab Computer Science & Engineering Center for Urban Science & Progress (CUSP) Center for Data Science New York
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, firstname.lastname@example.org Assistant Professor, Information
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies
Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms
Cloud Computing using MapReduce, Hadoop, Spark Benjamin Hindman email@example.com Why this talk? At some point, you ll have enough data to run your parallel algorithms on multiple computers SPMD (e.g.,
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics April 10, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie,
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System firstname.lastname@example.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
L1: Introduction to Hadoop Feng Li email@example.com School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System firstname.lastname@example.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Large-Scale Data Processing Eiko Yoneki email@example.com http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:firstname.lastname@example.org
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: email@example.com
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19
PAGE 1 l Teradata Magazine l Q1/2011 l 2011 Teradata Corporation l AR-6309 It s going mainstream, and it s your next opportunity. by Merv Adrian Enterprises have never had more data, and it s no surprise
CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Cloud Computing and Amazon Web Services Cloud Computing Amazon
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs
+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data
Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot. Remember it stands front and center in the discussion of how to implement a big data strategy. Early adopters
3 Speak Tech Brief RichRelevance Distributed Computing: creating a scalable, reliable infrastructure Overview Scaling a large database is not an overnight process, so it s difficult to plan and implement
Big Data Are You Ready? Jorge Plascencia Solution Architect Manager Big Data: The Datafication Of Everything Thoughts Devices Processes Thoughts Things Processes Run the Business Organize data to do something