Lecture #1. An overview of Big Data
|
|
|
- Shonda Paul
- 10 years ago
- Views:
Transcription
1 Additional Topics: Big Data Lecture #1 An overview of Big Data Joseph Bonneau April 27, 2012
2 Course outline 0 Google on Building Large Systems (Mar. 14) David Singleton 1 Overview of Big Data (today) 2 Algorithms for Big Data (April 30) 3 Case studies from Big Data startups (May 2) Pete Warden
3 Goals after four lectures recognise some of the main terminology remember that there are many tools available realise the potential of Big Data all of you are skilled enough to get involved!
4 What is Big Data? Data sets that grow so large that they become awkward to work with using on hand database management tools? (Wikipedia)
5 What is Big Data? distributed file systems NoSQL databases grid computing, cloud computing MapReduce and other new paradigms large scale machine learning
6 What is Big Data? from Big Data and the Web: Algorithms for Data Intensive Scalable Computing PhD Thesis, Gianmarco De Francisci Morales
7 What is Big Data? buzzword? bubble? gold rush? revolution? funding fad? DARPA XDATA project, March 2012
8 Who's profiting?
9 RapLeaf { } "age":"21-24", "gender":"male", "interests":{ "Blogging":true, "High-End Brand Buyer":true, "Sports":true, }, "education":"completed Graduate School", "occupation":"professional", "children":"no", "household_income":"75k-100k", "marital_status":"single", "home_owner_status":"rent" 10 TB of user data all computation done using local Hadoop cluster
10 Gnip grand central station for social web streams aggregates several TB of new social data per day all data stored on Amazon S3
11 Climate Corporation 14 TB of historical weather data all computation done using Amazon EC2 30 technical staff, including 12 PhDs 10,000 sales agents
12 FICO 50+ years of experience doing credit ratings transitioning to predictive analytics
13 Cloudera employs many of the original Hadoop developers now facing many competitors
14 Palantir Technologies data analysis, exploration tools for government, finance 0 $2 billion in 5 years
15 The modern data scientist Engineer collect & scrub disparate data sources manage a large computing cluster Mathematician machine learning statistics Artist visualise data beautifully tell a convincing story
16 Sources of Big Data
17 Proprietary data: Facebook 40+ billion photos 6 billion messages per day 100 PB 5 10 TB 900 million users 1 trillion connections?
18 Common Crawl covers about 5 million web pages 50 TB of data available for free, hosted by Amazon
19 Twitter Firehose { "in_reply_to_status_id_str":null, "retweet_count":0, "favorited":false, "text":"new ipad vs ipad 2: which should you choose? Redsn0w b6 i4siri Proxy Server For Spire GEVEY Ultra S WP7 _83", "in_reply_to_user_id_str":null, "in_reply_to_status_id":null, "created_at":"mon Mar 19 22:22: ", "geo":null, "in_reply_to_user_id":null, "truncated":false, "source":"<a href=\" rel=\"nofollow\">tech Discovery</a>", "id_str":" ", "entities":{ "hashtags":[], "urls": [ { "indices": [45,65], "expanded_url":" "url":" "display_url":"ow.ly/9a4 kc" } ], "user_mentions": [] }, "contributors":null, "in_reply_to_screen_name":null, "place":null, "retweeted":false, "possibly_sensitive_edi table":true, "possibly_sensitive":false, "coordinates":null," user": { "profile_text_color":"333333", "profile_image_url_https":" _normal.png", "screen_name":"ricegyeat", "default_profile_image":false, "profile_background_image_url":" 0.twimg.com/images/themes/theme1/bg.png", "favourites_count":0, "created_at":"thu Feb 09 18:26: ", "profile_link_color":"0084b4", "verified":false, "friends_count":0, "url":null, "description":"", "profile_background_color":"c0deed", "id_str":" ", "lang":"en", "profile_background_tile":false, "l isted_count":0, "contributors_enabled":false, "geo_enabled":false, "profile_sidebar_fill_color":"ddeef6", "lo cation":"", "time_zone":null, "protected":false, "default_profile":true, "following":null, "notifications" :null, "profile_sidebar_border_color":"c0deed", "name":"rice gyeat", "is_translator":false, "show_all_inline_media":false, "follow_request_sent":null, "statuses_count":82 53, "followers_count":109, "profile_image_url":" "id": , "profile_use_background_image":true, "profile_background_image_url_https":" /images/themes/theme1/bg.png", "utc_offset":null }, "id": } 350 M tweets/day x 2 3Kb/tweet 1 TB/day
20 Sloan Digital Sky Survey 35% of the sky mapped 500 million objects classified 50 TB of data available
21 Large Hadron Collider 15 PB of data generated annually mostly stored in Oracle databases (SQL)
22 Human genome humans and 50 other species only 250 GB
23 US census data (2000) detailed demographic data for small localities only 200 GB
24 Roll your own big data crawling as a service web sites, social profiles, product listings, etc. free accounts offer crawls of up to 10k URLs
25 Roll your own big data Needlebase: graphical tagging of website structure
26 Roll your own big data Boilerpipes: remove clutter from web pages Google Refine: clean up human entered data Metadata, JavaScript, etc. fix common typos, spacing, etc. NLPToolkit: simplify natural language stem words, replace synonyms Lucene: index terms for text search Amazon MTurk: human analysis
27 Storing big data
28 Traditional SQL databases TABLE instructor ID Name David Singleton 27 Joseph Bonneau 52 Pete Warden TABLE lectures ID Title Lecturer BD at Google 14 2 Overview of BD 27 3 Algorithms for BD 27 4 BD at startups
29 Traditional SQL databases TABLE instructor ID Name David Singleton 27 Joseph Bonneau 52 Pete Warden most interesting queries require computing joins TABLE lectures ID Title Lecturer BD at Google 14 2 Overview of BD 27 3 Algorithms for BD 27 4 BD at startups
30 Data assumptions Traditional RDBMS (SQL) NoSQL integrity is mission-critical OK as long as most data is correct data format consistent, well-defined data format unknown or inconsistent data is of long-term value data will be replaced data updates are frequent write-once, ready multiple predictable, linear growth unpredictable growth (exponential?) non-programmers writing queries only programmers writing queries regular backup replication access through master server sharding
31 Replication and sharding master where is a?
32 Replication and sharding a a a master a is at nodes 2, 3, 6
33 Replication and sharding a a a a? master
34 Integrity guarantees Traditional RDBMS (SQL) NoSQL Available Basically Consistent Available Isolated transactions Soft state Durable writes Eventually consistent
35 Strong consistency Xiuying read(a) = 1 read(a) = 1 Yves read(a) = 1 Zaira
36 Strong consistency Xiuying read(a) = 1 read(a) = 1 Yves read(a) = 1 Zaira write(a) = 2
37 Strong consistency Xiuying read(a) = 1 read(a) = 1 write(a) = 2 read(a) = 2 read(a) = 2 Yves read(a) = 1 Zaira read(a) = 2
38 Eventual consistency Xiuying read(a) = 1 read(a) = 1 Yves read(a) = 1 Zaira write(a) = 2
39 Eventual consistency Xiuying read(a) = 1 read(a) = 1 write(a) = 2 read(a) = 1 read(a) = 2 Yves read(a) = 1 Zaira read(a) = 1
40 Eventual consistency Xiuying read(a) = 1 read(a) = 1 write(a) = 2 read(a) = 1 read(a) = 2 read(a) = 2 Yves read(a) = 1 Zaira read(a) = 1 inconsistent window read(a) = 2
41 Read-own writes consistency Xiuying read(a) = 1 read(a) = 1 write(a) = 2 read(a) = 2 read(a) = 2 read(a) = 2 Yves read(a) = 1 Zaira read(a) = 1 inconsistent window read(a) = 2
42 Monotonic read consistency Xiuying read(a) = 1 read(a) = 1 Yves read(a) = 1 Zaira write(a) = 2 read(a) = 1 read(a) = 2 read(a) = 2 read(a) = 1 inconsistent window read(a) = 2
43 CAP theorem (Brewer) Consistency Availability Partitioning Pick any two...
44 CAP theorem - informal proof partition 1 partition 2 Ahmed Bernardo write(a, 2) a=1 a=1
45 CAP theorem - informal proof partition 1 partition 2 Ahmed Bernardo read(a) a=1 a=2 How should P1 respond to Ahmed?
46 CAP theorem - informal proof partition 1 partition 2 Ahmed Bernardo read(a) a=1 a=2 How should P1 respond to Ahmed? respond that a = 1 (forgo consistency) delay (forgo availability) be in constant contact with P2 (forgo partitioning)
47 Approaches to NoSQL key/value column oriented document oriented graph oriented
48 Key/value stores simplest possible interface: put('foo', 'bar') get('foo') will return 'bar' dates to at least the 1970s (dbm) Memcached/MemcacheDB as a caching layer BerkeleyDB (now maintained by Oracle) others: Tokyo Cabinet, Voldemort, Redis, Scalaris, Google LevelDB
49 Column-oriented RowKey TimeStamp ColumnFamily contents ColumnFamily anchor com.cnn.www t1 contents.html =... anchor:cnnsi.com = CNN com.cnn.www t0 contents.html =... anchor:cnnsi.com = News uk.ac.cam.www t1 contents.html =... anchor:cl.cam.ac.uk = Home uk.ac.cam.cl.www t1 contents.html =... anchor:cl.cam.ac.uk/jcb82 = My Lab anchor:cam.ac.uk = Computer Lab Hadoop HBase
50 Column-oriented maintain unique keys per row much more complicated multi valued columns Google's BigTable was a pioneer richer querying possible Bigtable: A Distributed Storage System for Structured Data, Chang et al. others: Apache Cassandra, Hadoop HBase, Amazon DynamoDB, Hypertable
51 Document-oriented like a key value store, but stores documents can be serialised objects or binary files can query attributes of serialised objects
52 XML <person id="11"> <name>joseph Bonneau</name> </person> can specify separate DTD schema can display using XSLT
53 JSON { } "id": 11, "name":"joseph Bonneau", " ":"jcb82@cam" human readable native parsing in JavaScript BSON binary version developed for MongoDB
54 Protocol buffers message Person { required int32 id = 1; required string name = 2; optional string = 3; } compiles to binary format not human readable developed by Google many similar approaches Apache Thrift
55 Document-oriented: MongoDB { "_id": ObjectId("4efa8d2b7d284dad101e4bc9"), "name": "Joseph Bonneau", "role": "PhD student", "office number": 17 }, { "_id": ObjectId("4efa8d2b7d284dad101e4bc7"), "name": "Ross Anderson", "role": "Professor", "subject": "Security Engineering" } db.find({"name" : "Joseph Bonneau"}); db.find({"name" : { $regex : '.*Anderson$'} } } ); db.find({"office number" : { $gt: 16 } } );
56 Document-oriented MongoDB uses BSON (Binary JSON) open source, commercially developed Apache CouchDB uses vanilla JSON open source, community developed
57 Graph-oriented developed specifically for large graphs index free lookup of neighbors OrientDB is perhaps the best known less practical use than other approaches
58 Summary of NoSQL Pros: scalable and fast flexible can be easier for experienced programmers Cons: fewer consistency/concurrency guarantees weaker set of queries supported less mature
59 Big Data stacks
60 Big Data stacks Google Apache Hadoop proprietary but influential totally open source Amazon Web Services (AWS) mixed source, including some Hadoop
61 Hadoop founded in 2004 by a Yahoo! employee spun into open source Apache project general purpose framework for Big Data MapReduce implementation supporting tools (distributed storage, concurrency) used by everybody... Yahoo!, Facebook, Amazon, Microsoft, Apple
62 Hadoop Distributed File Sys.
63 Review: MapReduce Example from rapidgremlin.com
64 Hadoop MapReduce example public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } } Example from rapidgremlin.com
65 Hadoop MapReduce example public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Example from rapidgremlin.com
66 Hadoop MapReduce example public class WordCount { public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); } } FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); Example from rapidgremlin.com
67 Hadoop JobTracker
68 MapReduce alternative: Pig lines = LOAD '../data/words.txt' USING TextLoader() AS (sentence:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(sentence)) AS word; groupedwords = GROUP words BY word; counts = FOREACH groupedwords GENERATE group, COUNT(words); STORE counts INTO 'output/wordcounts' USING PigStorage(); Example from rapidgremlin.com
69 MapReduce alternative: Hive create table textlines(text string); load data local inpath 'C:\work\ClearPoint\Data20\data\words.txt' overwrite into table textlines; create table words(word string); insert overwrite table words select explode(split(text, '[ \t] +')) word from textlines; select word, count(*) from words group by word; Example from rapidgremlin.com
70 The Hadoop Industry
71 Amazon web services launched 2006 largest, most popular cloud computing platform others: Rackspace, Azure, Google App Engine
72 Elastic Compute Cloud (EC2) rent Elastic compute units by the hour one ECU = one 1 GHz processor machine can choose Linux, FreeBSD, Solaris, Windows virtual private servers running on Xen pricing: US$ per hour varies by machine capacity spot pricing varies by demand
73 Simple Storage Service (S3) store arbitrary files (objects) index by bucket and key accessible via HTTP, SOAP, BitTorrent directly readable from EC2 machines over 1 trillion objects now uploaded pricing: US$ per GB per month wimilar rates for transfer out, free transfer in
74 Other AWS elements elastic MapReduce run Hadoop on EC2 machines with S3 storage free data transfer relational Database Service SQL database many features for running a data driven website content delivery networks, caches, etc.
75 Comparison Google Hadoop Amazon WS storage GoogleFS HDFS S3 caching memcacheg locking chubby key-value LevelDB column-oriented BigTable document-oriented ElastiCache Zookeeper Cassandra DynamoDB CouchDB SimpleDB
76 Visualisation
77 Visualisation Charles Minard 1869
78 Visualisation re-creation with Protovis, Google Maps
79 Visualisation Kirk Goldberry
80 Visualisation Martin Wattenburg Fernanda Viegas
81 Visualisation Eric Fischer
82 Visualisation challenges 2 spacial dimensions + 3 colour dimensions big data sets can have thousands! animation/interactivity often necessary
83 Graph visualisation tools GraphViz See also: Protovis, Gephi, Processing,
84 Geographic visualisation tools OpenHeatMap See also: FusionTables, GoogleMaps API
85 Interactive chart tools Rickshaw See also: Tableau, Highcharts JS, ExtJS, Raphaël, flot, dojox.charting
86 Privacy
87 Privacy Technology is neither good nor bad, it is neutral big data is often generated by people obtaining consent is often impossible anonymisation is very hard...
88 You only need 33 bits... birth date, postcode, gender preference in movies 99% of 500k with 8 ratings (Narayanan 2007) web browser Unique for 87% of US population (Sweeney 1997) 94% of 500k users (Eckersley 2010) writing style 20% accurate out of 100k useres (Naraynanan 2012)
89 Cross-graph de-anonymisation public private
90 Cross-graph de-anonymisation public private B A B' A' C C' Step 1: identify seed nodes
91 Cross-graph de-anonymisation public private B A D C B' A' D' Step 2: assign mappings based on mapped neighbors C'
92 Cross-graph de-anonymisation public private B A D B' A' D' C E C' E' Step 3: iterate
93 Cross-graph de-anonymisation public private B A D B' A' D' C E C' E' Step 3: Iterate 31% of common Twitter/Flickr users found given 30 seeds!
94 Homophily Birds of a feather link together age gender political orientation sexual orientation
95 Thank you
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
Introduction To Hadoop
Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the
How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Hadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008
Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer
Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1
Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Programming in Hadoop Programming, Tuning & Debugging
Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment
BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science
Applications for Big Data Analytics
Smarter Healthcare Applications for Big Data Analytics Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail:
Structured Data Storage
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Preparing Your Data For Cloud
Preparing Your Data For Cloud Narinder Kumar Inphina Technologies 1 Agenda Relational DBMS's : Pros & Cons Non-Relational DBMS's : Pros & Cons Types of Non-Relational DBMS's Current Market State Applicability
Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship
Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15 1 MongoDB in the NoSQL and SQL world. NoSQL What? Why? - How? Say goodbye to ACID, hello BASE You
Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
wow CPSC350 relational schemas table normalization practical use of relational algebraic operators tuple relational calculus and their expression in a declarative query language relational schemas CPSC350
NOSQL DATABASE SYSTEMS
NOSQL DATABASE SYSTEMS Big Data Technologies: NoSQL DBMS - SoSe 2015 1 Categorization NoSQL Data Model Storage Layout Query Models Solution Architectures NoSQL Database Systems Data Modeling id ti Application
HPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015
NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA Ompal Singh Assistant Professor, Computer Science & Engineering, Sharda University, (India) ABSTRACT In the new era of distributed system where
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
3 Case Studies of NoSQL and Java Apps in the Real World
Eugene Ciurana [email protected] - pr3d4t0r ##java, irc.freenode.net 3 Case Studies of NoSQL and Java Apps in the Real World This presentation is available from: http://ciurana.eu/geecon-2011 About Eugene...
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Open Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra [email protected] Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
Introduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
BIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
NoSQL Systems for Big Data Management
NoSQL Systems for Big Data Management Venkat N Gudivada East Carolina University Greenville, North Carolina USA Venkat Gudivada NoSQL Systems for Big Data Management 1/28 Outline 1 An Overview of NoSQL
Comparing Scalable NOSQL Databases
Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : [email protected] Sponsor : Euranova Website : nosqlbenchmarking.com February 15, 2011 Clarications
NoSQL Databases. Nikos Parlavantzas
!!!! NoSQL Databases Nikos Parlavantzas Lecture overview 2 Objective! Present the main concepts necessary for understanding NoSQL databases! Provide an overview of current NoSQL technologies Outline 3!
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Download and install Download virtual machine Import virtual machine in Virtualbox
Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi
Building a distributed search system with Apache Hadoop and Lucene Mirko Calvaresi a Barbara, Leonardo e Vittoria 2 Index Preface... 5 1 Introduction: the Big Data Problem... 6 1.1 Big data: handling the
Introduction to MapReduce and Hadoop
UC Berkeley Introduction to MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab [email protected] What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered
nosql and Non Relational Databases
nosql and Non Relational Databases Image src: http://www.pentaho.com/big-data/nosql/ Matthias Lee Johns Hopkins University What NoSQL? Yes no SQL.. Atleast not only SQL Large class of Non Relaltional Databases
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
Introduction to NOSQL
Introduction to NOSQL Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 31, 2014 Motivations NOSQL stands for Not Only SQL Motivations Exponential growth of data set size (161Eo
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases
NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases Background Inspiration: postgresapp.com demo.beatstream.fi (modern desktop browsers without
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis A practical case of why, what and how to use a NoSQL Database Management System instead of a relational one José Manuel Ciges Regueiro
these three NoSQL databases because I wanted to see a the two different sides of the CAP
Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)
Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure
NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF LANE <[email protected]> @GEOFFLANE
NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF LANE @GEOFFLANE WHAT IS NOSQL? NON-RELATIONAL DATA STORAGE USUALLY SCHEMA-FREE ACCESS DATA WITHOUT SQL (THUS... NOSQL) WIDE-COLUMN / TABULAR
BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &
BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research & Innovation 04-08-2011 to the EC 8 th February, Luxembourg Your Atos business Research technologists. and Innovation
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Riding the Data Wave. New Capabilities New Techniques. Bill Chute Acadiant Limited
Riding the Data Wave New Capabilities New Techniques Bill Chute Acadiant Limited There are new challenges New technologies are on your side 2 MiFID II & MIFIR Basel III NAV Volcker VaR Dodd-Frank MAD II
